URL scrapping with javascript

Submitted by hemanth on Sat, 05/12/2012 - 13:11

Quite some time ago /me had coded a small entry on Hyperlink to URL && Extracting all the links from a site which was much of a terminal based scrapping.

Lately I was playing around with different possible ways of extracting URLs from a give webpage using plain and raw javascript, the idea was to figure out which of the different possible ways is the best and how.

A simple comparison of different methods is demonstrated in this entry.

The focus of this writeup :

List out the possible ways of extracting URLs.
Do a performance review of them.
Select the best out of the lot.

Different possible ways of extracting URLs :

document.links

document.getElementsByTagName('a')

document.querySelectorAll('a')

Understanding them all :

document.links is of type HTMLCollection, and is readonly. It's a DOM level2 API

The links property returns a collection of all AREA elements and anchor elements in a document with a value for the href attribute.

interface HTMLCollection {
  readonly attribute unsigned long   length;
  Node               item(in unsigned long index);
  Node               namedItem(in DOMString name);
};

document.getElementsByTagName also a DOM level 2 API

Returns a NodeList of all the Elements with a given tag name in the order in which they are encountered in a preorder traversal of the Document tree.

document.querySelectorAll is a Selectors API Level 1

module dom {
  [Supplemental, NoInterfaceObject]
  interface NodeSelector {
    Element   querySelector(in DOMString selectors);
    NodeList  querySelectorAll(in DOMString selectors);
  };
  Document implements NodeSelector;
  DocumentFragment implements NodeSelector;
  Element implements NodeSelector;
};

This is an example table written in HTML 4.01. From w3.org
 
<table id="score">
  <thead>
    <tr>
      <th>Test
      <th>Result
  <tfoot>
    <tr>
      <th>Average
      <td>82%
  <tbody>
    <tr>
      <td>A
      <td>87%
    <tr>
      <td>B
      <td>78%
    <tr>
      <td>C
      <td>81%
</table>
 
In order to obtain the cells containing the results in the table, 
which might be done, for example, to plot the values on a graph, 
there are at least two approaches that may be taken. 
Using only the APIs from DOM Level 2, 
it requires a script like the following that iterates 
through each tr within each tbody in the table 
to find the second cell of each row.
 
var table = document.getElementById("score");
var groups = table.tBodies;
var rows = null;
var cells = [];
 
for (var i = 0; i < groups.length; i++) {
  rows = groups[i].rows;
  for (var j = 0; j < rows.length; j++) {
    cells.push(rows[j].cells[1]);
  }
}
Alternatively, using the querySelectorAll() method, 
that script becomes much more concise.
 
var cells = document.querySelectorAll("#score>tbody>tr>td:nth-of-type(2)");

Performance review

On Chrome 18 this was the best of three runs using benchmark.js -> Result

So, clearly the winner is document.links in this particular case, but still my preference would go to querySelectorAll() based on the standards and cross browser compatibility.

Based on the above Some silly code to get all the images from pintrest

[].forEach.call( document.querySelectorAll('img.PinImageImg'), 
function  fn(elem){ 
    console.log(elem.src);
});

This has been the case study so far, do let me know your views below!

Update May 15 2012 :

W.R.T blindwanderer's question below, xpath turned out to be slower than querySelectorAll()

xpath vs querySelectorAll() :

document.evaluate("//a[@href='#']", document, null, 0, null);
document.querySelectorAll("a[href='#']");

It's not only faster, but even simpler! :)

hemanth's blog

Hemanth.HM's Experiments on web, CLI, GNU/Linux and more

URL scrapping with javascript

Recent blog posts