URL scrapping with javascript
Quite some time ago /me had coded a small entry on Hyperlink to URL && Extracting all the links from a site which was much of a terminal based scrapping.
Lately I was playing around with different possible ways of extracting URLs from a give webpage using plain and raw javascript, the idea was to figure out which of the different possible ways is the best and how.
A simple comparison of different methods is demonstrated in this entry.
The focus of this writeup :
List out the possible ways of extracting URLs.
Do a performance review of them.
Select the best out of the lot.
Different possible ways of extracting URLs :
document.links
document.getElementsByTagName('a')
document.querySelectorAll('a')
Understanding them all :
- document.links is of type HTMLCollection, and is readonly. It's a DOM level2 API
The links property returns a collection of all AREA elements and anchor elements in a document with a value for the href attribute.
interface HTMLCollection { readonly attribute unsigned long length; Node item(in unsigned long index); Node namedItem(in DOMString name); };
- document.getElementsByTagName also a DOM level 2 API
Returns a NodeList of all the Elements with a given tag name in the order in which they are encountered in a preorder traversal of the Document tree.
- document.querySelectorAll is a Selectors API Level 1
module dom { [Supplemental, NoInterfaceObject] interface NodeSelector { Element querySelector(in DOMString selectors); NodeList querySelectorAll(in DOMString selectors); }; Document implements NodeSelector; DocumentFragment implements NodeSelector; Element implements NodeSelector; };
This is an example table written in HTML 4.01. From w3.org <table id="score"> <thead> <tr> <th>Test <th>Result <tfoot> <tr> <th>Average <td>82% <tbody> <tr> <td>A <td>87% <tr> <td>B <td>78% <tr> <td>C <td>81% </table> In order to obtain the cells containing the results in the table, which might be done, for example, to plot the values on a graph, there are at least two approaches that may be taken. Using only the APIs from DOM Level 2, it requires a script like the following that iterates through each tr within each tbody in the table to find the second cell of each row. var table = document.getElementById("score"); var groups = table.tBodies; var rows = null; var cells = []; for (var i = 0; i < groups.length; i++) { rows = groups[i].rows; for (var j = 0; j < rows.length; j++) { cells.push(rows[j].cells[1]); } } Alternatively, using the querySelectorAll() method, that script becomes much more concise. var cells = document.querySelectorAll("#score>tbody>tr>td:nth-of-type(2)");
Performance review
On Chrome 18 this was the best of three runs using benchmark.js -> Result
So, clearly the winner is document.links in this particular case, but still my preference would go to querySelectorAll() based on the standards and cross browser compatibility.
Based on the above Some silly code to get all the images from pintrest
[].forEach.call( document.querySelectorAll('img.PinImageImg'), function fn(elem){ console.log(elem.src); });
This has been the case study so far, do let me know your views below!
Update May 15 2012 :
W.R.T blindwanderer's question below, xpath turned out to be slower than querySelectorAll()
xpath vs querySelectorAll() :
document.evaluate("//a[@href='#']", document, null, 0, null); document.querySelectorAll("a[href='#']");
It's not only faster, but even simpler! :)
Recent blog posts
- watir-webdriver web inspector
- gem list to gemfile
- Packing ruby2.0 on debian.
- Made it into The Guinness Book!
- to_h in ruby 2.0
- Filter elements by pattern jQuery.
- Better HTML password fields for mobile ?
- Grayscale image when user offline
- nth-child CSS pseudo-class Christmas colors
- EventEmitter in nodejs