Introduction
In this article, we will review how to scrap web page data with the help of Node.Js and some helpful NPM modules.
And since I also already built a site to do web scraping, I'd also like to introduce it to you.
Background
Web scraping has always had a negative connotation in the world since APIs are present for most popular service and they should be used to retrieve data rather than web scraping. But we have to use scraping if we are interested in some web page data but no API provide for it or the API have some license/quota limitations.
Any language can be used for writing scrapers, but in my view, the essence of web page is DOM, so retrieve data through DOM should be the best approach, and as every one knows, the most popular framework to do DOM operation is jQuery, so I hope I can use jQuery selector help me to retrieve web page data.
Then I did some research, I have only found Node.js meets my requirement. Let's get started!
Scraping with Cheerio
OK, let's be honest, actually we can not use all of jQuery syntax.
As the cheerio mentioned, it is a "Fast, flexible, and lean implementation of core jQuery designed specifically for the server." nodejs module.
We can install the module using npm:
npm install cheerio
We also need the help of module "request" which will be used to retrieve web page data.
npm install request
The module is extremely simple, we can just use it like using jQuyer, let us scrap a real world web page:
var request = require('request');
var cheerio = require('cheerio');
var url = "http://www.imdb.com/chart/";
request({
"uri": url
}, function(err, resp, body){
var $ = cheerio.load(body);
var strContent = "";
$('th:contains(Gross)').parents('table').find('tr').each(function(index,item){
if(index>0)
{
var tds = $(item).find('td');
strContent += $(tds.eq(1)).find('a').text().trim() + ","
+ tds.eq(2).text().trim() + "," + tds.eq(3).text().trim()+ "\r\n";
}
});
console.log(strContent);
});
As you see, we can scrap the data using jQuery similar syntax, the output shows in below:
The Hobbit: The Desolation of Smaug,$29.8M,$190.3M
Frozen,$28.8M,$248.4M
Anchorman 2: The Legend Continues,$20.2M,$83.7M
American Hustle,$19.6M,$60M
The Wolf of Wall Street,$18.5M,$34.3M
Saving Mr. Banks,$14M,$37.8M
The Secret Life of Walter Mitty,$13M,$25.6M
The Hunger Games: Catching Fire,$10.2M,$391.1M
47 Ronin,$9.9M,$20.6M
A Madea Christmas,$7.4M,$43.7M
Points of Interest
I believe you you can use xpath/regex to get same result, but I think that code is not so clearly like the above js code.
And the performance is also well, it only takes about 160ms in my PC, it is acceptable, is n't it?
History
Nathan Xu (The owner of the online scraping site www.datafiddle.net ) created on 2013-12-21