Introduction
Web scraping always plays a negative role in web-development, but in some cases it is very important. jQuery is greatly helpful for cross-domain scraping and a bunch of examples are available too . Web scraping has been influenced by Yahoo Query Language(YQL). This article is going to provide a basic overview of web-scraping using jQuery and YQL. To represent data I have also used Mustache as HTML template so, I will also provide a short overview of Mustache.js here.
Background
If you are familiar with HTML you will understand this article easily. Simple basic knowledge of JavaScript and AJAX will do. Don’t worry, I have added some useful links to make you clear.
YQL Overview
YQL is a SQL-like syntax and can be used to work with different APIs. YQL is popular because of its faster response. Details overview can be found here
To get HTML off a page, YQL has different methods called ‘Tables’. Mostly highlighted are given below with example…
In my simple project I’ll be using HTML table.
Querying With YQL Console
Now, let’s become familiar with YQL Console. In the following example we are going to scrap data from gsmarena. We’ll search in gsmarena using phone manufacturer’s name and scrap the result. So,let’s go to gsmarena and see the data we are going to fetch as following
- Search using any phone manufacturer’s name, for example, ‘nokia’.
- We have got the search result page. Copy the page’s URL (http://www.gsmarena.com/results.php3?sQuickSearch=yes&sName=nokia).
- Now, let’s go to YQL console and run query to get the result of the mentioned url. (You need to be logged in in Yahoo!)
- Go to YQL Console.
- In the textbox write the query “select * from html where url="http://www.gsmarena.com/results.php3?sQuickSearch=yes&sName=nokia"”.
- To get the data in JSON format select JSON and press the ‘TEST’ Button.
- That’s all, the result is in the result box. Don’t worry if you do not understand the result, it has scrapped full page
(http://www.gsmarena.com/results.php3?sQuickSearch=yes&sName=nokia) in JSON format.
- But we do not need the full page, we will take only the results generated by the search. To do this, we need the
XPATH of the resultant div.
- To get the XPATH right click on the page(http://www.gsmarena.com/results.php3?sQuickSearch=yes&sName=nokia)
and copy the XPATH as shown in the picture.
- Now go to YQL Console and run the query with the
XPATH again. And now our result is only containing
the search result’s div.
To khow more about XPATH you can follow this
w3schools tutorial.
- Lastly, to Access YQL API using ajax we need the ‘REST QUERY’. Simply copy the ‘REST QUERY’.
If we see the ‘REST QUERY’ closely we will find three portions
http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%3D%22http%3A%2F%2Fwww.
gsmarena.com%2Fresults.php3%3FsQuickSearch%3Dyes%26sName%3Dnokia%22%20and%20xpath%3D'%2Fhtml%2Fbody%2F
div%2Fdiv%5B2%5D%2Fdiv%5B2%5D%2Fdiv%2Fdiv%5B2%5D%2Fdiv%5B2%5D'&format=json&diagnostics=true&callback=
• Url to access YQL API (http://query.yahooapis.com/v1/public/yql?)
• Our full query. Here it is encoded because uri should not contain any whitespace.
(q=select%20*%20from%20html%20where%20url%3D%22http%3A%2F%2Fwww.gsmarena.com%2Fresults.php3%3
FsQuickSearch%3Dyes%26sName%3Dnokia%22%20and%20xpath%3D'%2Fhtml%2Fbody%2Fdiv%2Fdiv%5B2%5D%2Fdiv%5B2%5D%2Fdiv%2Fdiv%5B2%5D%2Fdiv%5B2%5D')
• Data format and other components (&format=json&diagnostics=true&callback=)
Requirements
Using the code
Now our environment is ready to work on. If you face any trouble rearrange it as following
<script src="scripts/jquery-1.9.1.min.js" type="text/javascript"></script>
<script src="//cdnjs.cloudflare.com/ajax/libs/jquery.cycle/2.9999.8/jquery.cycle.all.min.js" type="text/javascript"></script>
<script src="scripts/mustache.js" type="text/javascript"></script>
<script src="scripts/json2.js" type="text/javascript"></script>
<script src="scripts/jsonpath-0.8.0.js" type="text/javascript"></script>
So, our task is simple. We’ll take manufacturer’s name as user input, scrap the result from www.gsmarena.com using YQL API and will receive the result performing an AJAX request.
So, let’s start…
var item = $('#valueText').val();
To build the query first examine the query where the user input is appended as shown in the picture below.
So, we’ll simply append the user input in the query.
var query = "SELECT * FROM html WHERE url=" + '"' +
"http://www.gsmarena.com/results.php3?sQuickSearch=yes&sName=" + item + '"' +
" and xpath='/html/body/div/div[2]/div[2]/div/div[2]/div[2]'";
For now, we will ignore cacheBuster
for simplicity. And the rest query for the AJAX URL is:
var url = 'http://query.yahooapis.com/v1/public/yql?q=' +
encodeURIComponent(query) + '&format=json&_nocache=' + cacheBuster;
Now, we have everything to access YQL API using AJAX. So, run the AJAX request and get the scrapped data.
window['wxCallback'] = function (data) {
console.log(data);
ParseData(data);
};
$.ajax({
url: url,
dataType: 'jsonp',
cache: true,
jsonpCallback: 'wxCallback'
});
When the AJAX request performs successfully we need to call ParseData(data)
function with the data as its parameter. This function is simply parsing
the data and showing it inside the carousel using mustache template.
function ParseData(data) {
var result = jsonPath(data, "$.query.results[*].ul.li[*]");
$('#carousel').empty();
var html = "";
for (i = 0; i < result.length; i++) {
var template = $('#speakerstpl').html();
html += Mustache.to_html(template, result[i]);
}
$('#carousel').append(html);
$('#carousel').cycle({
fx: 'fade',
pause: 1,
next: '#next_btn',
prev: '#prev_btn',
speed: 500,
timeout: 10000
});
Here, jsonPath
is finding the specific ‘li’ which one in containing the data using the provided query.
Now, run the app and see the scrapped result in the carousel.
Points of Interest
No doubt, web-scrapping is an interesting job to do. It is much more interesting with YQL, I think.
History