Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / web / HTML

Screen scraping using YQL and AJAX

5.00/5 (5 votes)
30 Sep 2013CPOL5 min read 54.2K   1K  
A simple application to scrap HTML data in JSON format.

Introduction

Web scraping always plays a negative role in web-development, but in some cases it is very important. jQuery is greatly helpful for cross-domain scraping and a bunch of examples are available too . Web scraping has been influenced by Yahoo Query Language(YQL). This article is going to provide a basic overview of web-scraping using jQuery and YQL. To represent data I have also used Mustache as HTML template so, I will also provide a short overview of Mustache.js here.

Background

If you are familiar with HTML you will understand this article easily. Simple basic knowledge of  JavaScript and AJAX will do. Don’t worry, I have added some useful links to make you clear.

YQL Overview

YQL is a SQL-like syntax and can be used to work with different APIs. YQL is popular because of its faster response. Details overview can be found here

To get HTML off a page, YQL has different methods called ‘Tables’. Mostly highlighted are given below with example…

In my simple project I’ll be using HTML table.

Querying With YQL Console

Now, let’s become familiar with YQL Console. In the following example we are going to scrap data from gsmarena. We’ll search in gsmarena using phone manufacturer’s name and scrap the result. So,let’s go to gsmarena and see the data we are going to fetch as following

  1. Search using any phone manufacturer’s name, for example, ‘nokia’. 
  2. We have got the search result page. Copy the page’s URL (http://www.gsmarena.com/results.php3?sQuickSearch=yes&sName=nokia).
  3. Now, let’s go to YQL console and run query to get the result of the mentioned url. (You need to be logged in in Yahoo!)
    1. Go to YQL Console.
    2. In the textbox write the query “select * from html where url="http://www.gsmarena.com/results.php3?sQuickSearch=yes&sName=nokia"”.
    3. To get the data in JSON format select JSON and press the ‘TEST’ Button.
  4. That’s all, the result is in the result box. Don’t worry if you do not understand the result, it has scrapped full page (http://www.gsmarena.com/results.php3?sQuickSearch=yes&sName=nokia) in JSON format.
  5.  

  6. But we do not need the full page, we will take only the results generated by the search. To do this, we need the XPATH of the resultant div.
  7. To get the XPATH right click on the page(http://www.gsmarena.com/results.php3?sQuickSearch=yes&sName=nokia) and copy the XPATH as shown in the picture. 
  8. Now go to YQL Console and run the query with the XPATH again. And now our result is only containing the search result’s div.  
  9. To khow more about XPATH you can follow this w3schools tutorial.

  10. Lastly, to Access YQL API using ajax we need the ‘REST QUERY’. Simply copy the ‘REST QUERY’.
SQL
If we see the ‘REST QUERY’ closely we will find three portions
http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%3D%22http%3A%2F%2Fwww.
  gsmarena.com%2Fresults.php3%3FsQuickSearch%3Dyes%26sName%3Dnokia%22%20and%20xpath%3D'%2Fhtml%2Fbody%2F
  div%2Fdiv%5B2%5D%2Fdiv%5B2%5D%2Fdiv%2Fdiv%5B2%5D%2Fdiv%5B2%5D'&format=json&diagnostics=true&callback=
•   Url to access YQL API (http://query.yahooapis.com/v1/public/yql?)
•   Our full query. Here it is encoded because uri should not contain any whitespace. 
   (q=select%20*%20from%20html%20where%20url%3D%22http%3A%2F%2Fwww.gsmarena.com%2Fresults.php3%3
   FsQuickSearch%3Dyes%26sName%3Dnokia%22%20and%20xpath%3D'%2Fhtml%2Fbody%2Fdiv%2Fdiv%5B2%5D%2Fdiv%5B2%5D%2Fdiv%2Fdiv%5B2%5D%2Fdiv%5B2%5D')
•   Data format and other components (&format=json&diagnostics=true&callback=)

Requirements

  •  Jquery : In this project JQuery is used for two purposes…

    1. To run ajax request to get data from YQL
    2. To display data using JqueyCycle. To get this feature in just add this script :
      XML
      <script src="http://www.codeproject.com/cdnjs.cloudflare.com/ajax/libs/jquery.cycle/
        2.9999.8/jquery.cycle.all.min.js" type="text/javascript"></script>

    For both purposes I am using jquery -v1.9.1. You can download any Latest version from jQuery.

  • Json2.js : It is very important to know that From YQL using ajax we get data in JSON format. Json2.js is very helpful to handle this JSON data. Download Json2.js from here and include in your project. To know more about JSON you can go here
  • jsonpath-0.8.0.js : When we want to get the specific result from the received JSON data, It is important to query among the divs, tables etc. jsonpath stands for this purpose. Get any latest version of jsonpath from here and include it in your project.
  • mustache.js : The last requirement for this project to run is mustache.js. Mustache is a “logic-less” template syntax. It is very helpful for decoupling HTML markups from data. Mustache is implemented in different languages: Ruby, JavaScript, Python, PHP, Perl, Objective-C, Java, .NET, Android, C++, Go, Lua, Scala, etc. Mustache.js is the JavaScript implementation. Get mustache.js from here and this one is a helpful tutorial for mustache.

Using the code  

Now our environment is ready to work on. If you face any trouble rearrange it as following

XML
<script src="scripts/jquery-1.9.1.min.js" type="text/javascript"></script>
    <!--jquery cycle library to slide the results-->
    <script src="//cdnjs.cloudflare.com/ajax/libs/jquery.cycle/2.9999.8/jquery.cycle.all.min.js" type="text/javascript"></script>
    <!--mustache.js template for javascript -->
    <script src="scripts/mustache.js" type="text/javascript"></script>
    <!--json2.js is to work with json data -->
    <script src="scripts/json2.js" type="text/javascript"></script>
    <!--jsonpath is helpful for querying json data -->
    <script src="scripts/jsonpath-0.8.0.js" type="text/javascript"></script>

So, our task is simple. We’ll take manufacturer’s name as user input, scrap the result from www.gsmarena.com using YQL API and will receive the result performing an AJAX request. So, let’s start…

  • To take the input, we need a text box. A button click event is going to fire a JavaScript function named GetResult(). Also, a div is used to hold the entire result. So, the HTML markup is as following…

    HTML
    <body>
        <input id="valueText" type="text" />
        <button type="button" onclick="GetResult()">Get Result</button>
        <div id="speakerbox" style="float:left">
            <a href="#" id="prev_btn">&laquo;</a>
            <a href="#" id="next_btn">&raquo;</a>
            <div id="carousel"></div>
        </div>
    </body>
  • JavaScript’s GetResult() function is fetching scrapped data using AJAX. Get user input from textbox.
  • JavaScript
    var item = $('#valueText').val();
  • To build the query first examine the query where the user input is appended as shown in the picture below. 
  • So, we’ll simply append the user input in the query.

    JavaScript
    var query = "SELECT * FROM html WHERE url=" + '"' + 
      "http://www.gsmarena.com/results.php3?sQuickSearch=yes&sName=" + item + '"' + 
      " and xpath='/html/body/div/div[2]/div[2]/div/div[2]/div[2]'";
  • For now, we will ignore cacheBuster for simplicity. And the rest query for the AJAX URL is:
  • JavaScript
    var url = 'http://query.yahooapis.com/v1/public/yql?q=' + 
      encodeURIComponent(query) + '&format=json&_nocache=' + cacheBuster;
  • Now, we have everything to access YQL API using AJAX. So, run the AJAX request and get the scrapped data.
    JavaScript
    window['wxCallback'] = function (data) {
            console.log(data);
            ParseData(data); // To show the result
        };
        $.ajax({
            url: url,
            dataType: 'jsonp',
            cache: true,
            jsonpCallback: 'wxCallback'
    });
  • When the AJAX request performs successfully we need to call ParseData(data) function with the data as its parameter. This function is simply parsing the data and showing it inside the carousel using mustache template.
  • C#
    function ParseData(data) {
        var result = jsonPath(data, "$.query.results[*].ul.li[*]");
        $('#carousel').empty();
        var html = "";
        for (i = 0; i < result.length; i++) {
            var template = $('#speakerstpl').html();
            html += Mustache.to_html(template, result[i]);
    
        }
        $('#carousel').append(html);
        $('#carousel').cycle({
                    fx: 'fade',
                    pause: 1,
                    next: '#next_btn',
                    prev: '#prev_btn',
                    speed: 500,
                    timeout: 10000
    });

    Here, jsonPath is finding the specific ‘li’ which one in containing the data using the provided query.

  • Now, run the app and see the scrapped result in the carousel. 

Points of Interest

No doubt, web-scrapping is an interesting job to do. It is much more interesting with YQL, I think.

History

  • 28th September, 2013.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)