Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / C#

Web Scraping using Node.js

3.00/5 (1 vote)
29 Dec 2015CPOL4 min read 10K  
How to create Web Scrapper using Node.js ?

Web Scraping

Web Scraping is the software technique of extracting the information server side web applications. In this blog, we see how things work by simply creating a web scrapper using the DOM Parsing technique and the tool which I am using is Node.js.

Before we proceed, I want you to be aware of the following concepts.

Serialization and Deserialization

Serialization is the process of converting an object into a stream of bytes in order to store the object or transmit it to memory, a database, or a file. Its main purpose is to save the state of an object in order to be able to recreate it when needed. The reverse process is called Deserialization.

So the data of web is serialize from the web and then we use deserialization to get that data.

Json

Java Script Object Notation or Json is syntax for storing and exchanging the data and is easier to use alternative to XML. Json is language independent and light weight data interchange format.

We are going to use Json in our process. Our data will be in Json format.

Node.Js

An open source, cross-platform runtime environment for developing server side web application. Node.js will be our tool during our scrapping process.

Request and Cheerio

Request and Cheerio are our npm packages. Cheerio doesn’t try to emulate a full implementation of the DOM. It specifically focuses on the scenario where you want to manipulate an HTML document using jQuery-like syntax. As such, it compares to jsdom favorably in some cases, but not in every situation.

Cheerio itself doesn’t include a mechanism for making HTTP requests, and that’s something that can be tedious to handle manually. It’s a bit easier to use a module called request to facilitate requesting remote HTML documents. Request handles common tasks like caching cookies between multiple requests, setting the content length on POSTs, and generally makes life easier.

If you don’t understand any of the above concepts, simply ignore them and let's create a scrapper from here now. :)

Set Up IDE

I am using:

  • Windows 10 x64
  • Visual Studio 2015(Community)
  • Visit Node.js and download your installer according to your specifications

After you have your Node.js installed, activate your Visual Studio 2015 and create a new project there.

1.PNG

Select Template

Now it's time to select your template.

  • Select Node.js
  • Select Basic Azure Node.js Express 4.
  • Name it, for instance , MyScrapper

1067745/2.png

Install NPM Package

Now install your NPM packages, as shown in the image.

3

4

After the package is loaded, write request and cheerio and then click install.

5

6

Uninstall Jade

When you are done, uninstall Jade.

1067745/7.png

Changes in APP.js

  • Go to app.js.
  • comment the views as shown in the image, as we are not displaying any.

Before

8

After

9

When you are done, make some further changes as shown in the image.

Before

10

After

11

Request and Cheerio

  • Go to Routes(node)
  • Select users.js
  • Add the request and cheerio as shown in the image

12

Website URL

Select the website you want to scrape and save its url in the variable as shown in the image. For instance, I choose bbc.com.

13

Edit Function

Just simply edit your router.get function as shown in the image. The router.get function is shown in the image above and you can edit it by writing the code mentioned in the below image.

1067745/14.png

DOM Parsing .

Programs can retrieve the dynamic content generated by client-side scripts, by embedding the browsers. These browser controls also parse web pages into a DOM tree, based on which programs can retrieve parts of the pages.

DOM is a language independent, cross-platform convention used for interacting with objects in HTML, XML, XHTML.

  • Open website you want to scrape in browser.
  • For instance, I open bbc.com in Google Chrome.
  • Click Inspect.
  • Image is there to help you.

1067745/15.png

Code Function

Code the function now, as you can see in the above image that we are traversing the DOM, as you can see in the image that I have selected the data shown in red circular region and in the inspect window, it gives me the relative dom and then you can write code for it.

scrapeDataFromHtml is our function and we create variables in the function for every item that we want to scrape from the website and then the data is serialized from website in Json format and then we have it once deserialization is done. In this case, the circular red region gives me its relative node in inspect window.

  • First, we reach at url.
  • Then, we traversed DOM.
  • Select our Nodes, the desired data we want to scrape
  • Create your function for instance scrapeDataFromHtml
  • In this function, store all the data you want to scrape from website in variables
  • Write your logic. For multiple values, you can use an array.
  • span and image are two things we want to scrape.

1067745/16.png

Run Application

Now, run your application and it's working :).

1067745/17.png

Conclusion

The simple example above helps you to understand what is scrapping and how stuff works. Happy coding. :)

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)