(untagged)

Web Scraping (Problems & Solutions)

dpalash

0.00/5 (No votes)

5 Nov 2013

From this article, you will be able to get the basic idea about web scraping and a few problems and their solutions while working.

Download source - 3.5 MB

Introduction

Web scraping is the Considered the most efficient and programmatic way to grab data from different web sources. Basically web scraping is done on webpages. It is a simple technique to collect necessary information from other webpages to personal database.

Need to Consider:

Html Structure.
Proper Tagging.

1. Html Structure:

Our first consideration for web scraping will be Html structure. For scraping we need our content Html to be structured. With out proper structured Html code scraping will be a mess because of lot of time consumption and hazard. If the content is well structured then it an amazing way to collect data.

2. Proper Tagging:

Content Html tags need to be properly formatted. It needs id or class. If the content Html has only inline Html then it will be a mess. It needs a identification to fetch data. The proper way to put an Id or a class name that we can use. If the content Html has this facility then scraping will be a good idea.

Uses of Web Scraping

Online price comparison
Contact scraping
Weather data monitoring
Website change detection
Research
Web mash up
Web data integration
Telephone no collection
Address collection
Country/City/State Name Collection.

In this article, I will discuss a few useful techniques of web scrapping using HtmlAgilityPack. The most surprising feature of HTML Agility Pack is that it now supports LINQ. This means you can write the usual Linq query to get your result. If you need to know more information about HTML Agility Pack, then you can visit their documentation at CodePlex.

Okay, so let’s begin now.

Problem Statement-1

Suppose we have the following HTML code. From the underlined Html, we want to extract only the links related to the anchor tags.

Solution-1

Step 1: Process the raw content (that is HTML). Load the total HTML source code and convert it to a string. Through the Html Web Request and Response we get the entire Html code from the given link. Then using the Stream Reader the total content is read to the end and we get the string format of the Html source code. Following is the code for the above procedure.

Step 2: Return the converted string and again convert to HTML document type.

In the above code, we have the getSourceCode() method in the WorkerClass class. This method loads the total HTML provided and then returns the total HTML as a string. Returning string is then converted to HtmlDocument and returned. The underlined images show that we have the HTML document ready. Now our content is ready to perform a LINQ query to get our desired result.

Here primaryDivId is a Boolean variable which will be true if it gets any div with id divAchors. Here anchorsHref holds the collection of the anchor’s links and anchorsInnerText is the collection of the anchor’s inner text.

Problem Statement-2

Suppose we need to download images. The HTML format may be like the following:

Solution-2

To download all the images and also to get their alternative information text, we need to do the following:

The following //img tag on the SelectNodes represents that the div having the Id divImage may have the img tag. If it gets any image tag within the scope of this dev it will fetch it's source and alternative information text. Here I need to mention that no matter where the image tag resides, no mater if the image tag resided with a few div levels, this query will fetch them all.

From the above code, we will be able to get the collection of the image source links in the imageSrc list and their alt text in the imageInnerText list. Using a foreach loop, we can download and save the images in our desired folder.

Problem Statement-3

Suppose we need to find the inner text of a div with its class name. The HTML for this problem may look like the following:

Solution-3

Here is the solution for this problem statement:

The innerText string will provide you a full length uncut string whereas the innerTextList will provide you a list of inner text’s collection.

Problem Statement-4

Suppose we have the similar problem like the above one with a slight change. The change is that the class name toggles between two classes. I am not sure about which class name might present when the page renders. The HTML for this problem statement may look like the following:

Here the classes toggles between demoText1 and demoText2.

Solution-4

Here is the solution for the above problem statement:

The solution is similar to the solution-3 with an extra or (|) condition in the query. You can also use and (&) condition if you need to.

These are the recent problems that I faced so far in my work and I solved them in this way. I think these solutions will help you to solve your problems because it covers a lot related to web scraping. If you encounter more problems, please let me know, I will try to solve them. Thanks for reading. Happy coding. :)

References

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here