Introduction
Web scraping is the Considered the most efficient and programmatic way to grab data from different web sources. Basically web scraping is done on webpages. It is a simple technique to collect necessary information from other webpages to personal database.
Need to Consider:
- Html Structure.
- Proper Tagging.
1. Html Structure:
Our first consideration for web scraping will be Html structure. For scraping we need our content Html to be structured. With out proper structured Html code scraping will be a mess because of lot of time consumption and hazard. If the content is well structured then it an amazing way to collect data.
2. Proper Tagging:
Content Html tags need to be properly formatted. It needs id or class. If the content Html has only inline Html then it will be a mess. It needs a identification to fetch data. The proper way to put an Id or a class name that we can use. If the content Html has this facility then scraping will be a good idea.
Uses of Web Scraping
- Online price comparison
- Contact scraping
- Weather data monitoring
- Website change detection
- Research
- Web mash up
- Web data integration
- Telephone no collection
- Address collection
- Country/City/State Name Collection.
In this article, I will discuss a few useful techniques of web scrapping using HtmlAgilityPack. The most surprising feature of HTML Agility Pack is that it now supports LINQ. This means you can write the usual Linq query to get your result. If you need to know more information about HTML Agility Pack, then you can visit their documentation at CodePlex.
Okay, so let’s begin now.
Problem Statement-1
Suppose we have the following HTML code. From the underlined Html, we want to extract only the links related to the anchor tags.
Solution-1
Step 1: Process the raw content (that is HTML). Load the total HTML source code and convert it to a string
. Through the Html Web Request and Response we get the entire Html code from the given link. Then using the Stream Reader the total content is read to the end and we get the string format of the Html source code. Following is the code for the above procedure.
Step 2: Return the converted string
and again convert to HTML document type.
In the above code, we have the getSourceCode()
method in the WorkerClass
class. This method loads the total HTML provided and then returns the total HTML as a string. Returning string
is then converted to HtmlDocument
and returned. The underlined images show that we have the HTML document ready. Now our content is ready to perform a LINQ query to get our desired result.
Here primaryDivId
is a Boolean variable which will be true
if it gets any div
with id divAchors
. Here anchorsHref
holds the collection of the anchor’s links and anchorsInnerText
is the collection of the anchor’s inner text.
Problem Statement-2
Suppose we need to download images. The HTML format may be like the following:
Solution-2
To download all the images and also to get their alternative information text, we need to do the following:
The following //img tag on the SelectNodes represents that the div having the Id divImage may have the img tag. If it gets any image tag within the scope of this dev it will fetch it's source and alternative information text. Here I need to mention that no matter where the image tag resides, no mater if the image tag resided with a few div levels, this query will fetch them all.
From the above code, we will be able to get the collection of the image source links in the imageSrc
list and their alt text in the imageInnerText
list. Using a foreach
loop, we can download and save the images in our desired folder.
Problem Statement-3
Suppose we need to find the inner text of a div
with its class name. The HTML for this problem may look like the following:
Solution-3
Here is the solution for this problem statement:
The innerText string
will provide you a full length uncut string
whereas the innerTextList
will provide you a list of inner text’s collection.
Problem Statement-4
Suppose we have the similar problem like the above one with a slight change. The change is that the class name toggles between two classes. I am not sure about which class name might present when the page renders. The HTML for this problem statement may look like the following:
Here the classes toggles between demoText1 and demoText2.
Solution-4
Here is the solution for the above problem statement:
The solution is similar to the solution-3 with an extra or (|
) condition in the query. You can also use and (&) condition if you need to.
These are the recent problems that I faced so far in my work and I solved them in this way. I think these solutions will help you to solve your problems because it covers a lot related to web scraping. If you encounter more problems, please let me know, I will try to solve them. Thanks for reading. Happy coding. :)
References
- Wikipedia
- CodePlex