WebSite Scraping

shah vatsal

4.97/5 (19 votes)

1 Jun 2013CPOL7 min read

93.1K

Website scraping using Watin

Introduction

The main objective of this article is to demonstrate scraping of web pages using Testing tools like Watin testing tool.

Generally, scraping of web pages is done with HttpWebRequest and HttpWebResponse method of C# in ASP.NET. However, it is observed that when server side navigation is performed using AJAX in the application, then it becomes very difficult to fetch page data using HttpWebRequest method (we need to perform tricks to fetch next page data).

The same thing can be done with Watin Tool very easily and quickly. My objective here is not to challenge HttpWebRequest and HttpWebResponse methods, but to show how effectively we can do web site scraping using testing tools like Watin.

Background

In this article, I have created a demonstration web site with Category and subsequent Item Listing page. I will be scraping this web site using .NET testing tools like Watin.
Here, I have made use of third party tools like NUnit, Watin to demonstrate this example. Please refer below for a brief introduction for each tool and respective URL for further reference.

About Watin: Watin is a third party web application testing tool designed for .NET.

You can get more information about this tool by visiting this link.
About NUnit: NUnit is a third party unit-testing framework for all .NET languages. More information can be gathered by visiting this site.

Using the Code

This article consists of two applications. Brief details about this application have been given below.

The first application is a web based application created in Visual Studio 2010 (.NET 4.0). This is a demonstration web site with category and item listing pages. This web site needs to be deployed on local / remote server IIS.

The second application is a Windows based class library project created using Visual Studio 2010 (.NET 4.0) and Watin DLL.

The pre-requisite software required to execute this demonstration are as follows:

.NET Framework 4.0
NUnit 2.6.2

Please perform the step given below to configure web application:

Deploy web application in your IIS and assign .NET Framework 4.0 to this application and check that the application is running correctly on your workstation.

Perform the following to configure Web Scraping application:

Open the configuration file (App.config or WatinWebScraping.dll.config) of this application and make changes in values of below configuration keys accordingly:
- WebApplicationPath: This is the application path where the demonstration web site is deployed. In the current application, I have deployed it on local host,
  therefore I have given path "http://localhost/WebApplication/CategoryListing.aspx".
- ScraperEnginePhysicalLocation: This is the physical location where the scraper web application is hosted.
  I have defined value of this path as: "D:\WebScraping\Web Scraper\WatinWebScraping\WatinWebScraping". I use this path to store the scraped data in the text file.

Code Snippet of Demo Web Application

Below is the brief level understanding of the code that resides in respective pages:

CategoryListing.aspx: contains just the listing of categories in the form of hyperlink.
ItemListing.aspxsting.aspx: In this page, I have used Grid View control and have used XMLDataSource instead of Database (for easy configuration) in page.

Refer to the below code snippet:

ASP.NET

<asp:XmlDataSource ID="xmlSource" runat="server" DataFile="~/XMLDataBase/MenFashion.xml">
</asp:XmlDataSource>

<asp:GridView ID="gvItemListing" runat="server" DataSourceID="xmlSource" 
AutoGenerateColumns = "false"
AllowPaging="true" PageSize="5" Width="100%" PagerSettings-Position="Bottom">

Code Snippet of WatinWebScraper Application

Here, I will be explaining the following things:

Initialization of Watin and NUnit in the application
Using Watin for website navigation
Using Regular Expression features (RegEx and MatchCollection) of .NET to fetch respective data from the HTML page source
Execute this application using NUnit

Please refer to the following explanation of the respective sections.

1) Initializing Watin and NUnit in the Application

To use Watin and NUnit in the application, add reference to nunit.framework.dll, "Interop.SHDocVw.dll" and "WatiN.Core.dll".
Now add reference to "NUnit.Framework" and "WatiN.Core" in this project.

As we will be using NUnit for scraping this application; it requires mentioning "[TestFixture]" while creating class for the same and usage of "[Test]" and "[STAThread]" at the top of this method.

You can get more details of these attributes by referring to this web site.

2) Using Watin for Web Scraping

// Create an instance of Internet Explorer browser 
IE ieInstance = new IE(webSitePath);

// This will open Internet Explorer browser in maximized mode 
ieInstance.ShowWindow(WatiN.Core.Native.Windows.NativeMethods.WindowShowStyle.ShowMaximized);

Watin window can be made hidden to the user while performing web scraping using the below code snippet. Currently, this code is kept in comment. The user can also un-comment this code snippet.

//ieInstance.Visible = false;

// This will wait for the browser to complete loading of the page 
ieInstance.WaitForComplete(); 

// This will store page source in categoryPageSource variable 
string categoryPageSource = ieInstance.Html;

3) Using Regular Expression Features (RegEx and MatchCollection) of .NET to fetch respective data from the HTML page source

I have used regular expression for fetching categories and do iterative logic to fetch items in the respective categories and to move to next page use regular expression for fetching all the pages for the respective category items.

Please refer to the below regular expression used for Category, Item fetching and page navigation respectively.

Category Regular Expression

<A\S.*?class=bold\s.*?href="(?<href>.*?)"></span>

Code Explanation

The above regular expression will fetch all the categories URL from the CategoryListing.aspx page and will navigate in a recursive loop.

Item Regular Expression

<P\s*id=.*?>ProductID:\s*<B>(?<ProductID>.*?)</B>.*?</P>\s*.*?<P\s*id=.*?>
ProductName:\s*<B>(?<ProductName>.*?)</B>.*?</P>\s*.*?<P\s*id=.*?>
ProductPrice:\s*<B>(?<ProductPrice>.*?)</B>.*?</P>

Code Explanation

The above regular expression will fetch ProductID, Product Name and Product Price for the respective item that resides in the given page.

Paging Regular Expression

(?(?=<SPAN>.*?</SPAN>)<SPAN>(?<PageNumber>.*?)\s*</SPAN>|
<A\s*href="javascript.*?>(?<PageNumber>.*?)\s*</A>)

Code Explanation

The above regular expression will fetch respective pages from the ItemListing.aspx page.

To use this regular expression in this application, I have used "RegEx" class of "System.Text.RegularExpression" namespace. RegEx will compile respective regular expression pattern using different options like "RegexOptions.Compiled", "RegexOptions.IgnoreCase", "RegexOptions.IgnorePatternWhitespace" and "RegexOptions.CultureInvariant".

Refer to the below code snippet for the same:

// Regular expression for Category listing page 
private const string _categoryRegEx = <A\S.*?class=bold\s.*?href="(?<href>.*?)"></span>
Regex categoryMatches = new Regex(_categoryRegEx, RegexOptions.Compiled | 
RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace | RegexOptions.CultureInvariant);

To fetch records based upon regular expression, it requires using MatchCollection to fetch list of successful matches for the respective HTML source generated (string categoryPageSource = ieInstance.Html;) as per #2 above. Refer to the below code:

MatchCollection categoryMatchCollection = categoryMatches.Matches(categoryPageSource);

Use for loop to fetch this respective result of single match: Refer to the below code:

foreach (Match categoryMatch in categoryMatchCollection)

If regular expression is created based upon the group, then use GroupCollection method to fetch groups of the respective result. Refer to the below code:

GroupCollection categoryGroup = categoryMatch.Groups;

GroupCollection contains multiple groups associated. To fetch category, I have used "href" as a group. Refer to the below code for reference:

string itemListingURL = Convert.ToString(categoryGroup["href"].Value);

Now, itemListingURL variable will contain href for the respective category. Now Watin will navigate to this URL as shown below.
itemListingPath variable contains the full path of the item listing page for the respective category.

ieInstance.GoTo(itemListingpath);

I have used "WaitForComplete" method to wait using respective page loading is completed. Refer to the below code:

ieInstance.WaitForComplete();

Using the above code application will navigate to the item listing page. Similar operation needs to be performed for fetching items.

Once all items of the respective page are fetched and to move to the next page, Watin provides Click event to perform click on specific page.

Click event can also be performed based upon other different criteria. Please refer below:

`Find.ByAlt`	`Find.ByClass`	`Find.ByDefault`	`Find.ByElement`	`Find.ByExistence` `OfRelatedElement`	`Find.ByFor`
`Find.ById`	`Find.ByIndex`	`Find.ByLabelText`	`Find.ByName`	`Find.BySelector`	`Find.BySrc`
`Find.ByStyle`	`Find.ByText`	`Find.ByTextInColumn`	`Find.ByTitle`	`Find.ByUrl`	`Find.ByValue`

You can get more information on all the above criteria by visiting this link.

In this article, I have used "Find.ByText" to find link by text and then perform click event. You can also attach
regular expression with the above criteria.

Refer to the below code:

// Fetches the page number of the current page.
string linkText = Convert.ToString(pagingGroup[_pageNumber].Value); 

// Performs click event on the given link. For e.g if linkText contains "2" as a value 
// then Watin will perform click event on this second link.
ieInstance.Link(Find.ByText(linkText)).Click(); 
// Wait for the operation to complete
ieInstance.WaitForComplete(); 

// Store the result of the page in itemListingPageSource variable
itemListingPageSource = ieInstance.Html;

Once the respective items in the web page are scraped, the current application will store respective items in "Output.txt" using StreamWriter of System.IO namespace.

Now open "Output.txt" file and observe that it will contain all the items of Men, Women and Children category.

4) Execute this Application using NUnit

To execute WatinWebScraper application, it requires following the steps given below:

Open NUnit application.
Now click on File --> Open Project and navigate to the DLL file ("WatinWebScraping.dll") of Watin Web Scraper application. Refer to the below image:

Now click on "Run" button as shown in the above image. Observe that application will start scraping of Demo Web application by Navigating to Category and all its respective items will be stored in "Output.txt" file.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)