Introduction
The main objective of this article is to demonstrate scraping of web pages using Testing tools like Watin testing tool.
Generally, scraping of web pages is done with HttpWebRequest
and HttpWebResponse
method of C# in ASP.NET. However, it is observed that when server side navigation is performed using AJAX in the application, then it becomes very difficult to fetch page data using HttpWebRequest
method (we need to perform tricks to fetch next page data).
The same thing can be done with Watin Tool very easily and quickly. My objective here is not to challenge HttpWebRequest
and HttpWebResponse
methods, but to show how effectively we can do web site scraping using testing tools like Watin.
Background
In this article, I have created a demonstration web site with Category and subsequent Item Listing page. I will be scraping this web site using .NET testing tools like Watin.
Here, I have made use of third party tools like NUnit, Watin to demonstrate this example. Please refer below for a brief introduction for each tool and respective URL for further reference.
About Watin: Watin is a third party web application testing tool designed for .NET.
You can get more information about this tool by visiting this link.
About NUnit: NUnit is a third party unit-testing framework for all .NET languages. More information can be gathered by visiting this site.
Using the Code
This article consists of two applications. Brief details about this application have been given below.
The first application is a web based application created in Visual Studio 2010 (.NET 4.0). This is a demonstration web site with category and item listing pages. This web site needs to be deployed on local / remote server IIS.
The second application is a Windows based class library project created using Visual Studio 2010 (.NET 4.0) and Watin DLL.
The pre-requisite software required to execute this demonstration are as follows:
- .NET Framework 4.0
- NUnit 2.6.2
Please perform the step given below to configure web application:
- Deploy web application in your IIS and assign .NET Framework 4.0 to this application and check that the application is running correctly on your workstation.
Perform the following to configure Web Scraping application:
- Open the configuration file (App.config or WatinWebScraping.dll.config) of this application and make changes in values of below configuration keys accordingly:
Code Snippet of Demo Web Application
Below is the brief level understanding of the code that resides in respective pages:
- CategoryListing.aspx: contains just the listing of categories in the form of hyperlink.
- ItemListing.aspxsting.aspx: In this page, I have used Grid View control and have used
XMLDataSource
instead of Database (for easy configuration) in page.
Refer to the below code snippet:
<asp:XmlDataSource ID="xmlSource" runat="server" DataFile="~/XMLDataBase/MenFashion.xml">
</asp:XmlDataSource>
<asp:GridView ID="gvItemListing" runat="server" DataSourceID="xmlSource"
AutoGenerateColumns = "false"
AllowPaging="true" PageSize="5" Width="100%" PagerSettings-Position="Bottom">
Code Snippet of WatinWebScraper Application
Here, I will be explaining the following things:
- Initialization of Watin and NUnit in the application
- Using Watin for website navigation
- Using Regular Expression features (RegEx and MatchCollection) of .NET to fetch respective data from the HTML page source
- Execute this application using NUnit
Please refer to the following explanation of the respective sections.
1) Initializing Watin and NUnit in the Application
To use Watin and NUnit in the application, add reference to nunit.framework.dll, "Interop.SHDocVw.dll" and "WatiN.Core.dll".
Now add reference to "NUnit.Framework
" and "WatiN.Core
" in this project.
As we will be using NUnit for scraping this application; it requires mentioning "[TestFixture]
" while creating class for the same and usage of "[Test]
" and "[STAThread]
" at the top of this method.
You can get more details of these attributes by referring to this web site.
2) Using Watin for Web Scraping
IE ieInstance = new IE(webSitePath);
ieInstance.ShowWindow(WatiN.Core.Native.Windows.NativeMethods.WindowShowStyle.ShowMaximized);
Watin window can be made hidden to the user while performing web scraping using the below code snippet. Currently, this code is kept in comment. The user can also un-comment this code snippet.
ieInstance.WaitForComplete();
string categoryPageSource = ieInstance.Html;
3) Using Regular Expression Features (RegEx and MatchCollection) of .NET to fetch respective data from the HTML page source
I have used regular expression for fetching categories and do iterative logic to fetch items in the respective categories and to move to next page use regular expression for fetching all the pages for the respective category items.
Please refer to the below regular expression used for Category, Item fetching and page navigation respectively.
Category Regular Expression
<A\S.*?class=bold\s.*?href="(?<href>.*?)"></span>
Code Explanation
The above regular expression will fetch all the categories URL from the CategoryListing.aspx page and will navigate in a recursive loop.
Item Regular Expression
<P\s*id=.*?>ProductID:\s*<B>(?<ProductID>.*?)</B>.*?</P>\s*.*?<P\s*id=.*?>
ProductName:\s*<B>(?<ProductName>.*?)</B>.*?</P>\s*.*?<P\s*id=.*?>
ProductPrice:\s*<B>(?<ProductPrice>.*?)</B>.*?</P>
Code Explanation
The above regular expression will fetch ProductID
, Product Name and Product Price for the respective item that resides in the given page.
Paging Regular Expression
(?(?=<SPAN>.*?</SPAN>)<SPAN>(?<PageNumber>.*?)\s*</SPAN>|
<A\s*href="javascript.*?>(?<PageNumber>.*?)\s*</A>)
Code Explanation
The above regular expression will fetch respective pages from the ItemListing.aspx page.
To use this regular expression in this application, I have used "RegEx
" class of "System.Text.RegularExpression
" namespace. RegEx
will compile respective regular expression pattern using different options like "RegexOptions.Compiled
", "RegexOptions.IgnoreCase
", "RegexOptions.IgnorePatternWhitespace
" and "RegexOptions.CultureInvariant
".
Refer to the below code snippet for the same:
private const string _categoryRegEx = <A\S.*?class=bold\s.*?href="(?<href>.*?)"></span>
Regex categoryMatches = new Regex(_categoryRegEx, RegexOptions.Compiled |
RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace | RegexOptions.CultureInvariant);
To fetch records based upon regular expression, it requires using MatchCollection
to fetch list of successful matches for the respective HTML source generated (string categoryPageSource = ieInstance.Html;
) as per #2 above. Refer to the below code:
MatchCollection categoryMatchCollection = categoryMatches.Matches(categoryPageSource);
Use for
loop to fetch this respective result of single match: Refer to the below code:
foreach (Match categoryMatch in categoryMatchCollection)
If regular expression is created based upon the group, then use GroupCollection
method to fetch groups of the respective result. Refer to the below code:
GroupCollection categoryGroup = categoryMatch.Groups;
GroupCollection
contains multiple groups associated. To fetch category, I have used "href
" as a group. Refer to the below code for reference:
string itemListingURL = Convert.ToString(categoryGroup["href"].Value);
Now, itemListingURL
variable will contain href
for the respective category. Now Watin will navigate to this URL as shown below.
itemListingPath
variable contains the full path of the item listing page for the respective category.
ieInstance.GoTo(itemListingpath);
I have used "WaitForComplete
" method to wait using respective page loading is completed. Refer to the below code:
ieInstance.WaitForComplete();
Using the above code application will navigate to the item listing page. Similar operation needs to be performed for fetching items.
Once all items of the respective page are fetched and to move to the next page, Watin provides Click
event to perform click on specific page.
Click
event can also be performed based upon other different criteria. Please refer below:
Find.ByAlt | Find.ByClass | Find.ByDefault | Find.ByElement | Find.ByExistence OfRelatedElement | Find.ByFor |
Find.ById | Find.ByIndex | Find.ByLabelText | Find.ByName | Find.BySelector | Find.BySrc |
Find.ByStyle | Find.ByText | Find.ByTextInColumn | Find.ByTitle | Find.ByUrl | Find.ByValue |
You can get more information on all the above criteria by visiting this link.
In this article, I have used "Find.ByText
" to find link by text and then perform click
event. You can also attach
regular expression with the above criteria.
Refer to the below code:
string linkText = Convert.ToString(pagingGroup[_pageNumber].Value);
ieInstance.Link(Find.ByText(linkText)).Click();
ieInstance.WaitForComplete();
itemListingPageSource = ieInstance.Html;
Once the respective items in the web page are scraped, the current application will store respective items in "Output.txt" using StreamWriter
of System.IO
namespace.
Now open "Output.txt" file and observe that it will contain all the items of Men
, Women
and Children
category.
4) Execute this Application using NUnit
To execute WatinWebScraper
application, it requires following the steps given below:
- Open NUnit application.
- Now click on File --> Open Project and navigate to the DLL file ("WatinWebScraping.dll") of Watin Web Scraper application. Refer to the below image:
- Now click on "Run" button as shown in the above image. Observe that application will start scraping of Demo Web application by Navigating to Category and all its respective items will be stored in "Output.txt" file.