Introduction
Searching and collecting data published on web sites has always been a long and boring manual task. With this project, I try to give you a tool that can help to automate some of these tasks and save results in an ordered way.
It is simply another web scraper written in Microsoft .NET Framework (C# and VB.NET), but finally without the use of Microsoft mshtml parser!
I often use this light version because it is simple to customize and to be included in new projects.
These are the components that it is made of:
- A
parser
Object gkParser
, that uses Jamierte’s version of the HtmlParserSharp
(https://github.com/jamietre/HtmlParserSharp) and that provides navigation functions - A
ScrapeBot
object gkScrapeBot
that provides search, extraction and purge data functions - Some helper classes to speed up the development of database operations
Architecture
The search and extraction method requires that the HTML language be transformed in XML, also if it is not well formed. Doing so, it will be simpler to locate data inside a web page. The base architecture, then, focuses on making this transformation and executing queries on the XML result document.
The Parser
class includes all functions to navigate and parse. In this version, parsing is limited only to HTML and JSON.
When navigation functions return a successful response, you have an XML DOM representation of the web page.
At this point, another object, the ScrapeBot
can execute queries, extract and purge desired data with XPath syntax.
The gkScrapeBot
is the main object you will use in your project. It already uses the Parser
for you. It provides some wrappers to navigation functions, some useful query functions and some other functions to extract and purge data.
Let’s take a look inside these objects.
The Parser Class (gkParser)
This component is written in VB.NET and uses the Jamietre’s version of the port of the Validator.nu parser (http://about.validator.nu/htmlparser/).
Why not Microsoft Parser? Ah, ok. I want to spend a little time to write about this painful choice. ;-)
The first version of this project was made using mshtml. This is why I decided to change:
First: my intent was to use this component windowless. There are many documents on the web about using mshtml. No one official from Microsoft. However, a lot about users troubles… The only few useful documents from Microsoft are dated 1999 (walkall example from inet sdk)! It works, but I quickly found its limitation.
Second: Then I start coding .NET based on walkall example. After overcoming COM interop difficulties, I experienced that mshtml is able to make only GET
requests. And POST
? … Somewhere, Microsoft writes that it could be possible to customize the request process, implementing an interface, writing some callback functions… NO. doesn't work!
Third: I need to control the download of linked document, JavaScript, images, CSS, … Oh, yes. Microsoft writes about this. It writes that you have a total control on this… NO!
I used wireshark to see what my process was downloading and this feature didn’t work. I see that work only if hosted by the heavy MS WebBrowser
component.
Then: I understand that Microsoft does not like developers using its parser.
The Component
Navigation functions are implemented with the use of WebRequest
and WebResponse
classes, and the HTML parser is implemented using the object HtmlParserSharp.SimpleHtmlParser
. The Navigate
method is the only public
function used to make both GET
and POST
requests. It has 4 overloads to permit different behavior.
Public Sub Navigate(ByVal url As String)
Public Sub Navigate(ByVal url As String, ByVal dontParse As Boolean)
Public Sub Navigate(ByVal url As String, ByVal postData As String)
Public Sub Navigate(ByVal url As String, ByVal postData As String, ByVal dontParse As Boolean)
It's not easy to create a class that fully implements all navigation features, I've created one that implements a basic cookies management and that doesn't fully implement the https protocol.
All methods are synchronous, when they return, a XML DOM document is ready.
After a web request gets a success response, the class checks the content type and instantiates the correct parser.
The Jamietre’s parser returns a very formal XML. Too formal for our purpose. Moreover, some web pages are very large and complex and it would be useful to have a smaller XML. For this reason, I implemented an interesting algorithm that filters tags and attributes: You can instruct the parser to consider only desired tags and attributes and to exclude undesired ones.
The following two properties control this behavior:
Public Property ExcludeInstructions() As String
Public Property IncludeInstructions() As String
p_tag2ExcInstruction = "SCRIPT|META|LINK|STYLE"
p_tag2IncInstruction = "A:href|IMG:src,alt|INPUT:type,value,name"
With this feature, you can customize the result XML and make it easier to understand and to teach the bot.
The Scraper
The other main class is the gkScrapeBot
. This is the class you have to use.
It uses the gkParser
to navigate, to get the XML to analyze and to extract data from it.
It implements helper functions to meet these requirements:
Public Sub Navigate(ByVal url As String)
Public Sub Navigate(ByVal url As String, ByVal subel As String)
Public Sub Navigate(ByVal url As String, ByVal subel As String, ByVal wait As Integer)
Public Sub Post(ByVal url As String, ByVal postData As String)
Public Sub Post(ByVal url As String, ByVal postData As String, ByVal subel As String)
Public Sub Post(ByVal url As String, ByVal postData As String, _
ByVal subel As String, ByVal wait As Integer)
Public Function GetNode_byXpath(ByVal xpath As String, _
Optional ByRef relNode As XmlNode = Nothing, _
Optional ByVal Attrib As String = "") As XmlNode
Public Function GetNodes_byXpath(ByVal xpath As String, _
Optional ByRef relNode As XmlNode = Nothing, _
Optional ByVal Attrib As String = "") As XmlNodeList
Public Function GetText_byXpath(ByVal xpath As String, _
Optional ByRef relNode As XmlNode = Nothing, _
Optional ByVal Attrib As String = "") As String
Public Function GetValue_byXpath(ByVal xpath As String, _
Optional ByRef relNode As XmlNode = Nothing, _
Optional ByVal Attrib As String = "") As String
Public Function GetHtml_byXpath(ByVal xpath As String, _
Optional ByRef relNode As XmlNode = Nothing) As String
Look at the example below to see it in action.
How to Use: Test Project Included
Warning. Scraping is often forbidden by web sites policy.
Before scaping, you need to be sure that target site policy permits that.
I assume that you know how web site works (URL, method requests and parameters, ..). I use the developer tools provided by browsers both to discover all parameters and request sent to server, and to navigate the HTML tree.
Let's see it in action:
The test project, included in the download package, shows you how to get products details from a shop online. https://testscrape.gekoproject.com
I choose this example because it uses the key features of the scraper: cookies management and post request for login phase, and nodes exploring and database facilities to get and store extracted data:
Products are not visible to guest user. Only registered user can view products and prices.
The login process is based on cookies. Then, first of all, we need to simply navigate to the site to obtain the cookie.
url = "https://testscrape.gekoproject.com/index.php/author-login"
bot.Navigate(url)
In the login page, into the form, there are two string
s that are needed to post back to the server to successfully send a login request.
token1 = bot.GetText_byXpath("//DIV[@class='login']//INPUT[@type='hidden'][1]", , "value")
token2 = bot.GetText_byXpath("//DIV[@class='login']//INPUT[@type='hidden'][2]", , "name")
url = "https://testscrape.gekoproject.com/index.php/author-login?task=user.login"
data = "username=" & USER & "&password=" & PASS & "&return=" & token1 & "&" & token2 & "=1"
bot.Post(url, data)
If all goes right, you are redirected to the user page, and then you can check getting the "Registered Date" information:
mytext = bot.GetText_byXpath("//DT[contains(.,'Registered Date')]/following-sibling::DD[1]")
Console.WriteLine("User {0}, Registered Date: {1}", USER, mytext.Trim)
Once you are logged, you can navigate to the products listing page and start data scraping.
In the example, only data of the first page are scraped, but you can repeat the task for each page in the pager.
Below is the code to retrieve a list of products and their attributes:
Dim url As String
Dim name As String
Dim desc As String
Dim price_str As String
Dim price As Double
Dim img_path As String
url = "https://testscrape.gekoproject.com/index.php/front-end-store"
bot.Navigate(url)
Dim ns As XmlNodeList = bot.GetNodes_byXpath_
("//DIV[@class='row']//DIV[contains(@class, 'product ')]")
If ns.Count > 0 Then
Dim writer As XmlWriter = Nothing
Dim settings As XmlWriterSettings = New XmlWriterSettings()
settings.Indent = True
settings.IndentChars = (ControlChars.Tab)
settings.OmitXmlDeclaration = True
writer = XmlWriter.Create("data.xml", settings)
writer.WriteStartElement("products")
For Each n As XmlNode In ns
name = bot.GetText_byXpath(".//DIV[@class='vm-product-descr-container-1']/H2", n)
desc = bot.GetText_byXpath(".//DIV[@class='vm-product-descr-container-1']/P", n)
desc = gkScrapeBot.FriendLeft(desc, 50)
img_path = bot.GetText_byXpath(".//DIV[@class='vm-product-media-container']//IMG", n, "src")
price_str = bot.GetText_byXpath(".//DIV[contains(@class,'PricesalesPrice')]", n)
If price_str <> "" Then
price = gkScrapeBot.GetNumberPart(price_str, ",")
End If
writer.WriteStartElement("product")
writer.WriteElementString("name", name)
writer.WriteElementString("description", desc)
writer.WriteElementString("price", price)
writer.WriteElementString("image", img_path)
writer.WriteEndElement()
db.CommantType = DBCommandTypes.INSERT
db.Table = "Articles"
db.Fields("Name") = name
db.Fields("Description") = desc
db.Fields("Price") = price
Dim ra As Integer = db.Execute()
If ra = 1 Then
Console.WriteLine("Inserted new article: {0}", name)
End If
Next
writer.WriteEndElement()
writer.Flush()
writer.Close()
End If
Conclusion
I hope that this project will help you in collecting data from the web.
I know that it's not simple to discover how a web site works, especially if it makes large use of JavaScript to make async request.
Then this project wouldn't be a solution for all websites; if you need something more than this project, you can contact me by leaving a comment below and be sure to be authorized to scrape. ;-)
Happy scraping!
Updates
- 28-06-2019
- Updated to target .NET Framework 4.7.2 and VS 2019
- Test Project was updated to work with new Test Site https://testscrape.gekoproject.com
- Improved features and bug correction
- 16-07-2015
- Fixed a permission error on the demo site that caused a runtime exception while running the test project