Introduction
This article, and the included fully functional code, define how I built a screen scraping utility and a front end for searching through the newly acquired data using your .NET CF enabled device. My example extracts data (physical addresses that are near a zip code the user provides) from the web pages of several well known restaurants and inserts the data into a database, consisting of XML files. I've also included 'plug-in' functionality, where additional restaurant definitions (patterns for finding relevant data on a given website) can be supplied through a web service.
Before you write this application off like a few others have, please note that this application is merely a proof of concept to be as technically innovative, in a retro kind of way. The code below is limited - please download this onto your own Pocket PC (into Program Folders\RestaurantScraper), and try it out. Source code is of course included and documented.
Likely, you are reading this out of curiosity, mumbling "He did what? Why!?!?". I must be honest and say that I have asked myself this question many times over the past week. Making such a sophisticated platform (.NET CF) be responsible for extracting data from HTML goes against everything we've been taught in this bubbling age of technology. But I've found that, even with so much technology available, many companies have not yet embraced Web Services for general consumption, whether it be on purpose or out of ignorance. There may be a time in the future when everything will be available, a digital utopia if you will, but that time has yet to come. Until then, getting our hands on priceless data (spice?) may require some less than conventional techniques and, more than likely, a blatant disregard for those pesky "Terms of Use" agreements.
A word of caution: even though my program only does an HTTP request, identical to the call your browser makes, the process of screen scraping and the storage of the collected data is not permitted in the Terms Of Use for the web sites I am using. I am not responsible for your actions and am not liable for any damages that may result from your use of my program.
Background
It has always been my opinion that the process of screen scraping is almost as ugly as Richard Gere. And just like Runaway Bride, it should be avoided at all costs. For those who've never heard of it (screen scraping, not Runaway Bride), screen scraping is the process of finding patterns in a pool of text, and extracting the nuggets of data found within. It can really be compared to finding a needle in a haystack (but it must be said that it is a well organized haystack). For example, try finding this text in the source HTML document (View - Source for all of us Billy G followers). Likely, it is several hundred lines down, even though there are only a few paragraphs preceding this one. You'll find this paragraph nestled in between a host of <p> and <td> tags, tags used for formatting. One major reason screen scraping is not a popular form of getting data is because if the format of the web page changes, it will likely break the ability for the application to properly parse it. That is a significant issue that really cannot be resolved.
At the programming level, getting at data inside an HTML document requires the traversal of these hundreds or thousands of lines of HTML markup, looking for patterns that have already been defined. That is what I have done - using logic that can traverse HTML documents, looking for patterns, and store the resulting data locally, so that it can be referenced when offline. This app primarily shows how the use of XML files as data sources can make a one minded stand-alone app full featured & powerful.
Using the code
The meat of the application resides in the exctract.cs class. Just about all of the logic for the entire app is in that class. The heart of the application, the web page request, is below. There is obviously more code - I encourage you to download it and check it out for yourself.
private static void makeRequest(string url,
DataSet ds, DataRow settings,string companyName)
{
HttpWebRequest hr = (HttpWebRequest)WebRequest.Create(url);
HttpWebResponse response = (HttpWebResponse)hr.GetResponse ();
Stream receiveStream = response.GetResponseStream ();
StreamReader readStream = new StreamReader (receiveStream, Encoding.UTF8);
string fullPage = readStream.ReadToEnd();
response.Close ();
readStream.Close ();
string endIndicator = settings["endIndicator"].ToString();
string startIndicator =
settings[@"startIndicator"].ToString().Replace("NEWLINE","\n");
int begin = fullPage.IndexOf(startIndicator,startIndex);
Pretty straightforward. The really cool concept, at least in my mind, is in the endIndicator
and startIndicator
lines. You'll notice their values are set based on input from the settings
DataRow
. This DataRow
is straight from the definition (or configuration) file, custom tailored to the web site (see next section - mcdonalds.xml).
Here is an example of my using XSD & XML as data sources.
DataSet ds = new DataSet("addresses");
ds.ReadXmlSchema(schemaPath + @"\addresses.xsd");
ds.ReadXml(dataPath+@"\locations.xml");
If you are going to install this code onto your PDA or PDA emulator, please create a directory in the \Program Files folder named RestaurantScraper and extract all the compiled files to that folder.
On running the app, you will be asked to acknowledge a license agreement (see above about websites' Terms Of Use pages). Once you have accepted, click File - Data Maintenance. Assuming your PDA has network connectivity (either through your activesync or wireless), enter in a zip code on the top of the form and press Download. After a few moments, the number of records should jump to 20 or so, depending on how many McDonalds there are in that zip code's area. I believe this captures only USA McDonalds. So, try 90210 if you don't know any zip codes. Press OK to close the form and then you can do a search against this data. Back in the Data Maintenance form, you can clear out the locations that you have stored. Additionally, you can download new definitions (I have one from the ice cream shop Baskin Robbins available for download) by pressing the Download New Definitions button. This will call a web service and the app will download & install it automatically.
How XML & XSD are used for defining sites
My code comes with a site definition for the store locator site of the McDonalds website (maintained by vicinity.com). The site definition (mcdonalds.xml) contains the following elements (enforced by an XML schema definition.xsd):
startIndicator
- An identifiable string that is immediately followed by relevant data (a McDonalds physical address) in the HTML page
endIndicator
- A string that marks the end of relevant data (note that the data needs to be cleaned up - this simply identifies the end of the data
contentSeperator
- A string, usually a break, that differentiates the street address from the [city, state, zip, country] text
baseURL
- unused at this time.
contentURLBegin
- if a user did a real search for a McDonalds location, this would be the text in the address bar preceded by the zip code.
contentURLEnd
- if a user did a real search for a McDonalds location, this would be the text in the address bar after the zip code.
A list of installed site definitions is maintained in a separate XML file, definitions.xml (maintained by the app and enforced by definitions.xsd). Incidentally, the web service, not included in the attached code, relies on the same schema and returns a DataSet
using this schema layout).
displayName
- name of the company (i.e. McDonalds)
fileName
- [definition].xml (i.e., mcdonalds.xml. It's the same one that is defined above)
The list of physical locations that have been downloaded are stored in locations.xml (enforced by addresses.xsd). This file will be created on first use. Fields of note are:
unid
- a string that is the primary key. It is most of the address scrunched together. On future lookups, if this key already exists, another location record (with the same values) will not be created.
street
& city
& state
& zip
fields - self explanatory
companyName
- yeah, this could have been a relation to the definitions.xml doc, but it isn't.
Points of Interest
My primary goal was to make this a snap in type application - make it so other websites could be configured and installed with a minimum of legwork. Hard coding the starting and end points for a single site in order to screen scrap is not terribly difficult, but building an architecture that permits an unlimited number of sites to be scraped is very attractive, at least to me (unlike Richard Geer). I must admit that this application has been created to meet the needs of atleast two sites. It can only handle sites that have the zip code in the URL. It cannot get phone numbers or other information (like a link to the map of the location). It could, but this app is more a proof of concept than anything.
Let me say that my most frustrating roadblock came in the difference found in defining the location of the XML file and the XSD (XML schema) file on the file system. The example below, taken straight out of the code (and slightly formatted), points to the same location. But for some reason, the schema variable needs to have spaces replaced by %20's while the XML path cannot have %20's. I didn't waste too much time on it, but once I found out what the solution was, I could only roll my eyes.
public string schemaPath =@"\Program%20Files\RestaurantScraper";
ds.ReadXmlSchema(schemaPath + @"\addresses.xsd");
public string dataPath = @"\Program Files\RestaurantScraper";
ds.ReadXml(dataPath+@"\locations.xml");
I am most satisfied with my use of a web service in this application. While not an integral piece of the application as a whole, the application has the ability to call a web service (currently on my personal site), which will return a list of definitions (i.e., mcdonalds.xml & baskinrobbins.xml) in a DataSet
. The Pocket PC app will compare that DataSet
to the list of local definitions (definitions.xml), and if any are missing, download them (also on my site) and make them available for use instantly. I think that is really slick.
Have fun. Thanks for looking - feel free to comment and rate.