Introduction
As technologies continue to evolve and new content distribution methods are developed, there still remains a need to be able to handle, parse, manipulate, and aggregate data in ways, perhaps the original content distributor, never anticipated. With RSS, XML, web browsers, etc., consuming data has never been easier; however, refining the data delivered by these means to include only the content that one desires can pose quite a challenge at times.
For example, suppose we want to aggregate the local weather for 10 business locations from a website you visit regularly. In order to accomplish such a task, we would have to open a web browser, connect to the website, and run 10 queries through the site's frontend in order to obtain the desired data. However, using the various .NET classes and and Regular Expressions, such a task could be completed with very little code and at a significantly faster pace.
Alternatively, we could also apply these techniques in order to aggregate data from one source such as a website and format it to be consumed on multiple devices such as handhelds or media center applications. There have been some fascinating implementations of this concept in various platforms such as the XBOX Media Center and MythTV which have applied these principles to aggregating music videos, movie show times, RSS feeds, and web radio streams.
Legality
Of course, there are legal and moral questions that arise from consuming content which you may not own and altering it to suit your needs. This article assumes that you own the content you want to aggregate, or you have permission from the content owner to consume the data outside of the manner in which it is intended.
The Sample Application - Stock Ratings Aggregator
The Problem
John Q. Public has an aggressive investment strategy where he follows the same process every two weeks after receiving his paycheck from the ACME corporation:
- He logs into MSN's Money website and checks his portfolio.
- He already has a pool of 15 companies that he invests in regularly, so he glances over their performance.
- His broker only allows him to trade once a week, so he changes his stock every week (from his list of 15).
- Since he wants to make a wise decision in selecting which stock to invest in this week, he logs into MSN's MoneyCentral website, and goes through his list one at a time, entering in each stock symbol in order to see the much respected Stock Scouter rating.
- After being burned on SIRI (Sirius Satellite Radio), John decides to not invest in any stock with a rating below 8 (scale of 10).
John longs for a day when MSN will offer him some kind of RSS feed that will aggregate his stock ratings for him, but since that's not available, he accepts the fact that he will waste up to an hour every 2 weeks (on ACME time) doing this manually. If only John Q. Public was a .NET programmer, he'd know how easy tasks such as this are to automate!
The Solution
John is a beginner, so we want to keep this easy for him, so we want to ensure that we don't hard code anything in the source. (For the example, we are going to break the OOP commandment to always program to an interface and not an implementation, but John doesn't know any better!) To hold the variables, we'll use an XML file. The XML file, however, has way too many tags in it to display here, so you can see it in the project files. The XML files store all of John's stock symbols, the Regular Expression he's going to use to parse the web stream, and the URL of the site he wants to parse.
We only need three classes and the main entry point (very simple). The classes are:
Stock
- Parses the HTML to find the stock rating.
HTTPParser
- Fetches the stream (the raw HTML).
Settings
- Encapsulates the values in the XML settings file.
The Application Entry Point
static void Main(string[] args)
{
Program stockScouter = new Program();
stockScouter.init();
Console.ReadLine();
}
private void init()
{
Settings xmlConfiguration = new Settings();
foreach (string stockSymbol in xmlConfiguration.GetStocks)
{
string stockSearchUrl = xmlConfiguration.BaseUrl + stockSymbol;
Stock getStock = new Stock(stockSearchUrl,
xmlConfiguration.Pattern);
Console.WriteLine(stockSymbol + " " + getStock.GetRating);
}
}
The HTTP Parser (StreamReader)
There are a few properties in this class, but I'll cover the core method. This class needs the System.Net
and System.IO
namespaces referenced. It's quite simple, we're just asking for a search URL in the instantiation and setting that to a local field/public property. The Parse
method creates a new web request and saves the entire page's HTML code into a local field/public property.
public HTTPParser(string url)
{
this.url = url;
}
public void Parse()
{
string cachedStream = string.Empty;
HttpWebRequest myWebRequest = (HttpWebRequest)WebRequest.Create(Url);
HttpWebResponse siteResponse = (HttpWebResponse)myWebRequest.GetResponse();
Stream streamResponse = siteResponse.GetResponseStream();
StreamReader reader = new StreamReader(streamResponse, Encoding.UTF8);
cachedStream = reader.ReadToEnd();
reader.Close();
siteResponse.Close();
streamResponse.Close();
Html = cachedStream;
}
The Stock Class
Ideally, this class would hold the stock's name, symbol, price, rating, etc., but we're keeping this simple, so this class exists just for a physical representation of the stock. Its purpose is to extract the stock's rating from the string (which is stored in the HTTP parser).
class Stock
{
HTTPParser msnStockStream;
private string cachedMsnStream = string.Empty;
private string rating = string.Empty;
private string RegExPattern = string.Empty;
public string GetRating
{
get{ return rating; }
}
public Stock(string url, string RegExPattern)
{
this.RegExPattern = RegExPattern;
msnStockStream = new HTTPParser(url);
msnStockStream.Parse();
cachedMsnStream = msnStockStream.GetHtml;
getRating();
}
private void getRating()
{
rating = Regex.Match(cachedMsnStream,
RegExPattern).Value;
rating = cleanUnwantedChars(rating);
}
private string cleanUnwantedChars(string matchedString)
{
return matchedString.Replace("\"", "");
}
}
The Settings Class
The Settings
class encapsulates our settings.xml file, so we only need this one instance of System.Xml
running to have access to all of our settings. This class only has one method, and the rest of the code (in the posted source) exists only to provide friendly getter methods.
private void Read()
{
symbols = new ArrayList();
string xmlFilePath = System.IO.Directory.GetCurrentDirectory() +
"\\settings.xml";
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.Load(xmlFilePath);
XmlNode stockSettings = xmlDoc.SelectSingleNode("//Stocks");
for (int i = 0; i < stockSettings.ChildNodes.Count; i++)
symbols.Add(stockSettings.ChildNodes[i].InnerText);
XmlNode RegExSettings = xmlDoc.SelectSingleNode("//RegEx");
pattern = RegExSettings.ChildNodes[0].InnerText;
baseUri = RegExSettings.ChildNodes[1].InnerText;
}
Conclusion
In five simple minutes or less, John Q. Public was able to write a few simple C# classes and automate a process that has eaten away many hours of his life! Because of this improvement in efficiency, he now has more time to ignore his ringing telephone at work and pretend to be busy!
The Net.HttpWebRequest
, Net.HttpResponse
, IO.Stream
, IO.StreamReader
, and RegularExpressions.Regex
classes seem like a match made in heaven when used in conjunction with one another.
Code responsibly!