(untagged)

Aggregating Web Content in C# (Stock Ratings Sample)

thund3rstruck

0.00/5 (No votes)

21 Mar 2007

A demonstration of aggregation and pattern matching in web content.

Screenshot - stockScout.jpg

Introduction

As technologies continue to evolve and new content distribution methods are developed, there still remains a need to be able to handle, parse, manipulate, and aggregate data in ways, perhaps the original content distributor, never anticipated. With RSS, XML, web browsers, etc., consuming data has never been easier; however, refining the data delivered by these means to include only the content that one desires can pose quite a challenge at times.

For example, suppose we want to aggregate the local weather for 10 business locations from a website you visit regularly. In order to accomplish such a task, we would have to open a web browser, connect to the website, and run 10 queries through the site's frontend in order to obtain the desired data. However, using the various .NET classes and and Regular Expressions, such a task could be completed with very little code and at a significantly faster pace.

Alternatively, we could also apply these techniques in order to aggregate data from one source such as a website and format it to be consumed on multiple devices such as handhelds or media center applications. There have been some fascinating implementations of this concept in various platforms such as the XBOX Media Center and MythTV which have applied these principles to aggregating music videos, movie show times, RSS feeds, and web radio streams.

Legality

Of course, there are legal and moral questions that arise from consuming content which you may not own and altering it to suit your needs. This article assumes that you own the content you want to aggregate, or you have permission from the content owner to consume the data outside of the manner in which it is intended.

The Sample Application - Stock Ratings Aggregator

The Problem

John Q. Public has an aggressive investment strategy where he follows the same process every two weeks after receiving his paycheck from the ACME corporation:

He logs into MSN's Money website and checks his portfolio.
He already has a pool of 15 companies that he invests in regularly, so he glances over their performance.
His broker only allows him to trade once a week, so he changes his stock every week (from his list of 15).
Since he wants to make a wise decision in selecting which stock to invest in this week, he logs into MSN's MoneyCentral website, and goes through his list one at a time, entering in each stock symbol in order to see the much respected Stock Scouter rating.
After being burned on SIRI (Sirius Satellite Radio), John decides to not invest in any stock with a rating below 8 (scale of 10).

John longs for a day when MSN will offer him some kind of RSS feed that will aggregate his stock ratings for him, but since that's not available, he accepts the fact that he will waste up to an hour every 2 weeks (on ACME time) doing this manually. If only John Q. Public was a .NET programmer, he'd know how easy tasks such as this are to automate!

The Solution

John is a beginner, so we want to keep this easy for him, so we want to ensure that we don't hard code anything in the source. (For the example, we are going to break the OOP commandment to always program to an interface and not an implementation, but John doesn't know any better!) To hold the variables, we'll use an XML file. The XML file, however, has way too many tags in it to display here, so you can see it in the project files. The XML files store all of John's stock symbols, the Regular Expression he's going to use to parse the web stream, and the URL of the site he wants to parse.

We only need three classes and the main entry point (very simple). The classes are:

Stock - Parses the HTML to find the stock rating.
HTTPParser - Fetches the stream (the raw HTML).
Settings - Encapsulates the values in the XML settings file.

The Application Entry Point

static void Main(string[] args)
{
    Program stockScouter = new Program();
    stockScouter.init();
    Console.ReadLine();
}
private void init()
{
    //Load the Application settings from the xml file
    Settings xmlConfiguration = new Settings();
    //Enumerate the stock symbols
    foreach (string stockSymbol in xmlConfiguration.GetStocks)
    {
        //Parse the searchUrl and display the rating
        string stockSearchUrl = xmlConfiguration.BaseUrl + stockSymbol;
        Stock getStock = new Stock(stockSearchUrl, 
                         xmlConfiguration.Pattern);
        Console.WriteLine(stockSymbol + " " + getStock.GetRating);
    }
}

The HTTP Parser (StreamReader)

There are a few properties in this class, but I'll cover the core method. This class needs the System.Net and System.IO namespaces referenced. It's quite simple, we're just asking for a search URL in the instantiation and setting that to a local field/public property. The Parse method creates a new web request and saves the entire page's HTML code into a local field/public property.

public HTTPParser(string url)
{
    this.url = url;
}

public void Parse()
{
    string cachedStream = string.Empty;
    HttpWebRequest myWebRequest = (HttpWebRequest)WebRequest.Create(Url);
    HttpWebResponse siteResponse = (HttpWebResponse)myWebRequest.GetResponse();
    Stream streamResponse = siteResponse.GetResponseStream();
    StreamReader reader = new StreamReader(streamResponse, Encoding.UTF8);
    cachedStream = reader.ReadToEnd();
    reader.Close();
    siteResponse.Close();
    streamResponse.Close();

    Html = cachedStream;
}

The Stock Class

Ideally, this class would hold the stock's name, symbol, price, rating, etc., but we're keeping this simple, so this class exists just for a physical representation of the stock. Its purpose is to extract the stock's rating from the string (which is stored in the HTTP parser).

class Stock
{
    HTTPParser msnStockStream;
    private string cachedMsnStream = string.Empty;
    private string rating = string.Empty;
    private string RegExPattern = string.Empty;
    
    public string GetRating
    {
        get{ return rating; }
    }

    public Stock(string url, string RegExPattern)
    {
        this.RegExPattern = RegExPattern;
        //Create the instance of the Parser
        msnStockStream = new HTTPParser(url);
        //kick off the HTTPParser to get the page
        msnStockStream.Parse();
        //Save the HTML in a local variable
        cachedMsnStream = msnStockStream.GetHtml;

        getRating();
    }

    private void getRating()
    {
        //Match our Pattern
        rating = Regex.Match(cachedMsnStream, 
                             RegExPattern).Value;
        rating = cleanUnwantedChars(rating);
    }
    private string cleanUnwantedChars(string matchedString)
    {
        //custom function to remove the extra quotes
        return matchedString.Replace("\"", "");
    }
}

The Settings Class

The Settings class encapsulates our settings.xml file, so we only need this one instance of System.Xml running to have access to all of our settings. This class only has one method, and the rest of the code (in the posted source) exists only to provide friendly getter methods.

private void Read()
{
    symbols = new ArrayList();
    string xmlFilePath = System.IO.Directory.GetCurrentDirectory() + 
                         "\\settings.xml";
    XmlDocument xmlDoc = new XmlDocument();
    xmlDoc.Load(xmlFilePath);
    XmlNode stockSettings = xmlDoc.SelectSingleNode("//Stocks");

    for (int i = 0; i < stockSettings.ChildNodes.Count; i++)
      symbols.Add(stockSettings.ChildNodes[i].InnerText);
    
    XmlNode RegExSettings = xmlDoc.SelectSingleNode("//RegEx");
    pattern = RegExSettings.ChildNodes[0].InnerText;
    baseUri = RegExSettings.ChildNodes[1].InnerText;
}

Conclusion

In five simple minutes or less, John Q. Public was able to write a few simple C# classes and automate a process that has eaten away many hours of his life! Because of this improvement in efficiency, he now has more time to ignore his ringing telephone at work and pretend to be busy!

The Net.HttpWebRequest, Net.HttpResponse, IO.Stream, IO.StreamReader, and RegularExpressions.Regex classes seem like a match made in heaven when used in conjunction with one another.

Code responsibly!

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here