Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / web / CSS

Using Information Retrieval Techniques as an Alternative to Commercial Web Services

5.00/5 (1 vote)
14 Aug 2008CPOL4 min read 1   130  
In this article, I discuss an example of how you can use information retrieval to grab data out of MSN Money pages to get a free Web Service for currency exchange rates and other quotes.

CurrencyExchangeDemo

Introduction

Tired searching for a free currency exchange and stock info Web Service? Here is the solution. I came up with this idea after a failed search for a good and free of charge Web Service that provides currency exchange and other stock info.

Information retrieval will help us in getting information hidden inside HTML pages and trying to put them in a standard format that would be easier for us to use.

In this example, I am showing a class called DataGrabber which I use to retrieve selected information from MSN Money's page. Let's have a look at this page first and determine the areas of interest inside it.

points.JPG

I have marked some contents inside this web page: rate, direction arrow, change, change ratio, and currency converter. These are services that DataGrabber will provide by parsing the page's HTML code. Additionally, we should be able to get the page's title to accurately specify the quote's name and determine if a symbol was not found.

Getting the Page HTML

First, we have to get the HTML of the desired page. MSN's page URL looks like this: http://moneycentral.msn.com/detail/stock_quote?Symbol=SYMBOL&FormatAs=Index.

All we need is to specify the symbol for which we are getting the information and send the request to MSN using the WebClient object.

VB
'Assuming Symbol property is defined
Private Const URL As String = "http://moneycentral.msn.com/detail/stock_quote?" _
                              & "Symbol={0}&FormatAs=Index"

Private Function GetPageCode() As String
    Dim c As New WebClient()
    Dim data As Stream = c.OpenRead(String.Format(URL, Symbol))
    Dim reader As New StreamReader(data)
    Dim str As String = reader.ReadToEnd
    reader.Close() : reader.Dispose()
    data.Close() : data.Dispose()
    c.Dispose()

    Return str
End Function

'Use the following property to keep HTML alive inside the object
Private ReadOnly Property Page() As String
    Get
        If _page Is Nothing Then _page = GetPageCode()
        Return _page
    End Get
End Property

'To refresh data in next query, just set _page to null
Public Sub RefreshData()
    _page = Nothing
End Sub

Now we have the HTML code we need. Let's start parsing.

Simple HTML Parsing using String Functions

Before going into the HTML parsing functions, let's review some basic functions in the String class. These functions are very useful in our case.

  • IndexOf(s As String): This function returns the index of the first appearance of s, returning -1 if s was not found.
  • IndexOf(s As String, startIndex As Integer): Like the previous one, but starts from startIndex instead of the beginning of the string.
  • LastIndexOf(s As String): Returns the index of the last appearance of s, and returns -1 if s was not found.
  • Substring(startIndex As Integer, length As Integer): Returns the substring starting from startIndex with a length of length.
  • ToLower(): Returns the whole string in lower case.
  • Trim(): Removes spaces from the beginning and the end of the string.
  • Trim(ParamArray trimChars() As Char): Removes all specified characters from the beginning and end of the string.
  • StartsWith(s As String) : Returns true if the string begins with s.

Getting the Page Title

This read-only property parses the HTML code looking for the <title></title> tag to get the page's title which includes the details about the current quote.

VB
Public ReadOnly Property Title() As String
    Get
        Dim i1, i2 As Integer
        'Remember to add the length of the tag itself
        'in order to get rid of it
        i1 = Page.ToLower.IndexOf("<title>") + 7
        i2 = Page.ToLower.IndexOf("</title>")
        Return Page.Substring(i1, i2 - i1)
    End Get
End Property

Getting the Rate

The rate value is placed inside a span with a CSS class called s1. So we search for this span and read the value stored in it. We use the same method we used to get the title.

VB
Private Const S1 As String = "<span class=""s1"">" 'CSS class for rate
Public ReadOnly Property Rate() As Double
    Get
        Dim i1, i2 As Integer
        i1 = Page.ToLower.IndexOf(S1) + S1.Length
        i2 = Page.ToLower.IndexOf("</span>", i1)
        Dim d As Double = CDbl(Page.Substring(i1, i2 - i1))
        Return d
    End Get
End Property

Getting the Change and Change Ratio

Change could be UP, DOWN, or UNCH. We have to know which one we are looking for before trying to read the value. We can know by looking for images/up.gif or images/down.gif. If neither exist, then we return 0 (unchanged).

VB
Public ReadOnly Property Change() As Double
    Get
        Dim s As String
        If Page.ToLower.IndexOf(UP) <> -1 Then
            'Change is UP
            s = S4
        ElseIf Page.ToLower.IndexOf(DOWN) <> -1 Then
            'Change is DOWN
            s = S5
        Else
            'No change
            Return 0
        End If

        Dim i1, i2 As Integer
        'The location of change span is just after rate span, so we
        'start searching from there by using S1 CSS class
        i1 = Page.ToLower.IndexOf(s, Page.ToLower.IndexOf(S1)) + s.Length
        i2 = Page.ToLower.IndexOf("</span>", i1)
        Dim d As Double = CDbl(Page.Substring(i1, i2 - i1))
        Return d
    End Get
End Property 

Public ReadOnly Property ChangeRatio() As Double
    Get
        Dim s As String
        If Page.ToLower.IndexOf(UP) <> -1 Then
            'Change is UP
            s = S4
        ElseIf Page.ToLower.IndexOf(DOWN) <> -1 Then
            'Change is DOWN
            s = S5
        Else
            'No change
            Return 0
        End If

        Dim i1, i2 As Integer
        'Change ration is the last span if S4 or S5 in the page, so we
        'search for the last index of the tag.
        i1 = Page.ToLower.LastIndexOf(s) + s.Length
        i2 = Page.ToLower.IndexOf("</span>", i1)
        Dim d As Double = CDbl(Page.Substring(i1, i2 - i1).Trim("%"))
        Return d
    End Get
End Property

Getting the List of Currencies with Exchange Rates

The last part of my code I am explaining is the one which performs the exchange rate calculation between currencies based on their exchange rates to USD. This info is stored in JavaScript code instead of HTML, which will make our job of parsing it much easier.

First, let's have a look at the format in which the currency name, symbol, and value (against USD) is stored. Here is a part of the long list you will find in MSN Money page's HTML code:

JavaScript
curUSD2X['AED'] = new currency(0.272279262542725, 'Emirati Dirham');
curUSD2X['ARS'] = new currency(0.330906689167023, 'Argentine Peso');
curUSD2X['AUD'] = new currency(0.870776772499084, 'Australian Dollar');
curUSD2X['BHD'] = new currency(2.65561938285828, 'Bahraini Dinar');

It is obvious that we can split each line as follows. These constant strings can be used with the IndexOf() and Substring() functions, as we have seen before, to extract the values we need from each line.

VB
'Currency exchange rates are found in the following format in MSN's page
'where {0} is the symbol, {1} is the value, and {2} is the name
'curUSD2X['{0}'] = new currency({1}, '{2}');
Private Const CUR0 As String = "curUSD2X['"
Private Const CUR1 As String = "'] = new currency("
Private Const CUR2 As String = ", '"
Private Const CUR3 As String = "')"

Note that this list will not be in a page's HTML unless the symbol is for the exchange rate (e.g., /USDEUR, /ILSUSD, ..., etc.), but not other symbols (like $INDU, MSFT, -CL, ..., etc.).

To ease dealing with currency info, we define the Currency class as follows.

VB
Public Class Currency
    Private _name As String
    Private _symbol As String
    Private _amount As Double

    Public Sub New(ByVal symbol As String, _
           ByVal amount As Double, ByVal name As String)
        Me.Symbol = symbol
        Me.Amount = amount
        Me.Name = name
    End Sub

    Public Property Name()
        Get
            Return _name
        End Get
        Set(ByVal value)
            _name = value
        End Set
    End Property

    Public Property Symbol() As String
        Get
            Return _symbol
        End Get
        Set(ByVal value As String)
            _symbol = value
        End Set
    End Property

    Public Property Amount() As Double
        Get
            Return _amount
        End Get
        Set(ByVal value As Double)
            _amount = value
        End Set
    End Property

    Public Shared Function Convert(ByVal fromCur As Currency, _
        ByVal toCur As Currency, ByVal amount As Double) As Double
        Dim result As Double
        result = fromCur.Amount * (1 / toCur.Amount)
        Return result * amount
    End Function

End Class

'This comparer will help us in sorting...
Public Class CurrencyNameComparer
    Implements IComparer(Of Currency)

    Public Function Compare(ByVal x As Currency, ByVal y As Currency) _
        As Integer Implements System.Collections.Generic.IComparer(Of Currency).Compare
        Return String.Compare(x.Name, y.Name)
    End Function
End Class

Back to DataGrabber; define the following property which will help us in determining the type of the symbol.

VB
Public ReadOnly Property IsCurrency() As Boolean
    Get
        Return Symbol.StartsWith("/")
    End Get
End Property

Finally, this is the function which will iterate over all lines of currency rates inside a page's HTML and return a filled list of currencies.

VB
Public Function GetCurrencyList() As List(Of Currency)
    If Not IsCurrency Then
        Throw New ArgumentException("This works only for currency exchange symbols")
    End If

    Dim l As New List(Of Currency)
    Dim s As String
    Dim startIndex As Integer = Page.IndexOf(CUR0)
    Dim endIndex As Integer = Page.IndexOf("}", startIndex)
    s = Page.Substring(startIndex, endIndex - startIndex)
    Dim lines() As String = s.Split(";")

    For Each str As String In lines
        str = str.Trim
        If Not str.StartsWith(CUR0) Then Exit For

        Dim nam, sym As String
        Dim amt As Double
        Dim i1, i2 As Integer

        i1 = str.IndexOf(CUR0) + CUR0.Length
        i2 = str.IndexOf(CUR1)
        sym = str.Substring(i1, i2 - i1)

        i1 = str.IndexOf(CUR1) + CUR1.Length
        i2 = str.IndexOf(CUR2)
        amt = CDbl(str.Substring(i1, i2 - i1))

        i1 = str.IndexOf(CUR2) + CUR2.Length
        i2 = str.IndexOf(CUR3)
        nam = str.Substring(i1, i2 - i1)

        Dim cur As New Currency(sym, amt, nam)
        l.Add(cur)
    Next

    l.Sort(New CurrencyNameComparer)
    Return l
End Function

Wait! What if Microsoft Changes the HTML Code?

It is oblivious from the beginning that the whole solution is highly vulnerable to damage if Microsoft changes the HTML code of its MSN Money page. In an attempt to reduce this risk to the minimum, I tried to bind the parsing to the CSS classes which are more likely to remain constant; because, Microsoft may change the class itself but not the name of the class, even if the whole design of the page is altered. Anyway, this is a general solution that gives the idea; you may try to write a more sophisticated one which could use some kind of Regular Expressions and parsing rules stored in an external file, so you can keep up with any future changes without rewriting your code or recompiling it.

For more articles, please visit my blog at http://vbnet4arab.blogspot.com (in Arabic only).

Happy coding!

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)