RegEx feed parser for RSS, Atom, OPML ...

the retired

4.38/5 (9 votes)

12 Sep 2007CPOL11 min read

Is it possible to find an easy way for parsing feeds without the format and versioning traps of XML schemas? The idea becomes reality, thanks to Regular Expressions...

#Region " RegExFeedParser "

The major task for coders who deal with syndication is to write XML parsers. There are so many implementations with a variety of different approaches and methodologies, thus we can assume that this topic is covered. Unfortunately, it's not. Different formats with different versions are becoming a nightmare as it's obvious that we will have to rewrite, expand, or upgrade our parsers in the future. Let's try to solve the case for life (hopefully). We will approach the feeds in a way that is independent from XML by developing a cross format/version parser using Regular Expressions.

The Facts

Two popular formats are in use (for the moment): RSS and Atom. Additionally, OPML is in direct connection with feeds.

Each format has versions with different XML schemas that are "alive" as they are being used by feed providers. As an example, in RSS, the root element may be rdf or rss, item nodes may be children of the root or the channel node, and title and link nodes are common in all versions in opposite with description and pubDate.

XML nodes for the same kind of information, beyond their different names, may define their content in XML attributes. As an example, URL information is the link node value for RSS and the href attribute value of a link node for Atom, and publication date information can be found in pubDate nodes for RSS and issued or published nodes for Atom.

Information may be plain text or (X)HTML.

Regular Expression Patterns

Our first conclusion is that we need a total of three Regular Expression patterns for feed parsing.

The first one defines the existence of a node in a way that covers any valid node syntax:

<name\b[^>]*?(/>|>(\w|\W)*?</name>)

where the sequence "name" represents a node name.

The second one defines the existence of an attribute:

\bname="[^"]*?"

where the sequence "name" represents an attribute name.

The last one defines the stripping of tagged elements:

</{0,1}name[^>]*?>

The above patterns (believe it or not) cover all our needs for parsing a feed document. For their structure analysis, we can use this online Regular Expression Analyzer or any other online or offline analyzer.

Logic vs. Logic

The decision to parse a feed document using Regular Expression patterns means that we need the document content as a string. So it becomes very easy to write something like:

Dim doc As New XmlDocument
doc.Load(path)
Dim content As String = doc.OuterXml

The Load method is not so direct. In reality, it uses a XmlTextReader in background as an argument of a XmlLoader, that reconstructs the whole XML document from scratch via its Load method that defines the settings and finally calls the LoadDocSequence method that reads nodes sequentially and uses the XmlNode class for children nodes via its AppendChildForNode method... Finally, the OuterXml property (inherited from XmlNode) returns a string representation for the XmlDocument using a StringWriter and a XmlDOMTextWriter... Do we need all this?

We all know that the above approach is advantageous as it ensures that we will get back a valid XML document, but the "but" case is more important. XML documents are documents with tags just like HTML documents. Web has a significant number of badly formatted feeds! Can you imagine a browser that stops its rendering process because the document has a missing </td> and raises an exception alert? Our predefined Regular Expression patterns ensure that only valid parts of feed documents will return results, without the need for exception handling as any possible exception will just be skipped from parsing as it will not be included in the Regular Expression matches.

As an alternative, we can use System.IO.File for local documents and System.Net.WebClient for web documents, as follows:

If path.StartsWith("http:", StringComparison.CurrentCultureIgnoreCase) Then
  Using down As New WebClient
    down.Encoding = Encoding.UTF8
    down.Proxy = WebRequest.DefaultWebProxy 'optional
    content = down.DownloadString(path)
  End Using
Else
  content = File.ReadAllText(path)
End If

System.Net.WebClient allows us to directly use properties like Credentials or UseDefaultCredentials for authenticating our request, Headers, and ResponseHeaders, Proxy for adding proxy support ...

Case Analysis

Our goal is to develop a "global" parser with flexibility for future formats and versions or even other types of XML documents with simple schemas (the word simple refers to schemas with similar or lower complexity compared to RSS and Atom). So we need a function that performs the parsing task. Let's name it Magic. Magic must have the ability to return easily accessible results for different feed formats and versions. In other words, there is a need for a common reference to results. An easy approach is to retrieve results as key/value pairs, so it's time for an initial scenario that will help us to move on.

We assume that we care for RSS and Atom feeds information related to "title", "description", "URL" (of text/HTML pages) and "publication date". A string array (named keys) can be used to represent this assumption code wise.

In order to retrieve the relative information, we must define information containers depending on the feed format. So, for RSS feeds, we have another string array (named fields) for the title, link, description, and pubDate nodes. Atom is different as the information container may be an attribute of a node. Using the "Commercial At" character (@) as a semantic that defines this special case, we come up with a string array (fields) for title, link@href, content, and issued or published nodes@attributes.

What is missing is the top container of the information (named container) that is item for RSS and entry for Atom.

While it is always possible to have results as plain text or (X)HTML, it is a good idea to use a boolean flag (named allowHTML), allowing this way for Magic to optionally remove tags from the results.

We already have a good progress, and our final step is to decide how Magic will return results. The usage of key/value pairs guides us to select a HashTable (if we want to execute the ToString method from 1 to n times) or a System.Collections.Specialized.NameValueCollection (names -or keys- and values are strings by definition) or ... how to handle multiple results and avoid possible exceptions for keys already in use (HashTable case) or confusing nested arrays (NameValueCollection case)? One easy answer with one easy to remember word is ArrayList.

It is more than clear now that Magic will be a function with content (string), container (string), keys (string array), fields (string array), and allowHTML (boolean) as arguments, returning an ArrayList. Another useful argument is maxResults (integer) as it helps for selecting the top n logic. Assuming that Magic is a method of a class (named RegExFeedParser), let's mix code and English in order to describe the how to of the Magic function:

All predefined Regular Expression patterns are private constants of RegExFeedParser named in presentation order as containerPattern, attributePattern, and stripPattern.

Argument exception handling can be considered as optional due to the fact that arguments are based on our tested cases and not user input. In any case, the basic rules are to not have empty content and container arguments, the same array Length for keys and fields arguments with the Length greater than 0 (zero).

We declare results as a new ArrayList.

We declare items as a Regular Expressions MatchCollection for matches of the content argument combined with the containerPattern constant.

For Each item In the items collection 
	We declare pairs As a New NameValueCollection 
	For Each field In the fields argument
    	We declare with direct assigment the Integer 
                   index of the field In the fields argument 
    	We declare with direct assigment an empty String value 
    	We declare with direct assigment an mask String equal with field 
    	We declare with direct assigment the Integer position of character @ in field 
    	We apply an If condition for pos greater than 0
        	We set the mask equal to the left sub part of field up to pos
    	We close the condition block
    	We define found As the Regular Expression Match of item 
                  combined with the containerPattern 
    	We apply an If condition for pos greater than 0 And also found Not empty.
        	We set the mask equal to the right sub part 
                   of field starting from the pos increased by 1 
        	We set the found equal to the Regular Expression Match 
                       of mask combined with the attributePattern
    	We close the condition block
    	We apply an If condition for found Not empty
        	We apply an inner If condition for pos greater than 0 
        	    We set value equal to found.Value modified by string replacements 
        	Inner condition Else
        	    We set value equal to found.Value modified
        	           by string replacements via Regex
        	We close the inner condition block 
        	We apply an inner If condition for False allowHTML
        	    We remove HTML tags from value with string replacements 
        	We close the inner condition block
    	We close the condition block 
    	We add the pair keys(index) and value to pairs collection 
    	We proceed to the Next field
	We add the pairs collection to the results 
	We stop parsing If results quantity equals to maxResults
We proceed to the Next item 
We return the results ArrayList

The RegExFeedParser Class

It's much easier to write in VB.NET compared to the above mixed language.

Imports System.Collections.Specialized
Imports System.Text.RegularExpressions
Imports System.Web

Public Class RegExFeedParser

  Private Const containerPattern As String = _
    "<name\b[^>]*?(/>|>(\w|\W)*?</name>)"
  Private Const attributePattern As String = "\bname=""[^""]*?"""
  Private Const stripPattern As String = "</{0,1}name[^>]*?>"

  Public Shared Function Magic(ByVal content As String, _
                               ByVal container As String, _
                               ByVal keys As String(), _
                               ByVal fields As String(), _
                               ByVal maxResults As Integer, _
                               ByVal allowHTML As Boolean) As ArrayList
    Dim results As New ArrayList
    Dim items As MatchCollection = Regex.Matches(content, _
        containerPattern.Replace("name", container))
    For Each item As Match In items
      Dim pairs As New NameValueCollection
      For Each field As String In fields
        Dim index As Integer = Array.IndexOf(fields, field)
        Dim value As String = String.Empty
        Dim mask As String = field
        Dim pos As Integer = field.IndexOf("@"c)
        If pos > -1 Then
          mask = field.Substring(0, pos)
        End If
        Dim found As Match = Regex.Match(item.Value, _
            containerPattern.Replace("name", mask))
        If pos > -1 AndAlso Not found.Equals(Match.Empty) Then
          mask = field.Substring(pos + 1)
          found = Regex.Match(found.Value, _
            attributePattern.Replace("name", mask))
        End If
        If Not found.Equals(Match.Empty) Then
          If pos > -1 Then
            value = found.Value.Replace(mask & "=", String.Empty) _
              .Replace(Chr(34), String.Empty)
          Else
            value = Regex.Replace(found.Value, _
              stripPattern.Replace("name", field), _
              String.Empty)
          End If
          ' keep untagged entity information
          If value.IndexOf("<![CDATA[") = 0 Then
            value = value.Substring(0, value.Length - 3).Substring(9)
          End If
          If allowHTML = False Then
            value = HttpUtility.HtmlDecode(value)
            ' tranform breaks to new lines
            value = Regex.Replace(value, _
              stripPattern.Replace("name", "br"), _
              vbCrLf)
            ' remove all tags
            value = Regex.Replace(value, _
              stripPattern.Replace("name", ""), _
              String.Empty)
            ' trim lines
            value = value.Replace(" " & vbCrLf, vbCrLf) _
              .Trim(New Char() {Chr(13), Chr(10)})
          End If
          pairs.Add(keys(index), value)
        End If
      Next
      results.Add(pairs)
      If results.Count = maxResults Then Exit For
    Next
    Return results
  End Function

End Class

The usage of Array.IndexOf in the top For Each loop is not optimal, but it's very effective for reading a listing easily as the eye focuses only on the variable names. OK, this is not for us, but we are not alone.

HTML content handling is just a sample and can be adapted to more specific requirements.

Having the RegExFeedParser class, we proceed with a draft sample defining a usage approach.

The Test Module

Imports System.Collections.Specialized
Imports System.Net
Imports System.IO
Imports System.Text.RegularExpressions

Module Test
  Sub Main()
    'FeedTest("http://msdn.microsoft.com/webservices/rss.xml")
    'FeedTest("http://news.google.com/?output=atom")
    'FeedTest("http://share.opml.org/opml/top100.opml")
  End Sub

  Sub FeedTest(ByVal path As String)
    Dim feedFormat As String = String.Empty
    Dim content As String = FeedContent(path, feedFormat)
    If String.IsNullOrEmpty(content) Then
      Debug.WriteLine(New String("x"c, 80))
      Debug.WriteLine("no content for " & path)
      Debug.WriteLine(New String("x"c, 80))
    Else
      Dim container As String
      Dim keys As String()
      Dim fields As String()
      Dim results As New ArrayList
      Dim maxRecords As Integer = 10
      Dim allowHTML As Boolean = True
      Dim isList As Boolean = False
      If feedFormat.StartsWith("rss") OrElse feedFormat.StartsWith("rdf") Then
        container = "item"
        keys = New String() {"title", "url", "description", "date"}
        fields = New String() {"title", "link", "description", "pubDate"}
      ElseIf feedFormat.StartsWith("feed") Then
        container = "entry"
        keys = New String() {"title", "url", "description", "date"}
        fields = New String() {"title", "link@href", "content", _
          "(published|issued)"}
      ElseIf feedFormat.StartsWith("opml") Then
        container = "outline"
        keys = New String() {"url"}
        fields = New String() {"outline@xmlUrl"}
        isList = True
      Else
        Debug.WriteLine(New String("x"c, 80))
        Debug.WriteLine("no implementation for " & feedFormat)
        Debug.WriteLine(New String("x"c, 80))
        Exit Sub
      End If
      results = RegExFeedParser.Magic(content, container, keys, fields, _
        maxRecords, allowHTML)
      If isList = True Then
        For Each result As NameValueCollection In results
          FeedTest(result("url"))
        Next
      Else
        Debug.WriteLine(New String("="c, 80))
        Debug.WriteLine("results for :" & path)
        Debug.WriteLine(New String("="c, 80))
        For Each result As NameValueCollection In results
          For Each key In keys
            Debug.WriteLine(String.Concat(key & ": " & result(key)))
          Next
        Next
      End If
    End If
  End Sub

  Function FeedContent(ByVal path As String, _
    ByRef feedFormat As String) As String
    Dim content As String
    Try
      If path.StartsWith("http:", _
        StringComparison.CurrentCultureIgnoreCase) Then
        Using down As New WebClient
          down.Encoding = Encoding.UTF8
          'down.Proxy = WebRequest.DefaultWebProxy
          content = down.DownloadString(path)
        End Using
      Else
        content = File.ReadAllText(path)
      End If
      Dim lastTag As Integer = content.LastIndexOf("</")
      If lastTag > -1 Then
        feedFormat = content.Substring(lastTag + 2)
      End If
      Return content
    Catch ex As Exception
      Debug.WriteLine(path & vbTab & ex.Message)
      Return Nothing
    End Try
  End Function

End Module

Using the Test Code

We uncomment each line of the Main sub (one at a time) for checking our Magic function for RSS, Atom, and OPML. For the case of OPML, the whole code can be considered as an Aggregator kernel.

By testing, we may come up to unsupported staff or to tricky XML cases. As an example, the RSS version 1.0 feed from http://rss.slashdot.org/Slashdot/slashdot does not contain publication date information in items. If we check its XML, we will find out that actually this type of information is included (optionally) as dc:date meta information. What can we do now? We change the fields variable for RSS from:

fields = New String() {"title", "link", "description", "pubDate"}

to:

fields = New String() {"title", "link", "description", "(pubDate|dc:date)"}

and voila! Each field is not just a string; it's a part of a Regular Expression pattern!

For the case we have an interest on using the code for an aggregator, we will need to retrieve information from the "header" part of the feed, or in other words, to parse the XML document in two steps. Let's focus on the "header".

For an RSS feed, we can use the following set of definitions:

maxRecords=1
container = "channel"
keys = New String() {"title", "url", "description", "date"}
fields = New String() {"title", "link", "description", "pubDate"}

and for an Atom feed:

maxRecords=1
container = "feed"
keys = New String() {"title", "url", "description", "date"}
fields = New String() {"title", "link", "subtitle", "updated"}

The key definition is for maxRecords as some fields can be found more than once in a feed document.

We can "play" with it forever...

Effective Friendship

An email from a friend saying that the presentation of the idea up to this point does not reveal its power is the reason for extending the article with the following content. Thanks George :)

The Hidden Visibility

We have enough information in order to have a good clue for the power of Regular Expression patterns. The patterns themselves and the way we use them allows us to create any generic or targeted feed and related solution without limits. In simple English: whenever there is a new implementation or an updated version, we only change variables (or settings if we keep them externally) instead of rewriting or modifying parsing routines etc.!

Syndication Extensions

status=READY

The discussed case of using a field as a pattern -"(pubDate|dc:date)" - hides an unexpected fact: fields can take values covering all existing and future syndication extensions (blogChannel, creativeCommons, Dublic Core etc.) with the demonstrated ability for merged usage with basic feed nodes. Let's see some how to examples by defining values for keys and fields (sequential presentation).

blogChannel Extention

Keys: siteBlogRoll, authorSubscriptions, authorlink, changes

Fields: blogChannel:brogRoll, blogChannel:mySubscriptions, blogChannel:blink, blogChannel:changes

blogChannel:blogRoll and blogChannel:mySubscriptions return the URL of an OPML file (this case is covered in the Test Module code sample)

Creative Commons Extension

Keys: license

Fields: creativeCommons:license

Dublin Code Extension

Keys: title, subject, type, source, relation, creator, date ...

Fields: dc:title, dc:subject, dc:type, dc:source, dc:relation, dc:creator, dc:date ...

Basic Geo (WGS84 lat/long)

Keys: latitude, longitude

Fields: geo:lat, geo:long

... and voila! No code, no new classes, methods ...

XML as eXtended Mess Language

The title does not represent the un-contradictable value of XML itself, but the feelings we may have whenever we try to deal with various implementations of the language that hide high complexity. Our approach for parsing has the logic of a base path extended to unique paths. As an example based on RSS, we have the base path: /item extended to /title and /link. The Magic function can parse /item/title and /item/link paths only once per item. This fact looks like a limitation, but the reality is different.

We will see the usual usage of the containerPattern constant.

<name\b[^>]*?(/>|>(\w|\W)*?</name>)

This pattern is in use for both keys and fields, and by replacing the sequence "name" with a key or field value, we get the desired tag in return. Once again, we note that any key or field value can be a Regular Expression on its own.

Now, let's have a look at a blogs related case. By checking the XML document from http://news.google.com/?output=atom, we see that each entry node does not necessarily have a unique link node. Thankfully, each link node describes itself with a rel argument (logic similar to medatata), meaning that there is no problem at all.

The field value for unique parsing is "link". Assume that our application needs link information for alternate, replies, self, and edit (rel attribute values). If we creatively treat the value as a Regular Expression pattern, we can come up with fields like:

link[^>]*(\brel="alternate"){1}
link[^>]*(\brel="replies"){1}
link[^>]*(\brel="self"){1}
link[^>]*(\brel="edit"){1}

Now we apply the @ semantic for defining the attribute that holds the desired information as follows:

link[^>]*(\brel="alternate"){1}@href
link[^>]*(\brel="replies"){1}@href
link[^>]*(\brel="self"){1}@href
link[^>]*(\brel="edit"){1}@href

and voila again! No code, no new classes, methods ...

Non Stop

The above blog related XML document includes metadata information of the openSearch Extension. We use the following pair of keys and fields:

Keys: results, first, count

Fields: openSearch:totalResults, openSearch:startIndex, openSearch:itemsPerPage

and voila again! No code, no new classes, methods ...

Google Sitemaps

Google sitemaps are a good example of using our code for non-feed related XML documents. We assume that we want to add support to our application for sitemaps. At this point, we have to extend our Test.FeedTest method in order to cover sitemap and sitemap index files.

First, by copying the condition block for "feed" and pasting it twice, we have:

ElseIf feedFormat.StartsWith("feed") Then
  container = "entry"
  keys = New String() {"title", "url", "description", "date"}
  fields = New String() {"title", "link@href", "content", _
          "(published|issued)"}
ElseIf feedFormat.StartsWith("feed") Then
  container = "entry"
  keys = New String() {"title", "url", "description", "date"}
  fields = New String() {"title", "link@href", "content", _
          "(published|issued)"}

We modify the pasted code as follows:

ElseIf feedFormat.StartsWith("urlset") Then
  maxResults = 10000 ' for the top 10,000 records
  container = "url"
  keys = New String() {"url"}
  fields = New String() {"loc"}
ElseIf feedFormat.StartsWith("sitemapindex") Then
  container = "sitemap"
  keys = New String() {"url"}
  fields = New String() {"loc"}

Finally, we extend the condition block for "OPML".

ElseIf feedFormat.StartsWith("opml") Then

becomes:

ElseIf feedFormat.StartsWith("opml") OrElse _
       feedFormat.StartsWith("sitemapindex") Then

Now we are ready to test our application using our own sitemap files or the huge one from Google (almost 4MB, with more than 35,000 records located at http://www.google.com/sitemap.xml)...

...and voila once again! Easy coding only for settings, and no new classes, methods ...

#End Region

All comments, questions, ideas, requests for help ... are very welcomed.

Thanks for your time.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)