#Region " RegExFeedParser "
The major task for coders who deal with syndication is to write XML parsers. There are so many implementations with a variety of different approaches and methodologies, thus we can assume that this topic is covered. Unfortunately, it's not. Different formats with different versions are becoming a nightmare as it's obvious that we will have to rewrite, expand, or upgrade our parsers in the future. Let's try to solve the case for life (hopefully). We will approach the feeds in a way that is independent from XML by developing a cross format/version parser using Regular Expressions.
The Facts
Two popular formats are in use (for the moment): RSS and Atom. Additionally, OPML is in direct connection with feeds.
Each format has versions with different XML schemas that are "alive" as they are being used by feed providers. As an example, in RSS, the root element may be rdf
or rss
, item
nodes may be children of the root or the channel
node, and title
and link
nodes are common in all versions in opposite with description
and pubDate
.
XML nodes for the same kind of information, beyond their different names, may define their content in XML attributes. As an example, URL information is the link
node value for RSS and the href
attribute value of a link
node for Atom, and publication date information can be found in pubDate
nodes for RSS and issued
or published
nodes for Atom.
Information may be plain text or (X)HTML.
Regular Expression Patterns
Our first conclusion is that we need a total of three Regular Expression patterns for feed parsing.
The first one defines the existence of a node in a way that covers any valid node syntax:
<name\b[^>]*?(/>|>(\w|\W)*?</name>)
where the sequence "name" represents a node name.
The second one defines the existence of an attribute:
\bname="[^"]*?"
where the sequence "name" represents an attribute name.
The last one defines the stripping of tagged elements:
</{0,1}name[^>]*?>
The above patterns (believe it or not) cover all our needs for parsing a feed document. For their structure analysis, we can use this online Regular Expression Analyzer or any other online or offline analyzer.
Logic vs. Logic
The decision to parse a feed document using Regular Expression patterns means that we need the document content as a string. So it becomes very easy to write something like:
Dim doc As New XmlDocument
doc.Load(path)
Dim content As String = doc.OuterXml
The Load
method is not so direct. In reality, it uses a XmlTextReader
in background as an argument of a XmlLoader
, that reconstructs the whole XML document from scratch via its Load
method that defines the settings and finally calls the LoadDocSequence
method that reads nodes sequentially and uses the XmlNode
class for children nodes via its AppendChildForNode
method... Finally, the OuterXml
property (inherited from XmlNode
) returns a string representation for the XmlDocument
using a StringWriter
and a XmlDOMTextWriter
... Do we need all this?
We all know that the above approach is advantageous as it ensures that we will get back a valid XML document, but the "but" case is more important. XML documents are documents with tags just like HTML documents. Web has a significant number of badly formatted feeds! Can you imagine a browser that stops its rendering process because the document has a missing </td>
and raises an exception alert? Our predefined Regular Expression patterns ensure that only valid parts of feed documents will return results, without the need for exception handling as any possible exception will just be skipped from parsing as it will not be included in the Regular Expression matches.
As an alternative, we can use System.IO.File
for local documents and System.Net.WebClient
for web documents, as follows:
If path.StartsWith("http:", StringComparison.CurrentCultureIgnoreCase) Then
Using down As New WebClient
down.Encoding = Encoding.UTF8
down.Proxy = WebRequest.DefaultWebProxy
content = down.DownloadString(path)
End Using
Else
content = File.ReadAllText(path)
End If
System.Net.WebClient
allows us to directly use properties like Credentials
or UseDefaultCredentials
for authenticating our request, Headers
, and ResponseHeaders
, Proxy
for adding proxy support ...
Case Analysis
Our goal is to develop a "global" parser with flexibility for future formats and versions or even other types of XML documents with simple schemas (the word simple refers to schemas with similar or lower complexity compared to RSS and Atom). So we need a function that performs the parsing task. Let's name it Magic
. Magic
must have the ability to return easily accessible results for different feed formats and versions. In other words, there is a need for a common reference to results. An easy approach is to retrieve results as key/value pairs, so it's time for an initial scenario that will help us to move on.
We assume that we care for RSS and Atom feeds information related to "title", "description", "URL" (of text/HTML pages) and "publication date". A string array (named keys
) can be used to represent this assumption code wise.
In order to retrieve the relative information, we must define information containers depending on the feed format. So, for RSS feeds, we have another string array (named fields
) for the title
, link
, description
, and pubDate
nodes. Atom is different as the information container may be an attribute of a node. Using the "Commercial At" character (@) as a semantic that defines this special case, we come up with a string array (fields
) for title
, link@href
, content
, and issued
or published
nodes@attributes.
What is missing is the top container of the information (named container
) that is item
for RSS and entry
for Atom.
While it is always possible to have results as plain text or (X)HTML, it is a good idea to use a boolean flag (named allowHTML
), allowing this way for Magic
to optionally remove tags from the results.
We already have a good progress, and our final step is to decide how Magic
will return results. The usage of key/value pairs guides us to select a HashTable
(if we want to execute the ToString
method from 1 to n times) or a System.Collections.Specialized.NameValueCollection
(names -or keys- and values are strings by definition) or ... how to handle multiple results and avoid possible exceptions for keys already in use (HashTable
case) or confusing nested arrays (NameValueCollection
case)? One easy answer with one easy to remember word is ArrayList
.
It is more than clear now that Magic
will be a function with content
(string), container
(string), keys
(string array), fields
(string array), and allowHTML
(boolean) as arguments, returning an ArrayList
. Another useful argument is maxResults
(integer) as it helps for selecting the top n logic. Assuming that Magic
is a method of a class (named RegExFeedParser
), let's mix code and English in order to describe the how to of the Magic
function:
All predefined Regular Expression patterns are private constants of RegExFeedParser
named in presentation order as containerPattern
, attributePattern
, and stripPattern
.
Argument exception handling can be considered as optional due to the fact that arguments are based on our tested cases and not user input. In any case, the basic rules are to not have empty content
and container
arguments, the same array Length
for keys
and fields
arguments with the Length
greater than 0 (zero).
We declare results
as a new ArrayList
.
We declare items
as a Regular Expressions MatchCollection
for matches of the content
argument combined with the containerPattern
constant.
For Each item In the items collection
We declare pairs As a New NameValueCollection
For Each field In the fields argument
We declare with direct assigment the Integer
index of the field In the fields argument
We declare with direct assigment an empty String value
We declare with direct assigment an mask String equal with field
We declare with direct assigment the Integer position of character @ in field
We apply an If condition for pos greater than 0
We set the mask equal to the left sub part of field up to pos
We close the condition block
We define found As the Regular Expression Match of item
combined with the containerPattern
We apply an If condition for pos greater than 0 And also found Not empty.
We set the mask equal to the right sub part
of field starting from the pos increased by 1
We set the found equal to the Regular Expression Match
of mask combined with the attributePattern
We close the condition block
We apply an If condition for found Not empty
We apply an inner If condition for pos greater than 0
We set value equal to found.Value modified by string replacements
Inner condition Else
We set value equal to found.Value modified
by string replacements via Regex
We close the inner condition block
We apply an inner If condition for False allowHTML
We remove HTML tags from value with string replacements
We close the inner condition block
We close the condition block
We add the pair keys(index) and value to pairs collection
We proceed to the Next field
We add the pairs collection to the results
We stop parsing If results quantity equals to maxResults
We proceed to the Next item
We return the results ArrayList
The RegExFeedParser Class
It's much easier to write in VB.NET compared to the above mixed language.
Imports System.Collections.Specialized
Imports System.Text.RegularExpressions
Imports System.Web
Public Class RegExFeedParser
Private Const containerPattern As String = _
"<name\b[^>]*?(/>|>(\w|\W)*?</name>)"
Private Const attributePattern As String = "\bname=""[^""]*?"""
Private Const stripPattern As String = "</{0,1}name[^>]*?>"
Public Shared Function Magic(ByVal content As String, _
ByVal container As String, _
ByVal keys As String(), _
ByVal fields As String(), _
ByVal maxResults As Integer, _
ByVal allowHTML As Boolean) As ArrayList
Dim results As New ArrayList
Dim items As MatchCollection = Regex.Matches(content, _
containerPattern.Replace("name", container))
For Each item As Match In items
Dim pairs As New NameValueCollection
For Each field As String In fields
Dim index As Integer = Array.IndexOf(fields, field)
Dim value As String = String.Empty
Dim mask As String = field
Dim pos As Integer = field.IndexOf("@"c)
If pos > -1 Then
mask = field.Substring(0, pos)
End If
Dim found As Match = Regex.Match(item.Value, _
containerPattern.Replace("name", mask))
If pos > -1 AndAlso Not found.Equals(Match.Empty) Then
mask = field.Substring(pos + 1)
found = Regex.Match(found.Value, _
attributePattern.Replace("name", mask))
End If
If Not found.Equals(Match.Empty) Then
If pos > -1 Then
value = found.Value.Replace(mask & "=", String.Empty) _
.Replace(Chr(34), String.Empty)
Else
value = Regex.Replace(found.Value, _
stripPattern.Replace("name", field), _
String.Empty)
End If
If value.IndexOf("<![CDATA[") = 0 Then
value = value.Substring(0, value.Length - 3).Substring(9)
End If
If allowHTML = False Then
value = HttpUtility.HtmlDecode(value)
value = Regex.Replace(value, _
stripPattern.Replace("name", "br"), _
vbCrLf)
value = Regex.Replace(value, _
stripPattern.Replace("name", ""), _
String.Empty)
value = value.Replace(" " & vbCrLf, vbCrLf) _
.Trim(New Char() {Chr(13), Chr(10)})
End If
pairs.Add(keys(index), value)
End If
Next
results.Add(pairs)
If results.Count = maxResults Then Exit For
Next
Return results
End Function
End Class
The usage of Array.IndexOf
in the top For Each
loop is not optimal, but it's very effective for reading a listing easily as the eye focuses only on the variable names. OK, this is not for us, but we are not alone.
HTML content handling is just a sample and can be adapted to more specific requirements.
Having the RegExFeedParser
class, we proceed with a draft sample defining a usage approach.
The Test Module
Imports System.Collections.Specialized
Imports System.Net
Imports System.IO
Imports System.Text.RegularExpressions
Module Test
Sub Main()
End Sub
Sub FeedTest(ByVal path As String)
Dim feedFormat As String = String.Empty
Dim content As String = FeedContent(path, feedFormat)
If String.IsNullOrEmpty(content) Then
Debug.WriteLine(New String("x"c, 80))
Debug.WriteLine("no content for " & path)
Debug.WriteLine(New String("x"c, 80))
Else
Dim container As String
Dim keys As String()
Dim fields As String()
Dim results As New ArrayList
Dim maxRecords As Integer = 10
Dim allowHTML As Boolean = True
Dim isList As Boolean = False
If feedFormat.StartsWith("rss") OrElse feedFormat.StartsWith("rdf") Then
container = "item"
keys = New String() {"title", "url", "description", "date"}
fields = New String() {"title", "link", "description", "pubDate"}
ElseIf feedFormat.StartsWith("feed") Then
container = "entry"
keys = New String() {"title", "url", "description", "date"}
fields = New String() {"title", "link@href", "content", _
"(published|issued)"}
ElseIf feedFormat.StartsWith("opml") Then
container = "outline"
keys = New String() {"url"}
fields = New String() {"outline@xmlUrl"}
isList = True
Else
Debug.WriteLine(New String("x"c, 80))
Debug.WriteLine("no implementation for " & feedFormat)
Debug.WriteLine(New String("x"c, 80))
Exit Sub
End If
results = RegExFeedParser.Magic(content, container, keys, fields, _
maxRecords, allowHTML)
If isList = True Then
For Each result As NameValueCollection In results
FeedTest(result("url"))
Next
Else
Debug.WriteLine(New String("="c, 80))
Debug.WriteLine("results for :" & path)
Debug.WriteLine(New String("="c, 80))
For Each result As NameValueCollection In results
For Each key In keys
Debug.WriteLine(String.Concat(key & ": " & result(key)))
Next
Next
End If
End If
End Sub
Function FeedContent(ByVal path As String, _
ByRef feedFormat As String) As String
Dim content As String
Try
If path.StartsWith("http:", _
StringComparison.CurrentCultureIgnoreCase) Then
Using down As New WebClient
down.Encoding = Encoding.UTF8
content = down.DownloadString(path)
End Using
Else
content = File.ReadAllText(path)
End If
Dim lastTag As Integer = content.LastIndexOf("</")
If lastTag > -1 Then
feedFormat = content.Substring(lastTag + 2)
End If
Return content
Catch ex As Exception
Debug.WriteLine(path & vbTab & ex.Message)
Return Nothing
End Try
End Function
End Module
Using the Test Code
We uncomment each line of the Main
sub (one at a time) for checking our Magic
function for RSS, Atom, and OPML. For the case of OPML, the whole code can be considered as an Aggregator kernel.
By testing, we may come up to unsupported staff or to tricky XML cases. As an example, the RSS version 1.0 feed from http://rss.slashdot.org/Slashdot/slashdot does not contain publication date information in items. If we check its XML, we will find out that actually this type of information is included (optionally) as dc:date
meta information. What can we do now? We change the fields
variable for RSS from:
fields = New String() {"title", "link", "description", "pubDate"}
to:
fields = New String() {"title", "link", "description", "(pubDate|dc:date)"}
and voila! Each field
is not just a string; it's a part of a Regular Expression pattern!
For the case we have an interest on using the code for an aggregator, we will need to retrieve information from the "header" part of the feed, or in other words, to parse the XML document in two steps. Let's focus on the "header".
For an RSS feed, we can use the following set of definitions:
maxRecords=1
container = "channel"
keys = New String() {"title", "url", "description", "date"}
fields = New String() {"title", "link", "description", "pubDate"}
and for an Atom feed:
maxRecords=1
container = "feed"
keys = New String() {"title", "url", "description", "date"}
fields = New String() {"title", "link", "subtitle", "updated"}
The key definition is for maxRecords
as some fields
can be found more than once in a feed document.
We can "play" with it forever...
Effective Friendship
An email from a friend saying that the presentation of the idea up to this point does not reveal its power is the reason for extending the article with the following content. Thanks George :)
The Hidden Visibility
We have enough information in order to have a good clue for the power of Regular Expression patterns. The patterns themselves and the way we use them allows us to create any generic or targeted feed and related solution without limits. In simple English: whenever there is a new implementation or an updated version, we only change variables (or settings if we keep them externally) instead of rewriting or modifying parsing routines etc.!
Syndication Extensions
status=READY
The discussed case of using a field
as a pattern -"(pubDate|dc:date)" - hides an unexpected fact: fields can take values covering all existing and future syndication extensions (blogChannel, creativeCommons, Dublic Core etc.) with the demonstrated ability for merged usage with basic feed nodes. Let's see some how to examples by defining values for keys
and fields
(sequential presentation).
blogChannel Extention
Keys: siteBlogRoll, authorSubscriptions, authorlink, changes
Fields: blogChannel:brogRoll, blogChannel:mySubscriptions, blogChannel:blink, blogChannel:changes
blogChannel:blogRoll and blogChannel:mySubscriptions return the URL of an OPML file (this case is covered in the Test Module code sample)
Creative Commons Extension
Keys: license
Fields: creativeCommons:license
Dublin Code Extension
Keys: title, subject, type, source, relation, creator, date ...
Fields: dc:title, dc:subject, dc:type, dc:source, dc:relation, dc:creator, dc:date ...
Basic Geo (WGS84 lat/long)
Keys: latitude, longitude
Fields: geo:lat, geo:long
... and voila! No code, no new classes, methods ...
XML as eXtended Mess Language
The title does not represent the un-contradictable value of XML itself, but the feelings we may have whenever we try to deal with various implementations of the language that hide high complexity. Our approach for parsing has the logic of a base path extended to unique paths. As an example based on RSS, we have the base path: /item extended to /title and /link. The Magic
function can parse /item/title and /item/link paths only once per item. This fact looks like a limitation, but the reality is different.
We will see the usual usage of the containerPattern
constant.
<name\b[^>]*?(/>|>(\w|\W)*?</name>)
This pattern is in use for both keys
and fields
, and by replacing the sequence "name" with a key
or field
value, we get the desired tag in return. Once again, we note that any key
or field
value can be a Regular Expression on its own.
Now, let's have a look at a blogs related case. By checking the XML document from http://news.google.com/?output=atom, we see that each entry node does not necessarily have a unique link node. Thankfully, each link node describes itself with a rel
argument (logic similar to medatata), meaning that there is no problem at all.
The field
value for unique parsing is "link". Assume that our application needs link information for alternate, replies, self, and edit (rel
attribute values). If we creatively treat the value as a Regular Expression pattern, we can come up with fields
like:
link[^>]*(\brel="alternate"){1}
link[^>]*(\brel="replies"){1}
link[^>]*(\brel="self"){1}
link[^>]*(\brel="edit"){1}
Now we apply the @ semantic for defining the attribute that holds the desired information as follows:
link[^>]*(\brel="alternate"){1}@href
link[^>]*(\brel="replies"){1}@href
link[^>]*(\brel="self"){1}@href
link[^>]*(\brel="edit"){1}@href
and voila again! No code, no new classes, methods ...
Non Stop
The above blog related XML document includes metadata information of the openSearch Extension. We use the following pair of keys
and fields
:
Keys: results, first, count
Fields: openSearch:totalResults, openSearch:startIndex, openSearch:itemsPerPage
and voila again! No code, no new classes, methods ...
Google Sitemaps
Google sitemaps are a good example of using our code for non-feed related XML documents. We assume that we want to add support to our application for sitemaps. At this point, we have to extend our Test.FeedTest
method in order to cover sitemap and sitemap index files.
First, by copying the condition block for "feed" and pasting it twice, we have:
ElseIf feedFormat.StartsWith("feed") Then
container = "entry"
keys = New String() {"title", "url", "description", "date"}
fields = New String() {"title", "link@href", "content", _
"(published|issued)"}
ElseIf feedFormat.StartsWith("feed") Then
container = "entry"
keys = New String() {"title", "url", "description", "date"}
fields = New String() {"title", "link@href", "content", _
"(published|issued)"}
We modify the pasted code as follows:
ElseIf feedFormat.StartsWith("urlset") Then
maxResults = 10000
container = "url"
keys = New String() {"url"}
fields = New String() {"loc"}
ElseIf feedFormat.StartsWith("sitemapindex") Then
container = "sitemap"
keys = New String() {"url"}
fields = New String() {"loc"}
Finally, we extend the condition block for "OPML".
ElseIf feedFormat.StartsWith("opml") Then
becomes:
ElseIf feedFormat.StartsWith("opml") OrElse _
feedFormat.StartsWith("sitemapindex") Then
Now we are ready to test our application using our own sitemap files or the huge one from Google (almost 4MB, with more than 35,000 records located at http://www.google.com/sitemap.xml)...
...and voila once again! Easy coding only for settings, and no new classes, methods ...
#End Region
All comments, questions, ideas, requests for help ... are very welcomed.
Thanks for your time.