Introduction
This article discusses a markup parsing engine that can be used to work with markup data (such as HTML or XML) when a DOM is not available.
Background
Recently, I was working on a project in which I was interacting with a Web site in code using HttpWebRequest
and HttpWebResponse
objects. I had the need to simulate form submissions and in order to do that, I needed to set the form values in the Request
object. I decided that I would use the HTML text returned by the HttpWebResponse
to collect the names of the form controls and then I could set values for them as needed and push that into my Request
object. It seemed like a rather straight-forward approach so I began looking at how I could load the HTML text from the WebResponse
into some kind of DOM that I could use. Although I thought it would be a simple thing to do, I could not find framework classes that supported this. The XML DOM in the .NET Framework would not work as the HTML text was not well formed (XHTML) and the XML Document objects in the .NET Framework would throw exceptions and not load it. Searches on the internet turned up wacky COM implementations that almost gave me nightmares. So, next I looked for home-grown HTML parsers. I found a few free HTML parsers on the internet, but none seemed to support the kind of interaction I wanted.
I wanted to have the HTML text loaded into an Object Model that would allow intuitive access to the hierarchy of the markup tags without having to jump through a million hoops. As I thought about this, I came up with an idea of how to parse the text myself and also do it in a generic way that would support any markup format. The idea is to use a set of regular expressions to identify markup tags within a given string
. I chose to define a markup tag as a single...
<....>
... occurrence. This means that an opening tag like...
<html>
... would be a markup tag occurrence as well as a closing tag like:
</html>
So, the idea was to use the regular expressions to identify the individual tags and only store the index into the original text of where the tag's text begins along with the length of the text. In this manner, I could create a map of all the tags in the text. Also, only one copy of the text would actually exist, and I could reference a specific tag by doing a substring against the original text using the index and length for the tag. From there, I saw that I would need to take a tag and do a similar analysis to determine its attributes. Again, taking an individual tag's text and feeding it into a regular expression, I was able to identify the tag's attributes.
I defined an attribute as a...
name=value
... occurrence within an opening tag. For example:
<input id="myinputcontrol">
Here, id="myinputcontrol"
is the attribute text, 'id
' is the name
, and 'myinputcontrol
' is the value
. These also were stored as an index of where the attribute's text begins and the length of its text. I defined an inline tag as a tag that ends with:
/>
and a comment tag as a tag that begins with:
<!--
Last, I defined a badly formed tag as a tag that is an opening tag that is not an inline tag and has no corresponding closing tag. For example, here badTag
is a badly formed tag:
<goodTag>
<badTag>
</goodTag>
This is quite common in HTML and browsers handle it without problems.
Using the Code
The work for the parsing of the text is done in the ParseMarkup()
function of the MarkupDocument
class. The parsing consists of the following steps:
- Identification of tags in the raw text - In this step, the raw text is fed through the regular expressions to identify the individual tags. When an individual tag is identified, a
MarkupTag
object is created for it. At this point, the tags are stored in a linear collection with no hierarchy.
- Correction of faulty inline tags - In this step, opening tags that are identified as badly formed tags are flagged.
- Construction of the document hierarchy - The linear collection of tags is analyzed to determine the nesting levels and construct the parent-child relationships between the tags.
- Association of opening tags with their closing tags - At this point, the hierarchy of the document exists and we can now create a connection between an opening tag object and its closing tag object based on the nesting level of the tags and a case-insensitive text comparison.
- Validation of the document is performed - Validation is simply that all opening tags should either be flagged as a faulty inline tag or have a closing tag associated with them.
- Removal of closing tags from the document's
RootTags
collection as well as from all children - In essence, this makes the closing tags accessible only from the ClosingTag
property of its opening tag. In general, if you are working with a markup document, you aren't much concerned with the closing tags.
- Clear the internal cache - Through testing, I found that large markup documents would benefit from caching of
string
s in the markup tag objects to prevent unnecessary, repetitive substringing of the underlying text. The cache is essentially a local string
of the substring result that is created at first use. After the parsing is complete, the local copies of the string
s are cleared.
The MarkupParser
uses three classes to represent the markup text in an object oriented manner:
MarkupDocument
MarkupTag
MarkupAttribute
The MarkupDocument
class, as its name implies, represents the document as a whole. It contains a RootTags
property which is a MarkupTag[]
at the root level of the document. The MarkupTag
class represents an individual tag in the document and it has a Children
property that is also an array of MarkupTag
objects to represent the tags that are embedded within it in the document hierarchy. Last, each MarkupTag
has an Attributes
property which is an array of MarkupAttribute
objects that represent the attributes of the tag. Each MarkupAttribute
has a Name
and Value
property that supply the name and value of the attribute as string
s. When an attribute value is quoted, the quotes are removed so you will always get the text inside the quotes. For example:
<product id="1" />
Here, the 'id
' attribute's Value
property will return 1
rather than "1"
.
The parsing of the markup text is done automatically when a MarkupDocument
is instantiated as a required parameter in construction is a string
of the markup text to load and parse. Once the constructor has executed, the text has been parsed and the document has been completely filled with MarkupTag
objects; it's ready for use! Here's an example of parsing some HTML text from a string
:
string htmlText = GetHTMLText();
MarkupDocument doc = new MarkupDocument(htmlText);
To access the inner text of a tag at the root of the document named 'html
':
string innerText = doc["html"][0].InnerText;
Note the use of the array index [0]
following the reference to the tag name html
. This is necessary because the string
indexer returns a MarkupTag[]
matching the supplied name (note that the MarkupTag
class also has a string
indexer that indexes into its Children
property). With markup documents, it is generally valid to allow multiple tags of the same name. For example, the following is valid markup...
<products>
<product id="1" name="lamp" />
<product id="2" name="pillow" />
</products>
... even though there are multiple product tags defined. Cases where a tag is limited to a single occurrence are specific to markup implementations, such as HTML with its html
, head
, body
, etc. tags. For this reason, I added an HTMLDocument
class that is a wrapper around the MarkupDocument
class and provides Head
, Body
, and Form
properties that give access directly to their respective tags. It can be used like this:
string htmlText = GetHTMLText();
HTMLDocument doc = HTMLDocument.Load(htmlText);
string innerText = doc.Head.InnerText;
Notice here that htmlText
is provided to the HTMLDocument
class through a static
method rather than a constructor. This is because there is some validation that must be done to ensure it is an HTMLDocument
(an html
tag must be at the root of the document and it can only have 1 head
and 1 body
tag).
Also, all string
comparisons are case-insensitive by default. This is desirable as an opening tag and closing tag can generally have a different case. This also means that when accessing a tag via the string
indexer, you don't need to worry about the case. So,...
string innerText = doc["html"][0].InnerText;
... and...
string innerText = doc["HTML"][0].InnerText;
... are the same.
When testing the performance I found that loading a complex XML document that was roughly 1 MB and contained 65000+ tags took about 6 seconds. Likewise, a typical HTML document that was roughly 1.5 KB took less than a second. You can use the MarkupDocument
constructor and manipulate some of the options to see how it affects the performance. In particular, the fixBadlyFormedInlineTags
option can be a big performance increase if it is false
as that is one step in the parsing process that will be skipped; this of course would only be beneficial if you are certain the markup is well formed. Also the caseSensitiveComparisons
parameter may also provide a performance gain if it is true as performing case sensitive comparisons should generally perform better; likewise this would only be beneficial if you are certain opening and closing tags in the document have matching case.
Points of Interest
- When the
useCaching
parameter is not available in the constructor or static creation method caching will be determined automatically based on the size of the text, caching will be used when the raw text is larger than 4K characters.
- There are static 'Known Inline' members in the
MarkupDocument
class that are used to account for tags that may erroneously be flagged as faulty inline tags. The only one currently in place is the <?xml ... ?>
tag used by XML; since there is never a corresponding closing tag and it is a well-known standard. You can add others to the KnownInlineTags static
property as needed.
- The use of the generic
Queue
and Stack
classes in the .NET Framework were invaluable in the parsing process and it was a good refresher in using a stack.
- XML's
CDATA
tag is currently not supported as the main regular expression that identifies the tags in the document text does not account for it.
History
- 11/07/2008 - Minor grammatical corrections; added note about XML
CDATA
tag
- 11/06/2008 - Initial publication