Introduction
This article presents my simple HTML parser library which I've developed for an automated application. The parser mainly detects tag syntax and it can collect a tag pair as a group. I was trying to use a parser generator like ANTLR but I'm in a hurry and don't have time to study the syntax, so I ended up writing it myself.
The parser was intended to be used with HTML content retrieved by the .NET WebResponse
class. So I have also developed a tool, named NativeWebSurf
, that downloads HTML content by WebResponse
and uses my parser to parse it into an HTML structure.
Using the Code
The library and the tool are written in .NET 3.5 with Visual Studio 2008. Now, there are two archive files: NativeWebSurf and NativeWebSurf_1.0.1. The former does not intensively use LINQ or Extension methods, but in the latter I applied LINQ more. (I finally found it is a somewhat convenient language feature. In case anyone would like to convert to .NET 2.0, try the former archive.
The NativeWebSurf
solution contains three projects.
NativeWebSurf
: This is the main application that uses the parser from RZLib
.
RZLib
: This is a class library that contains the parser source code. This project uses C5 Generic Collection library, which is not provided. Please download it from the link provided.
RZLib.UnitTest
: A Unit test module for RZLib
. It uses NUnit 2.4.3.
The parser class is RZ.Web.HtmlParser
. To create the parser object, pass an HTML text into its constructor.
using RZ.Web;
namespace TestLib
{
class Program
{
HtmlParser parser = new HtmlParser(
"<html><body>any HTML/text here...</body></html>");
}
}
HtmlParser.CurrentContent
represents a content object that the parser has just read. The content object is represented by content classes, which are classes that start with HtmlContent
. The content classes hierarchy is as follows:
HtmlContent
HtmlContentText
HtmlContentTag
HtmlContentHeadTag
HtmlContentOpenTag
HtmlContentCompleteTag
HtmlContentBlock
HtmlContentCloseTag
Bold class name is an abstract
class.
HtmlContentText
keeps all texts that are not considered as tag content.
HtmlContentOpenTag
, HtmlContentCloseTag
, and HtmlContentCompleteTag
keep information of open tag, close tag, and complete tag (i.e. <br />
), respectively.
When the parser is just created, its CurrentContent
is null
. All parser public
methods will cause it to change to a valid object.
FetchNextContent()
will move CurrentContent
to the next content.
MoveToHeadTag()
will move CurrentContent
to the next open/complete tag.
MoveToTag()
will move CurrentContent
to open/complete tag with the specified name (passed as its parameter) or predicate.
With lambda expression in C# 3, it makes the predicate statement more compact (compared to anonymous method), now MoveToTag()
can be used like:
parser.MoveToTag( tag => tag.TagName == "meta" && tag.Attributes["name"] == "Rating" );
GrabCurrentTag()
can be used only when CurrentContent
is at open/complete tag. It will match the end tag and put the whole content into HtmlContentBlock
, which has a tree structure.
Example:
HtmlParser parser = new HtmlParser(
@"<html>
<head>
<title>abc</title>
</head>
<body>any text here...
</body>
</html>"
);
Debug.Assert( parser.MoveToTag("head") );
Boolean hasCloseTag;
HtmlContentBlock headBlock = parser.GrabCurrentTag(out hasCloseTag);
Debug.Assert( hasCloseTag );
Debug.Assert( headBlock.TagName == "head" );
Debug.Assert( headBlock.Attributes.Count == 0 );
Debug.Assert( headBlock.Count == 3 );
Debug.Assert( headBlock[0] is HtmlContentText );
Debug.Assert( headBlock[1] is HtmlContentBlock );
Debug.Assert( headBlock[2] is HtmlContentText );
HtmlContentBlock titleBlock = (HtmlContentBlock) headBlock[1];
Debug.Assert( titleBlock.Count == 1 );
Debug.Assert( titleBlock[0].ToString() == "abc" );
Debug.Assert( parser.CurrentContent is HtmlContentCloseTag );
The current parser is read forward only. You better parse HTML to a block in order to read backward.
HtmlContentBlock
The idea of HtmlContentBlock
is an object that collects only two types of content:
HtmlContentText
HtmlContentBlock
The
Count
property can be used to determine the number of child items (text or block) in it and we can get an item by its indexer.
However, using its iterator with LINQ (or foreach
either) may probably be easier.
It provides two version of iterators: the first one, which is default, is for HtmlContentBlock
, another one is HtmlContent
. (Note that its default iterator has been changed from IEnumerable<HtmlContent>
to IEnumerable<HtmlContentBlock>
. The method IterateChild()
is created for HtmlContent
enumeration instead. -- Because I believe people would scan for block more than text).
Another way for searching a tag is through the FindTag()
method, which can find by either name or predicate as well. It returns an array of Int32
as an index, which we can use to get the content by the indexer and the index can be used as start position for finding next time too.
Issues
Script Support
This parser can handle a JavaScript tag, but not other languages, by simple means. It does not understand all JavaScript syntax but it can recognise JS string and comments and it will treat all Java code like normal text.
Bugs
There is at least one case that can cause parser to fail and when it fails, it throws HtmlParserException
. The case that I know is invalid tag form like:
This is <b>HTML <i>text</i</b>.
The parser expects a well-form close tag but it is not. I have marked the code with TODO: in HtmlLegacyParser.cs the point where you can handle this case.
Some Unrelated Issue...
While I have been developing NativeWebSurf
, I notice that the Cookie object returned from WebResponse
always contains a path even though the Set-Cookie
response header does not specify one!? I don't know whether it is a framework bug... but it could cause trouble when we can never know if it is really returned from the Web server... Well, it may not be a good place to ask here but if anyone knows something, it'd be grateful if you could share it:).
Finally
I hope it could be a useful library for others too. If you fix some bug or add some features, please share with me too.
History
- 2008-02-27: Updated article
Added a few features and refactored the lib.
NativeWebSurf
+ Supports Cookie cache (with all paths in same host)
+ Supports HTML content charset
* Fixed code to support new HtmlContentBlock
iteratorRZLib
HtmlContentBlock
* Changed default iterator to return HtmlContentBlock
instead of HtmlContent
+ Indexer with Int32[]
index
+ IterateChild()
for HtmlContent
enumeration
+ FindTag()
with predicate
HtmlParser
+ MoveToTag()
with predicate
- 2008-02-24: Posted initial article