(untagged)

Another C# Legacy HTML Parser Using Tag Processing

Ruxo Zheng

0.00/5 (No votes)

26 Feb 2008

A class library of HTML parser for HTML tag work

Introduction

This article presents my simple HTML parser library which I've developed for an automated application. The parser mainly detects tag syntax and it can collect a tag pair as a group. I was trying to use a parser generator like ANTLR but I'm in a hurry and don't have time to study the syntax, so I ended up writing it myself.

The parser was intended to be used with HTML content retrieved by the .NET WebResponse class. So I have also developed a tool, named NativeWebSurf, that downloads HTML content by WebResponse and uses my parser to parse it into an HTML structure.

Using the Code

The library and the tool are written in .NET 3.5 with Visual Studio 2008. Now, there are two archive files: NativeWebSurf and NativeWebSurf_1.0.1. The former does not intensively use LINQ or Extension methods, but in the latter I applied LINQ more. (I finally found it is a somewhat convenient language feature. In case anyone would like to convert to .NET 2.0, try the former archive.

The NativeWebSurf solution contains three projects.

NativeWebSurf: This is the main application that uses the parser from RZLib.
RZLib: This is a class library that contains the parser source code. This project uses C5 Generic Collection library, which is not provided. Please download it from the link provided.
RZLib.UnitTest: A Unit test module for RZLib. It uses NUnit 2.4.3.

The parser class is RZ.Web.HtmlParser. To create the parser object, pass an HTML text into its constructor.

using RZ.Web;

namespace TestLib
{
    class Program
    {
        HtmlParser parser = new HtmlParser(
            "<html><body>any HTML/text here...</body></html>");
    }
}

HtmlParser.CurrentContent represents a content object that the parser has just read. The content object is represented by content classes, which are classes that start with HtmlContent. The content classes hierarchy is as follows:

HtmlContent
- HtmlContentText
- HtmlContentTag
  - HtmlContentHeadTag
    - HtmlContentOpenTag
    - HtmlContentCompleteTag
    - HtmlContentBlock
  - HtmlContentCloseTag

Bold class name is an abstract class.

HtmlContentText keeps all texts that are not considered as tag content.

HtmlContentOpenTag, HtmlContentCloseTag, and HtmlContentCompleteTag keep information of open tag, close tag, and complete tag (i.e. <br />), respectively.

When the parser is just created, its CurrentContent is null. All parser public methods will cause it to change to a valid object.

FetchNextContent() will move CurrentContent to the next content.

MoveToHeadTag() will move CurrentContent to the next open/complete tag.

MoveToTag() will move CurrentContent to open/complete tag with the specified name (passed as its parameter) or predicate.

With lambda expression in C# 3, it makes the predicate statement more compact (compared to anonymous method), now MoveToTag() can be used like:

parser.MoveToTag( tag => tag.TagName == "meta" && tag.Attributes["name"] == "Rating" );

GrabCurrentTag() can be used only when CurrentContent is at open/complete tag. It will match the end tag and put the whole content into HtmlContentBlock, which has a tree structure.

Example:

HtmlParser parser = new HtmlParser(

@"<html>
    <head>
        <title>abc</title>
    </head>
    <body>any text here...
    </body>
</html>"
           ); 

Debug.Assert( parser.MoveToTag("head") );  // locate to <head> tag.

Boolean hasCloseTag;  // if it is false, it means parser cannot find its close tag.
HtmlContentBlock headBlock = parser.GrabCurrentTag(out hasCloseTag);

Debug.Assert( hasCloseTag );   // since we have </head>

Debug.Assert( headBlock.TagName == "head" );
Debug.Assert( headBlock.Attributes.Count == 0 );  // no attributes in <head>
Debug.Assert( headBlock.Count == 3 );  // there are 3 contents inside headBlock
Debug.Assert( headBlock[0] is HtmlContentText );  // \r\n between <head> and <title>
Debug.Assert( headBlock[1] is HtmlContentBlock ); // block of <title>
Debug.Assert( headBlock[2] is HtmlContentText );  // \r\n between </title> and </head>

HtmlContentBlock titleBlock = (HtmlContentBlock) headBlock[1];

Debug.Assert( titleBlock.Count == 1 );
Debug.Assert( titleBlock[0].ToString() == "abc" );

Debug.Assert( parser.CurrentContent is HtmlContentCloseTag );  // it is at </head>

The current parser is read forward only. You better parse HTML to a block in order to read backward.

HtmlContentBlock

The idea of HtmlContentBlock is an object that collects only two types of content:

HtmlContentText
HtmlContentBlock

The Count property can be used to determine the number of child items (text or block) in it and we can get an item by its indexer.

However, using its iterator with LINQ (or foreach either) may probably be easier.

It provides two version of iterators: the first one, which is default, is for HtmlContentBlock, another one is HtmlContent. (Note that its default iterator has been changed from IEnumerable<HtmlContent> to IEnumerable<HtmlContentBlock>. The method IterateChild() is created for HtmlContent enumeration instead. -- Because I believe people would scan for block more than text).

Another way for searching a tag is through the FindTag() method, which can find by either name or predicate as well. It returns an array of Int32 as an index, which we can use to get the content by the indexer and the index can be used as start position for finding next time too.

Issues

Script Support

This parser can handle a JavaScript tag, but not other languages, by simple means. It does not understand all JavaScript syntax but it can recognise JS string and comments and it will treat all Java code like normal text.

Bugs

There is at least one case that can cause parser to fail and when it fails, it throws HtmlParserException. The case that I know is invalid tag form like:

This is <b>HTML <i>text</i</b>.

The parser expects a well-form close tag but it is not. I have marked the code with TODO: in HtmlLegacyParser.cs the point where you can handle this case.

Some Unrelated Issue...

While I have been developing NativeWebSurf, I notice that the Cookie object returned from WebResponse always contains a path even though the Set-Cookie response header does not specify one!? I don't know whether it is a framework bug... but it could cause trouble when we can never know if it is really returned from the Web server... Well, it may not be a good place to ask here but if anyone knows something, it'd be grateful if you could share it:).

Finally

I hope it could be a useful library for others too. If you fix some bug or add some features, please share with me too.

History

2008-02-27: Updated article
Added a few features and refactored the lib.

NativeWebSurf
+ Supports Cookie cache (with all paths in same host)
+ Supports HTML content charset
* Fixed code to support new HtmlContentBlock iteratorRZLib
HtmlContentBlock
* Changed default iterator to return HtmlContentBlock instead of HtmlContent
+ Indexer with Int32[] index
+ IterateChild() for HtmlContent enumeration
+ FindTag() with predicate
HtmlParser
+ MoveToTag() with predicate

2008-02-24: Posted initial article

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here