Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

HTML fragments parsing and creation

0.00/5 (No votes)
15 Mar 2003 1  
Classes to parse HTML parts into an object tree and back

Introduction

When looking at the Controls collection of the Page object, you quickly realize that all the interesting stuff comes in a LiteralControl. So there is no easy way to insert or change this text in a comfortable manner. Therefore I wrote a few classes that take a string apart into objects. These objects can be changed safely and then generate a new string for a LiteralControl.

I ran across this problem when writing a page template class. You can take the literal code out of the .aspx file, but then the designer seems not to be working very well. So I like to use the designer and change the header literal in the page template.

Background

Parsing HTML is not really a fun thing so I made a few restrictions.

  • The parser does not really understand HTML, but only text, tags and attributes. He does not care what their name and values are.
  • Badly formed input will result in poor output. (i.e. the <meta> is often not closed, so the only way to place a following tag is as a child tag. So the source text must be changed to something like <meta ... />.)
  • Since you can insert plain text into the resulting object tree, you easily ruin the output. (i.e. inserting text like "<junk" will be rendered as "<junk" and not &lt;junk". Remember, the brain is in front of the screen :-)

Using the code

The main class is Fragments. The constructor of Fragments, take a string which is parsed into objects. Fragments is a collection of (guess what?) Fragements. Actually Fragment is the super class of FragmentText (representing simple plain text), FragmentTag (representing a tag <tagname attr="value" ... >), FragmentComment (for a comment <!-- ... --> and FragmentDoctype (i.e. <!DOCTYPE HTML ... >).

The objects can be changed, added or removed like in any collection. Objects of type FragmentTag, have a property Nodes representing the sub tags. Since we parse a fragment there can be unmatched tags (i.e. only open or only closing tags. Therefore the FragmentTag has a property Type, which state if there are open and/or closing tags. The value OpenCloseShort stands for tags of the kind <br/>. Obviously these tags can not have Nodes.

Finally using the ToString() method will transform the Fragment into a plain HTML string.

Fragments fragments = new Fragments( someString );
for each ( Fragment fragment in fragments )
{
  if ( fragment is FragmentTag )
  {
    FragmentTag tag = (FragmentTag)fragment;
    tag.Nodes.Add( new FragmentText( "plain text" ) );
  }
  if ( fragment is FragmentText )
  {
  ...
  }
}
string s = fragments.ToString();

You can also start with an empty Fragments object, insert everything into it and generate the output.

There is a small sample program with the sources, which I use for testing. It demonstrates most of the usage.

Points of interest

I use the Regex class to split the input into pieces. The pattern is rather unreadable, but the basic structure is pattern1|pattern2|pattern3|.... It took some time to understand, that the next match will contain exactly one of the patterns. There I gave each pattern an exclusive name and made some sub groups for parameters or names. Also note that the next match will not continue exactly behind the last match. It will only continue searching there. So we have to keep track ourselves if all input is parsed.

History

  • Version 1.0 - first release
  • Version 1.1 - bug fixes (exception inside exception, parsing of nested quotes)

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here