Introduction
This is a class library that helps you produce valid XHTML from HTML. It also provides tag and attribute filtering support. You can specify exactly which tags and attributes are allowed in the output and the other tags are filtered out. You can use this library to clean the bulky HTML that Microsoft Word documents produce when converted to HTML. You can also use it to cleanup HTML before posting to blog sites so that your HTML does not get rejected by blog engines like WordPress, B2evolution, etc.
How it Works
There are two classes: HtmlReader
and HtmlWriter
.
HtmlReader
extends the famous SgmlReader from Chris Clovett. When it reads the HTML, it skips any node that has some kind of a prefix. As a result, all those nasty tags like <o:p>
, <o:Document>
, <st1:personname>
and hundreds of other tags are filtered out. Thus, the HTML you read is free of tags that are not core HTML tags.
HtmlWriter
extends the regular XmlWriter
, which makes it produce XML. XHTML is basically HTML in XML format. All the familiar tags you use -- like <img>
, <br>
and <hr>
, which have no closing tags -- must be in empty element format in XHTML, i.e. <img .. />
, <br/>
and <hr/>
. As XHTML is a well-formed XML document, you can easily read a XHTML document using XML parsers. This gives you the opportunity to apply XPath searching.
HtmlReader
HtmlReader
is pretty simple. Here's the entire class:
public class HtmlReader : Sgml.SgmlReader
{
public HtmlReader( TextReader reader ) : base( )
{
base.InputStream = reader;
base.DocType = "HTML";
}
public HtmlReader( string content ) : base( )
{
base.InputStream = new StringReader( content );
base.DocType = "HTML";
}
public override bool Read()
{
bool status = base.Read();
if( status )
{
if( base.NodeType == XmlNodeType.Element )
{
if( base.Name.IndexOf(':') > 0 )
base.Skip();
}
}
return status;
}
}
HtmlWriter
This class is a bit trickier. Here are the tricks that have been used:
- Overrides the
WriteString
method of XmlWriter
and prevents it from encoding content using regular XML encoding. The encoding is done manually for HTML documents.
WriteStartElement
is overridden to prevent tags from being written to the output that are not allowed.
WriteAttributes
is overridden to prevent unwanted attributes.
Let's take a look at the entire class part-by-part:
Configurability
You can configure HtmlWriter
by modifying the following block:
public class HtmlWriter : XmlTextWriter
{
public bool FilterOutput = false;
public bool ReduceConsecutiveSpace = true;
public string [] AllowedTags =
new string[] { "p", "b", "i", "u", "em", "big", "small",
"div", "img", "span", "blockquote", "code", "pre", "br", "hr",
"ul", "ol", "li", "del", "ins", "strong", "a", "font", "dd", "dt"};
public string ReplacementTag = "dd";
public bool RemoveNewlines = true;
public string [] AllowedAttributes = new string[]
{
"class", "href", "target", "border", "src",
"align", "width", "height", "color", "size"
};
}
WriteString Method
public override void WriteString(string text)
{
text = text.Replace( " ", " " );
text = text.Replace("<![CDATA[","");
text = text.Replace("]]>", "");
text = text.Replace( "<", "<" );
text = text.Replace( ">", ">" );
text = text.Replace( "'", "'" );
text = text.Replace( "\"", ""e;" );
if( this.FilterOutput )
{
text = text.Trim();
// We want to replace consecutive spaces
// to one space in order to save horizontal width
if( this.ReduceConsecutiveSpace )
text = text.Replace(" ", " ");
if( this.RemoveNewlines )
text = text.Replace(Environment.NewLine, " ");
base.WriteRaw( text );
}
else
{
base.WriteRaw( text );
}
}
WriteStartElement: Applying Tag Filtering
public override void WriteStartElement(string prefix,
string localName, string ns)
{
if( this.FilterOutput )
{
bool canWrite = false;
string tagLocalName = localName.ToLower();
foreach( string name in this.AllowedTags )
{
if( name == tagLocalName )
{
canWrite = true;
break;
}
}
if( !canWrite )
localName = "dd";
}
base.WriteStartElement(prefix, localName, ns);
}
WriteAttributes Method: Applying Attribute Filtering
bool canWrite = false;
string attributeLocalName = reader.LocalName.ToLower();
foreach( string name in this.AllowedAttributes )
{
if( name == attributeLocalName )
{
canWrite = true;
break;
}
}
if( canWrite )
this.WriteStartAttribute(reader.Prefix,
attributeLocalName, reader.NamespaceURI);
while (reader.ReadAttributeValue())
{
if (reader.NodeType == XmlNodeType.EntityReference)
{
if( canWrite ) this.WriteEntityRef(reader.Name);
continue;
}
if( canWrite )this.WriteString(reader.Value);
}
if( canWrite ) this.WriteEndAttribute();
Conclusion
The sample application is a utility that you can use right now to clean HTML files. You can use this class in applications like blogging tools where you need to post HTML to some web service.