The .NET Framework provides a plethora of tools for generating HTML markup, and for both generating and parsing XML markup. However, it provides very little in the way of support for parsing HTML markup.
I had some pretty old code (written in classic Visual Basic) for spidering websites and I had ported it over to C#. Spidering generally involves parsing out all the links on a particular web page and then following those links and doing the same for those pages. Spidering is how companies like Google scour the Internet.
My ported code worked pretty well, but it wasn’t very forgiving. For example, I had a website that allowed users to enter a URL of a page that had a link to our site in return for a free promotion. The code would scan the given URL for a backlink. However, sometimes it would report there was no backlink when there really was.
The error was caused when the user’s web page contained syntax errors. For example, an attribute value that had no closing quote. My code would skip ahead past large amounts of markup, looking for that quote.
So I rewrote the code to be more flexible—as most browsers are. In the case of attribute values missing closing quotes, my code assumes the value has terminated whenever it encounters a line break. I made other changes as well, primarily designed to make the code simpler and more robust.
Listing 1 is the HtmlParser
class I came up with. Note that there are many ways you can parse HTML. My code is only interested in tags and their attributes and does not look at text that comes between tags. This is perfect for spidering links in a page.
The ParseNext()
method is called to find the next occurrence of a tag and returns an HtmlTag
object that describes the tag. The caller indicates the type of tag it wants information about (or “*
” if it wants information about all tags).
Parsing HTML markup is fairly simple. As I mentioned, much of my time spent was spent making the code handle markup errors intelligently. There were a few other special cases as well. For example, if the code finds a <script>
tag, it automatically scans to the closing </script>
tag, if any. This is because some scripting can include HTML markup characters that can confuse the parser so I just jump over them. I take similar action with HTML comments and have special handling for !DOCTYPE
tags as well.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace HtmlParser
{
public class HtmlTag
{
public string Name { get; set; }
public Dictionary<string, string> Attributes { get; set; }
public bool TrailingSlash { get; set; }
};
public class HtmlParser
{
protected string _html;
protected int _pos;
protected bool _scriptBegin;
public HtmlParser(string html)
{
Reset(html);
}
public void Reset()
{
_pos = 0;
}
public void Reset(string html)
{
_html = html;
_pos = 0;
}
public bool EOF
{
get { return (_pos >= _html.Length); }
}
public bool ParseNext(string name, out HtmlTag tag)
{
tag = null;
if (String.IsNullOrEmpty(name))
return false;
while (MoveToNextTag())
{
Move();
char c = Peek();
if (c == '!' && Peek(1) == '-' && Peek(2) == '-')
{
const string endComment = "-->";
_pos = _html.IndexOf(endComment, _pos);
NormalizePosition();
Move(endComment.Length);
}
else if (c == '/')
{
_pos = _html.IndexOf('>', _pos);
NormalizePosition();
Move();
}
else
{
bool result = ParseTag(name, ref tag);
if (_scriptBegin)
{
const string endScript = "</script";
_pos = _html.IndexOf(endScript, _pos,
StringComparison.OrdinalIgnoreCase);
NormalizePosition();
Move(endScript.Length);
SkipWhitespace();
if (Peek() == '>')
Move();
}
if (result)
return true;
}
}
return false;
}
protected bool ParseTag(string name, ref HtmlTag tag)
{
string s = ParseTagName();
bool doctype = _scriptBegin = false;
if (String.Compare(s, "!DOCTYPE", true) == 0)
doctype = true;
else if (String.Compare(s, "script", true) == 0)
_scriptBegin = true;
bool requested = false;
if (name == "*" || String.Compare(s, name, true) == 0)
{
tag = new HtmlTag();
tag.Name = s;
tag.Attributes = new Dictionary<string, string>();
requested = true;
}
SkipWhitespace();
while (Peek() != '>')
{
if (Peek() == '/')
{
if (requested)
tag.TrailingSlash = true;
Move();
SkipWhitespace();
_scriptBegin = false;
}
else
{
s = (!doctype) ? ParseAttributeName() : ParseAttributeValue();
SkipWhitespace();
string value = String.Empty;
if (Peek() == '=')
{
Move();
SkipWhitespace();
value = ParseAttributeValue();
SkipWhitespace();
}
if (requested)
{
if (tag.Attributes.Keys.Contains(s))
tag.Attributes.Remove(s);
tag.Attributes.Add(s, value);
}
}
}
Move();
return requested;
}
protected string ParseTagName()
{
int start = _pos;
while (!EOF && !Char.IsWhiteSpace(Peek()) && Peek() != '>')
Move();
return _html.Substring(start, _pos - start);
}
protected string ParseAttributeName()
{
int start = _pos;
while (!EOF && !Char.IsWhiteSpace(Peek()) && Peek() != '>'
&& Peek() != '=')
Move();
return _html.Substring(start, _pos - start);
}
protected string ParseAttributeValue()
{
int start, end;
char c = Peek();
if (c == '"' || c == '\'')
{
Move();
start = _pos;
_pos = _html.IndexOfAny(new char[] { c, '\r', '\n' }, start);
NormalizePosition();
end = _pos;
if (Peek() == c)
Move();
}
else
{
start = _pos;
while (!EOF && !Char.IsWhiteSpace(c) && c != '>')
{
Move();
c = Peek();
}
end = _pos;
}
return _html.Substring(start, end - start);
}
protected bool MoveToNextTag()
{
_pos = _html.IndexOf('<', _pos);
NormalizePosition();
return !EOF;
}
public char Peek()
{
return Peek(0);
}
public char Peek(int ahead)
{
int pos = (_pos + ahead);
if (pos < _html.Length)
return _html[pos];
return (char)0;
}
protected void Move()
{
Move(1);
}
protected void Move(int ahead)
{
_pos = Math.Min(_pos + ahead, _html.Length);
}
protected void SkipWhitespace()
{
while (!EOF && Char.IsWhiteSpace(Peek()))
Move();
}
protected void NormalizePosition()
{
if (_pos < 0)
_pos = _html.Length;
}
}
}
Listing 1: The HtmlParse class.
Using the class is very easy. Listing 2 shows sample code that scans a web page for all the HREF
values in A
(anchor) tags. It downloads a URL and loads the contents into an instance of the HtmlParser
class. It then calls ParseNext()
with a request to return information about all A
tags.
When ParseNext()
returns, tag is set to an instance of the HtmlTag
class with information about the tag that was found. This class includes a collection of attribute values, which my code uses to locate the value of the HREF
attribute.
When ParseNext()
returns false
, the end of the document has been reached.
protected void ScanLinks(string url)
{
WebClient client = new WebClient();
string html = client.DownloadString(url);
HtmlTag tag;
HtmlParser parse = new HtmlParser(html);
while (parse.ParseNext("a", out tag))
{
string value;
if (tag.Attributes.TryGetValue("href", out value))
{
}
}
}
Listing 2: Code that demonstrates using the HtmlParser class
While I’ll probably find a few tweaks and fixes required to this code, it seems to work well. I found similar code on the web, but didn’t like it. My code is fairly simple, does not rely on large library routines, and seems to perform well. I hope you are able to benefit from it.