HTML Parsers: The Journey to a More Semantic yet Forgiving Web

danielthesolver

4.00/5 (1 vote)

8 Dec 2011CPOL6 min read

16K

HTML5 is the next major revision of the html standard. If all works well, it should become the dominant markup in the nearest future ousting both html4 and xhtml1 from their cozy locations. A lot of people say HTML5 is the next big thing. In some sense, yes. But in another no. HTML5 isn’t another different markup language. It’s a specification that adds on to and removes some features from the already existing specifications for html4. It’s the next big thing in that it’s going to change the way we markup our html pages; it’ll add more meaning to elements making html pages more semantic. Apart from making the web more semantic html5 will also standardize a lot of features across major browsers. Finally, there’s going to be some elements that all browsers will implement and it would hopefully function the same way across these browsers. No browser will be left out including IE. Now, the ie6 death count down might even run faster. Check out the ie6 count down website at: http://www.ie6countdown.com/. Ok, that’s html5. What of xhtml1 and html4? Do they still exist and will they still exist? They still hang around and will for a while until all the browsers are standardized and old browser start to weather off.

All the html (and xhtml1) standards have parsers implemented in most non-trivial languages used frequently on the web to power web applications. There are xhtml1 and html4 parsers implemented in php, ruby, c++, and others. Most parsers use the libxml library in c to build and traverse the dom. It’s made for xml so the parser is very strict. The documentation and code for Libxml lives at: http://xmlsoft.org/. So libxml is appropriate for parsing xml but not for parsing the transitional versions of any html or xhtml. It’s not even appropriate for html5. HTML5 allows for some laxity on the side of the developer. That’s why there are parsers made specifically to parse HTML5 and no xhtml or html4 parser can appropriately parse HTML5. It’s different and COOL!!

HTML5 includes some new tags in its spec: <article>, <aside>, <header>, <section>, to mention a few. All these tags make html pages more descriptive, correspondingly making the web more semantic. These new tags will make development and deployment of web bots easier because now web bots can identify the different parts of a page and know what the data contained within the different page elements represent. They’ll now know if a writing is an article (stand-alone), if it’s just tangentially related to it’s surrounding content, if it’s a header section, and even how to outline the headers (using hgroup), and so on. For more information on the semantics of the new html5 tags and their use, please see Dive into HTML5. I think it’s a really practical and non-trivial guide to the new and emerging HTML5 specification. These new tags alone could throw the already existing parsers for html4 and xhtml off the edge. But there’s more complication to the work html5 parsers must handle. HTML5 is so FORGIVING! The <head>, <html>, and <body> tags which were required in the previous html specifications are now IMPLIED! That means that your web html5 page need not use these pivotal tags at all. You can have a page that looks like this:

<li>My boy is coming

The above markup is represented the same way in dom as this:

<html>
<head>
</head>
<body><li>My boy is coming</li>
</body>
</html>

THOSE PIVOTAL ALL-IMPORTANT CAN'T-DO-WITHOUT TAGS ARE NOW IMPLIED. It seems like a mess but it's not. Since every page will have to have these tags why not just help the author of the page define those tags as the page is read into the DOM? HTML5 parsers have to handle this situation. Soon, you'll see how the HTML5 parsers handle these weirdo syntax.

HTML5 parsing has been detailed by the WHATWG group to help authors of HTML5 parsers build effective parsers. The group explains how to parse html5 documents. This document is very important to html parser authors and should be used progressively since the html5 specification is still changing (will not fixed for some time I assure you). This document is so detailed. It veers into the input stream and even the character encodings that the html5 parser should be able to handle. That’s why it didn’t take long for html5 parsers to spring up. One prominent suite of html5 parsers implemented in multiple languages is the HTML5lib. There are currently python, php, and ruby implementations available for download. Only the python version, though, is still being actively developed. The ruby version is dead while the php version is still in it’s alpha release state, I think. On the other hand, the html5lib has been ported to javascript/node.js. But this seems to be an event-driven parser. So it might be a SAX (Simple API for XHML) and not a DOM parser (which most people use). The SAX Model is more compatible with node.js since node.js is also event-driven. But event-driven parsers usually stop when malformed syntax is encountered. Errors discourage html authors who most of the time don’t know what the standard is (might change tomorrow). And there’s another port for java programmers (expected, right?); it’s called Validator.nu HTML parser and it contains SAX, DOM, XOM API’s (jack of all trades). There’s no port to functional languages like clojure yet. Awwww… Maybe I’ll port it to clojure.

You might ask: what about HTML5 validation? HTML5 validation isn’t really necessary anymore due to the most forgiving syntax of HTML5. What’s there to validate when your page does not even need a root tag?

Some time ago, I was playing with some HTML Parsers and comparing how these parsers handle malformed html syntax. My tests were entirely written in php. I fed some malformed syntax to DOMDocument, HTML5lib, and the php simple dom parser. The PHP simple dom parser is basically the DOMDocument PHP parser on steroids. The Simple dom parser allows for easy traversal of the dom. For example, suppose, that you want to find all the image elements in an HTML page. Using DOMDocument library in php, you would write something like:

document.getElementsByTagName("img"); // returns a NodeList of the image Nodes in the DOM representation of your just
// created html document

Using the simple html dom, you can do this:

$html->find('img') // where $html is the root of your document as in document.documentElement == $html

I cannot show all but some part of the whole rundown and results of tests I ran on the html parsers. I wouldn’t show you in exact terms either. I had some strings containing some well-formed html4-transitional, xhtml-transitional, html5 as well as malformed versions of the aforementioned markups.

$first = "<html>
<body>
<h1>My First Heading</h1>
<p>My first paragraph.</p>
</body>
</html>";
$second = "<li>Tell me all about ya!";
$third = "<body>
<p>I was with you</p>
</body>";

Then I ran these html markup through the html5 parser and the DOMDocument library in PHP like this:

$dom1 = new DOMDocument("1.0", "utf-8");
$dom1->formatOutput = true;
$dom1->loadXML($first); // did the same thing for $second and $third
echo $dom1->saveXML();
echo "\n";
/*
The above code creates a DOMDocument in XML so the $first and $third html strings pass the test since they are both
valid xml but the $second is malformed. Complains that the <li> tag  isn't matched by its closing tag.
*/

$dom3 = DOMImplementation::createDocument(null, 'html',
DOMImplementation::createDocumentType("html",
"-//W3C//DTD XHTML 1.0 Transitional//EN",
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"));
$dom3->formatOutput = true;

$html = $dom3->documentElement;
// set up $html strings
$html->loadHTML($first); // did the same for $second and $third too.
echo $html->saveHTML();

/*
The above snippet creates a document in xhtml1 transitional.
Results:
$first is valid xhtml1
$second -- helps close the <li> tag
$third -- valid xhtml1 transitional. Puts a <html>  tag but no <head>  tag present.
*/

$dom2 = DOMImplementation::createDocument(null, 'html',
DOMImplementation::createDocumentType("html",
"-//W3C//DTD HTML 4.01 Transitional//EN",
"http://www.w3.org/TR/html4/loose.dtd"));
$dom2->formatOutput = true;

$html = $dom2->documentElement;
// set up $html strings
$html->loadHTML($first); // did the same for $second and $third too.
echo $html4_document->saveHTML();

/*
The above snippet creates a document in html4 transitional.
Results:
$first is valid.
$second - since it's transitional it just helps close the <li> tag
$third - puts the <html> (root) element in the dom but does not put the <head>  element. Valid html4 transitional
*/

$dom1 = HTML5_Parser::parse($first); // do the same for $
echo $dom1->saveHTML();
echo "\n";

/*
The above snippet uses the HTML5 parser to parse the html strings: $first, $second, $third.
Results:
$first , $second, $third passes.
It closes the <li> tag as needed and puts in the <html>, <head>,<body> elements in the DOM when absent.
Nice!
*/

The tests I ran were a more than this and more complex. I stripped some details to make the tests I ran easier to comprehend. Plus, I ran it on the command line. That’s why I use new lines to demarcate individual tests instead of <br> tags.

So we love html5. It’s forgiving. It’s modern. It might eventually replace flash. It’s already on our iphones and smart phones and is implemented in all recent versions of major browsers (including ie). We don’t need to validate our pages again because we know the built-in browser parsers won’t spew out errors (good or bad thing? You be the judge…). We can start using it right away even on older browsers (we can just use modernizr and HTML5shivs to detect if some html5 features are present in a browser). There are tools out there to help us handle old browser! Ain’t that great? We’ve already started our tortuous journey to a more semantic yet forgiving web!

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)