Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / desktop / MFC

Most Important Regular Expression for parsing HTML

3.50/5 (2 votes)
8 Dec 2011CPOL2 min read 36.8K  
The usefulness of Regexes (Regular Expressions) is ineffable. Especially in parsing documents, it’s a well-suited and indispensable tool.

The usefulness of Regexes (Regular Expressions) is ineffable. Especially in parsing documents, it’s a well-suited and indispensable tool. All good HTML and XML Parsers basically use Regexes to extract cardinal information in HTML documents like names of tags, whether the tag being examined is well-formed, empty, or even malformed by checking the tags against a bunch of rules. Of all the regular expressions needed by a HTML parser, the most important/complex is the one that matches the start tag of an element.

There are a lot of Regular Expression solutions on how to parse HTML tags and attributes. The most popular one (on the Internet that I’ve seen so far) is something like this: "<(\/?)(\w+)[^>]*(\/?)>." This is a non-greedy regular expression that matches both a start and an end tag. For example, it will match a "<pre>" and a "</pre>." It will also match a "<br/>" which is an Empty element. This Regex is so inefficient because it fails to consider a bunch of cases: How would the parser know if the tag is an empty, block, or inline element? How would it know if the tag has attributes and how would it handle these attributes? These questions cannot be answered by using this overly simplistic Regex.

I found a very efficient Regular Expression to parse HTML tags and attributes: /^<(\w+)((?:\s+\w+(?:\s*=\s*(?:(?:"[^"]*")|(?:'[^']*')|[^>\s]+))?)*)\s*(\/?)>/. This Regex makes it easier to not only identify empty tags but also parse the attributes in these tags.This intelligent piece of Regex was in John Resig’s “HTMLparser.js”–a simple HTML parser made by John Resig. At first sight, it seems complicated. But if you look more closely into this Regex, you will notice it simply extracts necessary information about the tag: its tag name, its attributes, and its type (empty or not).

BREAK DOWN OF THE REGEX

There are three main groups:

  • The first group – (\w+) – captures the name of the tag being examined.
  • The second group is the largest of the three. It contains sub-groups, most of which are non-matching (groups with ?: are non-matching which means that information about the group wouldn’t be available after the group is matched). This second group matches the attributes in a tag (if there exists any). The group checks for quotes (double or single) around an attribute’s value. It also handles the case where there are  no quotes around the attribute’s value: [^>\s]+ . This also makes sure that the regular expression matching is non-greedy ([^>]* prevents capturing more than one right-angled bracket).
  • The third group is (\/?) and it simply checks if the tag being examined is an empty element (like <br/>’s and <hr/>’s).

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)