Analyzing HTML Documents with Ease: An Introduction to NSoup and ExCSS

Farzan Hajian

4.43/5 (4 votes)

21 Nov 2014CPOL4 min read

16.6K

Analyzing HTML documents using NSoup and ExCSS libraries

Download source from GitHub

Introduction

Parsing HTML documents is a task that may be needed sometimes to be done in applications, but it is (somehow) a tedious task. In this tip, I will show how easy it could be using NSoup and ExCSS libraries.

NSoup is a HTML parser. As described in its website, its main features include:

jQuery-like CSS selectors for finding and extracting data from HTML pages
Sanitize HTML sent from untrusted sources
Manipulate HTML documents

At the time of writing this tip, the current version is 0.7.1 but this version has some serious issues so I downloaded and compiled its source code (which is 0.8.0.0). This version does not have the issues I faced when using the previous one.

The second library is ExCSS (Pronounced Excess). It is a CSS 2.1 and CSS 3 parser for .NET. Its website mentioned "The goal of ExCSS is to make it easy to read and parse stylesheets into a friendly object model with full LINQ support".
We use ExCSS to extract CSS rules and classes from HTML documents.

Using the Code

To add NSoup to our project, we have to add a reference to NSoup.dll. There is not a NuGet package available yet so the process in manual. Unlike NSoup, ExCSS has NuGet package. Using the following command in Visual Studio Package Manager Console window, we can add it to our project:

install-package excss

When the used libraries are set up, the fun part begins (CODING!!!). The corresponding project (HtmlAnalyzer4DotNet) is a WPF application that takes a URL, fetches its HTML document and extracts different types of data from the HTML document. It is developed and tested using Visual Studio 2012 and Microsoft .NET 4.5.

To begin, we have to declare a simple class that holds extracted pieces of data:

public class HtmlData
{
    public string Url { get; set; }

    public string RawHtml { get; set; }

    public string Title { get; set; }

    public List<string> MetadataList { get; set; }

    public List<string> AnchorList { get; set; }

    public List<string> ImageList { get; set; }

    public List<string> CssClassList { get; set; }

    public List<string> CssRuleList { get; set; }

    public HtmlData()
    {
        MetadataList = new List<string>();
        AnchorList = new List<string>();
        ImageList = new List<string>();
        CssClassList = new List<string>();
        CssRuleList = new List<string>();
    }

    public void Clear()
    {
        Url = null;
        RawHtml = null;
        Title = null;
        MetadataList.Clear();
        AnchorList.Clear();
        ImageList.Clear();
        CssClassList.Clear();
        CssRuleList.Clear();
    }
}

The main part is done inside Fetch method. The method is a member method of class MainWindow (It can be found inside MainWindow.xaml.cs).

The operation begins with connecting to the specified URL and fetching the HTML data:

// Connecting & Fetching ...
IConnection connection = NSoupClient.Connect(txtUrl.Text);
connection.Timeout(30000);
Document document = connection.Get();

The first line connects to the server which is specified by txtUrl.Text property and a connection object is returned. The second line sets the time to wait (in milliseconds) for fetching data before a timeout exception is raised. In our example, we wait 30 seconds to receive data.

The last line fetches data and creates a Document object. This object contains all pieces of information we need.

In order for NSoup types to be recognized, its namespaces must be imported at the beginning of the file:

using NSoup;
using NSoup.Nodes;

Now we have to parse HTML data to extract data we are looking for. We have created an instance of HtmlData class called parsedData. To fill Url, RawHtml and Title properties, nothing special is required:

parsedData.Url = document.BaseUri;
parsedData.RawHtml = document.Html();
parsedData.Title = document.Title;

To find HTML tags inside the document, you can use your CSS skills. Document class contains a useful method called Select which accepts CSS-selector queries and returns a collection of matching elements. Each item of the collection is of type Element. To find all <meta> tags, we can write the following code line:

var metaTagList = document.Select("meta");

Now to add all meta tags to parsedData object, the following code snippet is used:

foreach (Element meta in document.Select("meta")) parsedData.MetadataList.Add(meta.OuterHtml());

Element.OuterHtml() methods returns the entire HTML code related to the element (including opening and closing tags) so the preceding loop adds all HTML meta tags to parsedData object.

Similarly, we can find all image and anchor tags:

foreach (Element image in document.Select("img")) 
parsedData.ImageList.Add(image.Attr("src"));
foreach (Element anchor in document.Select("a")) 
parsedData.AnchorList.Add(anchor.Attr("href"));

Element.Attr() returns the value of a certain attribute.

So far so good. Now let's do something a bit more challenging. Let's gather CSS information. We will gather two types of CSS-related data:

CSS classes (those which start with a dot)
CSS rules (everything else)

To do so, we have to first find all HTML <style> tags and parse their inner HTML code using ExCSS parser object.
To use ExCSS, an instance of Parser class must be created. Parser class has a method called Parse(). The method accepts a string which must be a CSS document and returns an object of type StyleSheet. By using StyleSheet properties, different sorts of data can be retrieved from the parsed CSS document.
We iterate over CSS rules. Each rule which is a simple selector and starts with a "." is a CSS class.

Parser parser = new Parser();
foreach (Element style in document.Select("style"))
{
    foreach (StyleRule rule in parser.Parse(style.Html()).StyleRules)
    {
        if (rule.Selector is SimpleSelector && rule.Value.StartsWith("."))
            parsedData.CssClassList.Add(rule.ToString());
        else
            parsedData.CssRuleList.Add(rule.ToString());
    }
}

So far, we have collected everything we were interested about. Now it is time to populate the UI with the data. There is a TextBox (txtRawHtml) that should display the raw HTML document. Then there is a TreeView (treeParseData) which should display other pieces of data. Tree items that contain images, must act as a link to display their images in a popup window when the user clicked on them. The following code snippet shows how to fill the UI:

// Filling UI ...
txtRawHtml.Text = parsedData.RawHtml;
treeParseData.Items.Add("URL: " + parsedData.Url);
treeParseData.Items.Add("Title: " + parsedData.Title);
treeParseData.Items.Add(new TreeViewItem 
    { Header = "Metadata List", ItemsSource = parsedData.MetadataList });
treeParseData.Items.Add(new TreeViewItem 
    { Header = "Anchor List", ItemsSource = parsedData.AnchorList });
treeParseData.Items.Add(new TreeViewItem 
    { Header = "CSS Class List", ItemsSource = parsedData.CssClassList });
treeParseData.Items.Add(new TreeViewItem 
    { Header = "CSS Rule List", ItemsSource = parsedData.CssRuleList });

TreeViewItem imageList = new TreeViewItem { Header = "Image List" };
foreach (string imageUrl in parsedData.ImageList)
{
    TreeViewItem item = new TreeViewItem { Header = imageUrl, Cursor = Cursors.Hand };
    item.MouseDoubleClick += (sender1, e1) =>
    {
        new ImageViewerWindow(
            imageUrl.StartsWith("/") ? parsedData.Url.Remove(parsedData.Url.Length - 1) + 
        imageUrl : imageUrl
        ).ShowDialog();
    };
    imageList.Items.Add(item);
}
treeParseData.Items.Add(imageList);

Summary

In this tip, I tried to introduce you to NSoup and ExCSS libraries that I found useful when it is needed to deal with HTML documents. As you can probably guess, what we did was just scratching the surface. Both libraries have more capabilities than what we discussed in this short tip. For example, you can also manipulate HTML documents using NSoup.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)