Convert HTML to Open XML Word document using MariGold.OpenXHTML and AngleSharp

Kannan Ar

5.00/5 (1 vote)

14 Dec 2017CPOL4 min read

15.8K

353

A step by step guide to integrate AngleSharp HTML parser into MariGold.OpenXHTML

Download source - 10.7 KB

Introduction

MariGold.OpenXHTML is an open source GitHub library to convert HTML documents into Open XML Word documents. It uses an internal HTML parser to read the HTML elements. This HTML parser can be completely replaced with your favorite HTML parser. Here, I am demonstrating how to do this with the popular AngleSharp HTML parser.

Using the Code

MariGold.OpenXHTML depends on two interfaces to read the HTML. All we need to create the implementations for these two interfaces. These interfaces are:

MariGold.HtmlParser.IHtmlNode - This is just a container for an HTML element. Our custom HTML parser uses this to create a hierarchy of HTML node elements to process in MariGold.OpenXHTML.
MariGold.OpenXHTML.IParser - This is used to implement the custom HTML parser. We will integrate AngleSharp in this to process an HTML string.

Open the Visual Studio and create a console application project. Create a source.html file in the root location and fill with the HTML content you want to convert. Select the properties of this HTML file and set the Copy to Output directory property as Copy always so that the file will be available in the bin folder to process for our demo.

Next step is to install the MariGold.OpenXHTML nuget package. Execute the command Install-Package MariGold.OpenXHTML from the Tools -> NuGet Package Manager -> Package Manager Console. This will install the following packages in your project. The MariGold.HtmlParser is the default HTML parser. Here, we will use only one interface from this library for our custom implementation.

DocumentFormat.OpenXml
MariGold.HtmlParser
MariGold.OpenXHTML

Implement the MariGold.HtmlParser.IHtmlNode

Create a class named AngleNode and set it to inherit from the MariGold.HtmlParser.IHtmlNode. Create a constructor method in this class to accept an AngleSharp.Dom.INode type object to store in a private variable. We will use this node to extract the required properties of IHtmlNode.

private readonly INode node;

internal AngleNode(INode node)
{
    this.node = node;
}

These properties are used in the MariGold.OpenXHTML to create equivalent properties of Open XML elements. Not all the properties of IHtmlNode are used in MariGold.OpenXHTML. The required properties are listed below. You can find the implementation details of these properties in the sample project attached.

Attributes - A dictionary of all HTML attributes
Children - An IEnumerable collection of child elements including text elements
HasChildren - A boolean property to check whether the element has child elements or not
InnerHtml - Inner HTML of the element as a string format. Mostly used to get the content of text elements
IsText - Used to identify whether the node is a text element or not
Next - The next sibling element. This will be null if there is no such element
Parent - Parent element. Will be null if it is a root element
Previous - Preceding sibling element. This will be null if there is no such element
Styles - A dictionary of all style attributes
Tag - Tag name of the HTML element

Implement the MariGold.OpenXHTML.IParser

This interface resides in MariGold.OpenXHTML namespace. It is responsible for parsing the HTML document. It includes two properties and two methods. To implement these methods and properties, create a class named AngleParser and let it inherit from IParser. Implement the two properties as below. No need to include any logic in these properties. These are just containers of the base URL and URI schema of any relative or protocol free URL references in the HTML document.

public string BaseURL { get; set; }
public string UriSchema { get; set; }

Next, we will implement the method CalculateRelativeChildFontSize. Sometimes, the MariGold.OpenXHTML needs to calculate relative font size of some HTML elements. For more information, see the HTML parsing section in the GitHub. If you don't want to rework on this, you can use the default implementation. Just as in the below method.

public decimal CalculateRelativeChildFontSize(string parentFontSize, string childFontSize)
{
    return MariGold.HtmlParser.CSSUtility.CalculateRelativeChildFontSize
                                             (parentFontSize, childFontSize);
}

Next, create a constructor method in AngleParser to accept an HTML string. This constructor will load the HTML string into a private variable for later processing.

private readonly string html;

public AngleParser(string html)
{
    this.html = html;
}

The last method in IParser interface is the FindBodyOrFirstElement method. This is the core method in our custom parser implementation. Before implementing this, install AngleSharp using the NuGet package manager. Then, write a private method in AngleParser to read the HTML string and get the HTML body element using AngleSharp.

private IHtmlElement ParseBody()
{
    var parser = new AngleSharp.Parser.Html.HtmlParser(Configuration.Default.WithCss());
    var doc = parser.Parse(html);
    return doc.Body;
}

The FindBodyOrFirstElement method expected to return the IHtmlNode type implementation of the HTML Body element or the root element if the body element is missing in the HTML string.

public MariGold.HtmlParser.IHtmlNode FindBodyOrFirstElement()
{
    return new AngleNode(ParseBody());
}

Now the custom HTML parser implementation is fully ready to integrate with MariGold.OpenXHTML. Write the following code in the main method in the console application. The AngleParser reads the HTML string from source.html and parses it in the WordDocument class.

WordDocument doc = new WordDocument("sample.docx");

using (StreamReader sr = new StreamReader("source.html"))
{
    doc.Process(new AngleParser(sr.ReadToEnd()));
}

doc.Save();

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)