Introduction
MariGold.OpenXHTML
is an open source GitHub library to convert HTML documents into Open XML Word documents. It uses an internal HTML parser to read the HTML elements. This HTML parser can be completely replaced with your favorite HTML parser. Here, I am demonstrating how to do this with the popular AngleSharp HTML parser.
Using the Code
MariGold.OpenXHTML
depends on two interfaces to read the HTML. All we need to create the implementations for these two interfaces. These interfaces are:
MariGold.HtmlParser.IHtmlNode
- This is just a container for an HTML element. Our custom HTML parser uses this to create a hierarchy of HTML node elements to process in MariGold.OpenXHTML
.
MariGold.OpenXHTML.IParser
- This is used to implement the custom HTML parser. We will integrate AngleSharp in this to process an HTML string
.
Open the Visual Studio and create a console application project. Create a source.html file in the root location and fill with the HTML content you want to convert. Select the properties of this HTML file and set the Copy to Output directory property as Copy always so that the file will be available in the bin folder to process for our demo.
Next step is to install the MariGold.OpenXHTML
nuget package. Execute the command Install-Package MariGold.OpenXHTML
from the Tools -> NuGet Package Manager -> Package Manager Console. This will install the following packages in your project. The MariGold.HtmlParser
is the default HTML parser. Here, we will use only one interface from this library for our custom implementation.
DocumentFormat.OpenXml
MariGold.HtmlParser
MariGold.OpenXHTML
Implement the MariGold.HtmlParser.IHtmlNode
Create a class named AngleNode
and set it to inherit from the MariGold.HtmlParser.IHtmlNode
. Create a constructor method in this class to accept an AngleSharp.Dom.INode
type object to store in a private
variable. We will use this node to extract the required properties of IHtmlNode
.
private readonly INode node;
internal AngleNode(INode node)
{
this.node = node;
}
These properties are used in the MariGold.OpenXHTML
to create equivalent properties of Open XML elements. Not all the properties of IHtmlNode
are used in MariGold.OpenXHTML
. The required properties are listed below. You can find the implementation details of these properties in the sample project attached.
Attributes
- A dictionary of all HTML attributes Children
- An IEnumerable
collection of child elements including text elements HasChildren
- A boolean property to check whether the element has child elements or not InnerHtml
- Inner HTML of the element as a string
format. Mostly used to get the content of text elements IsText
- Used to identify whether the node is a text element or not Next
- The next sibling element. This will be null
if there is no such element Parent
- Parent element. Will be null
if it is a root element Previous
- Preceding sibling element. This will be null
if there is no such element Styles
- A dictionary of all style attributes Tag
- Tag name of the HTML element
Implement the MariGold.OpenXHTML.IParser
This interface resides in MariGold.OpenXHTML namespace
. It is responsible for parsing the HTML document. It includes two properties and two methods. To implement these methods and properties, create a class named AngleParser
and let it inherit from IParser
. Implement the two properties as below. No need to include any logic in these properties. These are just containers of the base URL and URI schema of any relative or protocol free URL references in the HTML document.
public string BaseURL { get; set; }
public string UriSchema { get; set; }
Next, we will implement the method CalculateRelativeChildFontSize
. Sometimes, the MariGold.OpenXHTML
needs to calculate relative font size of some HTML elements. For more information, see the HTML parsing section in the GitHub. If you don't want to rework on this, you can use the default implementation. Just as in the below method.
public decimal CalculateRelativeChildFontSize(string parentFontSize, string childFontSize)
{
return MariGold.HtmlParser.CSSUtility.CalculateRelativeChildFontSize
(parentFontSize, childFontSize);
}
Next, create a constructor method in AngleParser
to accept an HTML string
. This constructor will load the HTML string
into a private
variable for later processing.
private readonly string html;
public AngleParser(string html)
{
this.html = html;
}
The last method in IParser
interface is the FindBodyOrFirstElement
method. This is the core method in our custom parser implementation. Before implementing this, install AngleSharp
using the NuGet package manager. Then, write a private
method in AngleParser
to read the HTML string
and get the HTML body
element using AngleSharp
.
private IHtmlElement ParseBody()
{
var parser = new AngleSharp.Parser.Html.HtmlParser(Configuration.Default.WithCss());
var doc = parser.Parse(html);
return doc.Body;
}
The FindBodyOrFirstElement
method expected to return the IHtmlNode
type implementation of the HTML Body element or the root element if the body element is missing in the HTML string
.
public MariGold.HtmlParser.IHtmlNode FindBodyOrFirstElement()
{
return new AngleNode(ParseBody());
}
Now the custom HTML parser implementation is fully ready to integrate with MariGold.OpenXHTML
. Write the following code in the main method in the console application. The AngleParser
reads the HTML string
from source.html and parses it in the WordDocument
class.
WordDocument doc = new WordDocument("sample.docx");
using (StreamReader sr = new StreamReader("source.html"))
{
doc.Process(new AngleParser(sr.ReadToEnd()));
}
doc.Save();