(untagged)

How to interleave two HTML files into one .DOCX file with C#, the HTMLAgilityPack, and the DOCX Library

B. Clay Shannon

0.00/5 (No votes)

18 Feb 2014

Using HTMLAgilityPack and the DOCX Library with C#, create a .DOCX file from two HTML files

Download source - 2.2 KB

Why Combine Two Files into One?

First of all, you may wonder why one would want to take two files and merge them into one. In my case, it's to help me learn Spanish. I take the English version and the Spanish version of the same (HTML) document and create a .DOCX file that contains alternating English and Spanish paragraphs. I find this the easiest way to learn a new language - it's how I learned German half my life ago: with an English magazine or book in one hand, and its German counterpart in the other hand (while simultaneously listening to the German audio).

To use this tip, you will need to NuGet the HTMLAgilityPack and the DOCX Library.

You can supply your own files, of whichever two languages you want (presumably one will be your mother tongue, with the other one being the language you want to learn). If you know of no source for free publications like this, you can go to the Publications tab here, where there are dozens of publications in literally hundreds of languages. You can download these publications in several formats, depending on the exact publication; usually, PDF and one or two other formats, such as EPUB and MOBI. There are also audio files of many of these publications, too.

What You Get and How to Get It

What you end up with is a document that is not exactly pristine or beautifully formatted, but it does contain the information you need for this purpose. Here's an example (I cleaned up the formatting a little to make it look better):

Here is the code I used; it's not "normalized" (it's basically a big mess -- a big block of code in one method), but my excuse for that is twofold: it's a relatively simple one-trick pony, and: it's my utility for personal use, not really meant to be a programming showpiece. Anyway, without further palabra, here's the code:

        
private void ParseHTMLFilesAndSaveAsDOCX()
{
    const string BOLD_MARKER = "~";
    const string HEADING_MARKER = "^";
    List<string> sourceText = new List<string>();
    List<string> targetText = new List<string>();
    HtmlAgilityPack.HtmlDocument htmlDocSource = new HtmlAgilityPack.HtmlDocument();
    HtmlAgilityPack.HtmlDocument htmlDocTarget = new HtmlAgilityPack.HtmlDocument();

    // There are various options, set as needed
    htmlDocSource.OptionFixNestedTags = true;
    htmlDocTarget.OptionFixNestedTags = true;

    htmlDocSource.Load(sourceHTMLFilename);
    htmlDocTarget.Load(targetHTMLFilename);

    // Popul8 generic list of string with source text lines
    if (htmlDocSource.DocumentNode != null)
    {
        IEnumerable<HtmlAgilityPack.HtmlNode> pSourceNodes = htmlDocSource.DocumentNode.SelectNodes("//text()");
        string sourcePar;
        foreach (HtmlNode sText in pSourceNodes)
        {
            if (!string.IsNullOrWhiteSpace(sText.InnerText))
            {
                string formattingMarker = string.Empty;
                if (sText.OuterHtml.Contains("FONT SIZE=4"))
                {
                    formattingMarker = BOLD_MARKER;
                }
                else if (sText.OuterHtml.Contains("FONT SIZE=5"))
                {
                    formattingMarker = HEADING_MARKER;
                }
                sourcePar = string.Format("{0}{1}", formattingMarker, sText.InnerText);
                sourceText.Add(HttpUtility.HtmlDecode(sourcePar));                    
            }
        }
    }

    // Popul8 generic list of string with target text lines
    if (htmlDocTarget.DocumentNode != null)
    {
        IEnumerable<HtmlAgilityPack.HtmlNode> pTargetNodes = htmlDocTarget.DocumentNode.SelectNodes("//text()");
        string targetPar;
        foreach (HtmlNode tText in pTargetNodes)
        {
            if (!string.IsNullOrWhiteSpace(tText.InnerText))
            {
                string formattingMarker = string.Empty;
                if (tText.OuterHtml.Contains("FONT SIZE=4"))
                {
                    formattingMarker = BOLD_MARKER;
                }
                else if (tText.OuterHtml.Contains("FONT SIZE=5"))
                {
                    formattingMarker = HEADING_MARKER;
                }
                targetPar = string.Format("{0}{1}", formattingMarker, tText.InnerText);
                targetText.Add(HttpUtility.HtmlDecode(targetPar));
            }
        }
    }

    // Alternate through the two generic lists, writing to a doc file that will write the source
    // as regular text and the target bolded.
    int sourceLineCount = sourceText.Count;
    int targetLineCount = targetText.Count;
    int higherCount = Math.Max(sourceLineCount, targetLineCount);
    string sourceParagraph = string.Empty;
    string targetParagraph = string.Empty;

    // Write it out
    string docxFilename = string.Format("{0}.docx", textBoxDOCXFile2BCre8ed.Text.Trim());
    using (DocX document = DocX.Create(docxFilename))
    {
        for (int i = 0; i < higherCount; i++)
        {
            if ((i < sourceLineCount) && (null != sourceText[i]))
            {
                sourceParagraph = sourceText[i];
            }
            if ((i < targetLineCount) && (null != targetText[i]))
            {
                targetParagraph = targetText[i];
            }

            if (!string.IsNullOrWhiteSpace(sourceParagraph))
            {
                Paragraph pSource = document.InsertParagraph();
                if (sourceParagraph.Contains(BOLD_MARKER))
                {
                    sourceParagraph = sourceParagraph.Replace(BOLD_MARKER, "");
                    pSource.Append(sourceParagraph).Font(new FontFamily("Palatino Linotype")).FontSize(13).Bold();
                }
                else if (sourceParagraph.Contains(HEADING_MARKER))
                {
                    sourceParagraph = sourceParagraph.Replace(HEADING_MARKER, "");
                    pSource.Append(sourceParagraph).Font(new FontFamily("Palatino Linotype")).FontSize(16);
                }
                else
                {
                    pSource.Append(sourceParagraph).Font(new FontFamily("Palatino Linotype")).FontSize(11);
                }
                Paragraph pSpacer = document.InsertParagraph();
                pSpacer.Append(Environment.NewLine);
            }
            if (!string.IsNullOrWhiteSpace(targetParagraph))
            {
                Paragraph pTarget = document.InsertParagraph();
                if (targetParagraph.Contains(BOLD_MARKER))
                {
                    targetParagraph = targetParagraph.Replace(BOLD_MARKER, "");
                    pTarget.Append(targetParagraph).Font(new FontFamily("Georgia")).FontSize(13).Bold();
                }
                else if (targetParagraph.Contains(HEADING_MARKER))
                {
                    targetParagraph = targetParagraph.Replace(HEADING_MARKER, "");
                    pTarget.Append(targetParagraph).Font(new FontFamily("Georgia")).FontSize(16).Bold();
                }
                else
                {
                    pTarget.Append(targetParagraph).Font(new FontFamily("Georgia")).FontSize(11).Bold();
                }
                Paragraph pTargetSpacer = document.InsertParagraph();
                pTargetSpacer.Append(Environment.NewLine);
            }
        }
        document.Save();
    }
    MessageBox.Show("done!");
}

I have uploaded the source code, too. Feel free to clean it up/refactor it - please post it back here to Code Project if you do, though. From the source you can see which controls you need to add to the form and what to name them.

I did not tell you how to get the files from EPUB (or whatever format you download) to HTML; that is sort of an "exercise left to the reader," but in my case I use AVS Document Converter to convert EPUB files to DOCX, then I manually save those as HTML files before running my utility against those html files. You may find a better way; my experimentation did not, as converting directly to HTML created a rather malformed file, and saving a PDF as text also produced a very "ugly" text file.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here