Why Combine Two Files into One?
First of all, you may wonder why one would want to take two files and merge them into one. In my case, it's to help me learn Spanish. I take the English version and the Spanish version of the
same (HTML) document and create a .DOCX file that contains alternating English and Spanish paragraphs. I find this the easiest way to learn a new language - it's how I learned German half my life ago: with an English magazine or book in one hand, and its German counterpart in the other hand (while simultaneously listening to the German audio).
To use this tip, you will need to NuGet the HTMLAgilityPack and the DOCX Library.
You can supply your own files, of whichever two languages you want (presumably one will be your mother tongue, with the other one being the language you want to learn). If you know of no source for free publications like this, you can go to the Publications tab here, where there are dozens of publications in literally hundreds of languages. You can download these publications in several formats, depending on the exact publication; usually, PDF and one or two other formats, such as EPUB and MOBI. There are also audio files of many of these publications, too.
What You Get and How to Get It
What you end up with is a document that is not exactly pristine or beautifully formatted, but it does contain the information you need for this purpose. Here's an example (I cleaned up the formatting a little to make it look better):
Here is the code I used; it's not "normalized" (it's basically a big mess -- a big block of code in one method), but my excuse for that is twofold: it's a relatively simple one-trick pony, and: it's my utility for personal use, not
really meant to be a programming showpiece. Anyway, without further palabra, here's the code:
private void ParseHTMLFilesAndSaveAsDOCX()
{
const string BOLD_MARKER = "~";
const string HEADING_MARKER = "^";
List<string> sourceText = new List<string>();
List<string> targetText = new List<string>();
HtmlAgilityPack.HtmlDocument htmlDocSource = new HtmlAgilityPack.HtmlDocument();
HtmlAgilityPack.HtmlDocument htmlDocTarget = new HtmlAgilityPack.HtmlDocument();
htmlDocSource.OptionFixNestedTags = true;
htmlDocTarget.OptionFixNestedTags = true;
htmlDocSource.Load(sourceHTMLFilename);
htmlDocTarget.Load(targetHTMLFilename);
if (htmlDocSource.DocumentNode != null)
{
IEnumerable<HtmlAgilityPack.HtmlNode> pSourceNodes = htmlDocSource.DocumentNode.SelectNodes("//text()");
string sourcePar;
foreach (HtmlNode sText in pSourceNodes)
{
if (!string.IsNullOrWhiteSpace(sText.InnerText))
{
string formattingMarker = string.Empty;
if (sText.OuterHtml.Contains("FONT SIZE=4"))
{
formattingMarker = BOLD_MARKER;
}
else if (sText.OuterHtml.Contains("FONT SIZE=5"))
{
formattingMarker = HEADING_MARKER;
}
sourcePar = string.Format("{0}{1}", formattingMarker, sText.InnerText);
sourceText.Add(HttpUtility.HtmlDecode(sourcePar));
}
}
}
if (htmlDocTarget.DocumentNode != null)
{
IEnumerable<HtmlAgilityPack.HtmlNode> pTargetNodes = htmlDocTarget.DocumentNode.SelectNodes("//text()");
string targetPar;
foreach (HtmlNode tText in pTargetNodes)
{
if (!string.IsNullOrWhiteSpace(tText.InnerText))
{
string formattingMarker = string.Empty;
if (tText.OuterHtml.Contains("FONT SIZE=4"))
{
formattingMarker = BOLD_MARKER;
}
else if (tText.OuterHtml.Contains("FONT SIZE=5"))
{
formattingMarker = HEADING_MARKER;
}
targetPar = string.Format("{0}{1}", formattingMarker, tText.InnerText);
targetText.Add(HttpUtility.HtmlDecode(targetPar));
}
}
}
int sourceLineCount = sourceText.Count;
int targetLineCount = targetText.Count;
int higherCount = Math.Max(sourceLineCount, targetLineCount);
string sourceParagraph = string.Empty;
string targetParagraph = string.Empty;
string docxFilename = string.Format("{0}.docx", textBoxDOCXFile2BCre8ed.Text.Trim());
using (DocX document = DocX.Create(docxFilename))
{
for (int i = 0; i < higherCount; i++)
{
if ((i < sourceLineCount) && (null != sourceText[i]))
{
sourceParagraph = sourceText[i];
}
if ((i < targetLineCount) && (null != targetText[i]))
{
targetParagraph = targetText[i];
}
if (!string.IsNullOrWhiteSpace(sourceParagraph))
{
Paragraph pSource = document.InsertParagraph();
if (sourceParagraph.Contains(BOLD_MARKER))
{
sourceParagraph = sourceParagraph.Replace(BOLD_MARKER, "");
pSource.Append(sourceParagraph).Font(new FontFamily("Palatino Linotype")).FontSize(13).Bold();
}
else if (sourceParagraph.Contains(HEADING_MARKER))
{
sourceParagraph = sourceParagraph.Replace(HEADING_MARKER, "");
pSource.Append(sourceParagraph).Font(new FontFamily("Palatino Linotype")).FontSize(16);
}
else
{
pSource.Append(sourceParagraph).Font(new FontFamily("Palatino Linotype")).FontSize(11);
}
Paragraph pSpacer = document.InsertParagraph();
pSpacer.Append(Environment.NewLine);
}
if (!string.IsNullOrWhiteSpace(targetParagraph))
{
Paragraph pTarget = document.InsertParagraph();
if (targetParagraph.Contains(BOLD_MARKER))
{
targetParagraph = targetParagraph.Replace(BOLD_MARKER, "");
pTarget.Append(targetParagraph).Font(new FontFamily("Georgia")).FontSize(13).Bold();
}
else if (targetParagraph.Contains(HEADING_MARKER))
{
targetParagraph = targetParagraph.Replace(HEADING_MARKER, "");
pTarget.Append(targetParagraph).Font(new FontFamily("Georgia")).FontSize(16).Bold();
}
else
{
pTarget.Append(targetParagraph).Font(new FontFamily("Georgia")).FontSize(11).Bold();
}
Paragraph pTargetSpacer = document.InsertParagraph();
pTargetSpacer.Append(Environment.NewLine);
}
}
document.Save();
}
MessageBox.Show("done!");
}
I have uploaded the source code, too. Feel free to clean it up/refactor it - please post it back here to Code Project if you do, though. From the source you can see which controls you need to add to the form and what to name them.
I did not tell you how to get the files from EPUB (or whatever format you download) to HTML; that is sort of an "exercise left to the reader," but in my case I use AVS Document Converter to convert EPUB files to DOCX, then I manually save those as HTML files before running my utility against those html files. You may find a better way; my experimentation did not, as converting directly to HTML created a rather malformed file, and saving a PDF as text also produced a very "ugly" text file.