Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / productivity / Office

How to Replace a List of Words in a .DOCX File using the DocX Library

4.78/5 (13 votes)
16 Dec 2014CPOL3 min read 76.7K   1.7K  
Replace words in .docx files using the DocX Library

Mission

I recently came upon a tedious need to replace a slew (not slough) of words in a long document (John Ormsby's 1885 translation of Miguel de Cervantes' Don Quixote). 

Specifically, I wanted to replace the British spellings of words with their American dialect equivalents. For example, I wanted to replace "colour" with "color", "centre" with "center", "plough" with "plow", etc.

I could replace these one word at a time using Find > Replace, but that quickly becomes a pain in the arse...I mean, donkey. After all, WE BE PROGRAMMERS!

So, I located a handy library for working with .docx files named, appropriately if dully or even near-redundantly, DocX

Commission

To use the docx library, simply download it (docx.dll) from here, add a reference to it in your project, and then add this using clause: 

C#
using Novacode;

You first need to load the document that contains the "fawlty" spellings, like so (this assumes you have dropped an openfiledialog control on a Windows Forms form, and kept the default name (openFileDialog1)):

C#
string filename = string.Empty;
DialogResult result = openFileDialog1.ShowDialog();
if (result == DialogResult.OK)
{
    filename = openFileDialog1.FileName;
}
else
{
    MessageBox.Show("No file selected - hasta la vista and Ciao, baby!");
    return;
}
using (DocX document = DocX.Load(filename))
{
    document.ReplaceText("travelled", "traveled");
    document.Save();
}

But, of course, we want to do all the words at once. First, we need to have that list of words, so code like this is needed:

C#
List<string> wordPairs;
public Form1()
{
    InitializeComponent();
    Popul8WordPairs();
}
. . .
private void Popul8WordPairs()
{
    wordPairs = new List<string>();
    ExpandWordPairs("æroplane", "airplane");
    ExpandWordPairs("æsthetic", "esthetic");
    ExpandWordPairs("ageing", "aging");
    ExpandWordPairs("æsthetic", "esthetic");
    ExpandWordPairs("ageing", "aging");
    ExpandWordPairs("aluminium", "aluminum");
    ExpandWordPairs("amœba", "ameba");
    ExpandWordPairs("anæmia", "anemia");
    ExpandWordPairs("anæsthesia", "anesthesia");
    ExpandWordPairs("analyse", "analyze");
    . . .
    ExpandWordPairs("victual", "vittle");
    ExpandWordPairs("vigour", "vigor");
    ExpandWordPairs("vigourous", "vigorous");
    ExpandWordPairs("vigourously", "vigorously");
    ExpandWordPairs("whiskey", "whisky");
    ExpandWordPairs("woolen", "woollen");
    ExpandWordPairs("yoghurt", "yogurt");
}

But hold on there, pard! What is this "ExpandWordPairs" jazz? Well, if a "word" (sequence of letters) in the list appears in the middle of another word, we don't want to "mess with it", so as to avoid any potentially embarrassing mishaps. So we want to look for the word and only the word, and so it is bookended with spaces. But then again, what if it is at the start of a sentence (capitalized), or at the end of a sentence or clause and does not have a space after it, but some form of punctuation, such as a comma or period, etc.?

Those are the situations the ExpandWordPairs() method handles, thusly:

C#
private void ExpandWordPairs(string britSpelling, string amiSpelling)
{
    wordPairs.Add(SpacesForeAndAft(britSpelling, amiSpelling));
    wordPairs.Add(CapitalizedAndSpaceAft(britSpelling, amiSpelling));
    wordPairs.Add(SpaceForePeriodAft(britSpelling, amiSpelling));
    wordPairs.Add(SpaceForeCommaAft(britSpelling, amiSpelling));
    wordPairs.Add(SpaceForeColonAft(britSpelling, amiSpelling));
    wordPairs.Add(SpaceForeSemicolonAft(britSpelling, amiSpelling));
    wordPairs.Add(SpaceForeDashAft(britSpelling, amiSpelling));
}

An example of the methods it calls is shown here:

C#
private string SpaceForeDashAft(string britSpelling, string amiSpelling)
{
    return string.Format(" {0}-# {1}-", britSpelling, amiSpelling);
}

In this way, now by passing "scrutinising" to ExpandWordPairs(), the following will all be found and replaced with their equivalents ("zing" ending instead of "sing"):

  • " scrutinising "
  • "Scrutinising "
  • " scrutinising."
  • " scrutinising,"
  • " scrutinising;"
  • " scrutinising:
  • " scrutinising-"

Note: I used a pound/hash sign as a separator (instead of the traditional comma or semicolon) between the "bad" (British English) and the "good" (American English) spellings because those punctuation marks (comma and semicolon) would then be more difficult to handle. Using a "#" was simply easier. I could have used a tilde or the symbol that represents The Artist Formerly Known As Prince, or something else just as well. Well, maybe not just as well. That's why I stuck with the "#" over TAFKAP.

At any rate, the spartan code shown earlier now becomes this:

C#
string filename = string.Empty;
string britSpelling = string.Empty;
string amiSpelling = string.Empty;
DialogResult result = openFileDialog1.ShowDialog();
if (result == DialogResult.OK)
{
    filename = openFileDialog1.FileName;
}
else
{
    MessageBox.Show("No file selected - cheerio and later days, dude!!");
    return;
}
using (DocX document = DocX.Load(filename))
{
    foreach (string s in wordPairs)
    {
        britSpelling = GetFirstHalf(s);
        amiSpelling = GetSecondHalf(s);
        document.ReplaceText(britSpelling, amiSpelling);
    }
    document.Save();
}

Completion

All of the code is available in the accompanying file. All you need to do for it to run (besides what was already mentioned) is to drop a button on the form, retaining its default name (button1), and name the project "AmericanizeBritSpeak" (or name it whatever you want and replace that name with yours). 

This just scratches the surface of what can be accomplished when using the DocX library to work with .docx files. Download it from here, and donate if you find it useful and are able to.

If you cannot, or do not want to, create a utility based on the source code, you can download the .exe, which I have zipped up and added to this tip. It looks like this when you run it:

Image 1

Just click the button and you will be able to load a document and have its British English spellings replaced with the spellings used in American English. As you can see, the utility also contains links to two of my web sites as well as to all three volumes of the dual-language (Spanish and English) volumes of Don Quixote assembled and generated by "Found in the Translation".

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)