"Weird" Characters
In this tip, I showed
how to wedge a space between words that were run together, such as "DennisRodman or "theWorm", making them "Dennis Rodman" and
"the Worm" respectively (so to speak).
BTW: It's not funny, anymore, Dennis; you're not Marilyn Monroe, and that homicidal maniac is not JFK.
That tip, though, only dealt with the "normal" English alphabet (a..Z and A..Z). Since I'm currently working with foreign language
documents (Spanish and German, with French and perhaps Italian and Dutch coming later), I realized that I need to consider other possible
characters, too, both as the ending lowercase letter or other ending character (such as é, í, ñ, ?, !, ", », and ß) and as the beginning uppercase
letter or other character, such as ¿, ¡, ", and «
So, if you had a sentence such as this:
quéSera, Sera. Was zumTeuful ist hier los!¿se habla aleman?¡No!He said«Hola, muchacha»Das ist gewißMerkwürdig!
...running it through this helper method would "aerate" it like so:
qué Sera, Sera. Was zum Teuful ist hier los! ¿se habla aleman? ¡No! He said «Hola, muchacha» Das ist gewiß Merkwürdig!
Rather than clutter up and complicate the previous code, I wrote another helper function to handle those situations.
Preliminary Setting Up of Figurative Chairs
Follow these steps to prepare for the code to follow:
- Download the DocX DLL library from here.
- In your Visual Studio project, right-click References, select "Add Reference..." and add docx.dll to the project from
wherever you saved it.
- Add this to your
using
section:
using Novacode;
Add this code to the top of your class, too:
const int FIRST_CAP_POS = 65;
const int LAST_CAP_POS = 90;
const int FIRST_LOWER_POS = 97;
const int LAST_LOWER_POS = 122;
List<string> specialWordEndings;
List<string> specialWordBeginnings;
string soughtCombo = string.Empty;
string desiredCombo = string.Empty;
</string></string>
As usually happens, this ends up being a little more complicated than I first reckoned, because I have to deal with four different situations:
- An "odd" character at the end of a sentence followed by a "normal" (A..Z) character
- A "normal" (a..z, etc.) character at the end of a sentence followed by an "odd" character
- A combination of "odd" characters
- A combination of "normal" characters
And now, without further ado, adieux, or adios, straight from Carmel Valley, California, comes the illustrious and much-ballyhooed and
anticipated code, entering from stage left, welcome:
The Nitty Gritty Prettifier/Aerator
private void Popul8UnusualCharLists()
{
specialWordEndings = new List<string>() { "é", "í", "ñ", "?", "!", ",", ".", ":", ";", "\"", "»", "ß" };
specialWordBeginnings = new List<string>() { "¿", "¡", "\"", "É", "«" };
}
private void AerateUnusualCombo(string filename)
{
using (DocX document = DocX.Load(filename))
{
foreach (string endChar in specialWordEndings)
{
foreach (string beginChar in specialWordBeginnings)
{
soughtCombo = string.Format("{0}{1}", endChar, beginChar);
desiredCombo = string.Format("{0} {1}", endChar, beginChar);
document.ReplaceText(soughtCombo, desiredCombo);
}
}
document.Save();
}
}
private void AerateUnusualEndNormalBegin(string filename)
{
using (DocX document = DocX.Load(filename))
{
foreach (string endChar in specialWordEndings)
{
for (int i = FIRST_CAP_POS; i <= LAST_CAP_POS; i++)
{
char upperChar = (char)i;
soughtCombo = string.Format("{0}{1}", endChar, upperChar);
desiredCombo = string.Format("{0} {1}", endChar, upperChar);
document.ReplaceText(soughtCombo, desiredCombo);
}
}
document.Save();
}
}
private void AerateNormalEndUnusualBegin(string filename)
{
using (DocX document = DocX.Load(filename))
{
for (int i = FIRST_LOWER_POS; i <= LAST_LOWER_POS; i++)
{
char lowerChar = (char)i;
foreach (string beginChar in specialWordBeginnings)
{
soughtCombo = string.Format("{0}{1}", lowerChar, beginChar);
desiredCombo = string.Format("{0} {1}", lowerChar, beginChar);
document.ReplaceText(soughtCombo, desiredCombo);
}
}
document.Save();
}
}
private void AerateNormalEndNormalBegin(string filename)
{
using (DocX document = DocX.Load(filename))
{
for (int i = FIRST_LOWER_POS; i <= LAST_LOWER_POS; i++)
{
char lowerChar = (char)i;
for (int j = FIRST_CAP_POS; j <= LAST_CAP_POS; j++)
{
char upperChar = (char)j;
soughtCombo = string.Format("{0}{1}", lowerChar, upperChar);
desiredCombo = string.Format("{0} {1}", lowerChar, upperChar);
document.ReplaceText(soughtCombo, desiredCombo);
}
}
document.Save();
}
}
}
Call it like so:
Cursor.Current = Cursors.WaitCursor;
try
{
Popul8UnusualCharLists();
string filename = string.Empty;
DialogResult result = openFileDialog1.ShowDialog();
if (result == DialogResult.OK)
{
filename = openFileDialog1.FileName;
}
else
{
MessageBox.Show("No file selected - exiting");
return;
}
AerateUnusualCombo(filename);
AerateUnusualEndNormalBegin(filename);
AerateNormalEndUnusualBegin(filename);
AerateNormalEndNormalBegin(filename);
}
finally
{
Cursor.Current = Cursors.Default;
}
MessageBox.Show("Scrunched together words have been normalized!");
A Parting Plaintive Plea
If you find this tip useful, pay it forward and do something nice to somebody today, even if it surprises them.
Note: I have added two source code files: the smaller one is just for this tip; the larger one contains all the DocX code for various articles I wrote on CodeProject December 2013 and January 2014.