Prerequisites
In order to run the sample application, the Microsoft .NET Framework 2.0 or higher must be installed. In addition, Microsoft Office 2003 or higher must be installed along with the Microsoft Office 2003 Primary Interop Assemblies (PIAs) redistributable. These PIAs are installed if one performs a full install of Microsoft Office 2003, or you can get them for free from Microsoft.
For more information on how to install and use the Primary Interop Assemblies in .NET programs, please refer to
this link.
I would like to emphasize that one does not need Visual Tools for Office to run or modify this program.
Introduction
Regular Expressions are a very powerful tool for text processing. Sophisticated expressions can be used to find all kinds of patterns of text. Regular Expression engines are integrated into many text editors. Most Regular Expression examples show how to manipulate either ASCII or Unicode text. In addition to editors that handle the standard text formats mentioned previously, there are millions (or probably billions) of documents encoded in one of Microsoft’s many Office formats, such as WORD format (doc), Rich Text Format (RTF), and Excel (XLS). While one can perform searches in Microsoft Office documents using Regular Expressions through the use of Smart Tags, its implementation is cumbersome for many document processing purposes. In this article, I will present a simple methodology of applying the power of Regular Expressions to Microsoft Word documents through the use of the Microsoft .NET Framework. The methodology makes use of the System.Text.RegularExpressions
namespace and the Microsoft Word interop assemblies to realize this solution. In addition, through the use of dynamically loadable assemblies, every Regular Expression match can be validated to ensure that the match is correct. For example, it is quite easy to write a Regular Expression for a numerical date of the form 02/07/2007 for February 7, 2007. But to include in the Regular Expression checks for invalid dates such as 04/31/2002 or 02/30/2007 is quite difficult without code that performs such checks.
In future articles, I plan to present ways of using Regular Expressions to perform sophisticated text search and replace algorithms through the use of the MSOFFICE interop assemblies and .NET technologies. I will also apply these techniques to other MSOFFICE documents such as EXCEL.
Background
Support for Regular Expressions for Microsoft applications first appeared in Word 97. Its implementation was quite tedious because the syntax used differed significantly from the Regular Expression Standard. Microsoft realized the shortfalls in their implementation, and reintroduced Regular Expressions as part of their Smart Tags library 2.0, which was first available with Microsoft Office 2003. Smart Tags, of which Regular Expression operations form a small part, represented a generalized, integrated way to enable users to present data from their documents. However, due to its non-intuitive, complicated manner, Microsoft itself admits in their MSDN Web site that a poll showed developers have not taken the necessary steps to develop them or use the Microsoft .NET Framework to do so. Please refer to this MSDN article for more information: Realize the Potential of Office 2003 by Creating Smart Tags in Managed Code. The focus of this article is devising a simple, yet powerful way of using Regular Expressions (along with validation code).
Using the Code
On startup, the program reads the XML file Searches.XML. This file contains information for all built-in Regular Expression searches. Included in this XML file are searches for URLs, IP addresses, US dates, European dates, US phone numbers, and email addresses. One can add as many search options as she or he wants to this file. Each search option can be activated by placing a check by the desired search.
Each search group contains the following information in the XML file:
- Search Regex – The Regular Expression used in the search
- Indentifier – The search title that appears in the check listbox
- FindColor – The color used to highlight the found text in the document
- Action – The operation used (this version only supports Find)
- PlugInName – The name of the assembly associated with the search. If no assembly is associated, “None” is used.
- PlugInFunction – The function called for this search block that is found in its plug-in assembly
- Description – The description text that is displayed in the check list box
Finding the Text
MSWordRegExDemo
contains methods which manipulate the Microsoft Word or RTF document using automation by way of the Microsoft Word interop assembly. All of these methods are contained in the DocumentEngine
class. The two main Microsoft Word objects that are used in this application are:
Word.Application app;
Word.Document theDoc;
To open the document, we perform the following call which is triggered by the file open event in the GUI:
public void OpenDocument(string documentName)
{
object optional = Missing.Value;
object visible = true;
object fileName = documentName;
if (app == null)
app = new Word.Application();
app.Visible = true;
try
{
theDoc = app.Documents.Open(ref fileName, ref optional,
ref optional, ref optional, ref optional, ref optional, ref optional,
ref optional, ref optional, ref optional, ref optional, ref visible,
ref optional, ref optional, ref optional, ref optional);
paraCount = theDoc.Paragraphs.Count;
}
catch(Exception ex)
{
MessageBox.Show(ex.Message + ": Error opening document");
}
}
The first step is converting the text of the Word document into Text. Once we have the document in the text domain, we can perform a Regular Expression search on the text and see if there are any matches. See below:
docText = docEngine.GetRng(currentParaNum).Text;
If one or more matches occur, we then take the match text and feed it through the Microsoft Word.Find
function. In searching for text, we need to select a text range to import into text. I have chosen the paragraph range specifier. This means that we will loop through the document paragraph by paragraph, performing our searches on each paragraph. For short documents, we could select the entire range of the document. If we wanted to iterate through footnotes, Word provides a footnote range. To get the range of each paragraph, the following function is used:
public Word.Range GetRng(int nParagraphNumber)
{
try
{
return theDoc.Paragraphs[nParagraphNumber].Range;
}
catch (System.Runtime.InteropServices.COMException ex)
{
MessageBox.Show(ex.Message + "\nParagraph Number:
" + nParagraphNumber.ToString() + " does not exist.");
return null;
}
}
The main function which performs the "find" of text is RegularExpressionFind
.
public void RegularExpressionFind(int paraNum, string docText,
SearchStruct selSearchStruct, out List<hitinfo /> hits)
{
HitInfo hitInfo = new HitInfo();
hits = new List<hitinfo />();
System.Text.RegularExpressions.Regex r;
Word.WdColor color = GetSearchColor(selSearchStruct.TextColor);
r = new Regex(selSearchStruct.RegExpression);
MatchCollection matches = r.Matches(docText);
if (matches.Count == 0)
return;
try
{
if (!LoadSearchAssembly(selSearchStruct.PlugInName,
selSearchStruct.PlugInFunction))
return;
}
catch (Exception ex)
{
throw ex;
}
int index = 0;
int startSearchPos = GetRng(paraNum).Start;
foreach (Match match in matches)
{
if (hasValidationAssembly)
{
Object[] objList = new Object[1];
objList[0] = (Object)match;
if (!Convert.ToBoolean(validationMethod.Invoke
(assemblyInstance, objList)))
continue;
}
index = docText.IndexOf(match.Value, index);
string matchStr = docText.Substring(index, match.Value.Length);
index += matchStr.Length - 1;
FindTextInDoc(OperationMode.DotNetRegExMode, paraNum,
matchStr, color, startSearchPos, out startSearchPos,
out hitInfo.StartDocPosition);
hitInfo.Text = match.Value;
hits.Add(hitInfo);
}
}
First, we search for the Regular Expression in the imported paragraph, by using the Regex
.NET functions.
r = new Regex(selSearchStruct.RegExpression);
MatchCollection matches = r.Matches(docText);
if (matches.Count == 0)
return;
If there is a match, we load the search assembly if it has not already been loaded, and perform additional validation on the match.
try
{
if (!LoadSearchAssembly(selSearchStruct.PlugInName,
selSearchStruct.PlugInFunction))
return;
}
The following method dynamically loads the validation assembly for the Regular Expression, if one exists. If the assembly was previously loaded, the LoadFrom
method will return it.
public bool LoadSearchAssembly(string plugginName, string plugInFunction)
{
try
{
if (plugginName.ToLower() == "none")
{
hasValidationAssembly = false;
return true;
}
hasValidationAssembly = true;
string plugginPath = Path.GetDirectoryName
(Application.ExecutablePath) + @"\Plugins\" + plugginName;
if (!File.Exists(plugginPath))
throw new Exception("Cannot find path to assembly: " +
plugginName);
Assembly a = Assembly.LoadFrom(plugginPath);
Type[] types = a.GetTypes();
validationMethod = types[0].GetMethod(plugInFunction);
assemblyInstance = Activator.CreateInstance(types[0]);
return true;
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
return false;
}
}
Below is the assembly that validates a numerical date:
namespace SaelSoft.RegExPlugIn.NumericalDateValidator
{
public class NumericalDateValidatorClass
{
int month = 0;
int day = 0;
int year = 0;
public bool ValidateUSDate(Match matchResult)
{
if (matchResult.Groups.Count < 3)
return false;
int nResult = 0;
if (int.TryParse(matchResult.Groups[1].ToString(), out nResult))
month = nResult;
else
return false;
if (int.TryParse(matchResult.Groups[2].ToString(), out nResult))
day = nResult;
else
return false;
if (int.TryParse(matchResult.Groups[3].ToString(), out nResult))
year = nResult;
else
return false;
return CommonDateValidation();
}
public bool ValidateEuropeanDate(Match matchResult)
{
if (matchResult.Groups.Count < 3)
return false;
int nResult = 0;
if (int.TryParse(matchResult.Groups[1].ToString(), out nResult))
month = nResult;
else
return false;
if (int.TryParse(matchResult.Groups[2].ToString(), out nResult))
day = nResult;
else
return false;
if (int.TryParse(matchResult.Groups[3].ToString(), out nResult))
year = nResult;
else
return false;
return CommonDateValidation();
}
private bool CommonDateValidation()
{
if (day == 31 && (month == 4 || month == 6 || month == 9 || month == 11))
{
return false;
}
else if (day >= 30 && month == 2)
{
return false;
}
else if (month == 2 && day == 29 && !(year % 4 == 0
&& (year % 100 != 0 || year % 400 == 0)))
{
return false;
}
else
{
return true;
}
}
}
Finally, if we have a real match, we perform a search for the match string
in the Word document by calling the DocumentEngine
function, FindTextInDoc
.
internal bool FindTextInDoc(OperationMode opMode, int currentParaNum,
string textToFind, Word.WdColor color, int start, out int end,
out int textStartPoint)
{
string strFind = textToFind;
textStartPoint = 0;
Word.Range rngDoc = GetRng(currentParaNum);
if (start >= rngDoc.End)
{
end = 0;
return false;
}
rngDoc.Start = start;
rngDoc.Find.ClearFormatting();
rngDoc.Find.Forward = true;
rngDoc.Find.Text = textToFind;
object caseSensitive = "1";
object missingValue = Type.Missing;
object matchWildCards = Type.Missing;
if (opMode == OperationMode.Word97Mode)
matchWildCards = "1";
rngDoc.Find.Execute(ref missingValue, ref caseSensitive,
ref missingValue, ref missingValue, ref missingValue,
ref missingValue, ref missingValue, ref missingValue,
ref missingValue, ref missingValue, ref missingValue,
ref missingValue, ref missingValue, ref missingValue,
ref missingValue);
if (hilightText)
rngDoc.Select();
end = rngDoc.End + 1;
textStartPoint = rngDoc.Start;
if (rngDoc.Find.Found)
{
rngDoc.Font.Color = color;
return true;
}
return false;
}
Points of Interest
The DocumentEngine
class makes use of Microsoft Office events in order to detect the situation when the user closes the Microsoft Word document that was loaded by the application. When the Quit
event is invoked, the app and the document objects are set to NULL
. They are reinitialized when the user opens a new document.
public DocumentEngine()
{
app = new Word.Application();
((Word.ApplicationEvents4_Event)app).Quit += new Microsoft.Office.
Interop.Word.ApplicationEvents4_QuitEventHandler(App_Quit);
}
private void App_Quit()
{
app = null;
theDoc = null;
}
This project can serve as the first step of a complex document processing application for Microsoft Word and RTF documents. Basically, everything that can be accomplished with Regular Expressions with ASCII or UNICODE files can now be done almost as easily for *.doc and *.rtf files. In my next article, I will show how, by means of dynamic assemblies, we can perform complex formatting using Regular Expressions.
For more online information on Microsoft Office Interop Assemblies, please refer to MSDN.
For Further Investigation
For those who would like to find out more information on regular expressions and Microsoft Office automation, I recommend the follow excellent books: Mastering Regular Expressions by Jeffrey E. F. Freidl, and Visual Studio Tools for Office - Using C# with Excel, Word, Outlook, and Infoview by Eric Carter and Eric Lippert.
History
- 13th June, 2008: First version
- 14th June, 2008: Fixed the *.sln (solution files) so it is a bit tidier
- 16th June, 2008: Added a
ColorCheckedBoxList
component (subclassed from CheckeListBox
) to so it would be able to see which color corresponds to which Regular Expression match.
Drag and Drop functionality also added.