Introduction
This tip shows how to perform string
or regex search on multiple DOCX files in the specific directory.
The accompanying application will demonstrate how to read DOCX files, convert them to text and search for specific string
or regex on that text. It is based on Show Word file in WPF article which explains DOCX file format and implements DOCX reader used in this tip, so I would recommend reading it before this one.
Implementation
We will use the same DocxReader
class from the article mentioned above to unzip the DOCX files and to read DOCX main part (document.xml) with XmlReader
. Also, we will implement a converter (DocxToStringConverter
) which will convert specific XML elements (or their content) from document.xml to string
s.
DocxToStringConverter
This class inherits from the DocxReader
and overrides its virtual reading methods to create string
s like this:
- While
DocxReader
is reading document element (<document>
), we will create a new StringBuilder
which will be used for appending all of the DOCX text content:
protected override void ReadDocument(XmlReader reader)
{
this.text = new StringBuilder();
base.ReadDocument(reader);
}
- After
DocxReader
reads paragraph element (<p>
), we will append new line to the StringBuilder
:
protected override void ReadParagraph(XmlReader reader)
{
base.ReadParagraph(reader);
this.text.AppendLine().AppendLine();
}
- While
DocxReader
is reading text element (<t>
), we will append the content of that element to the StringBuilder
:
protected override void ReadText(XmlReader reader)
{
this.text.Append(reader.ReadString());
}
MainForm
This simple Windows Form user interface will enable you to search DOCX files in specific directory (and its subdirectories) and will show the search results in the ListView
control using the below code:
private void btnSearch_Click(object sender, EventArgs e)
{
foreach (var filePath in Search(this.txtDirectory.Text, this.txtSearch.Text,
this.cBoxUseSubdirectories.Checked, this.cBoxCaseSensitive.Checked, this.rBtnRegex.Checked))
{
var file = new FileInfo(filePath);
this.resultListView.Items.Add(new ListViewItem(new string[]
{ file.Name, string.Format("{0:0.0}", file.Length / 1024d), file.FullName }));
}
}
Depending on the user choice, we will perform regex or string
search on current DOCX file. To accomplish this, we will use Predicate<T>
delegate to implement these two search options like in the following code:
var isMatch = useRegex ?
new Predicate<string>
(x => Regex.IsMatch(x, searchString, caseSensitive ?
RegexOptions.None : RegexOptions.IgnoreCase))
: new Predicate<string>
x => x.IndexOf(searchString, caseSensitive ?
StringComparison.Ordinal : StringComparison.OrdinalIgnoreCase) >= 0);
Delegate isMatch
is used in method which iterates over all DOCX files in the specified directory, converts them to text and returns path to every DOCX file that satisfies the isMatch
delegate using the C# iterator (yield return
statement) like in the following code:
foreach (var filePath in Directory.GetFiles(directory, "*.docx",
searchSubdirectories ? SearchOption.AllDirectories : SearchOption.TopDirectoryOnly))
{
string docxText;
using (var stream = File.Open(filePath, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
docxText = new DocxToStringConverter(stream).Convert();
if (isMatch(docxText))
yield return filePath;
}
The resulting DOCX files listed in the ListView
control can be activated to show them in your default DOCX viewer (usually Microsoft Word).
private void resultListView_ItemActivate(object sender, EventArgs e)
{
string filePath = ((ListView)sender).SelectedItems[0].SubItems[2].Text;
if (File.Exists(filePath))
Process.Start(filePath);
}
Conclusion
Show Word file in WPF demonstrated how to convert DOCX to WPFs FlowDocument
, and this tip demonstrated how to convert DOCX to plain text using the same DOCX reading code. By combining these two articles, you could, for example, convert DOCX to HTML. Hopefully, this tip has shown you some basis of reading DOCX files and how to convert DOCX to other representations by reusing the same DOCX reading code in all of these conversions.