Introduction
Office 2007 has introduced OpenXML file format. It means Word, Excel and PowerPoint files are just compressed files or you can say it is a collection of a few predefined XML files. All the metadata and data for all Word, Excel and PowerPoint files is defined in XML files.
For more details about OpenXML file please visit this. Here you can find the description and details of the file format for all the applications in Office 2007 that support OpenXML file format. All the data is stored in XML formats and I used this key factor to search the text in the files.
In my example code, you can see that I iterate through all .docx files and open each file using Package
object that you can find in System.IO.Packaging
namespace.
System.IO.Packaging
namespace is part of .NET framework 3.0. You can download it from here. I have also used background thread for GUI operations like updating progress bar and other controls.
File Format
Here we will talk a little bit about .docx file format. Just create a .docx file and type a few words or some text in it. Now save the file and go to Windows Explorer and rename the file with extension .zip.(Word, Excel, PowerPoint files are actually zip files). Now double click on it and open it. It will show you the folder structure and files in it.
Now navigate to the Word folder and open the document.xml file. You can see all the data that you wrote in the Word file is there. One more important thing is all the data is enclosed in <w:t>
tags. This is the key we will use to search in each file.
Using the Code
Download the code and open the solution. Go to Docfileparse.cs. In this class you will find a class Docfileparse
that has a constructor.
public Docfileparse(string _filename,string _searchtext,bool iscasesenstive,
bool _matchword)
{
filename = _filename;
searchtext = _searchtext;
casesenstive = iscasesenstive;
if (casesenstive == false)
searchtext = searchtext.ToLower();
matchword = _matchword;
}
We are trying to support two additional types of search. These are:
- Case sensitive
- Match whole word
We pass these two parameters in the constructor to let the class know what type of search we want to do.
In case of case sensitive search, we search the exact text. If it is not case sensitive, then we convert all data in lowercase to search.
In the second case, if the user wants to search whole words, then we split the main string in words and match the string that the user wants to search one by one with each word.
GUI mainform has created an object of Docparsefile
class in background thread method ThreadProc
.
public void ThreadProc()
{
DirectoryInfo df=new DirectoryInfo(textBox1.Text);
int i = 0;
foreach (FileInfo f in df.GetFiles("*.docx"))
{
if (progressBar1.Value < progressBar1.Maximum)
{
MethodInvoker mi = new MethodInvoker(this.UpdateProgress);
this.BeginInvoke(mi);
}
else
{
MethodInvoker mi = new MethodInvoker(this.resetProgress);
this.BeginInvoke(mi);
}
i++;
Docfileparse docparser = new Docfileparse
(f.FullName,textBox2.Text,checkBox1.Checked,checkBox2.Checked);
string _message="";
if (docparser.searchText(ref _message) == true)
{
SetlistboxText(f.FullName);
}
docparser = null;
SetlabelText(i);
}
MethodInvoker mdone = new MethodInvoker(this.Processdone);
IAsyncResult ia= this.BeginInvoke(mdone);
}
This code looks a bit complicated because I am trying to provide some good working application. Otherwise, if you want to plug this search anywhere else, the DocParseFile.cs is more than enough.
Step by Step code walkthrough
In Docparsefile
class, the user creates the object of DocParseFile
class and calls the constructor that initializes the parameters in the class. When the user calls searchText
method, it performs the following steps.
Open the Word(.docx) file package
:
private bool openpack()
{
try
{
filepack = Package.Open(filename, FileMode.Open,
FileAccess.Read);
return true;
}
catch (Exception )
{
return false;
}
}
Navigate to document.xml file in Package
and load that XMLDocument
.
private XmlDocument loadXmlDoc()
{
try
{
XmlDocument xmdoc=new XmlDocument();
Uri pathtodoc = new Uri("/word/document.xml", UriKind.Relative);
PackagePart newPart = filepack.GetPart(pathtodoc);
xmdoc.Load(newPart.GetStream(FileMode.Open, FileAccess.Read));
return xmdoc;
}
catch (Exception)
{
return null;
}
}
Now close the package
by calling package.close()
method in ClosePackage
method. Now we got the Xmldocument
where all data is stored, so now we don't need to keep the package
open. Now we will search data in this Xmldocument
. If find it searchText
method will return true
otherwise it will return false
. Next step is to search the text in XmlDocument
. As I described in file format section all the data is stored in <w:t>
tags, so we will query Xmldocument
and get all the elements for this tag and then we will iterate through each tag to check InnerText
and match the text.
private bool IstextinFile(XmlDocument xmldocument)
{
XmlNodeList textNodes = xmldocument.GetElementsByTagName("w:t");
string text = "";
foreach (XmlNode xmnode in textNodes)
{
text = xmnode.InnerText;
if (casesenstive == false)
text = text.ToLower();
if (matchword == false)
{
if (text.IndexOf(searchtext) >= 0)
return true;
}
else
{
char[] separator ={ ' ' };
string[] _wrods = text.Split(separator);
foreach (string wordc in _wrods)
{
if (wordc.Equals(searchtext) == true)
return true;
}
}
}
return false;
}
That's it. We are done here. In UI, I have added some cool stuff, e.g. after searching, you can right click on file name and open in Word or you can also double click to open the file. Right now, I have created this example for Word and I am planning to enhance it for Excel and PowerPoint too.
You can explore this further to provide an online tool in organization to search for documents/presentations. In addition, I think that this search is faster than Windows search. I have tested it with more than 60 files and it works faster than Windows search.
Please give your feedback if I am missing something or if you think I can enhance it further.