Fast search for Word 2007 document

sharadbajaj

4.50/5 (4 votes)

15 Nov 20064 min read

642

How to search fast using Office 2007 OpenXML file format

Download source files and demo project - 419 Kb

Introduction

Office 2007 has introduced OpenXML file format. It means Word, Excel and PowerPoint files are just compressed files or you can say it is a collection of a few predefined XML files. All the metadata and data for all Word, Excel and PowerPoint files is defined in XML files.

For more details about OpenXML file please visit this. Here you can find the description and details of the file format for all the applications in Office 2007 that support OpenXML file format. All the data is stored in XML formats and I used this key factor to search the text in the files.

In my example code, you can see that I iterate through all .docx files and open each file using Package object that you can find in System.IO.Packaging namespace.

System.IO.Packaging namespace is part of .NET framework 3.0. You can download it from here. I have also used background thread for GUI operations like updating progress bar and other controls.

File Format

Here we will talk a little bit about .docx file format. Just create a .docx file and type a few words or some text in it. Now save the file and go to Windows Explorer and rename the file with extension .zip.(Word, Excel, PowerPoint files are actually zip files). Now double click on it and open it. It will show you the folder structure and files in it.

Now navigate to the Word folder and open the document.xml file. You can see all the data that you wrote in the Word file is there. One more important thing is all the data is enclosed in <w:t> tags. This is the key we will use to search in each file.

Using the Code

Download the code and open the solution. Go to Docfileparse.cs. In this class you will find a class Docfileparse that has a constructor.

/// <summary>
/// Constructor
/// </summary>
/// <param name="_filename">The Name of file to parse</param>
/// <param name="_searchtext">The text we need to search</param>
/// <param name="iscasesenstive">is the search case sensitive</param>
/// <param name="_matchword">Do we need to match the exact word</param>

public Docfileparse(string _filename,string _searchtext,bool iscasesenstive,
                            bool _matchword)
{
    filename = _filename;
    searchtext = _searchtext;
    casesenstive = iscasesenstive;
    if (casesenstive == false)
        searchtext = searchtext.ToLower();
    matchword = _matchword;
}

We are trying to support two additional types of search. These are:

Case sensitive
Match whole word

We pass these two parameters in the constructor to let the class know what type of search we want to do.

In case of case sensitive search, we search the exact text. If it is not case sensitive, then we convert all data in lowercase to search.

In the second case, if the user wants to search whole words, then we split the main string in words and match the string that the user wants to search one by one with each word.

GUI mainform has created an object of Docparsefile class in background thread method ThreadProc.

public void ThreadProc()
{
    DirectoryInfo df=new DirectoryInfo(textBox1.Text);
    int i = 0;
    foreach (FileInfo f in df.GetFiles("*.docx"))
    {
        if (progressBar1.Value < progressBar1.Maximum)
        {
            MethodInvoker mi = new MethodInvoker(this.UpdateProgress);
            this.BeginInvoke(mi);
        }
        else
        {
            MethodInvoker mi = new MethodInvoker(this.resetProgress);
            this.BeginInvoke(mi);
        }
    i++;

    //Creates object of DocParseFile and pass file path and other 
    //parameter values
    Docfileparse docparser = new Docfileparse
        (f.FullName,textBox2.Text,checkBox1.Checked,checkBox2.Checked);
    string _message="";
    //Call SearchText method on the object that returns true or false
    if (docparser.searchText(ref _message) == true)
    {
        SetlistboxText(f.FullName);
    }
    docparser = null;
    SetlabelText(i);
    }
    MethodInvoker mdone = new MethodInvoker(this.Processdone);
    IAsyncResult ia= this.BeginInvoke(mdone);
}

This code looks a bit complicated because I am trying to provide some good working application. Otherwise, if you want to plug this search anywhere else, the DocParseFile.cs is more than enough.

Step by Step code walkthrough

In Docparsefile class, the user creates the object of DocParseFile class and calls the constructor that initializes the parameters in the class. When the user calls searchText method, it performs the following steps.

Open the Word(.docx) file package:

private bool openpack()
{
    try
    {
        filepack = Package.Open(filename, FileMode.Open, 
                            FileAccess.Read);
        return true;
    }
    catch (Exception )
    {
        return false;
    }
}

Navigate to document.xml file in Package and load that XMLDocument.

private XmlDocument loadXmlDoc()
{
    try
    {
        XmlDocument xmdoc=new XmlDocument();
        Uri pathtodoc = new Uri("/word/document.xml", UriKind.Relative);
        PackagePart newPart = filepack.GetPart(pathtodoc);
        xmdoc.Load(newPart.GetStream(FileMode.Open, FileAccess.Read));
        return xmdoc;
    }
    catch (Exception)
    {
        return null;
    }
}

Now close the package by calling package.close() method in ClosePackage method. Now we got the Xmldocument where all data is stored, so now we don't need to keep the package open. Now we will search data in this Xmldocument. If find it searchText method will return true otherwise it will return false. Next step is to search the text in XmlDocument. As I described in file format section all the data is stored in <w:t> tags, so we will query Xmldocument and get all the elements for this tag and then we will iterate through each tag to check InnerText and match the text.

private bool IstextinFile(XmlDocument xmldocument)
{
    XmlNodeList textNodes = xmldocument.GetElementsByTagName("w:t");
    string text = "";
    foreach (XmlNode xmnode in textNodes)
    {
        text = xmnode.InnerText;
        if (casesenstive == false)
            text = text.ToLower();
        if (matchword == false)
        {
            if (text.IndexOf(searchtext) >= 0)
            return true;
        }
        else
        {
            char[] separator ={ ' ' };
            string[] _wrods = text.Split(separator);
            foreach (string wordc in _wrods)
            {
                if (wordc.Equals(searchtext) == true)
                return true;
            }
        }
    }
    return false;
}

That's it. We are done here. In UI, I have added some cool stuff, e.g. after searching, you can right click on file name and open in Word or you can also double click to open the file. Right now, I have created this example for Word and I am planning to enhance it for Excel and PowerPoint too.

You can explore this further to provide an online tool in organization to search for documents/presentations. In addition, I think that this search is faster than Windows search. I have tested it with more than 60 files and it works faster than Windows search.

Please give your feedback if I am missing something or if you think I can enhance it further.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here