Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / productivity / Office

OfficeQuery using XLINQ

5.00/5 (1 vote)
17 Jan 2010CPOL9 min read 19.1K   122  
Using XLINQ to search through a Word2007 zipped document.

Introduction

I'll start by letting you know that the basic idea and code was originally provided by Microsoft on their 'Virtual Labs' site. It's a great service although not all of the labs work; the ones that do are generally very well done. It's a life-saver when your last employer 'accidentally' formats your computer with all of your projects and only copy of Visual Studio.

You are basically remotely connecting to a virtual computer where different labs let you play on different versions of Windows, SQL Server, and Visual Studio, making it incredibly useful. It's not the "normal" lab where they force you to do every step they are teaching; if you wanted to, you can play around and test out your own code, or try different methods than what they suggest in the lab.

The only drawbacks I found are you can't save or copy the code you used to your computer or email yourself a copy of the solution. I can go on and on, but you should give it a look.

This is a simple app that gives you a pretty cool way to search through MS Word 2007 .docx files. Below is a quick requirement list for the UI if you really want to dive right into the code and avoid following every minute detail of this article.

Streamlined start for the UI:: Here are the basic requirements for the UI:

  • BackgroundWorker
  • FolderBrowserDialog
  • LinkButton (that will be used to open the folder dialog)
  • Button to invoke the search
  • TextBox to hold the query string
  • Label to display how many occurrences of the search item were found
  • RichTextBox to hold the results

Take a quick look at the paragraph under "Quick Background"; it explains how and what we are searching; also take note of the using statements we will need for this project.

Quick Background

With the release of Office 2007, we have a new file format for Word, Excel, and PowerPoint documents, called OpenXML.

To start, we need to find a Word 2007 document that has a .docx extension with a decent amount of text within it. Then, we can make a copy and re-name it with the .zip extension.

Here it starts to get pretty cool. Double-click the zip file and you should notice a few folders and files other than the document that you just saved. You will see a folder named word; within that folder, there is an XML file called document.xml: this is what we will be searching though later. You can double-click the document.xml file to view the XML in Internet Explorer.

Now, while you are looking at the document.xml file, you will notice that there are t tags, which represent text associated with a w tag that represents the Word namespace of the document. They should look something like: <t>Your text here bla bla bla...</t>. These tags are there to help you parse and search through your document.

Getting Started - Creating the UI

Create a new C# Forms project and make sure to include the following:

C#
using System.IO.Packaging;
using System.Xml;
using System.Xml.Linq;

We should put together the UI before diving into the code. First, we find and add a FolderBrowserDialog from the Dialog section of the toolbox. Next, we will add a BackgroundWorker to the form (which is also found in the Dialog section of the toolbox).

At the top of the form, add a LinkButton control, set its Text property to "Click to Select Folder", and its Name property to linkFolderSelect; this will open the folder browser.

Below the link button, we can add a Button control; set its Text property to "Search" and its Name property to btnSearch. We also want to make sure that its Enabled property is set to false in the property window. We 'disable' the button because we don't want the user to click Search before they have added a file to be searched, and will help prevent unexpected Exceptions.

Next, we can add a Label to the left of the form and set its Text to something like "Query String". Adjacent to the Label, we will add a TextBox and assign its Name property to tbSearchParam. This will be where the user types a string that they want to search for within a document they have loaded.

Alright!! Almost done. Let's place a Label below the Label and TextBox we just put-up. We can set its Text property to something like "Results" and Name property to lblResults. Finally, we will add a RichTextBox control below the "Results" label and stretch it out to the bottom of the form; this should be a decent size since it will be displaying the results from our search. Let's set the and Name property of the RichTextBox to tbResults.

Now, Let's Begin to Code

Add two private strings that will be accessible throughout the project. Create a string that will hold the search parameter from the textbox we named tbSearchParam. The next string will hold the name of the selected folder. Finally, add a private List<string> that will contain the results of our query. You can view this code below:

C#
namespace officeQuery
{
  public partial class Form1: Form
  {
    //here are a few variables we'll be using thoughout the project
     private string _searchPararm;
     private string _selectedFolder;
     private List<string> _results;

Within the Form constructor, we will initialize the FolderBrowserDialog to the folder we want it to open by default when it is first clicked. I simply assigned it to the "C:\" directory, but you can just as easily assign it "C:\Documents" ...etc.

C#
public Form1()
{
   InitializeComponent();
  
   //Here we are initializing where the fileDialog will open by default
   folderBrowserDialog1.SelectedPath = @"C:\";
}

Now, we can go back to the designer and double-click the link-label to create an empty method for when it's clicked. Here is where you add the code to open the folder browser. This is pretty general code for opening a folder-browser, and can be ported to other applications. We open and show the FolderBrowserDialog by calling ShowDialog(this) and assigning its result to a DialogResult variable. When the user selects a file and clicks OK, we assign its directory path to our private string variable _selectedFolder. Now that we have a file to query, we enable the Search button, so the user can now query the document.

C#
/*
method: linkFolderSelect_LinkClicked
accepts: object, LinkLabelLinkClickedEventArgs
returns: nothing
Desc: Open a folder dialog box and assigns the file selected 
      to our private variable _selectedFolder. 
      It will also enable the search button
*/
private void linkFolderSelect_LinkClicked(object sender, 
             LinkLabelLinkClickedEventArgs e)
{
   //begin opening the dialog - by calling the Show Dialog method
   DialogResult res = folderBrowserDialog1.ShowDialog(this);

   //now the dialog will wait until the user is done 
   //selecting whatever they want to select
   if(res == DialogResult.OK)
   {
       //get the path of the document and assign it to our variable
     _selectedFolder = folderBrowserDialog1.SelectedPath;
      FileInfo fInfo = new FileInfo(_selectedFolder);
      linkFolderSelect.Text = fInfo.Name;
   
      //enable the search btn, since we have a document to search
      btnSearch.Enabled = true;
   }
}

Go back to the designer and double-click the search button that we named btnSearch, this will bring us to the button's click event.

This event does a bit of work. First, we will initialize the BackgroundWorker, invoke its event handler RunWorkerCompletedEventHandler, and send it our method QueryComplete. The QueryComplete method will handle the formatting and display of our results when we are done with the query. Next, we call DoWorkEventHandler from the BackgroundWorker object and give it our method Query. This handles the bulk of our processing, as we will see later.

Here is the code so far, for our buttonClick event.

C#
/*
method: btnSearch_Click
accepts: object sender, EventArgs  e
returns: nothing
Desc: Sets up the BackgroundWorker object and begins 
      to query the document for the query string.
*/
   private void btnSearch_Click(object sender, EventArgs e)
   {

     //initialize backgroundWorker1 by calling its constructor
     backgroundWorker1 = new BackgroundWorker();
    
     //invoke the RunWorkerCompletedEventHandler and send it QueryComplete method
     //to handle the formatting when the query is complete
     backgroundWorker1.RunWorkerCompleted += 
         new RunWorkerCompletedEventHandler(QueryComplete);

     //invoke the DoWorkEventHandler and send it Query method to handle the LINQ query 
     backgroundWorker1.DoWork += DoWorkEventHandler(Query);

Let's finish up our button click event. Once we take care of the BackgroundWorker, we should make sure to clear the RichTextBox that will hold our results (to make sure there aren't any old search results). We will set the Search and LinkButton's Enabled property to false. Retrieve the search parameter and assign it to our variable _searchParam. Lastly, we call RunWorderAsynch() on the BackgroundWorker to begin the work.

C#
    //Now we should make sure the RichTextBox result window is clear
    tbResults.Clear();

    //here we display 'searching'  on the results label
    lblResults.Text = "Searching...";

    //reset the search and link buttons to false -- because we are finishing the query
    btnSearch.Enabled = false;
    linkFolderSelect.Enabled = false;

    //here we will retrieve the _searchParam variable the text from the search box
    _searchParam = tbSearchParam.Text.Trim();

    //finally begin the backgroundWorker
    backgroundWorker1.RunWorerAsynch();
}

I'll quickly describe the QueryComplete method, because it's fairly simple and it does exactly what the name suggests: deals with the completed query and displays any of the found items. This is the method we gave the BackgroundWorker object when we invoked the RunWorkerCompletedEventHandler.

First, we will assign to our lblResults label how many occurrences of the search item was found. Then, loop through our _results variable using a foreach construct. As we loop through, we will find and highlight the areas we found that correspond to the user's search parameter.

Below is the complete QueryComplete method:

C#
/*
Method: QueryComplete
Accepts: object sender, RunWorkerCompletedEventArgs e
Returns: nothing
Desc: This method deals with the completed query and displays 
      any of the found items in the RichTextBox as well 
      as highlighting the query string.
*/
void QueryComplete(object sender, RunWorkerCompletedEventArgs e)
{

    //display how many items we found within the query
    lblResults.Text = string.Format(
      "Results [{0} result(s) found]", _results.Count);

    //loop thru the results and split them
    foreach( string s in _results)
    {
        string[] result = s.Split('|');
        string t = result[0];
        int i = t.IndexOf( _searchParam);

        //loop through the split result string
        //and make necessary highlighting
        while( t.IndexOf( _searchParam) > 0)
        {
           tbResults.AppendText(t.Substring(0, i));
           tbResults.SelectionColor = Color.Red;
           tbResults.AppendText( _searchParam);
           tbResults.SelectionColor = Color.Black;
           t = t.Substring(i + _searchParam.Length);
           i = t.IndexOf( _searchParam );
        
        } //end while

        //append the new text to the RichTextBox
        tbResults.AppendText( t );
        //format the search results within the RichTextBox
        tbResults.SelectionColor = Color.DarkGreen;
        tbResults.AppendText(string.Format(" [{.}] ", result[1]));
        tbResults.AppendText(Environment.NewLine);
        tbResults.SelectionColor = Color.Black;

    } //end foerach
      
    //reset the Button and LinkButton
    btnSearch.Enabled = true;
    linkFolderSelect.Enabled = true;
}

The Query method is what we gave to the BackgroundWorker when we called the event handler DoWorkEventHandler.

This method creates a new List<t> object of strings to hold our results. Then, we loop through a DirectoryInfo object that contains the folder that we want to search. Finally, we send the .docx files into the WordDocumentQuery method, which is where some of the LINQ magic happens.

C#
/*
Method: Query
Accepts: object sender, DoWorkEventArgs e
Returns: nothing
Desc: This method initialize the _results, gets selected folder 
  and calls the WordDocumentQuery method, which will perform the LINQ query
*/
void Query(object sender, DoWorkEventArgs e)
{
    //create the _results object
    _results = new List<string>();
       
    //create a directoryinfo object of the folder to be searched
    DirectoryInfo dir = new DirectoryInfo(_selectedFolder);
 
    //now we will search the folder for '.docx' extensions
    foreach(FileInfo f in dir.GetFiles("*.docx"))
    {
        WordDocumentQuery(f)
    }  
}

We finally get to the last method of this project; the WordDocumentQuery. This method will accept a FileInfo object. First, we add an XNamespace object. Next, we add an instance of the Package class. This class allows us to access the entire contents of the file. We create a Uri object that contains the XML file we want to search, which is the document.xml file located in the zip file. The PackagePart object represents the contents of the URI - as the name "docPart" suggests - it's just part of the overall package. Next, we create an XmlReader object based on the PackagePart that we are interested in.

So, now we have created a way to read through the XML contents of the document.xml part of the Word document zip file.

C#
/*
Method: WordDocumentQuery
Accepts: FileInfo wordDocPath
         this is where we want to search
Returns: nothing
Desc: This method creates a Package of what we are going to search 
      through then a PackagePart, defining the part of the document 
      we want to search. Then we will perform a LINQ query and loop through 
      the results while assigning the query results to the _results variable.*/
void WordDocumentQuery( FileInfo wordDocPath )
{
     XNamespace wordNamespace = 
       "http:/schemas.openxmlformats.org/wordprocessingml/2006/main";
  
     //create a Package
     Package package = Package.Open(wordDocPath.FullName, FileMode.Open);
     Uri uri = new Uri("/word/document.xml", UriKind.Relative);
   
     //create a PackagePart of the document.xml
     PackagePart docPart = package.GetPart(uri);
     XmlReader reader = XmlReader.Create(docPart.GetStream(FileMode.Open));

Here, we will be using LINQ to XML. First, we need to create an XElement object and call its Load method to load the XMLReader that we created earlier. This is where it changes from the traditional System.Xml to the LINQ API.

Alright, we got to the LINQ query which is pretty simple, but the syntax is a little different than a traditional SQL query. In this sample, we are looking for any of those <t>elements (the ones containing text). We also want to filter them to only select the ones that contain our search parameter.

After creating the query, we put it in a foreach statement and split the query out by making the results an Array. Within the foreach, we keep appending the results to our variable _results. And finally, let's not forget to close the Package object.

C#
            //create XElement and load it with our reader object
            XElement wordDoc = XElement.Load(reader);

            //create the XML query
            var query =
                   From c in wordDoc.Descendants(wordNamespace + "t")
                   Where c.Value.Contains(_searchParm)
                   Select c;

            //loop through the LINQ to XML result set
            foreach (string s in query.ToArray())
            {
                string res = string.Format("{0}|{1}", 
                                    s, wordDocPath.Name);
                _results.Add(res);
            }

            //close the package when you are done
            package.close();

        }
    } //end of class
} //end of namespace

Using the Code

I am hoping this is a straightforward article that anyone can pick up and write. I have added a .cs file, but was unable to debug it - so if anyone finds errors, please let me know and I'll correct them.

Points of Interest

Again, the basic idea for this program came from Microsoft's Virtual Labs, and I would strongly suggest that you do a Google search and give them a try.

History

This is the first version, but if anyone comes across errors, I'll gladly keep this article up to date and correct.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)