Searcharoo 2007 (Medium Trust and Office 2007 indexing)

craigd

4.95/5 (11 votes)

28 Apr 2007CPOL10 min read

852

Remove Binary Serialization to solve Medium Trust problem; index OpenXML document formats in ASP.NET/C# free search engine

Download source code: Searcharoo_5.zip - 286.6 KB

Background

This article follows on from the previous four Searcharoo samples:

Searcharoo 1 describes building a simple search engine that crawls the file system. A basic design and object model was developed to support simple, single-word searches, whose results were displayed ina rudimentary query/results page.

Searcharoo 2 focused on adding a 'spider' to find data to index by following web links (downloading files via HTTP and parsing the HTML). Also discusses how multiple search words results are combined into a single set of 'matches'.

Searcharoo 3 implemented a 'save to disk' function for the catalog, so it could be reloaded across IIS application restarts without having to be generated each time. It also spidered FRAMESETs and added Stop words, Go words and Stemming to the indexer.

Searcharoo 4 added IFilter support for non-text filetypes (eg Word, PDF and Powerpoint), better robots.txt support, a remote-indexing console application and a lot of code tidy-up (refactoring!).

Introduction to version 5

This article is shorter than most, covering just two topics:

Allowing Searcharoo to run on websites where the ASP.NET application is restricted to Medium Trust. The remote-indexing console app in v4 was intended to addrsess this issue - but just building the catalog remotely isn't enough because you cannot binary-deserialize the file under Medium Trust. Rather than advise people to try and get the trust level on their server changed or customised (difficult!), the file format has been changed (to XML) to allow it to work in Medium Trust.
Extend the Document object hierarchy introduced in v4 to index Office 2007 (OpenXML) file types. I received a *.docx file from a collegue recently, and since I don't intended to upgrade to Office 2007 any time soon, it seemed like a good idea to investigate how the file could be indexed/searched without having the application/IFilter installed.

ASP.NET has 'Trust Issues'

When Searcharoo v4 is run under Medium Trust, you get one of these errors:

WebPermission denied if Search.aspx cannot find a catalog file and triggers SearchSpider.aspx (accessing websites or webservices is not allowed under Medium Trust by default).

[SecurityException: Request for the permission of type 'System.Net.WebPermission, System, Version=2.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089' failed.]
   System.Security.CodeAccessSecurityEngine.Check(Object demand, StackCrawlMark& stackMark, Boolean isPermSet) +0
   System.Security.CodeAccessPermission.Demand() +59
   System.Net.HttpWebRequest..ctor(Uri uri, ServicePoint servicePoint) +166
   System.Net.HttpRequestCreator.Create(Uri Uri) +26
   System.Net.WebRequest.Create(Uri requestUri, Boolean useUriBase) +373
   System.Net.WebRequest.Create(String requestUriString) +81
   Searcharoo.Indexer.RobotsTxt..ctor(Uri startPageUri, String userAgent) +250
   Searcharoo.Indexer.Spider.BuildCatalog(Uri startPageUri) +116

SecurityPermission denied if Search.aspx finds a binary-serialized catalog file and tries to deserialize it (Binary Serialization is not allowed under Medium Trust).

[SecurityException: Request for the permission of type 'System.Security.Permissions.SecurityPermission, mscorlib, Version=2.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089' failed.]
   System.Runtime.Serialization.Formatters.Binary.ObjectReader.CheckSecurity(ParseRecord pr) +1644388
   System.Runtime.Serialization.Formatters.Binary.ObjectReader.ParseObject(ParseRecord pr) +363
   System.Runtime.Serialization.Formatters.Binary.ObjectReader.Parse(ParseRecord pr) +64
   System.Runtime.Serialization.Formatters.Binary.__BinaryParser.ReadObjectWithMapTyped(BinaryObjectWithMapTyped record) +1050
   System.Runtime.Serialization.Formatters.Binary.__BinaryParser.ReadObjectWithMapTyped(BinaryHeaderEnum binaryHeaderEnum) +62
   System.Runtime.Serialization.Formatters.Binary.__BinaryParser.Run() +144
   System.Runtime.Serialization.Formatters.Binary.ObjectReader.Deserialize(HeaderHandler handler, __BinaryParser serParser, Boolean fCheck, Boolean isCrossAppDomain, IMethodCallMessage methodCallMessage) +183
   System.Runtime.Serialization.Formatters.Binary.BinaryFormatter.Deserialize(Stream serializationStream, HeaderHandler handler, Boolean fCheck, Boolean isCrossAppDomain, IMethodCallMessage methodCallMessage) +190
   System.Runtime.Serialization.Formatters.Binary.BinaryFormatter.Deserialize(Stream serializationStream) +12
   Searcharoo.Common.Catalog.Load() +461

The combination of errors -- cannot create a new catalog, and cannot load an existing catalog file (even if it was generated elsewhere) -- means that Searcharoo v4 doesn't work under Medium Trust. There are two options to fixing this problem:

Update your server with a custom Code Access Security policy to allow the Searcharoo code to perform these functions. This could be very difficult if your site is on shared hosting and you need to convince the ISP to make changes 'just for you'.
Make changes to Searcharoo so that at least one of those errors does not occur.

We'll do #2, since it's easier! There was a long discussion in v4 about why Binary Serialization was a good idea and Xml Serialization was bad: in this article we'll turn that around by fixing the problems with the Xml output so that we can build it remotely using the Indexer Console Application then uploaded to a Medium Trust website. Xml-serialized data can be de-serialized even under Medium Trust, so it can be loaded and searched.

About Option #2: Xml redux

Original (v4) Xml Catalog format
Way back in v4, the Xml-serialized Catalog object was dismissed as bloated, inefficient and (as implemented) unable to be de-serialized. It looked like this:

Recall that each Word object contains a collection (Hashtable) of File objects, indicating which File/s that Word appeared in. That works in memory because the File objects in Word._FileCollection are references - there's only one File 'object' per indexed file.

The problem with the resulting Xml is that the File object references are 'flattened out' (repeated every time they are referenced). You can see above that the document http://localhost:3359/content/Kilimanjaro.pdf is represented twice in the small excerpt. This repetition occurs for EVERY WORD in each File, creating a huge amount of redundant data in the Catalog file.

What's needed is a more succinct way to represent the relationship between Word and File: a 'foreign key' in database terms.

New (v5) Xml Format
This 'foreign key' will be represented by a new object CatalogWordFile, which will act as a 'proxy' for Word objects (which we will no longer serialize). The Word object will continue to be the basis of the Catalog, but when we load and save it via Xml Serialization, we will use attributes to ignore Word and treat the File and CatalogWordFile like two 'database tables' joined by a 'foreign key': the FileId.

Catalog object's relationship with Word and File classes

Now the File objects are serialized once and their FileId is their implicit order in the serialized Xml (starting from zero, of course). The content we mentioned above - http://localhost:3359/content/Kilimanjaro.pdf - appears in the new Xml as FileId=2 (below) just once.

In the same Xml file the individual CatalogWordFile objects reference just the FileId, resulting in a significantly smaller Xml than when Word objects were used.

Repeating the Original (v4) Xml Catalog example, you can see the two words boxed above shown here again, with just the FileId rather than a whole serialized File object.

Note that the markup shown still has some complete element names; in the actual code the element names are overridden to further reduce the Xml file size using attributes: [XmlElement("w")] and [XmlElement("f")] (see right).

The test data used during development created a 178 Kb file when Binary Serialization was used. This equated to a 1.1 Mb Xml file in the old format.

Using the new, improved Xml format, the file shrunk to 194 Kb; and after applying XmlElement attributes to shorten the element names shrunk even further to 97 Kb - actually smaller than the Binary version.

Behind the Xml-serialization Scenes

So that's the Xml format we need - how do we get it? Unfortunately, just replacing the Word[] with CatalogWordFile[] isn't all we needed to do to make this work. The FileId needs to be 'in-sync' between the CatalogWordFile and File arrays, but we don't really know what order the XmlSerializer will access the properties (nor whether they'll be accessed multiple times). To avoid having to populate the internal CatalogWordFile collection unnecessarily, we use pre/post methods in the Property accessors to create it on-demand.

The two property declarations look like this (below): the PrepareForSerialization() does the work of 'flattening' the _Index Hashtable of Word objects into CatalogWordFile proxies with FileIds, it's called in both get accessors to ensure they return the 'synchronized' collections.

The PostDeserialization() method waits until both File and WordFile set accessors have been called (because we need both collections to re-build the original _Index via our 'foreign key'), then loops through the data and calling the Add() method just like the Spider does when it builds the Catalog while indexing.

If you check the Catalog.Load() code, you'll also notice the XmlSerialization uses the Kelvin generic serialization helper (another CodeProject article).

Catalog c1 = Kelvin<Catalog>.FromXmlFile(xmlFileName);

One final note: rather than remove the Binary Serialization feature, both methods are still available, controlled by a new web.config/app.config setting (for your Website and Indexer Console application).

XML

<appSettings>
  <add key="Searcharoo_InMediumTrust" value="True" />
</appSettings>

If set to True, the Catalog will be saved as an Xml file, if set to False it will be written as *.dat. Don't forget to update the other .config file settings to match your environment - including the Searcharoo_VirtualRoot, Searcharoo_CatalogFilepath and Searcharoo_TempFilepath which will be used in the DownloadDocument class discussed in the remainder of this article...

Office 2007 File Formats

The rest of the article discusses indexing the new Office 2007 file formats.

Microsoft Word Docx file 'structure'
This blog on getting started with OpenXML discusses how to use the Open XML File Formats. It explains the basic structure of OpenXML documents: they are actually a series of related Xml (and other) files, 'hidden' inside a single ZIP file with an Office 2007 file extension like docs, xlsx, pptx, etc).

A Microsoft Word 2007 file looks like this 'inside' the ZIP:

You can read all about the details of the format in the references, but the key file we're interested in is the document.xml part. To search it, we'll need to do the following steps:

Download the OpenXML file/ZIP archive from the web link
Extract the file we need from the ZIP archive
Learn a bit about the Xml format so we can extract the plaintext we want to index, and ignore all the formatting and other data.

Step 1: Subclassing Document to share download code

The v4 article describes how the FilterDocument needed to download files for IFilter processing (whereas previously downloads were loaded into/parsed from a MemoryStream). The new Office 2007 classes need the same behaviour, so the SaveDownloadedFile method is pushed up to a superclass they can all implement.

Step 2: unZIP

The System.IO.Packaging API in .NET 3.0 provides built-in capabilities for accessing ZIP archives (some might say specifically to facilitate Office 2007/OpenXML interoperability). However, to keep Searcharoo accessible we're not going to upgrade to 3.0 just yet; luckily the System.IO.Compression namespace in .NET 2.0 contains the building blocks needed to build a ZipFile implementation that reads/writes ZIP files (and therefore also OpenXML documents).

Using the ZipFile to access a data stream to process is easy:

using (ZipFile zip = ZipFile.Read(filename))
{
    using (MemoryStream streamroot = new MemoryStream())
    {   
        MemoryStream stream = new MemoryStream();
        zip.Extract(@"word/document.xml", streamroot);
        stream.Seek(0, SeekOrigin.Begin);   // important!
        // TODO: code here to process Xml from the stream
    }
}

Step 3: Extract text

Turns out the Word 2007 OpenXML format is very Html-like in it's treatment of formatting and content: all document structure and formatting present in document.xml is contained in Xml attributes and the relevent plaintext in the InnerXml of each element. For our purposes, we'll assume that's all the text we wish to index (more research is required to determine whether headers/footers/tables/references are included, and more work would be required to detect and index other embedded Office documents).

DocxDocument in 3 easy steps

The new Docx file indexer inherits most of it's functionality from the abstract Document and DownloadDocument classes. All we really need to do is override the GetResponse() method to extract the file contents and set the WordsOnly property which is used to generate the Catalog.

This same pattern can be easily applied to PowerPoint 2007 (.pptx files) and Excel 2007 (.xlsx files) - see the XlsxDocument and PptxDocument code for the additional work that was required to loop through sheets/slides to get all the text in those file types.

Lastly, our new classes will never be instantiated unless we update DocumentFactory to be aware of the new MIME types we can 'handle', and which MIME type/file extension maps to which class.

More on Open XML Office Formats

Erika's blog is an excellent source of Office 2003 and 2007: MSDN Technical Articles, How-To Content, and Code Samples. Other links include:

Wrap-up

These additions to Searcharoo are quite minor, and have been posted mainly to help anyone wishing to use the code under Medium Trust. Many users may have Office 2007 installed (or the relevent IFilter on their server) and may not even need the additional Document subclasses - if this is the case, simply remove the new case statements from DocumentFactory and let the existing FilterDocument direct the Indexer.

History

2004-06-30: Version 1 on CodeProject
2004-07-03: Version 2 on CodeProject
2006-05-24: Version 3 on CodeProject
2007-03-18: Version 4 on CodeProject
2007-04-29: Version 5 (this page) on CodeProject

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)