Note: this is all for Xapian 1.0.18. Things (i.e., locations in files) might be in different locations in future versions.
Introduction
If you're looking to build a search function into your website or application, there are a ton of choices out there. Xapian is one of those choices, and on the surface, it seems like a pretty good option as its feature list is appealing and complete. It also includes an indexer (omega) that can index, and add to the Xapian database, a long list of document formats which is extremely appealing once you start diving into actually building an index and search component.
However, all of the documentation and support seems to be built around various *nix platforms (either compiling or getting the library from precompiled packages [depending on the distribution]). There are bindings for C#, and there are pre-compiled bindings for C# that you can download. This article discusses how to get started with these bindings, how to compile the rest of the Xapian package (omega), and the pitfalls of this library on the Windows environment.
Background
Search technologies have three components (typically).
The first component is typically referred to by the misnomer of document indexing. This is the process of actually extracting the text from documents (PDFs, web sites, Office documents, etc.). There are various technologies for doing this in the Windows space. The typical method is the use of IFilter
s (COM objects). There are also command line tools for most formats (more on this later).
The second component is the actual indexing of the text from those documents. There is a lot of theory involved on the best way to separate the words in a document, how to store them, and language tricks such as stemming a word. This is where a tool such as Xapian is very important as the effort required to build an indexer on your own is very high.
The third component is the search component -- how to actually retrieve the stored information and documents from the created index. This component typically is tightly integrated with the indexing component as it will be searching against the created index. Again, a tool such as Xapian is far preferable to something home grown in most cases as the theory of querying is fairly complex.
Getting Started
The initial step is to download the bindings for C#. There are two important files: XapianCSharp.dll (the actual C# binding to Xapian's C++ DLL) and _XapianSharp.dll (the C++ Xapian core functionality).
You will also need to download zlib. You'll need zlib1.dll from this download.
Create a new command line project in Visual Studio. Add a reference to XapianCSharp.dll. Add _XapianSharp.dll and zlib1.dll to the project and make sure that they are set to be copied to the output directory during compilation.
Add a new class that will broker calls to Xapian (SearchManager.cs). Add an OpenWriteDatabase
method to open a Xapian database in write mode. Add an AddDocument
method that will add a document to the index, storing some information about that document that can be used later when we search.
public class SearchManager
{
private const string DB_PATH = @"c:\temp\xap.db";
private static WritableDatabase OpenWriteDatabase()
{
return new WritableDatabase(DB_PATH,
Xapian.Xapian.DB_CREATE_OR_OPEN);
}
public static int AddDocument( int id, string type, string body )
{
using( var db = OpenWriteDatabase() )
using( var indexer = new TermGenerator() )
using( var stemmer = new Stem("english") )
using( var doc = new Document())
{
doc.SetData(string.Format( "{0}_{1}", type, id));
indexer.SetStemmer(stemmer);
indexer.SetDocument(doc);
indexer.IndexText(body);
return (int)db.AddDocument(doc);
}
}
}
Add another class to your project, SearchResult.cs, to handle the results of the queries.
public class SearchResult
{
public int Id { get; set; }
public string Type { get; set; }
public int ResultRank { get; set; }
public int ResultPercentage { get; set; }
public SearchResult( string combinedId )
{
var parts = combinedId.Split('_');
if ( parts.Length == 2 )
{
Type = parts[0];
int i;
if ( !int.TryParse( parts[1], out i ))
throw new ApplicationException(string.Format(
"CombinedId ID part incorrectly formatted: {0}",
combinedId));
Id = i;
return;
}
throw new ApplicationException( string.Format(
"CombinedId incorrectly formatted: {0}", combinedId ));
}
}
Now add a Search(string query)
method to search the index.
private static Database OpenQueryDatabase()
{
return new Database(DB_PATH);
}
public static IEnumerable<searchresult> Search( string queryString,
int beginIndex, int count )
{
var results = new List<searchresult>();
using( var db = OpenQueryDatabase() )
using( var enquire = new Enquire( db ) )
using( var qp = new QueryParser() )
using( var stemmer = new Stem("english") )
{
qp.SetStemmer(stemmer);
qp.SetDatabase(db);
qp.SetStemmingStrategy(QueryParser.stem_strategy.STEM_SOME);
var query = qp.ParseQuery(queryString);
enquire.SetQuery(query);
using (var matches = enquire.GetMSet((uint)beginIndex, (uint)count))
{
var m = matches.Begin();
while (m != matches.End())
{
results.Add(
new SearchResult(m.GetDocument().GetData())
{
ResultPercentage = m.GetPercent(),
ResultRank = (int)m.GetRank()
}
);
m++;
}
}
}
return results;
}
Edit the main function to add some data and then query it.
var docId = SearchManager.AddDocument(1, "upload", "this is my upload");
Console.WriteLine( "added: " + docId );
docId = SearchManager.AddDocument(2, "upload",
"This will eventually be the contents of a PDF");
Console.WriteLine("added: " + docId);
docId = SearchManager.AddDocument(1, "client", "McAdams Enterprises");
Console.WriteLine("added: " + docId);
docId = SearchManager.AddDocument(1, "Message",
"I think MSFT is wincakes!");
Console.WriteLine("added: " + docId);
var results = SearchManager.Search("upload", 0, 10);
foreach( var result in results )
Console.WriteLine( result.Id + " " + result.Type);
results = SearchManager.Search("MSFT", 0, 10);
foreach (var result in results)
Console.WriteLine(result.Id + " " + result.Type);
results = SearchManager.Search("PDF", 0, 10);
foreach (var result in results)
Console.WriteLine(result.Id + " " + result.Type);
Compile the program, jump out to a shell, and try to run it. If you're lucky, it just works. If you're unlucky (like I was), then it just doesn't.
What went wrong?
If you're like most people these days, you're likely running a 64 bit version of Windows. If you're not on your development computer, your server most likely is running a 64 bit version of the OS. Once you try and call into Xapian, you'll get this bit of niceness:
Unhandled Exception:
System.TypeInitializationException: The type initializer
for 'Xapian.XapianPINVOKE' threw an exception.
---> System.TypeInitializationException: The type initializer
---> for 'SWIGExceptionHelper' threw an exception.
---> System.BadImageFormatException: An attempt was made to load a program
---> with an incorrect format. (Exception from HRESULT: 0x8007000B)
The issue is that ASP.NET/C# code compiled for "Any CPU" (the default setting) on a 64 bit OS'ed computer will not be able to call into 32 bit DLLs using PInvoke (which is how the Xapian wrapper works).
What this means?
What this means is that you'll either have to compile your code in x86 (32 bit) mode, or get 64 bit binaries for Xapian. Unfortunately, the downloads that you got earlier don't have 64 bit bindings. To make matters worse, you need a 64 bit version of zlib1.dll as well (and they don't provide it). Running a 32 bit compiled ASP.NET application on a 64 bit server is a pain (it works, but you lose a lot of benefits due to running in WoW64 mode).
Compiling Xapian and zlib for 64 bit operation
So if you really want to use Xapian in your Windows 64 bit environment, you're going to have to get your hands dirty. And, I can't guarantee that there won't be any issues since you're going to get a lot of compiler warnings about lost precision.
Prerequisites
You're going to need Visual Studio .NET 2005 or 2008 with C++ installed (if you're like me, you never thought you'd need it, so you didn't install it). Go install it.
Get the source code for zlib.
Get the build files (one zip) and source code (three gzip-ed archives) from the Flax hosting site.
Unzip the source code to a common location (I recommend c:\xapian to make your life easier). Under this directory, you should have three directories, one for xapian-bindings-x.x.x, one for xapian-core-x.x.x, and one for xapian-omega-x.x.x.
Unzip the Win32 build scripts from Flax into the xapian-code directory (it should unzip into a win32 subdirectory).
Install ActivePerl (32 bit is fine, use the MSI). C:\perl is a good place for it.
Compiling zlib
Unzip the zlib source (to say c:\zlibsrc). Browse to the projects\visualc6 directory in the source. Open the zlib.dsw file. You'll likely be asked to convert the project, say Yes To All. Add an x64 build target (click the Win32 drop down, select Configuration Manager, under Active solution platform, click <new>, select x64, click OK, then Close). Select the LIB Release project from the dropdown and build it. Select the DLL Release project from the dropdown and build it.
Create a zlib directory for use in building Xapian (say c:\zlib). Copy everything from the zlib source\projects\visualc6\win32_dll_release directory to the zlib directory. Create an include folder in the zlib directory. Copy the zlib source to that include directory. Create a lib directory in the zlib directory. Copy everything from zlib source\projects\visualc6\win32_lib_release to that lib directory. In this directory, make a copy of the zlib.lib file and rename it to zdll.lib.
Compiling Xapian
Edit the xapian-core\win32\config.mak file (use Notepad). Edit the following lines:
- (line 32): set this to an appropriate directory for your Perl installation
- (line 155): set this to an appropriate directory for your Visual Studio installation
- (line 166): set this to the zlib directory created above (c:\zlib)
- (line 212): remove -D "_USE_32BIT_TIME_T"
Edit the xapian-core\win32\makedepend\makedepend.mak file (use Notepad). Edit the following line:
- (line 36): change /machine:I386 to /machine:AMD64
Edit xapian-core\common\utils.h and add the following lines:
string om_tostring(unsigned __int64 a);
Edit xapian-core\common\utils.cc and add the following lines:
string
om_tostring(unsigned __int64 val)
{
static const char fmt[] = { '%', 'I', '6', '4', 'd', 0 };
CONVERT_TO_STRING(fmt)
}
Open a "Visual Studio 2005 x64 Win64 Command Prompt" (it's under Visual Studio Tools on your Start menu). Change directories to c:\xapian\xapian-code.x.x.x\win32. Run "nmake". If everything goes according to plan, after a while, and a lot of compiler warnings about possible loss of data, it should be done compiling.
Now run "nmake COPYMAKFILES" (this will copy the mak files to the appropriate places).
There's a "bug" in the coding for the compilation of the bindings for x64. Run "set libpath", and if the result ends in a semicolon (;), then you need to reset the libpath in order to compile the bindings. In my case, I needed to run "set LIBPATH=C:\windows\Microsoft.NET\Framework64\v2.0.50727".
Change directories to c:\xapian\xapian-bindings.x.x.x\csharp and run "nmake". This should compile the bindings. The bindings end up in c:\xapian\xapian-core-x.x.x\win32\Release\CSharp. Copy those files to your project that you created above, and it should work on 64 bit computers (you'll have to remove/re-add the reference to XapianCSharp.dll since it will be signed differently, and recompile).
Change directories to c:\xapian\xapian-omega.x.x.x and run "nmake". This should compile the omega component of Xapian.
What about Omega and Document Processing?
The Xapian features page is a bit of a bait and switch. It says "The indexer supplied can index HTML, PHP, PDF, PostScript, OpenOffice/StarOffice, OpenDocument, Microsoft Word/Excel/PowerPoint/Works, Word Perfect, AbiWord, RTF, DVI, Perl POD documentation, and plain text." However once you dig into the Omega documentation, it relies on other components in order to actually parse other document types:
- HTML (.html, .htm, .shtml)
- PHP (.php) - our HTML parser knows to ignore PHP code
- text files (.txt, .text)
- PDF (.pdf) if pdftotext is available (comes with xpdf)
- PostScript (.ps, .eps, .ai) if ps2pdf (from ghostscript) and pdftotext (comes with xpdf) are available
- OpenOffice/StarOffice documents (.sxc, .stc, .sxd, .std, .sxi, .sti, .sxm, .sxw, .sxg, .stw) if unzip is available
- OpenDocument format documents (.odt, .ods, .odp, .odg, .odc, .odf, .odb, .odi, .odm, .ott, .ots, .otp, .otg, .otc, .otf, .oti, .oth) if unzip is available
- MS Word documents (.doc, .dot) if antiword is available
- MS Excel documents (.xls, .xlb, .xlt) if xls2csv is available (comes with catdoc)
- MS Powerpoint documents (.ppt, .pps) if catppt is available, (comes with catdoc)
- MS Office 2007 documents (.docx, .dotx, .xlsx, .xlst, .pptx, .potx, .ppsx) if unzip is available/li>
- Wordperfect documents (.wpd) if wpd2text is available (comes with libwpd)
- MS Works documents (.wps, .wpt) if wps2text is available (comes with libwps)
- AbiWord documents (.abw)
- Compressed AbiWord documents (.zabw) if gzip is available
- Rich Text Format documents (.rtf) if unrtf is available
- Perl POD documentation (.pl, .pm, .pod) if pod2text is available
- TeX DVI files (.dvi) if catdvi is available
- DjVu files (.djv, .djvu) if djvutxt is available
- XPS files (.xps) if unzip is available
So for the vast majority of the documents you'll be interested in parsing, Omega will require other third party applications -- which may or may not be available for your use on the Windows environment. It also appears that Omega will call these external applications and rear their output to parse the text of the document.
While this isn't necessarily a bad way to accomplish the task of extracting text out of documents, it can potentially become a point of failure that will be extremely difficult to track down because there will be little logging (if any) of the errors that occur when calling external applications.
Alternatives to Xapian
In the C# world, Lucene.NET is the most common search "engine" used. It has its own issues (no recent official releases, poor documentation, performance concerns over large data sets since it's running in managed code, etc.), however it must be evaluated as well. It does not offer a tool such as Omega, so you will be responsible for extracting the data from documents (via IFilter
or the same external programs as Omega).
History
- 4/8/2010: Initial release.
- 4/9/2010: Added a note that this is for the 1.0.18 version of Xapian.