(untagged)

LuceneWrap: A Compact Lucene.net Wrapper

george2giga

0.00/5 (No votes)

4 May 2011

Providing a generic wrap to Lucene.net basic search functions

Download source - 1.67 MB

Introduction

Lucene.net is a porting of the popular Lucene text search engine for Java. It allows us to store and index data on different types of datasources and it provide high performance querying on the stored text.

The purpose of this article is not to dig into the Lucene.Net architecture or behaviour (there are plenty of resources, you can find some at the end of the article) but to give an overview of the most common usage of the library and to show a way to ease the operations we have to perform.

Background

Before running through the code, let's summarize the process flow of indexing/searching data with Lucene.

Indexing

Get the data you want to index (from a database, XML, etc.)
Open the IndexDirectory, create and initialize an IndexWriter with its dependant objects.
Create a Field for each value you want to store, once you have all the fields that compose a "row of data" insert them to a Document and finally add the document to the index.
Once all the documents have been written, we can now Optimize the index and close it.

Search

Open the IndexDirectory, create and initialize an IndexReader with its dependant objects.
Perform a Query over the index.
Iterate through the returned Document resultset and get the values of each field of each row.

LuceneWrap Scope

LuceneWrap in nothing more than a simple wrapper around the original Lucene.net DLL. It doesn't pretend to embrace all the Lucene features (though feel free to extend it or rebake it), but to simplify/reduce the calls needed to interact with indexes.

LuceneWrap Essentials

LuceneWrap mainly does two things:

Abstract the developer from correctly creating/handling Lucene.net objects like directories, search object, indexes, etc.
By decorating the members of a class with a LuceneWrapAttribute, we avoid handling the mapping between the data that needs to be stored and a Lucene.Net.Documents.Field. Same thing for the search, we will be able to call Search<T> and retrieve a strongly typed list from the index.

The Code

The code is pretty simple, the key feature of the wrapper resides in the generic insert/update/search methods, and it is achieved using a custom attribute.

[System.AttributeUsage(System.AttributeTargets.Property)]
public class LuceneWrapAttribute : System.Attribute
{
    public string Name { get; set; }
    public string Value { get; set; }
    public bool IsStored { get; set; }
    public bool IsSearchable { get; set; }

    public LuceneWrapAttribute(){}
}

We then pick up the class we are planning to store and decorate the members that need to be indexed with a custom attribute responsible to flag them for indexing. Let's say, we would like to store the result of a query done with Entity Framework, in that case we will just have to decorate the members of the POCO object of the entity with LuceneWrapAttribute. In my sample, I am using a simple class representing a Feed with only three fields.

public class FeedResult
{
	[LuceneWrap(IsSearchable = false, Name = "Id", IsStored = true)]
	public string Id { get; set; }
	[LuceneWrap(IsSearchable = true, Name = "Title", IsStored = true)]
	public string Title { get; set; }
	[LuceneWrap(IsSearchable = true, Name = "Summary", IsStored = true)]
	public string Summary { get; set; }
}

Once the class is decorated properly, I can create an index and execute a search. Here's a snippet of a couple of tests.

LuceneManager<FeedResult> _luceneManager = new LuceneManager<FeedResult>();
LuceneSearcher _luceneSearcher = new LuceneSearcher();
 
[Test]
public void WriteIndex_Test()
{
    //We retrieve a list of feeds from a website and get a list of FeedResult
    List<FeedResult> feeds = FeedManager.GetFeeds();
    foreach (var feed in feeds)
    {
        _luceneManager.AddItemToIndex(feed);    
    }
    _luceneManager.FinalizeWriter(true);            
}
 
[Test]
public void SearchInIndex_Test()
{
    //we retrieve a list of FeedResult by searching on the 
    //Summary field any occurrence of "presentations"
    var result = _luceneSearcher.Search<FeedResult>("Summary", "presentations");
    foreach (var feedResult in result)
    {
        Console.WriteLine(feedResult.Id);
        Console.WriteLine(feedResult.Title);
        Console.WriteLine(feedResult.Summary);
        Console.WriteLine(Environment.NewLine);
    }
}

LuceneManager will be responsible for both the inserts and update. Note that in order to update a field, we have to delete it first and then insert it again.

public class LuceneManager<T> : ILuceneManager<T>
{
    private readonly string _INDEX_FILEPATH = 
	ConfigurationManager.AppSettings.Get("LuceneIndexFilePath");

    private Analyzer _analyzer = null;
    private IndexWriter _indexWriter = null;
    private IndexReader _indexReader = null;
    private Directory _luceneIndexDirectory = null;
        
    public LuceneManager()
    {
        Create();
    }

    public LuceneManager(string indexFilePath): this()
    {
        _INDEX_FILEPATH = indexFilePath;
    }

    public void Create()
    {
        _analyzer = new StandardAnalyzer(Version.LUCENE_29);
        _luceneIndexDirectory = FSDirectory.Open(new DirectoryInfo(_INDEX_FILEPATH));
        _indexWriter = new IndexWriter
	(_luceneIndexDirectory, _analyzer, IndexWriter.MaxFieldLength.UNLIMITED);
        _indexReader = IndexReader.Open(_luceneIndexDirectory, false);
        _indexReader.Close();
    }

    #region Insert index

    #region Public methods

    public void AddItemToIndex(T obj) 
    {
        AddObjectToIndex(obj);
    }

    public void AddItemsToIndex(List<T> objects) 
    {
        foreach (var obj in objects)
        {
            AddObjectToIndex(obj);
        }
    }

    #endregion

    #region Private methods

    public void AddObjectToIndex(T obj) 
    {
        Document document = new Document();
        var newFields = LuceneReflection.GetLuceneFields(obj, false);
        foreach (var newField in newFields)
        {
            document.Add(newField);    
        }
        _indexWriter.AddDocument(document);
    }

    #endregion

    #endregion

    #region UpdateIndex

    #region Public methods

    public void ModifyItemFromIndex(T oldObj, T newObj) 
    {
        DeleteObjectFromIndex(oldObj);
        InsertUpdateFieldFromIndex(newObj);
    }

    public void ModifyItemFromIndex(List<T> oldObj, List<T> newObj) 
    {
        foreach (var field in oldObj)
        {
            DeleteObjectFromIndex(field);
        }
        foreach (var field in newObj)
        {
            InsertUpdateFieldFromIndex(field);
        }
    }

    #endregion

    #region Private methods

    public void DeleteObjectFromIndex(T oldObj)
    {
        var oldFields = LuceneReflection.GetLuceneFields(oldObj, false);
        foreach (var oldField in oldFields)
        {
            _indexWriter.DeleteDocuments(new Term
		(oldField.Name(), oldField.StringValue()));
        }
    }

    public void InsertUpdateFieldFromIndex(T newfield)
    {
        AddObjectToIndex(newfield);
    }

    #endregion

    #endregion

    public void FinalizeWriter(bool optimize)
    {
        if (optimize)
            _indexWriter.Optimize();
        _indexWriter.Commit();
        _indexWriter.Close();
        _luceneIndexDirectory.Close();
    }
}

LuceneSearcher is responsible to retry a list of objects of a specified type. If, for instance, the index contains Employees we can search using LuceneSearcher.Search<Employee>. In our sample, we are using FeedResult as the type for our index, so we will search using LuceneSearcher.Search<FeedResult> .

public class LuceneSearcher : ILuceneSearcher
{
    private readonly string _INDEX_FILEPATH = 
	ConfigurationManager.AppSettings.Get("LuceneIndexFilePath");
    private Directory _luceneIndexDirectory = null;
    private IndexSearcher _indexSearcher = null;
    private QueryParser _queryParser = null;
    private StandardAnalyzer _analyzer = null;

    public LuceneSearcher()
    {
        Create();
    }

    public LuceneSearcher(string indexFilePath): this()
    {
        _INDEX_FILEPATH = indexFilePath;
    }

    public void Create()
    {
        _luceneIndexDirectory = FSDirectory.Open(new DirectoryInfo(_INDEX_FILEPATH));
        _analyzer = new StandardAnalyzer();
        _indexSearcher = new IndexSearcher(_luceneIndexDirectory);            
    }

    public List<T> Search<T>(string property, string textsearch) where T : new()
    {
        _queryParser = new QueryParser(property, _analyzer);
        var result = GetResults<T>(textsearch);

        return result;
    }

    public List<T> Search<T>(string textSearch) where T: new()
    {
        return GetResults<T>(textSearch);
    }

    public List<T> GetResults<T>(string textSearch) where T: new()
    {
        List<T> results = new List<T>();
        Query query = _queryParser.Parse(textSearch);
        //Do the search
        Hits hits = _indexSearcher.Search(query);
        int resultsCount = hits.Length();
        for (int i = 0; i < resultsCount; i++)
        {
            Document doc = hits.Doc(i);
            var obj = LuceneReflection.GetObjFromDocument<T>(doc);
            results.Add(obj);
        }

        return results;
    }
}

Here's the class responsible for the Reflection in both writing and reading.

public class LuceneReflection
{
public static List<Field> GetLuceneFields<T>(T obj, bool isSearch)
{
	List<Field> fields = new List<Field>();
	Field field = null;
	// get all properties of the object type
	PropertyInfo[] propertyInfos = obj.GetType().GetProperties();
	foreach (var propertyInfo in propertyInfos)
	{
		//If property is not null add it as field and it is not a search
		if (obj.GetType().GetProperty(propertyInfo.Name).GetValue
				(obj, null) != null && !isSearch)
		{
			field = GetLuceneFieldsForInsertUpdate
					(obj, propertyInfo, false);
		}
		else
		{
			field = GetLuceneFieldsForInsertUpdate
					(obj, propertyInfo, true);
				
		}
		fields.Add(field);
	}

	return fields;
}

private static Field GetLuceneFieldsForInsertUpdate<T>
	(T obj, PropertyInfo propertyInfo, bool isSearch)
{
	Field field = null;

	object[] dbFieldAtts = propertyInfo.GetCustomAttributes
				(typeof(LuceneWrapAttribute), isSearch);
	if (dbFieldAtts.Length > 0 && propertyInfo.PropertyType == typeof(System.String))
        {
            var luceneWrapAttribute = ((LuceneWrapAttribute)dbFieldAtts[0]);
            field = GetLuceneField(obj, luceneWrapAttribute, propertyInfo, isSearch);
        }
        else if (propertyInfo.PropertyType != typeof(System.String))
        {
            throw new InvalidCastException(string.Format("{0} 
		must be a string in order to get indexed", propertyInfo.Name));
        }	   

	return field;
}	

private static Field GetLuceneField<T>(T obj, LuceneWrapAttribute luceneWrapAttribute, 
	PropertyInfo propertyInfo, bool isSearch)
{
	Field.Store store = luceneWrapAttribute.IsStored ? 
		Field.Store.YES : Field.Store.NO;
	Lucene.Net.Documents.Field.Index index = luceneWrapAttribute.IsSearchable ? 
		Field.Index.ANALYZED : Field.Index.NOT_ANALYZED;
	//if it is not a search assign the object value to the field
	string propertyValue = isSearch ? string.Empty : 
	obj.GetType().GetProperty(propertyInfo.Name).GetValue(obj, null).ToString();
	Field field = new Field(propertyInfo.Name, propertyValue, store, index);
	return field;
}

public static T GetObjFromDocument<T>(Document document) where T : new()
{
	T obj = new T();
	var fields = GetLuceneFields(obj, true);
	foreach (var field in fields)
	{
		//setting values to properties of the object via reflection
		obj.GetType().GetProperty(field.Name()).SetValue
			(obj, document.Get(field.Name()), null);
	}

	return (T)obj;
}
}

Once we got our index built, we can search or modify a field and then search for the new data:

[Test]
public void UpdateIndex_Test()
{
    var feeds = FeedManager.GetFeeds();
    var oldFeed = feeds.First();
            
    FeedResult newField = new FeedResult(){Id = oldFeed.Id, 
		Summary = "CIAO CIAO",Title = oldFeed.Title};
    _luceneManager.ModifyItemFromIndex(oldFeed,newField);
    _luceneManager.FinalizeWriter(true);
}

[Test]
public void SearchModifiedEntry_Test()
{
    //we retrieve a list of FeedResult by searching on the 
    //Summary field any occurrence of "presentations"
    var result = _luceneSearcher.Search<FeedResult>("Summary", "CIAO CIAO");
    foreach (var feedResult in result)
    {
        Console.WriteLine(feedResult.Id);
        Console.WriteLine(feedResult.Title);
        Console.WriteLine(feedResult.Summary);
        Console.WriteLine(Environment.NewLine);
    }
}

Points of Interest

Lucene is an excellent framework for full text search, the main alternative in the .NET world is the Full Text Search provided by SQL Server. Here are the main differences between the two:

SQL Server FTS

You don't have to add anything to your solution, it's there in SQL Server
Much easier administration of indexes
It is database dependant

Lucene.net

It is free and opensource, it offers more possibility than SQL Server.
It is not tied to any product, you can easily scale horizontally by adding more indexes to your web servers.
You have to programmatically handle every phase of the indexing, from the creation to the update.

Conclusions

Rumors says that Lucene.net performs better than SQL Server on large sets of data, I haven't done any comparison yet and the argument doesn't fit in the article.

But if you're thinking of adding full text search on your website, the suggestion is: if the data you want to index is already in SQL Server and you want something quick and simple to implement, just go for the SQL Server FTS.

Otherwise, if you want something fast and more sophisticated because you know you will have to handle millions of records, or because some features are not present in SQL Server or simply because you don't have SQL Server... well, in that case, go for Lucene.net and use LuceneWrap. :)

References

Lucene.Net: http://incubator.apache.org/lucene.net/
SQL Server 2008 FTS: http://msdn.microsoft.com/en-us/library/ms142571.aspx
There's also a few projects built around Lucene, here are the main ones:

LinqToLucene: http://linqtolucene.codeplex.com/
SimpleLucene: http://simplelucene.codeplex.com/

The source code is at the top of the article.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here