Introduction
Lucene.net is a porting of the popular Lucene text search engine for Java. It allows us to store and index data on different types of datasources and it provide high performance querying on the stored text.
The purpose of this article is not to dig into the Lucene.Net architecture or behaviour (there are plenty of resources, you can find some at the end of the article) but to give an overview of the most common usage of the library and to show a way to ease the operations we have to perform.
Background
Before running through the code, let's summarize the process flow of indexing/searching data with Lucene.
Indexing
- Get the data you want to index (from a database, XML, etc.)
- Open the
IndexDirectory
, create and initialize an IndexWriter
with its dependant objects.
- Create a Field for each value you want to store, once you have all the fields that compose a "row of data" insert them to a
Document
and finally add the document to the index.
- Once all the documents have been written, we can now
Optimize
the index and close it.
Search
- Open the
IndexDirectory
, create and initialize an IndexReader
with its dependant objects.
- Perform a
Query
over the index.
- Iterate through the returned
Document
resultset and get the values of each field of each row.
LuceneWrap Scope
LuceneWrap in nothing more than a simple wrapper around the original Lucene.net DLL. It doesn't pretend to embrace all the Lucene features (though feel free to extend it or rebake it), but to simplify/reduce the calls needed to interact with indexes.
LuceneWrap Essentials
LuceneWrap mainly does two things:
- Abstract the developer from correctly creating/handling Lucene.net objects like directories, search object, indexes, etc.
- By decorating the members of a class with a
LuceneWrapAttribute
, we avoid handling the mapping between the data that needs to be stored and a Lucene.Net.Documents.Field
. Same thing for the search, we will be able to call Search<T>
and retrieve a strongly typed list from the index.
The Code
The code is pretty simple, the key feature of the wrapper resides in the generic insert
/update
/search
methods, and it is achieved using a custom attribute.
[System.AttributeUsage(System.AttributeTargets.Property)]
public class LuceneWrapAttribute : System.Attribute
{
public string Name { get; set; }
public string Value { get; set; }
public bool IsStored { get; set; }
public bool IsSearchable { get; set; }
public LuceneWrapAttribute(){}
}
We then pick up the class we are planning to store and decorate the members that need to be indexed with a custom attribute responsible to flag them for indexing. Let's say, we would like to store the result of a query done with Entity Framework, in that case we will just have to decorate the members of the POCO object of the entity with LuceneWrapAttribute
. In my sample, I am using a simple class representing a Feed with only three fields.
public class FeedResult
{
[LuceneWrap(IsSearchable = false, Name = "Id", IsStored = true)]
public string Id { get; set; }
[LuceneWrap(IsSearchable = true, Name = "Title", IsStored = true)]
public string Title { get; set; }
[LuceneWrap(IsSearchable = true, Name = "Summary", IsStored = true)]
public string Summary { get; set; }
}
Once the class is decorated properly, I can create an index and execute a search. Here's a snippet of a couple of tests.
LuceneManager<FeedResult> _luceneManager = new LuceneManager<FeedResult>();
LuceneSearcher _luceneSearcher = new LuceneSearcher();
[Test]
public void WriteIndex_Test()
{
List<FeedResult> feeds = FeedManager.GetFeeds();
foreach (var feed in feeds)
{
_luceneManager.AddItemToIndex(feed);
}
_luceneManager.FinalizeWriter(true);
}
[Test]
public void SearchInIndex_Test()
{
var result = _luceneSearcher.Search<FeedResult>("Summary", "presentations");
foreach (var feedResult in result)
{
Console.WriteLine(feedResult.Id);
Console.WriteLine(feedResult.Title);
Console.WriteLine(feedResult.Summary);
Console.WriteLine(Environment.NewLine);
}
}
LuceneManager
will be responsible for both the insert
s and update
. Note that in order to update
a field, we have to delete
it first and then insert
it again.
public class LuceneManager<T> : ILuceneManager<T>
{
private readonly string _INDEX_FILEPATH =
ConfigurationManager.AppSettings.Get("LuceneIndexFilePath");
private Analyzer _analyzer = null;
private IndexWriter _indexWriter = null;
private IndexReader _indexReader = null;
private Directory _luceneIndexDirectory = null;
public LuceneManager()
{
Create();
}
public LuceneManager(string indexFilePath): this()
{
_INDEX_FILEPATH = indexFilePath;
}
public void Create()
{
_analyzer = new StandardAnalyzer(Version.LUCENE_29);
_luceneIndexDirectory = FSDirectory.Open(new DirectoryInfo(_INDEX_FILEPATH));
_indexWriter = new IndexWriter
(_luceneIndexDirectory, _analyzer, IndexWriter.MaxFieldLength.UNLIMITED);
_indexReader = IndexReader.Open(_luceneIndexDirectory, false);
_indexReader.Close();
}
#region Insert index
#region Public methods
public void AddItemToIndex(T obj)
{
AddObjectToIndex(obj);
}
public void AddItemsToIndex(List<T> objects)
{
foreach (var obj in objects)
{
AddObjectToIndex(obj);
}
}
#endregion
#region Private methods
public void AddObjectToIndex(T obj)
{
Document document = new Document();
var newFields = LuceneReflection.GetLuceneFields(obj, false);
foreach (var newField in newFields)
{
document.Add(newField);
}
_indexWriter.AddDocument(document);
}
#endregion
#endregion
#region UpdateIndex
#region Public methods
public void ModifyItemFromIndex(T oldObj, T newObj)
{
DeleteObjectFromIndex(oldObj);
InsertUpdateFieldFromIndex(newObj);
}
public void ModifyItemFromIndex(List<T> oldObj, List<T> newObj)
{
foreach (var field in oldObj)
{
DeleteObjectFromIndex(field);
}
foreach (var field in newObj)
{
InsertUpdateFieldFromIndex(field);
}
}
#endregion
#region Private methods
public void DeleteObjectFromIndex(T oldObj)
{
var oldFields = LuceneReflection.GetLuceneFields(oldObj, false);
foreach (var oldField in oldFields)
{
_indexWriter.DeleteDocuments(new Term
(oldField.Name(), oldField.StringValue()));
}
}
public void InsertUpdateFieldFromIndex(T newfield)
{
AddObjectToIndex(newfield);
}
#endregion
#endregion
public void FinalizeWriter(bool optimize)
{
if (optimize)
_indexWriter.Optimize();
_indexWriter.Commit();
_indexWriter.Close();
_luceneIndexDirectory.Close();
}
}
LuceneSearcher
is responsible to retry a list of objects of a specified type. If, for instance, the index contains Employees
we can search using LuceneSearcher.Search<Employee>
. In our sample, we are using FeedResult
as the type for our index, so we will search using LuceneSearcher.Search<FeedResult>
.
public class LuceneSearcher : ILuceneSearcher
{
private readonly string _INDEX_FILEPATH =
ConfigurationManager.AppSettings.Get("LuceneIndexFilePath");
private Directory _luceneIndexDirectory = null;
private IndexSearcher _indexSearcher = null;
private QueryParser _queryParser = null;
private StandardAnalyzer _analyzer = null;
public LuceneSearcher()
{
Create();
}
public LuceneSearcher(string indexFilePath): this()
{
_INDEX_FILEPATH = indexFilePath;
}
public void Create()
{
_luceneIndexDirectory = FSDirectory.Open(new DirectoryInfo(_INDEX_FILEPATH));
_analyzer = new StandardAnalyzer();
_indexSearcher = new IndexSearcher(_luceneIndexDirectory);
}
public List<T> Search<T>(string property, string textsearch) where T : new()
{
_queryParser = new QueryParser(property, _analyzer);
var result = GetResults<T>(textsearch);
return result;
}
public List<T> Search<T>(string textSearch) where T: new()
{
return GetResults<T>(textSearch);
}
public List<T> GetResults<T>(string textSearch) where T: new()
{
List<T> results = new List<T>();
Query query = _queryParser.Parse(textSearch);
Hits hits = _indexSearcher.Search(query);
int resultsCount = hits.Length();
for (int i = 0; i < resultsCount; i++)
{
Document doc = hits.Doc(i);
var obj = LuceneReflection.GetObjFromDocument<T>(doc);
results.Add(obj);
}
return results;
}
}
Here's the class responsible for the Reflection in both writing and reading.
public class LuceneReflection
{
public static List<Field> GetLuceneFields<T>(T obj, bool isSearch)
{
List<Field> fields = new List<Field>();
Field field = null;
PropertyInfo[] propertyInfos = obj.GetType().GetProperties();
foreach (var propertyInfo in propertyInfos)
{
if (obj.GetType().GetProperty(propertyInfo.Name).GetValue
(obj, null) != null && !isSearch)
{
field = GetLuceneFieldsForInsertUpdate
(obj, propertyInfo, false);
}
else
{
field = GetLuceneFieldsForInsertUpdate
(obj, propertyInfo, true);
}
fields.Add(field);
}
return fields;
}
private static Field GetLuceneFieldsForInsertUpdate<T>
(T obj, PropertyInfo propertyInfo, bool isSearch)
{
Field field = null;
object[] dbFieldAtts = propertyInfo.GetCustomAttributes
(typeof(LuceneWrapAttribute), isSearch);
if (dbFieldAtts.Length > 0 && propertyInfo.PropertyType == typeof(System.String))
{
var luceneWrapAttribute = ((LuceneWrapAttribute)dbFieldAtts[0]);
field = GetLuceneField(obj, luceneWrapAttribute, propertyInfo, isSearch);
}
else if (propertyInfo.PropertyType != typeof(System.String))
{
throw new InvalidCastException(string.Format("{0}
must be a string in order to get indexed", propertyInfo.Name));
}
return field;
}
private static Field GetLuceneField<T>(T obj, LuceneWrapAttribute luceneWrapAttribute,
PropertyInfo propertyInfo, bool isSearch)
{
Field.Store store = luceneWrapAttribute.IsStored ?
Field.Store.YES : Field.Store.NO;
Lucene.Net.Documents.Field.Index index = luceneWrapAttribute.IsSearchable ?
Field.Index.ANALYZED : Field.Index.NOT_ANALYZED;
string propertyValue = isSearch ? string.Empty :
obj.GetType().GetProperty(propertyInfo.Name).GetValue(obj, null).ToString();
Field field = new Field(propertyInfo.Name, propertyValue, store, index);
return field;
}
public static T GetObjFromDocument<T>(Document document) where T : new()
{
T obj = new T();
var fields = GetLuceneFields(obj, true);
foreach (var field in fields)
{
obj.GetType().GetProperty(field.Name()).SetValue
(obj, document.Get(field.Name()), null);
}
return (T)obj;
}
}
Once we got our index built, we can search or modify a field and then search for the new data:
[Test]
public void UpdateIndex_Test()
{
var feeds = FeedManager.GetFeeds();
var oldFeed = feeds.First();
FeedResult newField = new FeedResult(){Id = oldFeed.Id,
Summary = "CIAO CIAO",Title = oldFeed.Title};
_luceneManager.ModifyItemFromIndex(oldFeed,newField);
_luceneManager.FinalizeWriter(true);
}
[Test]
public void SearchModifiedEntry_Test()
{
var result = _luceneSearcher.Search<FeedResult>("Summary", "CIAO CIAO");
foreach (var feedResult in result)
{
Console.WriteLine(feedResult.Id);
Console.WriteLine(feedResult.Title);
Console.WriteLine(feedResult.Summary);
Console.WriteLine(Environment.NewLine);
}
}
Points of Interest
Lucene is an excellent framework for full text search, the main alternative in the .NET world is the Full Text Search provided by SQL Server. Here are the main differences between the two:
SQL Server FTS
- You don't have to add anything to your solution, it's there in SQL Server
- Much easier administration of indexes
- It is database dependant
Lucene.net
- It is free and opensource, it offers more possibility than SQL Server.
- It is not tied to any product, you can easily scale horizontally by adding more indexes to your web servers.
- You have to programmatically handle every phase of the indexing, from the creation to the update.
Conclusions
Rumors says that Lucene.net performs better than SQL Server on large sets of data, I haven't done any comparison yet and the argument doesn't fit in the article.
But if you're thinking of adding full text search on your website, the suggestion is: if the data you want to index is already in SQL Server and you want something quick and simple to implement, just go for the SQL Server FTS.
Otherwise, if you want something fast and more sophisticated because you know you will have to handle millions of records, or because some features are not present in SQL Server or simply because you don't have SQL Server... well, in that case, go for Lucene.net and use LuceneWrap
. :)
References
The source code is at the top of the article.