What is Lucene.Net?
Lucene.Net is a high performance Information Retrieval (IR) library, also known as a search engine library. Lucene.Net contains powerful APIs for creating full text indexes and implementing advanced and precise search technologies into your programs. Some people may confuse Lucene.net with a ready to use application like a web search/crawler, or a file search application, but Lucene.Net is not such an application, it's a framework library. Lucene.Net provides a framework for implementing these difficult technologies yourself. Lucene.Net makes no discriminations on what you can index and search, which gives you a lot more power compared to other full text indexing/searching implications; you can index anything that can be represented as text. There are also ways to get Lucene.Net to index HTML, Office documents, PDF files, and much more.
Lucene.Net is an API per API port of the original Lucene project, which is written in Javal even the unit tests were ported to guarantee the quality. Also, Lucene.Net index is fully compatible with the Lucene index, and both libraries can be used on the same index together with no problems. A number of products have used Lucene and Lucene.Net to build their searches; some well known websites include Wikipedia, CNET, Monster.com, Mayo Clinic, FedEx, and many more. But, it’s not just web sites that have used Lucene; there is also a product that has used Lucene.Net, called Lookout, which is a search tool for Microsoft Outlook that just brought Outlook’s integrated search to look painfully slow and inaccurate.
Lucene.Net is currently undergoing incubation at the Apache Software Foundation. Its source code is held in a subversion repository and can be found here. If you need help downloading the source, you can use the free TortoiseSVN, or RapidSVN. The Lucene.Net project always welcomes new contributors. And, remember, there are many ways to contribute to an open source project other than writing code.
Creating a search solution
There are roughly two main parts to a search solution. Indexing the content you wish to search, and actually searching the content. And, it is pretty much as simple as that. After we have an index, we will perform a search.
What you need to create an index
Let’s see an example of what it takes to create an index and to populate it.
string indexFileLocation = @"C:\Index";
Lucene.Net.Store.Directory dir =
Lucene.Net.Store.FSDirectory.GetDirectory(indexFileLocation, true);
Lucene.Net.Analysis.Analyzer analyzer = new
Lucene.Net.Analysis.Standard.StandardAnalyzer();
Lucene.Net.Index.IndexWriter indexWriter = new
Lucene.Net.Index.IndexWriter(dir, analyzer,
true);
Lucene.Net.Documents.Document doc = new
Lucene.Net.Documents.Document();
Lucene.Net.Documents.Field fldContent =
new Lucene.Net.Documents.Field("content",
"The quick brown fox jumps over the lazy dog",
Lucene.Net.Documents.Field.Store.YES,
Lucene.Net.Documents.Field.Index.TOKENIZED,
Lucene.Net.Documents.Field.TermVector.YES);
doc.Add(fldContent);
indexWriter.AddDocument(doc);
indexWriter.Optimize();
indexWriter.Close();
Alright, not bad, let’s take a look at what we just did. There are five main classes in use here, and they are Directory
, Analyzer
, IndexWriter
, Document
, and Field
. We create a Directory
that lets Lucene know where we want to store the index. The Analyzer
is used to analyze the text. We have an IndexWriter
that uses the Directory
and Analyzer
to create and write out the index. Then, we create a new Document
object, and create a Field
that has it’s field name set to “content” and the value to “The quick brown fox jumps over the lazy dog”. We add the Field
to the Document
, and now, we can index the newly created Document
with the IndexWriter
. Then, we have a funny looking call to Optimize
(more on this later), and call Close
to close the writer when we are done. We have successfully created a full text index that’s ready to be searched. First, let’s elaborate a little bit on some of the classes that we just used.
Lucene.Net.Store.Directory
– The Directory
is a base class that is used to provide an abstract view of a directory. There are two implementations packaged with Lucene.Net. FSDirectory
works with a file directory to store the index. RAMDirectory
is an in memory directory that you can use to store the index. You can inherit from the Directory
class to implement your own custom directory object to store the index.Lucene.Net.Analysis.Analyzer
– The Analyzer
is a base class that is responsible for breaking the text down into single words or terms, and removing any noise words, or what Lucene.net calls stop words; stop words include “and”, “a”, “the” etc. For now, we will just use the StandardAnalyzer
class as it’s a very good first choice. You can pass in a list of your own stop words to the constructor of the StandardAnalyzer
as a sting array. Using the default constructor will use the default list of stop words. You can inherit from the Analyzer
to implement a custom way to handle the documents that are to be indexed.Lucene.Net.Index.IndexWriter
– The IndexWriter
takes on the responsibility of coordinating the Analyzer
and throwing the results to the Directory
for storage. During the creation of the index, the writer will create some files in the Directory
. When we add some documents to the index writer, the index writer will use the Analyzer
to break down each of the fields and find a place to store the indexed document in the Directory
. After a session of indexing documents, it is encouraged that you optimize the index, which compacts the index for a less resource-intensive model. Also note that it is not recommended that you call Optimize
for every Document
you add to the index, just once after an indexing session, if you can. At the end of the IndexWriter
’s constructor, we specify true
to create a new index. To add more documents to the index, you would specify false
here, to avoid overwriting the index.Lucene.Net.Documents.Document
– The Document
class is what gets indexed by the IndexWriter
. You can think of a Document
as an entity that you want to retrieve; a Document
could represent an email, or a web page, or a recipe, or even a CodeProject article.Lucene.Net.Documents.Field
– The Document
contains a list of Field
s that are used to describe the document. Every field has a name and a value. Each of the field’s values contains the text that you want to make searchable. The other parts of the field's constructor contains instructions for how to handle an individual field. The Field.Store
instructions tell the IndexWriter
that you want to store the field’s value inside the index, so later the value can be retrieved and acted upon like showing the data to the user in the search results or storing an identifier value like the primary key of the object that this field's document represents.Other instructions are the Field.Index
values, which tell the IndexWriter
how to index (if at all) the field. Possible values include Field.Index.TOKENIZED
, meaning that we want to break down the string with the IndexWriter
’s supplied Analyzer
and make it searchable. Another option is Field.Index.UNTOKENIZED
, which will still index the field but as a whole, and it is not broken down by the Analyzer
. The difference between storing a value and indexing the value is that when you store the value, the purpose is to be able to retrieve the value back from the index. And, the purpose behind indexing a value is to make the field’s value searchable. It is totally acceptable to store a value but not have it indexed, like you would probably want to store an identifier value but not want to index it, and it’s possible to want to index the content of an email but you don’t really want to display the content within the search results. The other set of instructions define how to handle TermVector
s. Storing the TermVector
s in an index is used for an advanced feature of Lucene that doesn’t exactly match a search query term, but with term vectors, you will be able to retrieve related documents - as in documents that are about the same subject.
After a good indexing of some documents, I’m sure that we are ready for the fun part.
What you need to search an index
Let’s take a look at an example of what you need to perform a simple search.
string indexFileLocation = @"C:\Index";
Lucene.Net.Store.Directory dir =
Lucene.Net.Store.FSDirectory.GetDirectory(indexFileLocation, true);
Lucene.Net.Search.IndexSearcher searcher = new
Lucene.Net.Search.IndexSearcher(dir);
Lucene.Net.Index.Term searchTerm =
new Lucene.Net.Index.Term("content", "fox");
Lucene.Net.Search.Query query = new Lucene.Net.Search.TermQuery(searchTerm);
Lucene.Net.Search.Hits hits = searcher.Search(query);
for (int i = 0; i < hits.Length(); i++)
{
Document doc = hits.Doc(i);
string contentValue = doc.Get("content");
Console.WriteLine(contentValue);
}
With this small bit of code, we defined where our index is stored, again through the use of a Directory
class. But now, we have this IndexSearch
object, which does all the heavy lifting of the actual search. To use the IndexSearcher
, we have to pass it a Query
object. You call the Search
method from the IndexSearcher
object, while passing in the Query
object to the search. And, it will return you a Hits
object. And finally, by iterating through Hits
, we are able to pull out the Document
s that match that query. After we have our document, we can pull out a field's value that was previously stored with the document when it was indexed. Let's look into the classes a little more closer!
Lucene.Net.Search.IndexSearcher
– The IndexSearcher
object again does all the heavy lifting of doing the actual search. When a search is to be performed, it will use the Directory
object passed into the IndexSearcher
’s constructor to open the index as a read-only file. There are more methods on the IndexSearcher
object that provides some other ways to query an index.Lucene.Net.Index.Term
– A Term
is the most basic construct for searching. A Term
consists of two parts, the name of a field you wish to search, and the value of the field.Lucene.Net.Search.Query
– A base class that works with the IndexSearcher
to provide the results. The Query
is an abstract base class. In the example above, we used a TermQuery
object that makes a query of a single Term
. There are many other ways to create a query. Some implementations of the Query
class, besides the TermQuery
, include BooleanQuery
, PhraseQuery
, PrefixQuery
, PhrasePrefixQuery
, RangeQuery
, FilteredQuery
, and SpanQuery
. With all these choices in how to query, we need a way to let the user make a powerful query from a single textbox. This is were another class comes in, and that is the QueryParser
. More on this soon.Lucene.Net.Search.Hits
– This represents a list of documents that were returned in the search. A Hits
object can be iterated over, and is responsible for getting the documents from the search. For larger indexes, it is not recommended to iterate over all the search results. Also, it’s good to note that the Hits
object doesn’t load all the documents initially, it only loads a portion of the documents. Otherwise, it will lead to performance issues. After you have a Hits
object, you can call the Doc(int index)
method which will return the document associated with a single hit.
Like I mentioned earlier, there are many implementations of the Query
class, each of them has a place in queries. Mostly, you wouldn’t create a query object yourself, but let a powerful parser build a complex query for you with some simple syntax, much like how you search Google. This is were I introduce you to the QueryParser
. A QueryParser
instance has a method called Parse(string query)
. Here is a small example on using the QueryParser
:
Lucene.Net.Analysis.Analyzer analyzer = new
Lucene.Net.Analysis.Standard.StandardAnalyzer();
Lucene.Net.QueryParsers.QueryParser queryParser = new
Lucene.Net.QueryParsers.QueryParser("content", analyzer);
Lucene.Net.Search.Query
query = queryParser.Parse("fox");
And, if you think all this stuff is neat, we have barely even scratched the surface. But, this will be all of the article for now. If you want to find out some more, let me know, and I’ll work on another article about Lucene.Net.