What is Lucene.Net?
Lucene.Net is a high performance Information Retrieval (IR) library, also known as a search engine library. Lucene.Net contains powerful APIs for creating full text indexes and implementing advanced and precise search technologies into your programs. Some people may confuse Lucene.net with a ready to use application like a web search/crawler, or a file search application, but Lucene.Net is not such an application, it's a framework library. Lucene.Net provides a framework for implementing these difficult technologies yourself. Lucene.Net makes no discriminations on what you can index and search, which gives you a lot more power compared to other full text indexing/searching implications; you can index anything that can be represented as text. There are also ways to get Lucene.Net to index HTML, Office documents, PDF files, and much more.
Lucene.Net is an API per API port of the original Lucene project, which is written in Java. Even the unit tests were ported to guarantee the quality. Also, Lucene.Net index is fully compatible with the Lucene index, and both libraries can be used on the same index together with no problems. A number of products have used Lucene and Lucene.Net to build their searches; some well known websites include Wikipedia, CNET, Monster.com, Mayo Clinic, FedEx, and many more. But, it’s not just web sites that have used Lucene; there is also a product that has used Lucene.Net, called Lookout, which is a search tool for Microsoft Outlook that just brought Outlook’s integrated search to look painfully slow and inaccurate.
Lucene.Net is currently undergoing incubation at the Apache Software Foundation. Its source code is held in a subversion repository and can be found here. If you need help downloading the source, you can use the free TortoiseSVN, or RapidSVN. The Lucene.Net project always welcomes new contributors. And, remember, there are many ways to contribute to an open source project other than writing code.
How Do I Get Lucene.Net to Work with Synonyms?
The goal here is to be able to search for a word and be able to retrieve results that contain words that have the same meaning as the words you are searching for. This will allow you to be able to kind of search by meaning than search by the keywords.
We can easily get Lucene.Net to work with synonyms by creating a custom Analyzer
class. The Analyzer
will be able to inject the synonyms into the full text index. For some details on the internals of an Analyzer
, please see my previous article Lucene.Net – Text Analysis.
Creating the Analyzer
The first thing we want to do is sort of abstract the work of getting the synonyms. So we will create a simple interface to do this.
public interface ISynonymEngine
{
IEnumerable<string> GetSynonyms(string word);
}
Great, now let’s work on an implementation of the synonym engine.
public class XmlSynonymEngine : ISynonymEngine
{
private List<ReadOnlyCollection<string>> SynonymGroups =
new List<ReadOnlyCollection<string>>();
public XmlSynonymEngine(string xmlSynonymFilePath)
{
XmlDocument Doc = new XmlDocument();
Doc.Load(xmlSynonymFilePath);
var groupNodes = Doc.SelectNodes("/synonyms/group");
foreach (XmlNode g in groupNodes)
{
XmlNodeList synNodes = g.SelectNodes("child::syn");
List<string> synonymGroupList = new List<string>();
foreach (XmlNode synNode in g)
{
synonymGroupList.Add(synNode.InnerText.Trim());
}
SynonymGroups.Add(new ReadOnlyCollection<string>(synonymGroupList));
}
Doc = null;
}
#region ISynonymEngine Members
public IEnumerable<string> GetSynonyms(string word)
{
foreach (var synonymGroup in SynonymGroups)
{
if (synonymGroup.Contains(word))
{
return synonymGroup;
}
}
return null;
}
#endregion
}
Now let's look at a sample document that our XmlSynonymEngine
will read:
="1.0" ="utf-8"
<synonyms>
<group>
<syn>fast</syn>
<syn>quick</syn>
<syn>rapid</syn>
</group>
<group>
<syn>slow</syn>
<syn>decrease</syn>
</group>
<group>
<syn>google</syn>
<syn>search</syn>
</group>
<group>
<syn>check</syn>
<syn>lookup</syn>
<syn>look</syn>
</group>
</synonyms>
When thinking about creating any analyzer
that will provide a new capability to Lucene, it’s best to think about instead of putting your logic in the Analyzer
class, to place it either in the Tokenizer
or TokenFilter
class. The injecting of synonyms is more of a TokenFilter
area, so I will create a SynonmFilter
class that will act as a TokenFilter
. This implementation of a TokenFilter
will only require us to override one method of the TokenFilter
base class and that is the Next()
method which returns a token. Here is the implementation for the SynonymFilter
class:
public class SynonymFilter : TokenFilter
{
private Queue<Token> synonymTokenQueue
= new Queue<Token>();
public ISynonymEngine SynonymEngine { get; private set; }
public SynonymFilter(TokenStream input, ISynonymEngine synonymEngine)
: base(input)
{
if (synonymEngine == null)
throw new ArgumentNullException("synonymEngine");
SynonymEngine = synonymEngine;
}
public override Token Next()
{
if (synonymTokenQueue.Count > 0)
{
return synonymTokenQueue.Dequeue();
}
Token t = input.Next();
if (t == null)
return null;
IEnumerable<string> synonyms = SynonymEngine.GetSynonyms(t.TermText());
if (synonyms == null)
{
return t;
}
foreach (string syn in synonyms)
{
if ( ! t.TermText().Equals(syn))
{
Token synToken = new Token(syn, t.StartOffset(),
t.EndOffset(), "<SYNONYM>");
synToken.SetPositionIncrement(0);
synonymTokenQueue.Enqueue(synToken);
}
}
return t;
}
}
And finally the SynonymAnalyzer
:
public class SynonymAnalyzer : Analyzer
{
public ISynonymEngine SynonymEngine { get; private set; }
public SynonymAnalyzer(ISynonymEngine engine)
{
SynonymEngine = engine;
}
public override TokenStream TokenStream
(string fieldName, System.IO.TextReader reader)
{
TokenStream result = new StandardTokenizer(reader);
result = new StandardFilter(result);
result = new LowerCaseFilter(result);
result = new StopFilter(result, StopAnalyzer.ENGLISH_STOP_WORDS);
result = new SynonymFilter(result, SynonymEngine);
return result;
}
}
Now let's see the results:
Analyzer Viewer, Looking at the Tokens using The StandardAnalyzer
Analyzer Viewer, Looking at the Tokens using The SynonymAnalyzer
Points of Interest
The SynonymAnalyzer
is really great for indexing, but I think it might junk up a Query if you plan to use the SynonymAnalyzer
for use with a QueryParser
to construct a query. One way around this is to modify the SynonymFilter
, and SynonymAnalyzer
to have a bool switch to turn the synonym injection on and off. That way you could turn the synonym injection off while you are using it with a QueryParser
.
The code attached includes the Analyzer Viewer application that I had in my last article, but it also includes an update to include our brand new synonym analyzer.
History
- 1/2/2009 - Initial release