Lucene Full Text Search - A Very Basic Tutorial

Han Bo Sun

5.00/5 (4 votes)

26 Sep 2019MIT13 min read

24.6K

389

A simple tutorial on using Apache Lucene for full text search

Download demo project - 8.5 KB

Introduction

This is the fourth tutorial I am writing for this year. For this one, I was going to do some research on one of my favorite subjects - full text search engine. The common one that people use is Apache Lucene. I used this before, wrote a tutorial about it with Hibernate. This time, I want to explore it without mixing with other technology. I will have a simple Java console application, which will perform three different functions:

Index some document
Full text search to find the target document
Get document by unique identifier

The program also performs some other miscellaneous functions like deleting all documents from the index, or deleting just some documents from index.

The program will use file directory as index repository. The version of Apache Lucene used in this is 8.2.0. We only needed the lucene-core library to get all these to work.

Background

Working with Lucene seems to be complicated. But it is quite simple if you compare it with relational databases. Let me lay out some terminologies. First, there is the concept of index. Index is like a database. It can store a lot of documents. Document is like a table. As we all know, there can be one or more columns in a table. In a document, there can be one or more fields. These fields are like columns in a table of a relational database.

So, adding a document is like adding a row into a table. Then finding documents in an index is like querying the table to find the data rows that match the query criteria. As we all know, when query a table in a relational database is specifying query criteria against the columns. Finding the documents in the index can be done with the same way, by specifying the search terms against the fields of the documents.

For this sample application, I will use the file system to store the document index. After documents are added into an index, you will see that the directory looks like this:

Let's start with the way in which indexing a document works.

Indexing a Document

In order to perform a full text search operation, the first thing you have to do is add some documents into the index. Apache Lucene library provides two object types, one is called a Document; the other is called an IndexableField. A document contains multiple indexable fields. Once a document with multiple indexable fields are created, it can be added into the full text search index. IndexableField is an abstract type, its sub types included TextField, StringField, IntPoint, FloatPoint, IntRange, FloatRange, and many other field types. For this tutorial, I am using only TextField, and StringField.

The reason there are so many field types is that different types of values can be analyzed differently and yet added into the same searchable index as a single document. The difference between TextField and StringField is that any value of a TextField will be broken into words (tokens). For English language, a sentence is created with words separated with spaces, and punctuation marks. If a sentence is stored in a text fields, all the words are extracted and each is a searchable token. Lucene would associate this document with all these words. If the sentence is stored in a StringField, then the entire sentence is considered as one token. And the numeric values in numeric fields will be treated as numeric values, which equal, greater than, less than, or other range based comparisons can be used to query against these fields and used to locate the documents.

Let's take a look at how to create a indexable document. Here is the code:

Java

public Document createIndexDocument(IndexableDocument docToAdd)
{
   Document retVal = new Document();

   IndexableField docIdField = new StringField("DOCID",
      docToAdd.getDocumentId(),
      Field.Store.YES);
   IndexableField titleField = new TextField("TITLE",
      docToAdd.getTitle(),
      Field.Store.YES);
   IndexableField contentField = new TextField("CONTENT",
      docToAdd.getContent(),
      Field.Store.NO);
   IndexableField keywordsField = new TextField("KEYWORDS",
      docToAdd.getKeywords(),
      Field.Store.YES);
   IndexableField categoryField = new StringField("CATEGORY",
      docToAdd.getCategory(),
      Field.Store.YES);
   IndexableField authorNameField = new TextField("AUTHOR",
      docToAdd.getAuthorName(),
      Field.Store.YES);
   long createTime = docToAdd.getDocumentDate().getTime();
   IndexableField documentTimeField = new StoredField("DOCTIME", createTime);
   IndexableField emailField = new StringField("AUTHOREMAIL",
      docToAdd.getAuthorEmail(),
      Field.Store.YES);

   retVal.add(docIdField);
   retVal.add(titleField);
   retVal.add(contentField);
   retVal.add(keywordsField);
   retVal.add(categoryField);
   retVal.add(authorNameField);
   retVal.add(documentTimeField);
   retVal.add(emailField);

   return retVal;
}

The above code snippet is not hard to understand. It creates a Document object. Then create a number of IndexableFields objects. All these field objects are added to the Document object. Then the Document object is returned from the method. This method can be found in the file "FileBasedDocumentIndexer.java".

In the above code snippet, I have used TextField and StringField. The constructor of these two types has three parameters. The first is the field name. The second one is the value of the field. The last one is an enum value that indicates whether or not to store the value in the index or simply index and throw away the value. The difference between storing or not storing the value is that by storing the value, when you retrieve the document, you can also get the value of these stored fields.

The next step is the actual indexing of the document. Here is the code:

Java

String indexDirectory;
...

public void indexDocument(Document docToAdd) throws Exception
{
   IndexWriter writer = null;
   try
   {
      Directory indexWriteToDir =
            FSDirectory.open(Paths.get(indexDirectory));

      writer = new IndexWriter(indexWriteToDir, new IndexWriterConfig());
      writer.addDocument(docToAdd);
      writer.flush();
      writer.commit();
   }
   finally
   {
      if (writer != null)
      {
         writer.close();
      }
   }
}

The above code snippet uses FDiretory class' static method open() to get a reference of the index directory, which is an object of type Directory (an object type in Apache Lucene). Next, I instantiate a IndexWriter object. The constructor takes in two arguments, the first is the Directory object; the second one is a configuration object of type IndexWriterConfig. I use the default configuration, which uses the StandardAnalayzer. The standard analyzer is by default working with English language based sentences. There are many other types of analyzers, which you can include an additional jar to use them, or implement your own analyzer if you wanted to. For the sake of simplicity, this tutorial is using only the StandardAnalyzer.

Next, I use the IndexWriter object's method addDocument() to add the document into the full text search index. This is where the document indexing happens. It follows by the call to flush() and commit() to make sure the indexing is fully committed to the full text search index.

I wrapped the whole operation in a try-finally block, but no catch block. It is done so that any exception can be handled by any callers. The finally block would close the writer to clean up the resource use. If you run the sample project, and finish this method, you will see something similar to the above screenshot.

Now that we know how to index a document, next we will take a look at how to search for this document.

Full Text Search to Locate Documents

Before I present the code, I would like to explain my design intention. I like to search all fields of the documents in the index. If any field has a match of the full text search criteria, then the document will be considered as found.

Here is the code:

Java

public List<FoundDocument> searchForDocument(String searchVal)
{
   List<FoundDocument> retVal = new ArrayList<FoundDocument>();

   try
   {
      Directory dirOfIndexes =
            FSDirectory.open(Paths.get(indexDirectory));

      IndexSearcher searcher = new IndexSearcher(DirectoryReader.open(dirOfIndexes));

      QueryBuilder bldr = new QueryBuilder(new StandardAnalyzer());
      Query q1 = bldr.createPhraseQuery("TITLE", searchVal);
      Query q2 = bldr.createPhraseQuery("KEYWORDS", searchVal);
      Query q3 = bldr.createPhraseQuery("CONTENT", searchVal);
      Query q4 = bldr.createBooleanQuery("CATEGORY", searchVal);
      Query q5 = bldr.createPhraseQuery("AUTHOR", searchVal);
      Query q6 = bldr.createBooleanQuery("AUTHOREMAIL", searchVal);

      BooleanQuery.Builder chainQryBldr = new BooleanQuery.Builder();
      chainQryBldr.add(q1, Occur.SHOULD);
      chainQryBldr.add(q2, Occur.SHOULD);
      chainQryBldr.add(q3, Occur.SHOULD);
      chainQryBldr.add(q4, Occur.SHOULD);
      chainQryBldr.add(q5, Occur.SHOULD);
      chainQryBldr.add(q6, Occur.SHOULD);

      BooleanQuery finalQry = chainQryBldr.build();

      TopDocs allFound = searcher.search(finalQry, 100);
      if (allFound.scoreDocs != null)
      {
         for (ScoreDoc doc : allFound.scoreDocs)
         {
            System.out.println("Score: " + doc.score);

            int docidx = doc.doc;
            Document docRetrieved = searcher.doc(docidx);
            if (docRetrieved != null)
            {
               FoundDocument docToAdd = new FoundDocument();

               IndexableField field = docRetrieved.getField("TITLE");
               if (field != null)
               {
                  docToAdd.setTitle(field.stringValue());
               }

               field = docRetrieved.getField("DOCID");
               if (field != null)
               {
                  docToAdd.setDocumentId(field.stringValue());
               }

               field = docRetrieved.getField("KEYWORDS");
               if (field != null)
               {
                  docToAdd.setKeywords(field.stringValue());
               }

               field = docRetrieved.getField("CATEGORY");
               if (field != null)
               {
                  docToAdd.setCategory(field.stringValue());
               }

               if (docToAdd.validate())
               {
                  retVal.add(docToAdd);
               }
            }
         }
      }
   }
   catch (Exception ex)
   {
      ex.printStackTrace();
   }

   return retVal;
}

This code snippet can be split into several parts, the first part is opening the directory of the full text search index:

Java

Directory dirOfIndexes = 
   FSDirectory.open(Paths.get(indexDirectory));

IndexSearcher searcher = new IndexSearcher(DirectoryReader.open(dirOfIndexes));

This part is similar to the way directory is opened for adding document to the index. But instead of using an IndexWriter, I am using an IndexSearcher object. Again, the analyzer I used is the default StandardAnalyzer. You won't see StandardAnalyzer objects or any references in my sample code, because it is assumed as the default analyzer used.

Next, I need to create a query. The query basically specifies that if any fields contain the phrase (passed in by the parameter of my method), then the document is considered as found. Since I have multiple fields, I have to create the query like this:

Java

QueryBuilder bldr = new QueryBuilder(new StandardAnalyzer());
Query q1 = bldr.createPhraseQuery("TITLE", searchVal);
Query q2 = bldr.createPhraseQuery("KEYWORDS", searchVal);
Query q3 = bldr.createPhraseQuery("CONTENT", searchVal);
Query q4 = new TermQuery(new Term("CATEGORY", searchVal));
Query q5 = bldr.createPhraseQuery("AUTHOR", searchVal);
Query q6= new TermQuery(new Term("AUTHOREMAIL", searchVal));

BooleanQuery.Builder chainQryBldr = new BooleanQuery.Builder();
chainQryBldr.add(q1, Occur.SHOULD);
chainQryBldr.add(q2, Occur.SHOULD);
chainQryBldr.add(q3, Occur.SHOULD);
chainQryBldr.add(q4, Occur.SHOULD);
chainQryBldr.add(q5, Occur.SHOULD);
chainQryBldr.add(q6, Occur.SHOULD);

BooleanQuery finalQry = chainQryBldr.build();

The above code snippet first creates six different Query objects. Each for one field of the document. As observed, I used two different kinds of Query sub-type. One is called a phrase query. This type of query will try to match the input string as a sub text segment of the field value. The other type is called a term query. The reason for two different types of Query is that phrase query does not work on StringField type fields. So I used Term query to attempt matching the input search value against the StringField type fields. This simple approach is enough to make the query work.

Once I constructed the queries, I need to create a master query that concatenate all six queries together. The logic should be if field #1 matches the input query value, or field #2 matches the input query value, or field #3 matches the input query value, ..., or field #6 matches the input query value, then the document is the one that should be retrieved. We can create such master query using the query builder of BooleanQuery. Once I created the build object, I add all six queries one by one to the builder. Each query is added with the Occur option with object value Occur.Should. The object value Occur.Should is equivalent to the logical operator "OR". If you want to use something equivalent to logical operator "AND", then you can use Occur object value Occur.Must. For my scenario, object value Occur.Should is the one I needed.

The last line will construct the final master query. Then I need to invoke the query against the index. Here is how to do it:

Java

TopDocs allFound = searcher.search(finalQry, 100);

I used the search() method of the object IndexSearcher to find the most relevant documents. This method takes two parameters. The first is the master query. The second one is the max number of relevant documents to be returned. TopDocs is a collection of documents that are found most relevant against the query execution. Each of the object inside the TopDocs is an integer index, and a score. The score indicates how relevant the document is against the search criteria.

Now that I have a collection of documents, I will fetch them and get the document information I needed. Here is the full code of how I did this:

Java

if (allFound.scoreDocs != null)
{
   for (ScoreDoc doc : allFound.scoreDocs)
   {
      System.out.println("Score: " + doc.score);
      
      int docidx = doc.doc;
      Document docRetrieved = searcher.doc(docidx);
      if (docRetrieved != null)
      {
         FoundDocument docToAdd = new FoundDocument();

         IndexableField field = docRetrieved.getField("TITLE");
         if (field != null)
         {
            docToAdd.setTitle(field.stringValue());
         }
         
         field = docRetrieved.getField("DOCID");
         if (field != null)
         {
            docToAdd.setDocumentId(field.stringValue());
         }
         
         field = docRetrieved.getField("KEYWORDS");
         if (field != null)
         {
            docToAdd.setKeywords(field.stringValue());
         }
         
         field = docRetrieved.getField("CATEGORY");
         if (field != null)
         {
            docToAdd.setCategory(field.stringValue());
         }
         
         if (docToAdd.validate())
         {
            retVal.add(docToAdd);
         }
      }
   }
}

The above code snippet iterates through the found documents. For each found document, first I output the score of relevance. Next, I take the integer index value of the document via doc.doc. Finally, I use the searcher object to retrieve the document using the integer index of the document. Here is the code snippet:

Java

System.out.println("Score: " + doc.score);

int docidx = doc.doc;
Document docRetrieved = searcher.doc(docidx);

Once I retrieved the document, I have to extract the field values and store in my document object. This is the code that does it:

Java

if (docRetrieved != null)
{
   FoundDocument docToAdd = new FoundDocument();

   IndexableField field = docRetrieved.getField("TITLE");
   if (field != null)
   {
      docToAdd.setTitle(field.stringValue());
   }
   
   field = docRetrieved.getField("DOCID");
   if (field != null)
   {
      docToAdd.setDocumentId(field.stringValue());
   }
   
   field = docRetrieved.getField("KEYWORDS");
   if (field != null)
   {
      docToAdd.setKeywords(field.stringValue());
   }
   
   field = docRetrieved.getField("CATEGORY");
   if (field != null)
   {
      docToAdd.setCategory(field.stringValue());
   }
   
   if (docToAdd.validate())
   {
      retVal.add(docToAdd);
   }
}

This is the end of the most exciting part of this tutorial. Next, I will go over some miscellaneous operations.

Some Other Fun Stuff

We all know that in relational databases, each row in a table has a unique identifier. When I use Lucene, I do the same. I use one field of the document to store a unique identifier. What I do is I use Java's UUID object to create a GUID value. I take the string representation of the GUID value, then remove the dash character. Here is an example of a GUID value I use:

77bbd895bb6f4c16bb637a44d8ea6f1e

Assuming the field that stores the unique identifier is called "DOCID". In order to locate a document with such an ID, here is how I do it:

Java

public Document getDocumentById(String docId)
{
   Document retVal = null;
   try
   {
      Directory dirOfIndexes =
            FSDirectory.open(Paths.get(indexDirectory));

      StandardAnalyzer analyzer = new StandardAnalyzer();
      IndexSearcher searcher = new IndexSearcher(DirectoryReader.open(dirOfIndexes));
      QueryBuilder quryBldr = new QueryBuilder(analyzer);

      Query idQury = quryBldr.createPhraseQuery("DOCID", docId);
      TopDocs foundDocs = searcher.search(idQury, 1);
      if (foundDocs != null)
      {
         if (foundDocs.scoreDocs != null && foundDocs.scoreDocs.length > 0)
         {
            System.out.println("Score: " + foundDocs.scoreDocs[0].score);
            retVal = searcher.doc(foundDocs.scoreDocs[0].doc);
         }
      }
   }
   catch (Exception ex)
   {
      ex.printStackTrace();
   }

   return retVal;
}

This is pretty similar to the search method presented in the previous section. First, I open a Directory that contains the full text index, then I create an object of IndexSearcher. Using this searcher object, I create a phrase query which specifies the GUID input value to be searched against just one field, called "DOCID". I specify only one document should be returned for such query. So it is a one result or nothing query. Whatever is found would be the document I expected. Once I found the document, I would use the integer index value to retrieve it and convert into a document of my own type.

Next, I like to discuss two ways that can be useful for cleaning up. One is how to delete a document from the full text index. The other is deleting all documents in a full text index. Both are easy to perform. First, let's check how to delete a single document from the index. Here is the code:

Java

public void deleteDocument(String docId) throws Exception
{
   IndexWriter writer = null;
   try
   {
      Directory indexWriteToDir =
            FSDirectory.open(Paths.get(indexDirectory));

      writer = new IndexWriter(indexWriteToDir, new IndexWriterConfig());
      writer.deleteDocuments(new Term("DOCID", docId));
      writer.flush();
      writer.commit();
   }
   finally
   {
      if (writer != null)
      {
         writer.close();
      }
   }
}

This code snippet shows why it is important to have a unique identifier for the document. In this code snippet, it calls the IndexWriter's deleteDocuments(). This method uses a Term object to find all the documents that matches the search term. Then it removes all these documents. In the above code, I uses again the field "DOCID" to find the document that matches the unique identifier. This method is not limited to deleting just one doucment. You can use the Term object to specify search term for multiple documents. Then the method will remove all these documents.

Similarly, deleting all indexes can be done by just a call to method deleteAll(), here is how it can be done:

Java

public void deleteAllIndexes() throws Exception
{
   IndexWriter writer = null;
   try
   {
      Directory indexWriteToDir =
            FSDirectory.open(Paths.get(indexDirectory));

      writer = new IndexWriter(indexWriteToDir, new IndexWriterConfig());
      writer.deleteAll();
      writer.flush();
      writer.commit();
   }
   finally
   {
      if (writer != null)
      {
         writer.close();
      }
   }
}

In both methods, I open the directory; create a IndexWriter object with the directory object, and default configuration. Finally, I call the delete methods of the IndexWriter object. Lastly, I flush the IndexWriter object and commit the changes.

This is it! All the little trinkets needed for basic document indexing and full text search. They are not much. But they work.

Test Run

The big question now is how to test all these code design. In my sample application. I have a class called IndexingMain. Inside, there is the main entry, and a bunch of helper methods. Let me begin with the method that creates a document. Here it is:

Java

public static IndexableDocument prepareDocForTesting(String docId)
{
   IndexableDocument doc = new IndexableDocument();

   Calendar cal = Calendar.getInstance();
   cal.set(2018, 8, 21, 13, 13, 13);

   doc.setDocumentId(docId);
   doc.setAuthorEmail("testuser@lucenetest.com");
   doc.setAuthorName("Lucene Test User");
   doc.setCategory("Index File Sample");
   doc.setContent("There are two main types of medical gloves: "
      + "examination and surgical. Surgical gloves have more "
      + "precise sizing with a better precision and sensitivity "
      + "and are made to a higher standard. Examination gloves "
      + "are available as either sterile or non-sterile, while "
      + "surgical gloves are generally sterile.");
   doc.setDocumentDate(cal.getTime());
   doc.setKeywords("Joseph, Brian, Clancy, Connery, Reynolds, Lindsay");
   doc.setTitle("Quick brown fox and the lazy dog");

   return doc;
}

IndexableDocument is a document type I created. And I have to convert my document type object into the Apache Lucene Document object. This is done with this code snippet, which you can find in the main entry:

Java

FileBasedDocumentIndexer indexer = new FileBasedDocumentIndexer("c:/DevJunk/Lucene/indexes");
...
Document lucDoc1 = indexer.createIndexDocument(doc1);
indexer.indexDocument(lucDoc1);

And the last line in the above code snippet is indexing the document into the Lucene file index. Now that we can successfully index a document, it is time to see how search works. Here it is:

Java

public static void testFindDocument(String searchTerm)
{
   LuceneDocumentLocator locator = new LuceneDocumentLocator("c:/DevJunk/Lucene/indexes");
   List<FoundDocument> foundDocs = locator.searchForDocument(searchTerm);
   
   if (foundDocs != null)
   {
      for (FoundDocument doc : foundDocs)
      {
         System.out.println("------------------------------");
         System.out.println("Found document...");
         System.out.println("Document Id: " + doc.getDocumentId());
         System.out.println("Title: " + doc.getTitle());
         System.out.println("Category: " + doc.getCategory());
         System.out.println("Keywords: " + doc.getKeywords());
         System.out.println("------------------------------");
      }
   }
}

Here is how this helper method is used in the main entry:

Java

...        
System.out.println("********************************");
System.out.println("Search first document");
testFindDocument("available as either");
System.out.println("********************************");
...

Lastly, I have created a helper method that finds a document by the "DOCID". Here it is:

Java

public static Document testGetDocumentById(String docId)
{
   LuceneDocumentLocator locator = new LuceneDocumentLocator("c:/DevJunk/Lucene/indexes");
   Document retVal = locator.getDocumentById(docId);

   if (retVal != null)
   {
      System.out.println("Get Document by Id [" +  docId + "] found.");
   }
   else
   {
      System.out.println("Get Document by Id [" +  docId + "] **not** found.");
   }

   return retVal;
}

To use this helper method to test it, like this:

Java

...
testGetDocumentById(id1);
...

Here it is, the main entry of the sample application:

Java

public static void main(String[] args)
{
   UUID x = UUID.randomUUID();
   String id1 = x.toString();
   id1 = id1.replace("-", "");
   System.out.println("Document #1 with id [" + id1 + "] has been created.");

   x = UUID.randomUUID();
   String id2 = x.toString();
   id2 = id2.replace("-", "");
   System.out.println("Document #2 with id [" + id2 + "] has been created.");

   IndexableDocument doc1 = prepareDocForTesting(id1);
   IndexableDocument doc2 = prepare2ndTestDocument(id2);

   FileBasedDocumentIndexer indexer =
            new FileBasedDocumentIndexer("c:/DevJunk/Lucene/indexes");
   try
   {
      indexer.deleteAllIndexes();

      Document lucDoc1 = indexer.createIndexDocument(doc1);
      indexer.indexDocument(lucDoc1);

      System.out.println("********************************");
      System.out.println("Search first document");
      testFindDocument("available as either");
      System.out.println("********************************");

      Document lucDoc2 = indexer.createIndexDocument(doc2);
      indexer.indexDocument(lucDoc2);

      testGetDocumentById(id1);

      System.out.println("********************************");
      System.out.println("Search second document");
      testFindDocument("coocoobird@moomootease.com");
      System.out.println("********************************");

      testGetDocumentById(id2);

      indexer.deleteAllIndexes();
   }
   catch (Exception ex)
   {
      ex.printStackTrace();
      return;
   }
}

The sample application included an Apache Maven pom.xml. To build the application, just run:

mvn clean install

If you wish, you can create an Eclipse project from this Maven pom.xml file. Then you can import the project into Eclipse. To create the Eclipse project files, do this:

mvn eclipse:eclipse

When you run the application, you will see this:

Document #1 with id [ae1541e5051743e5af310bcfb50a19e8] has been created.
Document #2 with id [c1f20e79043d4b40aa2b9f3ac74e287b] has been created.
********************************
Search first document
Score: 0.39229375
------------------------------
Found document...
Document Id: ae1541e5051743e5af310bcfb50a19e8
Title: Quick brown fox and the lazy dog
Category: Index File Sample
Keywords: Joseph, Brian, Clancy, Connery, Reynolds, Lindsay
------------------------------
********************************
Score: 0.3150669
Get Document by Id [ae1541e5051743e5af310bcfb50a19e8] found.
********************************
Search second document
Score: 0.3150669
------------------------------
Found document...
Document Id: c1f20e79043d4b40aa2b9f3ac74e287b
Title: The centre of kingfisher diversity is the Australasian region
Category: Once upon a Time
Keywords: Liddy, Yellow, Fisher, King, Stevie, Nickolas, Feng Feng
------------------------------
********************************
Score: 0.3150669
Get Document by Id [c1f20e79043d4b40aa2b9f3ac74e287b] found.

It is messy, but it is proof that all the methods I have created above work as expected. There might be some bugs in it. I hope you can catch some. At least, feel free to change the search strings in the method calls to testFindDocument(...);<code> Like this:

Java

...
testFindDocument("<Your test search string here>");
...

Summary

Finally, it comes to the point to write the summary of this tutorial. This tutorial has not been the fun kind of tutorial I used to write. I was not very familiar with the subject matter. And I have struggled. In the end, the outcome looked OK. And I am fairly happy with it.

In this tutorial, I have discussed the following topics:

How to open a directory as a document Index
How to index a document into a document index
How to search documents in a document index. It is simple but does the job OK.
How to locate a document by unique identifier
How to delete documents by using a unique identifier. And how to delete all documents

I still had fun writing this tutorial. And I hope you would enjoy this as well.

History

09/24/2019 - Initial draft

License

This article, along with any associated source code and files, is licensed under The MIT License