Inverted Indexes: A Scala Step-by-Step Implementation Guide

Leonid Chashnikov

5.00/5 (2 votes)

25 Sep 2023CPOL6 min read

11.9K

Discover an efficient way to implement document search using inverted index

The article provides a comprehensive overview and step-by-step guide to implementing inverted indexes using the Scala programming language. Inverted indexes are explained as a fundamental component for efficient document retrieval systems, akin to what you might find in a search engine or a database. The article covers the basics of how inverted indexes work, including the mapping of words to documents and the ability to perform complex queries. It then proceeds to provide code examples for building an inverted index, including text processing and multi-threading to enhance performance. The article offers practical insights for those interested in understanding and implementing document search systems using inverted indexes.

Before we start with the implementation, let's talk about why you would actually need an inverted index in real life.

Why Would Anyone Need Inverted Indexes at All

Imagine you need to create a system that would quickly look up a document, given several words from it - something like a wiki search. Simplest option I can think of would be to scan through each document, marking ones that have all the necessary words. That might work at first, but such a solution wouldn't scale, as each new document increases time to answer any query.

Can we do something better? Actually, we can! If a user wants to search by words - then words should be keys in our "database" (index). What would the values be in that case? All the document ids (or any other unique references / pointers) of the documents that contain the word.

How Inverted Indexes Work

Imagine we have documents like this:

C++

id: 1; Text: "In a hole in the ground there lived a hobbit."
id: 2; Text: "The sky above the port was the color of television, tuned to a dead channel."
id: 3; Text: "Hobbits Are Amazing Creatures"

In this case, map representing our index would look like that:

C++

{
    "hobbit": [1, 3],
    "television": [2],
    "hole": [1],
    ...
}

Seems quite obvious how we got here, right?

Now if the user queries the system for "hobbit" - we look up this key in a constant time (as it’s a key in a map), and return document 1 and 3.

This structure also allows us to execute complex queries like NOT, AND and OR - we get sets of documents by each key, and then do usual set operations, like intersect in case of AND, or difference in case of NOT. Hence for a query like "hobbit AND hole" we will look up sets [1, 3] and [1], their intersection would be [1], and we would return a document with id=1.

Obviously, all this is just a tip of the iceberg, the most naive implementation - real-word document indexing / querying systems could rank results based on relevance, do different sorts of fuzzy search, faceted search, etc, but naive implementation is a great place to start, and implementing it gives us a deeper understanding of the idea. So let's go ahead with an implementation.

Implementation Process

I would start somewhat "from the bottom" - so we'll first implement just a class representing Inverted Index, with the ability to add and lookup tokens there, then add some helpers to index full documents, and then provide the scaffolding that allows to actually run this as an application.

So, let's start with a class representing InvertedIndex itself:

C++

type Token = String
type Filename = String

case class InvertedIndex(
    tokenIndex: Map[Token, Set[Filename]] = HashMap[Token, Set[Filename]]()
)

As I mentioned in the beginning of the article, our index itself would be powered by a usual Map[String, Set[String]] - mapping from token (indexed word) to a set of document ids (or in our case just document names).

Not much going on just yet - let's do something useful, at least add a method to add word to the index, and get added documents afterwards:

C++

def add(token: Token, filename: Filename): InvertedIndex = {
  tokenIndex.get(token) match {
    case Some(set) =>
    // we already know this token - add filename to the set
      InvertedIndex(tokenIndex.updated(token, set + filename))
    case None =>
    // token didn't previously exist - creating new set
      InvertedIndex(tokenIndex.updated(token, Set(filename)))
  }
}

def getDocumentsForToken(token: Token): Set[Filename] =
  tokenIndex.getOrElse(token, Set.empty[String])

Not a lot is going on here, but our inverted index is already usable!

Though we still need something to add words into the index. We'll create IndexGenerator class for that purpose:

C++

class IndexGenerator(filesToIndex: List[String]) {

  def generateIndex(): InvertedIndex = {
    logger.info("Indexer started")
    filesToIndex
      .map(readFileFromDisk)
      .foldLeft(InvertedIndex()) { case (idx, doc) =>
        processDocument(idx, doc)
      }
  }private def processDocument(index: InvertedIndex, 
                               document: GenericDocument): InvertedIndex =
    document.text
      .split(" ")
      .flatMap(processToken)
      .foldLeft(index)(_.add(_, document.fileName))

private def processToken(token: String): Option[String] = {
    val tokenLower = token.toLowerCase
    if (tokenLower.isEmpty ||
      StringUtils.getStopWords.contains(tokenLower)) {
      None
    } else {
      Some(tokenLower.removeTags().removeNumbers())
    }
  }
}

So what do we have here?

IndexGenerator takes a list of files to index, reads them from disk, and "processes". Processing in our case is just splitting whole text into words (tokens), a little bit of cleanup via processToken function, and adding them one by one to the InvertedIndex.

I won't get into details of StringUtils functions here - their implementation is quite obvious from names and in any real-life application, you would actually spend some time coming up with good data cleanup rules. One thing to mention is stop words - those are basically all the words that you often encounter in any text, but are not very helpful in actual search - words like the, a, and etc. In a real-world scenario, you'll also want to do some lemmatization, converting the word into its basic form - for instance word "democratical" would be transformed into token "democracy", providing a more concise token dictionary and simplifying lookup.

But all these text processing methods are a topic of a whole separate discussion, and there are many open-source libraries available.

At this point, we already have everything so that our indexer could work. Let’s add some methods to get a peek inside the index, fetch some results or execute user queries.

Accessing Index

We've already seen the simplest method of getting all documents for one specific token. But as I’ve mentioned, with some set magic, we can answer more complex queries. For instance, get documents that have ALL tokens:

C++

def getDocumentsForAllTokens(tokens: Set[Token]): Set[Filename] =
    tokens
        .map(getDocumentsForToken)
        .reduce((d1: Set[String], d2: Set[String]) => d1.intersect(d2))

Or that have at least one of the tokens list:

C++

def getDocumentsForAllTokens(tokens: Set[Token]): Set[Filename] =
    tokens
        .map(getDocumentsForToken)
        .reduce((d1: Set[String], d2: Set[String]) => d1.union(d2))

Last one would give us some stats on the words in our index - we'll check what words are the most popular:

C++

def getTopWords(count: Int): List[(Token, Int)] =
    tokenIndex.toList
      // sort tokens by size of documents list
      .sortBy(_._2.size)
      .reverse
      .map { case (token, docs) => 
        token -> docs.size
      } 
      .take(count)

This method can be helpful when you’re deciding what should be considered “stop-words” in your indexing process. Remove any filtering, add all the words to index, and take 50 top ones - you’ll see all these thes and ands there.

There's just one more thing I want to add to spice things up - multiple threads.

Running in Parallel

Absolute majority of CPUs currently have more than 1 core, hence can execute several operations in parallel. But our code currently doesn't use this fact to its advantage - we just read files one by one, and add tokens, also one by one. Why not utilize more cores if we have them? Adding such a powerful capability is surprisingly easy to do, and on our "business logic" side (in InvertedIndex) very little would change.

We'll take the “divide and conquer” approach - so we’ll split our initial task into bits that are smaller and easier to handle, and then we’ll combine all the results together in an efficient way.

Combining our separate indices will be done via the merge method that we will add now.

We could've just added words one by one, but in that case, we would lose all the advantage of actually having 2 indexes already created. Instead, we first create a union of all the keys, and for each key, we union documents corresponding to the key.

C++

def merge(other: InvertedIndex): InvertedIndex = {
  val unionTokens = this.tokenIndex.keys.toSet
                    .union(other.tokenIndex.keys.toSet)
  val unionMap = unionTokens.map { token =>
    val unionSet = this.tokenIndex
                    .getOrElse(token, Set.empty[Filename])
          .union(other.tokenIndex
                    .getOrElse(token, Set.empty[Filename])
          )
    token -> unionSet
  }.toMap
  InvertedIndex(unionMap)
}

Let's also do a bit of scaffolding to run several instances of our existing IndexGenerator in parallel:

C++

import scala.collection.parallel.CollectionConverters._

def runIndexer(inputDir: String, numThreads: Int): InvertedIndex = {
  val filesToIndex = FileOps.getFilesToIndex(inputDir)
  val groupSize = getGroupSize(filesToIndex, numThreads)
  filesToIndex
    .grouped(groupSize)
    .toList
    .par
    .map { files =>
      val indexGenerator = new IndexGenerator(files)
      indexGenerator.generateIndex()
    }
    .reduce { (i1, i2) =>
      i1.merge(i2)
    }
}

def getGroupSize(filesToIndex: List[String], numThreads: Int): Int = {
  val groupSize = filesToIndex.size / numThreads
  if (groupSize == 0) 1 else groupSize
} 
<meta charset="utf-8" />

We split files into numThreads equal groups, run generateIndex for each, and then merge results together.

Where does all the parallelization come from? Just from the par method, courtesy of "org.scala-lang.modules" %% "scala-parallel-collections" library.

I won't get into details on using JMH for benchmarking code (you can find all the code in the repo provided), but here are some results using different numbers of threads - speed up is clearly visible:

numThreads       Score / Error         Units
  1             3.487 ± 0.181          s/op
  2             1.859 ± 0.010          s/op
  4             0.996 ± 0.020          s/op

Conclusion

Summing things up, in this article we:

discovered an efficient way to implement document search using inverted index
wrote a basic implementation of said index and briefly discussed text processing that should go along with it
added a multi-threading capability to our code to utilize multi-core architecture of modern hardware

I hope this was helpful, feel free to post your feedback or any questions in the commentaries section.

History

25^th September, 2023: Initial version

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)