The article provides a comprehensive overview and step-by-step guide to implementing inverted indexes using the Scala programming language. Inverted indexes are explained as a fundamental component for efficient document retrieval systems, akin to what you might find in a search engine or a database. The article covers the basics of how inverted indexes work, including the mapping of words to documents and the ability to perform complex queries. It then proceeds to provide code examples for building an inverted index, including text processing and multi-threading to enhance performance. The article offers practical insights for those interested in understanding and implementing document search systems using inverted indexes.
Before we start with the implementation, let's talk about why you would actually need an inverted index in real life.
Why Would Anyone Need Inverted Indexes at All
Imagine you need to create a system that would quickly look up a document, given several words from it - something like a wiki search. Simplest option I can think of would be to scan through each document, marking ones that have all the necessary words. That might work at first, but such a solution wouldn't scale, as each new document increases time to answer any query.
Can we do something better? Actually, we can! If a user wants to search by words - then words should be keys in our "database" (index). What would the values be in that case? All the document ids (or any other unique references / pointers) of the documents that contain the word.
How Inverted Indexes Work
Imagine we have documents like this:
id: 1; Text: "In a hole in the ground there lived a hobbit."
id: 2; Text: "The sky above the port was the color of television, tuned to a dead channel."
id: 3; Text: "Hobbits Are Amazing Creatures"
In this case, map representing our index would look like that:
{
"hobbit": [1, 3],
"television": [2],
"hole": [1],
...
}
Seems quite obvious how we got here, right?
Now if the user queries the system for "hobbit" - we look up this key in a constant time (as it’s a key in a map), and return document 1 and 3.
This structure also allows us to execute complex queries like NOT, AND and OR - we get sets of documents by each key, and then do usual set operations, like intersect in case of AND
, or difference in case of NOT
. Hence for a query like "hobbit AND hole
" we will look up sets [1, 3] and [1], their intersection would be [1], and we would return a document with id=1.
Obviously, all this is just a tip of the iceberg, the most naive implementation - real-word document indexing / querying systems could rank results based on relevance, do different sorts of fuzzy search, faceted search, etc, but naive implementation is a great place to start, and implementing it gives us a deeper understanding of the idea. So let's go ahead with an implementation.
Implementation Process
I would start somewhat "from the bottom" - so we'll first implement just a class representing Inverted Index
, with the ability to add and lookup tokens there, then add some helpers to index full documents, and then provide the scaffolding that allows to actually run this as an application.
So, let's start with a class representing InvertedIndex
itself:
type Token = String
type Filename = String
case class InvertedIndex(
tokenIndex: Map[Token, Set[Filename]] = HashMap[Token, Set[Filename]]()
)
As I mentioned in the beginning of the article, our index itself would be powered by a usual Map[String, Set[String]]
- mapping from token (indexed word) to a set of document ids (or in our case just document names).
Not much going on just yet - let's do something useful, at least add a method to add word to the index, and get added documents afterwards:
def add(token: Token, filename: Filename): InvertedIndex = {
tokenIndex.get(token) match {
case Some(set) =>
InvertedIndex(tokenIndex.updated(token, set + filename))
case None =>
InvertedIndex(tokenIndex.updated(token, Set(filename)))
}
}
def getDocumentsForToken(token: Token): Set[Filename] =
tokenIndex.getOrElse(token, Set.empty[String])
Not a lot is going on here, but our inverted index is already usable!
Though we still need something to add words into the index. We'll create IndexGenerator
class for that purpose:
class IndexGenerator(filesToIndex: List[String]) {
def generateIndex(): InvertedIndex = {
logger.info("Indexer started")
filesToIndex
.map(readFileFromDisk)
.foldLeft(InvertedIndex()) { case (idx, doc) =>
processDocument(idx, doc)
}
}private def processDocument(index: InvertedIndex,
document: GenericDocument): InvertedIndex =
document.text
.split(" ")
.flatMap(processToken)
.foldLeft(index)(_.add(_, document.fileName))
private def processToken(token: String): Option[String] = {
val tokenLower = token.toLowerCase
if (tokenLower.isEmpty ||
StringUtils.getStopWords.contains(tokenLower)) {
None
} else {
Some(tokenLower.removeTags().removeNumbers())
}
}
}
So what do we have here?
IndexGenerator
takes a list of files to index, reads them from disk, and "processes". Processing in our case is just splitting whole text into words (tokens), a little bit of cleanup via processToken
function, and adding them one by one to the InvertedIndex
.
I won't get into details of StringUtils
functions here - their implementation is quite obvious from names and in any real-life application, you would actually spend some time coming up with good data cleanup rules. One thing to mention is stop words - those are basically all the words that you often encounter in any text, but are not very helpful in actual search - words like the
, a
, and etc
. In a real-world scenario, you'll also want to do some lemmatization, converting the word into its basic form - for instance word "democratical" would be transformed into token "democracy", providing a more concise token dictionary and simplifying lookup.
But all these text processing methods are a topic of a whole separate discussion, and there are many open-source libraries available.
At this point, we already have everything so that our indexer could work. Let’s add some methods to get a peek inside the index, fetch some results or execute user queries.
Accessing Index
We've already seen the simplest method of getting all documents for one specific token. But as I’ve mentioned, with some set magic, we can answer more complex queries. For instance, get documents that have ALL tokens:
def getDocumentsForAllTokens(tokens: Set[Token]): Set[Filename] =
tokens
.map(getDocumentsForToken)
.reduce((d1: Set[String], d2: Set[String]) => d1.intersect(d2))
Or that have at least one of the tokens list:
def getDocumentsForAllTokens(tokens: Set[Token]): Set[Filename] =
tokens
.map(getDocumentsForToken)
.reduce((d1: Set[String], d2: Set[String]) => d1.union(d2))
Last one would give us some stats on the words in our index - we'll check what words are the most popular:
def getTopWords(count: Int): List[(Token, Int)] =
tokenIndex.toList
.sortBy(_._2.size)
.reverse
.map { case (token, docs) =>
token -> docs.size
}
.take(count)
This method can be helpful when you’re deciding what should be considered “stop-words” in your indexing process. Remove any filtering, add all the words to index, and take 50 top ones - you’ll see all these the
s and and
s there.
There's just one more thing I want to add to spice things up - multiple threads.
Running in Parallel
Absolute majority of CPUs currently have more than 1 core, hence can execute several operations in parallel. But our code currently doesn't use this fact to its advantage - we just read files one by one, and add tokens, also one by one. Why not utilize more cores if we have them? Adding such a powerful capability is surprisingly easy to do, and on our "business logic" side (in InvertedIndex
) very little would change.
We'll take the “divide and conquer” approach - so we’ll split our initial task into bits that are smaller and easier to handle, and then we’ll combine all the results together in an efficient way.
Combining our separate indices will be done via the merge
method that we will add now.
We could've just added words one by one, but in that case, we would lose all the advantage of actually having 2 indexes already created. Instead, we first create a union
of all the keys, and for each key, we union
documents corresponding to the key.
def merge(other: InvertedIndex): InvertedIndex = {
val unionTokens = this.tokenIndex.keys.toSet
.union(other.tokenIndex.keys.toSet)
val unionMap = unionTokens.map { token =>
val unionSet = this.tokenIndex
.getOrElse(token, Set.empty[Filename])
.union(other.tokenIndex
.getOrElse(token, Set.empty[Filename])
)
token -> unionSet
}.toMap
InvertedIndex(unionMap)
}
Let's also do a bit of scaffolding to run several instances of our existing IndexGenerator
in parallel:
import scala.collection.parallel.CollectionConverters._
def runIndexer(inputDir: String, numThreads: Int): InvertedIndex = {
val filesToIndex = FileOps.getFilesToIndex(inputDir)
val groupSize = getGroupSize(filesToIndex, numThreads)
filesToIndex
.grouped(groupSize)
.toList
.par
.map { files =>
val indexGenerator = new IndexGenerator(files)
indexGenerator.generateIndex()
}
.reduce { (i1, i2) =>
i1.merge(i2)
}
}
def getGroupSize(filesToIndex: List[String], numThreads: Int): Int = {
val groupSize = filesToIndex.size / numThreads
if (groupSize == 0) 1 else groupSize
}
<meta charset="utf-8" />
We split files into numThreads
equal groups, run generateIndex
for each, and then merge results together.
Where does all the parallelization come from? Just from the par
method, courtesy of "org.scala-lang.modules" %% "scala-parallel-collections" library.
I won't get into details on using JMH for benchmarking code (you can find all the code in the repo provided), but here are some results using different numbers of threads - speed up is clearly visible:
numThreads Score / Error Units
1 3.487 ± 0.181 s/op
2 1.859 ± 0.010 s/op
4 0.996 ± 0.020 s/op
Conclusion
Summing things up, in this article we:
- discovered an efficient way to implement document search using inverted index
- wrote a basic implementation of said index and briefly discussed text processing that should go along with it
- added a multi-threading capability to our code to utilize multi-core architecture of modern hardware
I hope this was helpful, feel free to post your feedback or any questions in the commentaries section.
History
- 25th September, 2023: Initial version