Introduction
Text Indexer Using SQLite
SQLite is a powerful, free, open source database. It has an excellent implementation of the B-Tree, which is fast and small. SQLite is not copyrighted and therefore you can use it freely in any commercial projects. This project uses the SQLite’s B-Tree extensively to index text files. Each text document is scanned and the content is tokenized to build a dictionary in the SQLite database. Later searches are conducted against the dictionary and related map tables to retrieve the file names containing the search string.
This article assumes the reader is well versed in C++ and data structures. An in-depth understanding of Windows APIs is also necessary.
Why use B-Tree?
A well balanced B-Tree has several advantages over a Hashtable:
- It is faster to insert a new item.
- It is fast and easy to search for ‘Nearest Hit’ items. This is very important in building a search dictionary. Searches can be done with partial
string
, for example to search for the word ‘where
’, we can start with the string
‘whe
’.
Once the first “Nearest Hit” node is obtained, it is easy to walk through the B-Tree list to obtain the matching nodes.
Design
The indexer consists of a set of B-Trees (Word Dictionary and MAP tables) and a collection of ‘observers’ looking for ‘user input’, ‘file changes’ and ‘system idle time’ to update or query the B-Trees. Each observer runs in separate threads.
- ‘User Input’ observer reads user’s input and queries the indexer for matches
- ‘File Change’ observer waits for file change notifications from the observed directories and queue the change information in a list.
- ‘System Idle Time’ observer wakes up when the system is idling and processes the queued file change information. This observer is responsible for indexing modified files.
File Change Observer
This observer opens the directory names specified in the ‘settings.ini’ file (located in the same folder as ‘INDEXER.EXE’). Later it creates an IoCompletionPort
and waits for file changes. ‘IoCompletionPort
’ allows a single thread to observe several directories at the same time. It uses ‘ReadDirectoryChangesW
’ API to retrieve the file changes from the system whenever an IoCompletionPort
reports a completion status. All the new, modified or deleted files names are put in a file name list.
Class Name: CFileObserver
(filechg.h and filechg.cpp)
System Idle Observer
This observer sleeps most of the time and whenever it wakes up it uses the ‘NtQuerySystemInformation
’ API exposed by NTDLL.LL to calculate the CPU usage. If the CPU usage is less than 5% and there are file names in the file name list (queued up by the File Change Observer), it will de-queue the file names, will parse and index the content of the files.
Class Name: CIdleObserver
(observer.h and observer.cpp)
User Input Observer
This observer waits for user inputs on ‘STDIN
’ and queries the indexer to retrieve file names containing the input string. It runs in the main application thread and will continue to run until user presses CTRL+C.
Class Name: CInputObserver
(observer.h and observer.cpp)
The Indexer
This class indexes the file content. It uses the dictionary lookup algorithm to index and search words. There is only ONE dictionary in the sample code provided. If you need high performance lookup, you should create an array of dictionaries based on the hash value of a portion of the word (say, the first three characters of the word). In such a case, the dictionary is a Hashtable of B-Trees and should provide fast lookups. See below for the schema design of the indexer.
Dictionary
(m_dict
of CIndexer
class)
This table maps words to its address. An address is 64bit integer. Every query will look up the address of the word from this table and later uses the address for all other table lookups.
Word1 | Address1 |
Word2 | Address2 |
Word3 | Address3 |
- File Map Table (
m_file
of CIndexer
class)
This table maps file names to FILE IDs. Each FILE ID is a unique 64bit integer.
File name1 | FILE ID1 |
File nam2 | FILE ID2 |
File name3 | FILE ID3 |
- File ID to File Name map table (
m_fileId
of CIndexer
class)
This table is a reverse lookup table for looking up a file name given the FILE ID.
FILE ID1 | File name1 |
FILE ID2 | File name2 |
FILE ID3 | File name3 |
- WORD address to FILE ID map table (
MAP
table) (m_map
of CIndexer
class)
This table maps a word address to a set of FILE IDs. Each word address entry is a primary key to a vector record containing the FILE IDs of all the files containing at least one occurrence of the word.
Address1 | FILE ID1, FILE ID2 |
Address2 | FILE ID1 |
Address3 | FILE ID3 |
Indexing Algorithm
- Create an entry in the FILE ID table if the file name is new. Also, create an entry in the reverse lookup table.
- Parse the content of the file by reading one line at a time and tokenizing the string. Each token is assumed to be a WORD and index the word.
- Lookup the Dictionary to get the word’s address. If the word is not present in the dictionary, create an entry.
- Lookup up the MAP table using the word’s address as the primary key to retrieve the vector of FILE IDs. Insert the new FILE ID into the vector and save it in the database.
- Create a line number and column number entry in the Line number map table.
Searches Algorithm
- Get the nearest hit word addresses from Dictionary (For example, a search of ‘w’ may result in ‘where’, ‘while’, ‘which’, etc.).
- For every word address, lookup the MAP table to retrieve the FILE IDs
- For every FILE IDs, lookup the FILE ID to File Name reverse lookup table to retrieve the file name.
- Lookup the line number map table to obtain the line number and column number of the word in the file.
Source Code
The source code is arranged as follows.
Directory ‘./sqlite’ contains all the SQLite source code. I have picked only the relevant ‘C’ and ‘H’ files from the SQLite source tree and have modified some of the files to remove the extra dependencies. If you want to use the latest code from the SQLite site (http://www.sqlite.org), you will have to modify the code to make it work.
- Directory ‘./indexer’ contains all the indexer code
- All the project files are based on Visual Studio 2005
- Binary file is created in ./bin/win32 directory
- Settings file (settins.ini) is located in ./bin/win32 directory
- The index file is created in %TEMP%/index directory
History
- 14th June, 2006: Initial post