Simple Text Indexer Using SQLite Database

Chulliyan

3.93/5 (10 votes)

14 Jun 2006CPOL5 min read

1.8K

Simple Text Indexer Using SQLite Database

Download source files - 481.23 KB

Introduction

Text Indexer Using SQLite

SQLite is a powerful, free, open source database. It has an excellent implementation of the B-Tree, which is fast and small. SQLite is not copyrighted and therefore you can use it freely in any commercial projects. This project uses the SQLite’s B-Tree extensively to index text files. Each text document is scanned and the content is tokenized to build a dictionary in the SQLite database. Later searches are conducted against the dictionary and related map tables to retrieve the file names containing the search string.

This article assumes the reader is well versed in C++ and data structures. An in-depth understanding of Windows APIs is also necessary.

Why use B-Tree?

A well balanced B-Tree has several advantages over a Hashtable:

It is faster to insert a new item.
It is fast and easy to search for ‘Nearest Hit’ items. This is very important in building a search dictionary. Searches can be done with partial string, for example to search for the word ‘where’, we can start with the string ‘whe’.

Once the first “Nearest Hit” node is obtained, it is easy to walk through the B-Tree list to obtain the matching nodes.

Design

The indexer consists of a set of B-Trees (Word Dictionary and MAP tables) and a collection of ‘observers’ looking for ‘user input’, ‘file changes’ and ‘system idle time’ to update or query the B-Trees. Each observer runs in separate threads.

‘User Input’ observer reads user’s input and queries the indexer for matches
‘File Change’ observer waits for file change notifications from the observed directories and queue the change information in a list.
‘System Idle Time’ observer wakes up when the system is idling and processes the queued file change information. This observer is responsible for indexing modified files.

File Change Observer

This observer opens the directory names specified in the ‘settings.ini’ file (located in the same folder as ‘INDEXER.EXE’). Later it creates an IoCompletionPort and waits for file changes. ‘IoCompletionPort’ allows a single thread to observe several directories at the same time. It uses ‘ReadDirectoryChangesW’ API to retrieve the file changes from the system whenever an IoCompletionPort reports a completion status. All the new, modified or deleted files names are put in a file name list.

Class Name: CFileObserver (filechg.h and filechg.cpp)

System Idle Observer

This observer sleeps most of the time and whenever it wakes up it uses the ‘NtQuerySystemInformation’ API exposed by NTDLL.LL to calculate the CPU usage. If the CPU usage is less than 5% and there are file names in the file name list (queued up by the File Change Observer), it will de-queue the file names, will parse and index the content of the files.

Class Name: CIdleObserver (observer.h and observer.cpp)

User Input Observer

This observer waits for user inputs on ‘STDIN’ and queries the indexer to retrieve file names containing the input string. It runs in the main application thread and will continue to run until user presses CTRL+C.

Class Name: CInputObserver (observer.h and observer.cpp)

The Indexer

This class indexes the file content. It uses the dictionary lookup algorithm to index and search words. There is only ONE dictionary in the sample code provided. If you need high performance lookup, you should create an array of dictionaries based on the hash value of a portion of the word (say, the first three characters of the word). In such a case, the dictionary is a Hashtable of B-Trees and should provide fast lookups. See below for the schema design of the indexer.

Dictionary (m_dict of CIndexer class)
This table maps words to its address. An address is 64bit integer. Every query will look up the address of the word from this table and later uses the address for all other table lookups.

Word1 Address1
Word2 Address2
Word3 Address3
File Map Table (m_file of CIndexer class)
This table maps file names to FILE IDs. Each FILE ID is a unique 64bit integer.

File name1 FILE ID1
File nam2 FILE ID2
File name3 FILE ID3
File ID to File Name map table (m_fileId of CIndexer class)
This table is a reverse lookup table for looking up a file name given the FILE ID.

FILE ID1 File name1
FILE ID2 File name2
FILE ID3 File name3
WORD address to FILE ID map table (MAP table) (m_map of CIndexer class)
This table maps a word address to a set of FILE IDs. Each word address entry is a primary key to a vector record containing the FILE IDs of all the files containing at least one occurrence of the word.

Address1 FILE ID1,
FILE ID2
Address2 FILE ID1
Address3 FILE ID3

Indexing Algorithm

Create an entry in the FILE ID table if the file name is new. Also, create an entry in the reverse lookup table.
Parse the content of the file by reading one line at a time and tokenizing the string. Each token is assumed to be a WORD and index the word.
Lookup the Dictionary to get the word’s address. If the word is not present in the dictionary, create an entry.
Lookup up the MAP table using the word’s address as the primary key to retrieve the vector of FILE IDs. Insert the new FILE ID into the vector and save it in the database.
Create a line number and column number entry in the Line number map table.

Searches Algorithm

Get the nearest hit word addresses from Dictionary (For example, a search of ‘w’ may result in ‘where’, ‘while’, ‘which’, etc.).
For every word address, lookup the MAP table to retrieve the FILE IDs
For every FILE IDs, lookup the FILE ID to File Name reverse lookup table to retrieve the file name.
Lookup the line number map table to obtain the line number and column number of the word in the file.

Source Code

The source code is arranged as follows.

Directory ‘./sqlite’ contains all the SQLite source code. I have picked only the relevant ‘C’ and ‘H’ files from the SQLite source tree and have modified some of the files to remove the extra dependencies. If you want to use the latest code from the SQLite site (http://www.sqlite.org), you will have to modify the code to make it work.

Directory ‘./indexer’ contains all the indexer code
All the project files are based on Visual Studio 2005
Binary file is created in ./bin/win32 directory
Settings file (settins.ini) is located in ./bin/win32 directory
The index file is created in %TEMP%/index directory

History

14^th June, 2006: Initial post

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Word1	Address1
Word2	Address2
Word3	Address3

File name1	FILE ID1
File nam2	FILE ID2
File name3	FILE ID3

FILE ID1	File name1
FILE ID2	File name2
FILE ID3	File name3

Address1	FILE ID1, FILE ID2
Address2	FILE ID1
Address3	FILE ID3