Introduction
Every C# programmer might have used class “hashtable
”. A hashtable
is a mechanism to store and retrieve key/value pairs. Hashtable
is efficient in many ways and data is stored in the RAM that makes it extremely fast. What if the number of key/value pairs is very large? Alternatives are windows registry or a relational database table. Obviously one may not wish to store a large amount of application data in the Windows registry and databases are resource hungry and expensive.
HashFile
class is an attempt to find an answer to the problem.
HashFile Class
HashFile
is a technique to insert/delete/find key value pairs in a file in the hard disk. It uses fixed length records to store data and index. There are two files “name.dat” and “name.idx” corresponding to a "name
". "name
" is the identification for key/value pairs, and is given at the time of initialization.
"name.dat”
This is the data file. The Data file stores fixed length records of key/value pairs. Name of key/value pair, length of key and value should be specified at the time of initialization. Maximum size of the data file could stretch up to 2^32 bytes, which is the maximum value of unsigned integer (uint32
).
“name.idx”
This is the index file. It stores fixed length records of record pointer and index pointer. Record pointer and index pointer are unsigned integers.
Indexing Technique
It uses Index Block (IB) of 91 x 2 Matrix to store record and index pointers. When Hashfile
is initialized for the first time, an empty IB will be created and every element will be filled with max value of data type uint
. Why 91 rows? 91 is the number of characters in the ASCII range 32-122. Each row in the IB represents an ASCII character. Columns of the IB are record pointer and index pointer. Data types of columns are unsigned integers. Record pointer represents the starting byte index number of a data record with respect to the origin, similarly index pointer represents starting byte index number of an Index record with respect to the origin.
Pseudo code for InsertKey
method is given below:
- Take a byte from the new key
- If byte is within range 32-122, then jump to the matching index record within an IB that corresponds to distance from the left most byte, otherwise return error
- Read record pointer and index pointer
- If index pointer is equal to maximum value of data type
uint
, then, save record pointer and index pointer, save key and value in the data record and return record pointer - Otherwise read data record for the record pointer
- Compare new key and existing key
- If new key matches existing key in the data record, then go to step 10
- If new key is greater than existing key, then go to step 11
- Otherwise step 13
- If the status of the data record is deleted then, save new value, change deleted flag to false, return record pointer, otherwise return error
- If index pointer points to next IB, take next byte and jump to step 2
- Otherwise create new IB, save record pointer and index pointer in the new IB, assign start index pointer of new IB to the old IB, save key and value in the data record, return record pointer
- If index pointer points to next IB, assign new record pointer to the index record, swap new and existing keys jump to step 2
- Otherwise assign new record pointer to the index record, create new IB, assign existing record pointer, new index pointer to new IB, assign start index pointer of new IB to the old IB, save key and value in the data record, return record pointer
Sounds bit fuzzy, isn't it? But the technique seems to work in the test environment. As new keys are added, indexing mechanism will attempt to place record pointer at a minimum distance as possible from the root IB. Binary tree search algorithm eliminates items to be searched by half after each branch, whereas the above described technique eliminates probably about 90/91th of the outstanding records after every IB assuming that population is well distributed. Test results show that average IB reads to find a record is about 6.2, in a population of 528,461 keys, which sounds good. See the test results section for details.
Using the HashFile Class
HashFile.HashFile hf = new HashFile.HashFile();
hf.Initialize("Test1",50,100);
String key = "Microsoft";
String Value = "Software company";
int success = hf.InsertKey(key,Value,false)
String key = "Microsoft";
uint success = hf.FindKey(key)
String key = "Microsoft";
uint success = hf.DeleteKey(key)
int Success = hf.GetFirstKey();
int Success = hf.GetNextKey();
int Success = hf.GetPrevKey();
Test code is hashfileTest.cs.
A Method to Test HashFile.InsertKey
public static void InsertKeys()
{
HashFile.HashFile HashFileTest1 = new HashFile.HashFile();
HashFileTest1.Initialize("Test1",50,50);
StreamReader StreamReaderTestKeys = File.OpenText
(@"D:\Documents and Settings\All Users\Documents\vs\EmailIDTEst\
EmailIDTEst\bin\Debug\Keys1.txt");
Stopwatch StopWatchtest1 = new Stopwatch();
int CountofTestKeys = 0;
Console.WriteLine("Please Wait...");
string TestKeyRecord = StreamReaderTestKeys.ReadLine();
while (TestKeyRecord != null)
{
string[] TestKeyFields = TestKeyRecord.Split(' ',',');
for (int i = 0; i < TestKeyFields.Length; i++)
{
if (TestKeyFields[i].Contains("@"))
{
StopWatchtest1.Start();
int success = HashFileTest1.InsertKey
(TestKeyFields[i].Trim(),CountofTestKeys.ToString(),
false);
StopWatchtest1.Stop();
if (success < 1)
Console.WriteLine(TestKeyFields[i].Trim() + "," + success );
CountofTestKeys++;
}
}
TestKeyRecord = StreamReaderTestKeys.ReadLine();
}
StreamReaderTestKeys.Close();
Console.WriteLine("Records=" + CountofTestKeys + ",Ticks=" +
StopWatchtest1.ElapsedTicks + ",Frequency=" + Stopwatch.Frequency +
",Average Write Time(ms)=" + (double)StopWatchtest1.ElapsedTicks /
(double)Stopwatch.Frequency * 1000 / (double)CountofTestKeys + ",
Avg Reads=" + (double)HashFileTest1.count / (double)CountofTestKeys);
}
Test Results
InsertKey
Keys=528461,Ticks=1022397419386, Frequency=2992560000, Average Write Time(milliseconds)=0.646493162076644, Avg Reads=5.77592859264922
Keys=8125, Ticks=10044699390, Frequency=2992560000, Average Write Time(milliseconds)=0.413114755979444, Avg Reads=5.50141538461538
FindKey
Keys=528461,Ticks=201771307445, Frequency=2992560000, Average Read Time(milliseconds)=0.127586169617676, Avg Reads=6.14659738372368
Keys=8125, Ticks=2945715784, Frequency=2992560000, Average Read Time(milliseconds)=0.121150331139174, Avg Reads=5.48246153846154
Remarks
The version of the class released is just a prototype, which obviously means that this software should not be used in the production environment. Any losses arising out of error or bugs in this version of the software are the sole responsibility of the end user.
Any suggestions to improve the HashFile
class is most welcome.
History
- 4th October, 2009: Initial version