Save Key/Value Pairs in a File

Bizken

3.92/5 (7 votes)

9 Oct 2009GPL34 min read

51.8K

408

A HashFile class to save key/value pairs in a file

Download source - 4.58 KB

Introduction

Every C# programmer might have used class “hashtable”. A hashtable is a mechanism to store and retrieve key/value pairs. Hashtable is efficient in many ways and data is stored in the RAM that makes it extremely fast. What if the number of key/value pairs is very large? Alternatives are windows registry or a relational database table. Obviously one may not wish to store a large amount of application data in the Windows registry and databases are resource hungry and expensive.

HashFile class is an attempt to find an answer to the problem.

HashFile Class

HashFile is a technique to insert/delete/find key value pairs in a file in the hard disk. It uses fixed length records to store data and index. There are two files “name.dat” and “name.idx” corresponding to a "name". "name" is the identification for key/value pairs, and is given at the time of initialization.

"name.dat”

This is the data file. The Data file stores fixed length records of key/value pairs. Name of key/value pair, length of key and value should be specified at the time of initialization. Maximum size of the data file could stretch up to 2^32 bytes, which is the maximum value of unsigned integer (uint32).

“name.idx”

This is the index file. It stores fixed length records of record pointer and index pointer. Record pointer and index pointer are unsigned integers.

Indexing Technique

It uses Index Block (IB) of 91 x 2 Matrix to store record and index pointers. When Hashfile is initialized for the first time, an empty IB will be created and every element will be filled with max value of data type uint. Why 91 rows? 91 is the number of characters in the ASCII range 32-122. Each row in the IB represents an ASCII character. Columns of the IB are record pointer and index pointer. Data types of columns are unsigned integers. Record pointer represents the starting byte index number of a data record with respect to the origin, similarly index pointer represents starting byte index number of an Index record with respect to the origin.

Pseudo code for InsertKey method is given below:

Take a byte from the new key
If byte is within range 32-122, then jump to the matching index record within an IB that corresponds to distance from the left most byte, otherwise return error
Read record pointer and index pointer
If index pointer is equal to maximum value of data type uint, then, save record pointer and index pointer, save key and value in the data record and return record pointer
Otherwise read data record for the record pointer
Compare new key and existing key
If new key matches existing key in the data record, then go to step 10
If new key is greater than existing key, then go to step 11
Otherwise step 13
If the status of the data record is deleted then, save new value, change deleted flag to false, return record pointer, otherwise return error
If index pointer points to next IB, take next byte and jump to step 2
Otherwise create new IB, save record pointer and index pointer in the new IB, assign start index pointer of new IB to the old IB, save key and value in the data record, return record pointer
If index pointer points to next IB, assign new record pointer to the index record, swap new and existing keys jump to step 2
Otherwise assign new record pointer to the index record, create new IB, assign existing record pointer, new index pointer to new IB, assign start index pointer of new IB to the old IB, save key and value in the data record, return record pointer

Sounds bit fuzzy, isn't it? But the technique seems to work in the test environment. As new keys are added, indexing mechanism will attempt to place record pointer at a minimum distance as possible from the root IB. Binary tree search algorithm eliminates items to be searched by half after each branch, whereas the above described technique eliminates probably about 90/91^th of the outstanding records after every IB assuming that population is well distributed. Test results show that average IB reads to find a record is about 6.2, in a population of 528,461 keys, which sounds good. See the test results section for details.

Using the HashFile Class

// Instantiate class
HashFile.HashFile hf = new HashFile.HashFile();
 
// Initialize Key & Value attributes
hf.Initialize("Test1",50,100);
 
// "test1" is the indentification
// 50 is maximum length of key
// 100 is maximum length of Value

// Call  InsertKey
//
String key = "Microsoft";
String Value = "Software company";
int success = hf.InsertKey(key,Value,false)
 
// Call FindKey
String key = "Microsoft";
uint success = hf.FindKey(key)

// Call DeleteKey
String key = "Microsoft";
uint success = hf.DeleteKey(key)

// Call GetFirstKey
 int Success = hf.GetFirstKey();

// Call GetNextKey
 int Success = hf.GetNextKey();

// Call GetPrevKey
 int Success = hf.GetPrevKey();

Test code is hashfileTest.cs.

A Method to Test HashFile.InsertKey

public static void InsertKeys()
{
    HashFile.HashFile HashFileTest1 = new HashFile.HashFile();
    HashFileTest1.Initialize("Test1",50,50);
 
    //StreamReader StreamReaderTestKeys = File.OpenText
    //(@"D:\Documents and Settings\All Users\Documents\vs\
    //EmailIDTEst\EmailIDTEst\bin\Debug\Keys.txt");
    StreamReader StreamReaderTestKeys = File.OpenText
	(@"D:\Documents and Settings\All Users\Documents\vs\EmailIDTEst\
	EmailIDTEst\bin\Debug\Keys1.txt");
 
    Stopwatch StopWatchtest1 = new Stopwatch();
 
    int CountofTestKeys = 0;
 
    Console.WriteLine("Please Wait...");
 
    string TestKeyRecord = StreamReaderTestKeys.ReadLine();
    while (TestKeyRecord != null)
    {
        string[] TestKeyFields = TestKeyRecord.Split(' ',',');
        for (int i = 0; i < TestKeyFields.Length; i++)
        {
            if (TestKeyFields[i].Contains("@"))
            {
                StopWatchtest1.Start();
                int success = HashFileTest1.InsertKey
                       	(TestKeyFields[i].Trim(),CountofTestKeys.ToString(),
                        		false); // Call to HashFile.InsertKey
                StopWatchtest1.Stop();
 
                if (success < 1) // Failure to insert key
                    Console.WriteLine(TestKeyFields[i].Trim() + "," + success );
 
                CountofTestKeys++;
            }
        }
        TestKeyRecord = StreamReaderTestKeys.ReadLine();
    }
 
    StreamReaderTestKeys.Close();
    Console.WriteLine("Records=" + CountofTestKeys + ",Ticks=" +
            	StopWatchtest1.ElapsedTicks + ",Frequency=" + Stopwatch.Frequency +
            	",Average Write Time(ms)=" + (double)StopWatchtest1.ElapsedTicks /
            	(double)Stopwatch.Frequency * 1000 / (double)CountofTestKeys + ",
            	Avg Reads=" + (double)HashFileTest1.count / (double)CountofTestKeys);
}

Test Results

InsertKey

Keys=528461,Ticks=1022397419386, Frequency=2992560000, Average Write Time(milliseconds)=0.646493162076644, Avg Reads=5.77592859264922

Keys=8125, Ticks=10044699390, Frequency=2992560000, Average Write Time(milliseconds)=0.413114755979444, Avg Reads=5.50141538461538

FindKey

Keys=528461,Ticks=201771307445, Frequency=2992560000, Average Read Time(milliseconds)=0.127586169617676, Avg Reads=6.14659738372368

Keys=8125, Ticks=2945715784, Frequency=2992560000, Average Read Time(milliseconds)=0.121150331139174, Avg Reads=5.48246153846154

Remarks

The version of the class released is just a prototype, which obviously means that this software should not be used in the production environment. Any losses arising out of error or bugs in this version of the software are the sole responsibility of the end user.

Any suggestions to improve the HashFile class is most welcome.

History

4^th October, 2009: Initial version

License

This article, along with any associated source code and files, is licensed under The GNU General Public License (GPLv3)