Introduction
I was working on a project where I had to process about 1,000 records. Eventually, this set of records grew to several thousand. In some cases, the time complexity of routines that processed this data was O(n2). Using an array to randomly access these records was no longer feasible. It was time to go back to my computer science course notes to work with a more exciting data structure. You are probably here because you are in the same situation. (I'm no longer in that situation. Now I have to process several hundred thousand records!)
In my search for work already done that I could fall back on, the only serious C++ hash template class I could find was in SGI's extension to STL. Like much of STL, I found it difficult to understand. Unlike STL, I found it difficult to use as well. If you have a lightweight task to perform, you have to go through a relatively cumbersome process of defining this, implementing that, etc. I wanted something that could be quickly defined and instantiated -- I would have to write my own.
Why My Hash Tables are Better than SGIs
- You can implement them easily without too much fuss: The implementation for the more common use of the templates simply involves deriving a class and implementing 3 pure virtual methods, which on average would only have about 3 short lines of code each. The way I've written my implementations, I only need 1 line of code each.
- You can quickly familiarize yourself with them: It's easier to understand what your parts of the code are doing, and the consequences of how you write them.
- If you want to get adventurous, you can have more control over optimization and the hashing function.
Why SGI's Hash Tables are Better than Mine
- SGIs are richer in features: For instance, SGI boasts that their hash's bucket array is self-adjusting. This was not necessary for the roles in which my hash tables were used. You have to set the array size yourself and once you do that, you can't change it again. (The beta version no longer has this constraint.)
- There's a chance the performance of my hash tables is not optimal: I'm only a senior-intermediate programmer and I'm sure the guys at SGI know what they are doing. While I do have an acute understanding of optimization, compilers and the like, there's a chance I missed something.
Limitations
There are some things that other hash tables can do that this one cannot:
- The keys must be unique: I believe that SGI's
hash_multiset
is free from this restriction. If you try adding an element which has a key that already exists in the table, then only the first key will ever be referenced in a random access lookup. - The bucket array size must be set only once: As stated above, my hash tables were not used in such a way that adjustments to the array were necessary. In some situations, a hash table that has a fixed bucket array size is unacceptable. (No longer a restriction in the beta version.)
Background
This article assumes that you know how a hash table works. If you don't, then you should familiarize yourself and consider the possibility that a hash table is not even an appropriate solution. Since you must also supply the hash function, you must know what those are all about. You don't have to write your own. There are plenty available on-line, but you'll need to incorporate them, understand their principles, and judge if they are appropriate to your data.
Platforms, Environments and Requirements
I've written these templates to be friendly to all compilers, environments and flavours of C++. However, I have only actually tried the code in GNU C++ and Visual C++. The only environmental requirement is that they must have access to the STL vector template class. There is also an #include <assert.h>
, which is part of the ANSI standard. If, for whatever reason, that causes a compiler error, you can easily comment out the line and define your own assert macro to keep the compiler happy.
Using the Code
There are 3 hash table templates that you can use. The Doxygen-generated documentation is included with the source code. However, let's get you started on a "Hello world
," which you can use as an outline to hash table applications.
Getting Started: A Tutorial
First, you'll already have your class defined, to which you will need fast random access, called T
in the template code (or the "data class"). The declaration to such a class might look like this:
class Student {
public:
Student();
string m_sName;
int m_nGrade;
int m_nID;
};
Next, we will declare your hash class:
#include <ahashp.h>
class StudentHash : public _HashP<Student> {
public:
StudentHash();
private:
virtual bool COMPAREREFERENCESPREATTRIBUTES CompareReferences(
const Student &ref1, const char *ref2) const COMPAREREFERENCESPOSTATTRIBUTES;
virtual bucket_array_unit HASHREFERENCEPREATTRIBUTES GetHashReference(
const char *ref) const HASHREFERENCEPOSTATTRIBUTES;
virtual bucket_array_unit HASHREFERENCEPREATTRIBUTES GetHashReference(
const Student &ref) const HASHREFERENCEPOSTATTRIBUTES;
};
Note that you can dispense with those *ATTRIBUTES
macros if you expect never to use them. However, if you do use them, there's a chance that you might forget that you'll need to update your old code and forget about this class. So, if you are a stickler for correctness, make sure you leave them in.
Also note that there are actually two template parameters being passed to _HashP
. The second one defaults to const char *
. You'll also notice that those template parameters are the types of the parameters being passed to the virtual methods. This is not a coincidence. The template parameters decide the types of the parameters passed to the virtuals. Finally, define your class.
CompareReferences()
is used so that when _HashP
needs to match a key with a particular element in the hash table, it can. The STL way of doing things might require that Student
have a comparison operator. This way, instead of making changes to your data class, you can just implement a pure virtual in your hash class.
bool StudentHash::CompareReferences(const Student &ref1, const char *ref2) const {
return ref1.m_sName == ref2;
}
Now for the hash function. These return an index to the bucket array. Notice that the modulus operation with GetBucketSize()
will usually be necessary:
StudentHash::bucket_array_unit StudentHash::GetHashReference(const char *ref) const {
sdbm(ref)%GetBucketSize();
}
StudentHash::bucket_array_unit StudentHash::GetHashReference(const Student &ref) const {
return GetHashReference(ref.m_sName.c_str());
}
It is up to you to choose which hash function to use. Sdbm()
is an attractive choice for keys which are string
s, with good performance and good randomization, so I recommend that one. And there: your hash class is implemented! All you need to do is use it, and for that you should read about the Add()
, Get()
and SetBucketSize()
methods.
Points of Interest
Optimization
I've made use of 4 macros whose names are of the form *ATTRIBUTES
. You can define these as compiler-specific attributes, which may make the calling of the pure virtuals slightly more efficient. One such implementation might be of the GNU attribute __attribute__((fastcall))
or the Visual C++ __fastcall
. Use of attributes can be very dangerous in GNU, and probably to a lesser extent in Visual C++, so you should be very careful that you know what you are doing when using those. So far, I've been using __attribute__((fastcall))
quite happily without incident, so due diligence should yield safe usage.
Characteristics
This hash table is of the chaining variety. The chains are maintained by linked list. This construct is ideal for roles where the number of lookups on the table do not greatly exceed the insertions and deletions on the table.
If You Think You Can Do Better...
If you'd like to make improvements to my code, I'll give you some tips on understanding it.
Notation
Hungarian notation is a favourite of mine, but I prefer an improvement that I call Canadian notation.
Noise/Clutter
I've tried to clean up the code. However, I have left in some debug code that I was reluctant to trim, since I've been able to make use of it at one time or another. Unlike all the other methods, it's not very well documented and, in some cases, usage might not be very intuitive. So, keep in mind that you can ignore it without losing any key points in understanding the code.
STL Philosophy
One can iterate through a hash table using the "hasherators," which borrow from the STL iterators. I found the iterator concept difficult to digest at first, but it is probably the best way to sequentially traverse such a random access data structure efficiently, should it be necessary.
Private Pure Virtual Philosophy
The C++ culture seems to believe in only making public
pure virtuals. This would probably be why, if you've enabled such a warning, GNU C++ will always give you a warning about a class with virtual methods not having a virtual destructor, even if the virtual methods are not public
or the destructor is not public
. So far, I have not discovered a reason why pure virtuals should not be private
, so that the abstract
class can call them, rather than having the owner call them through what is probably a pointer to the instance. This philosophy makes pure virtuals like callbacks, and still takes full advantage of the strengths of abstract
classes.
Miscellaneous
One might notice that I used the STL array template vector
, but did not use the linked list. I found the linked list frustrating at the time and was getting a lot of debug errors about the improper use of iterators. So, I just dumped it and created my own linked list. Also, I'm not entirely certain of the efficiency of STL's linked list, but I am of the one I wrote.
Differences in Versions
The beta version of the code has been tested but not extensively. V1.0, however, will generate compiler errors if you use certain methods. There are also some glaring inefficiencies in some of the less important code. So the beta version is characterized by:
- Self-adjusting bucket array: You no longer have to worry about the size of the bucket array at all. The array will increase when it feels that the table has gotten too big. You also have the ability to control when those adjustments are made and how big they are.
- It's still in beta.
- More efficient code: Certain code blocks are faster, but not likely the crucial
Get()
methods. - Names changes: I was using naming schemes that broke with convention, so I've had to change some of the class names.
History
- 4th February, 2008 -- Original version posted
- 8th February, 2008 -- Article content updated
- 18th March, 2008 -- Article updated - some code fixes
- 22nd August, 2011 -- Article updated - beta version introduced
- 2nd July, 2015 -- Discovered some quirks in the latest GNU, so some strange tweaks had to be made. There probably will be more.