Introduction
While working for a company that produced software development tools, I had to write an application that encapsulated an entire file system within a single file. Part of that problem involved creating an associative lookup structure in which alphanumeric keys -- full paths, in that case -- could be used both to look up data associated with those keys, i.e. pointers into a FAT, and to perform fast globing using the keys as common path prefixes. On top of it all, this funky lookup structure had to be fast, namely:
- Its lookup time had to beat that of a brain-dead hash table with string keys.
- Finding all the elements in the structure with a common prefix had to run in sub-linear time.
- The structure had to provide an element removal operation.
My approach was to implement a so-called Patricia trie, which is a compressed trie that uses certain properties of string keys, as described below. This is done in order to collapse trivial branches.
Background
As a historical note, this data structure's name is really just an acronym. It stands for "Practical Algorithm to Retrieve Information Coded in Alphanumeric." The ADT was described in a paper published in 1968 by Donald R. Morrison.
A Patricia trie is a compressed trie that uses common substrings in unique keys as a starting point for collapsing trie branches. As a result of this compression, Patricia tries are "denser" than regular tries. That is specifically the reason why operations on the structure -- i.e. addition, removal, lookup, prefix lookup -- and its memory consumption are so efficient.
A Patricia trie is a binary tree in which each node has a "bit index" that specifies a bit position in a key (string). In my implementation below, the tree levels have increasingly higher bit indices, so a parent's bit index is always lower than those of its children. That is, unless the node has a self edge or an upward edge, as discussed below. Other implementations assume that all the keys are of equal length, which is not the case here because you may have arbitrarily long sequences of characters in file names. They start from high bit indices, typically equal to the length of the key, and they assign descending bit indices to nodes as one goes deeper into the tree.
The keys in a Patricia trie are bit strings and the indices are labeled in ascending order from left to right. You can also think of keys in terms of byte strings, where a bit-index of 0 specifies bit 0 of byte 0 of a key. A bit-index of 8 specifies bit 0 of byte 1 of a key and so on.
An observation that I should make is that P-tries have back edges at the leaves. So if binary trees have leaves that "point to nothing" so to speak, P-tries have leaves that contain two edges that either point to the leaves themselves or back into the structure to some other nodes.
Searching starts at the root node. When the lookup algorithm arrives at one of these nodes in the process of searching a specific key, it tests the bit at the specified index in the lookup key. If the bit is 0, the algorithm follows the left outgoing edge. If the bit is 1, the algorithm follows the right outgoing edge to the next node. A search on a Patricia trie concludes when the algorithm follows an upward edge, i.e. an edge from a leaf back into the tree. Whatever the upward edge points to, that is the result of the search. When looking for specific keys, a key comparison has to be performed in this final step. If the key at that node matches the key we are looking for, the search succeeds. Otherwise, the search fails and thus the key is not in our trie.
Example
Say you have the following collection of keys:
SOME
ABACUS
SOMETHING
B
ABRACADABRA
THIS
SOMERSET
The trie will look like this:
Each node encapsulates a key (e.g. "SOME") and a bit index (0), among other things.
The simple case: Say you want to look for the key "B" in your trie. You start at the root ("SOME", 0), whose bit index is 0. So you look at bit 0 in your key: is it 0 or 1? It is 0, so you take the left edge to the next node: ("B",1). You look at bit 1 in your key: it is 1, so you take the right edge, which takes you backwards (i.e. to a node with a less-than or equal-to bit index) to the node ("B", 1). You do a string comparison between your key "B" and the key at the node where you ended up. They match! So there is your node, in 2 + 1 steps (2 bit comparisons + 1 strcmp
).
A more complicated case: Say you want to look for "SOME", whose key happens to be encapsulated by the root of the structure. You start at the root, whose bit index is 0. Look at bit 0 in your key "SOME". It is 1, so you take the right edge to ("ABACUS",1). Look at bit 1 in your key. It is 1, so you take the right edge to ("SOMERSET",33). Look at bit 33 in your key. It's 0, so you take the left edge to ("SOMETHING",34). Look at bit 34 in your key: it's 0, i.e. it doesn't exist so the keys are "padded" with as many zeroes as necessary. So you take the left edge again, which takes you backwards to the node whose key is "SOME". Do a string comparison: is "SOME" equal to your key? It is, so you have just found "SOME" in the structure in 6 steps (6 bit comparisons + 1 strcmp
).
An even more complicated case: Say you want to find all the elements in the trie with the common prefix "AB". Start at the root again. Look at bit 0 in your key. It's 1, so you take the right edge to ("ABACUS", 1). Look at bit 1 in your key. It is 0, so you take the left edge to ("ABRACADABRA", 16). At this point, the bit index at that node is 16, which is greater than the length in bits of your search key. So you STOP at that point and that becomes the "root" of the sub-tree that stores all the elements with the same common prefix "AB". This node was reached in 2 steps (2 bit comparisons). Say you want to enumerate all the nodes that have keys with a common prefix "AB". You can perform a depth first search starting at that root, stopping whenever you encounter back edges. The root has two outgoing edges -- both of which are back edges -- to nodes ("ABRACADABRA", 16), i.e. a self-edge, and ("ABACUS", 1). So that is your collection of nodes that have keys with a common prefix "AB".
Using the code
The Patricia trie implementation that I am presenting here consists of a single template class. A client can instantiate type-specific versions of the template, based on what type of information the structure has to store. While my initial implementation had to store FAT pointers -- FILE*
pointers, essentially -- I recognize that other applications of this may end up having entirely different requirements. A template class is ideally suited to address that.
The code consists of two classes, nPatriciaTrieNode
and nPatriciaTrie
. The former defines a trie node and it encapsulates the following:
- A bit index
- A key, i.e. the target of an upward edge somewhere else in the trie
- A pointer to the '"eft" child
- A pointer to the "right" child
The latter defines the trie itself. It encapsulates a single pointer to the root of the trie. Any lookup, addition or removal operation starts at the root and goes down the trie comparing key bits at the indices specified by the visited nodes. The trie node interface looks like this:
nPatriciaTrieNode();
nPatriciaTrieNode(nPatriciaTrieKey, T, int,
nPatriciaTrieNode<T>*, nPatriciaTrieNode<T>*);
virtual ~nPatriciaTrieNode();
void Initialize(nPatriciaTrieKey, T, int,
nPatriciaTrieNode<T>*, nPatriciaTrieNode<T>*);
T GetData();
bool SetData(T);
nPatriciaTrieKey GetKey();
nPatriciaTrieNode<T>* GetLeft();
nPatriciaTrieNode<T>* GetRight();
The trie interface looks as follows:
nPatriciaTrie();
virtual ~nPatriciaTrie();
virtual nPatriciaTrieNode<T>* Insert(nPatriciaTrieKey, T);
virtual T Lookup(nPatriciaTrieKey);
virtual nPatriciaTrieNode<T>* LookupNode(nPatriciaTrieKey);
virtual bool Delete(nPatriciaTrieKey);
The demo application included in the package shows how to instantiate an <int>
version of the class -- i.e. keys are associated with integer numbers -- and how to add elements to the structure, as well as how to remove elements and how to search for selected keys. These are simple calls to nPatriciaTrie<int>::nPatriciaTrie()
, nPatriciaTrie::Insert
, nPatriciaTrie::Delete
, and nPatriciaTrie::Lookup
, respectively. For example:
nPatriciaTrie* p = new nPatriciaTrie();
p->Insert("My first element", 1234);
printf("%d", p->Lookup("My first element"));
printf("Foo is %s the trie", p->LookupNode("Foo") ? "in" : "not in");
p->Delete("My first element");
I am sure that more interesting operations are also possible. In fact, my commercial implementation of this trie has a bit of functionality that this implementation does not: it can find all of the elements associated with keys using a common given prefix, as described in the example above. I will leave the implementation up to the reader, but the bottom line is that all such elements are in a sub-tree, including back edges and their targets. The root of that sub-tree can be reached using the given common prefix.
Points of interest
While writing this code, I noticed that no one at the time, i.e. 2001, seemed to have an implementation for an element removal operation. All the literature on the subject dismissed removal as being "trivial" or "similar to the addition operation." As a result, my own implementation may or may not differ conceptually from those written by other people. All I can say is that the removal was not entirely trivial and was certainly different from addition in a number of ways.
Contributions to this code are always appreciated.
History
- 02.03.2005 - Initial public domain release
- 02.07.2005 - Clarifications, thanks to WREY
- 07.11.2007 - Changes to work on arbitrary templated key-types, plus additional methods by Christoph Mayer