Introduction
The trie is a family of tree data structures used to hold sorted sets or maps (dictionaries) in which each node of the tree only holds, or represents, parts of keys, not entire keys.
I am using a loose definition of "trie" here because I have developed a special kind of trie. I call it a "Compact Patricia Trie" or CPTrie
. Unlike standard tries, which have very high memory requirements, CPTrie
has lower memory requirements than other collection types, particularly in certain cases such as keys with a limited alphabet (e.g., ASCII text). It's generally slower than SortedDictionary
and much slower than Dictionary
, but it potentially uses much less memory than either of them.
So, while normal tries are sometimes used to build a sorted list quickly in exchange for high memory use, CPTrie
provides the opposite tradeoff.
Please do read the Wikipedia article if you have never heard of tries; from now on, I'll assume you understand the concept.
Background
A few months ago, I read about something called the Judy array, a C sparse array data structure that organizes itself in one of several ways in order to keep itself both fast and memory-efficient at the same time. You can assign a value at any sequence of indices between 0 and uint.MaxValue
, and Judy will organize itself into a kind of trie up to 4 levels deep (one level for each byte of the index), that is always fast and memory efficient regardless of the key distribution. For example, if you assign a value at every index between 0 and 1000, Judy will produce a data structure similar to an ordinary array, so that the values are closely packed. Meanwhile, if you use clusters of indices, say 23-70, 200-210, and 500-550, with a few unused slots, Judy will choose a different arrangement called a "bitmap". In any case, it minimizes the number of processor cache lines that are accessed.
Thanks Judy...
I was so captivated by Judy's ideas that I wanted to make something like it for .NET, but after thinking and coding for several days, I came to the conclusion that there was no way, in C#, to come close to the speed or memory efficiency of Judy.
For example, Judy often stores data inside pointers. The bottom 2 or 3 bits of any pointer to any heap block is always zero; so you can either use those bits to store information along with the pointer, or you can use one of those bits to indicate that the pointer holds an integer instead. The dynamic language Ruby uses this technique to store integers in variables. That way, a variable can either point to an object, or (if the low bit is 1) it can store an integer directly with no heap space required. Sadly, this seems to be outright impossible in C#, even with unsafe code, since the garbage collector would probably crash when it encounters one of these non-pointers.
Another important limitation of C# is that you cannot store a fixed-length array of object[]
inside a class--you can only use fixed-length arrays of primitive types. Even then, you must sprinkle the unsafe
keyword everywhere so the code would be unusable in "safe" contexts. An external object[]
array is distinctly less efficient, as it requires a 4-DWORD
header and may need one extra cache line to access. Oh, and in .NET, the allocation alignment of ordinary objects cannot be controlled, so it is difficult to align to cache line boundaries.
...I left you for Patricia
Then, after reading about Burst tries and Patricia tries, I decided maybe there was hope that I could make some kind of useful new data structure for .NET. Plus, while porting some C++ code to C#, I was irritated because there was no .NET equivalent for map::lower_bound
, and I figured I would solve that problem too.
Introducing the CPTrie
CPTrie
combines...
- the Patricia trie idea of storing multi-byte prefixes in a single node
- the Burst trie idea of storing a limited number of keys together without factoring out common prefixes
- the Judy ideas of using multiple node types, storing data in a very compact form, and gaining speed by being cache-friendly
I guess I could have called it the PBJ trie. That would have been yummy.
I also wanted to use the Judy idea of using different types of nodes for different key densities, but I couldn't figure out an efficient way to port Judy's "bitmap" and "uncompressed" nodes to .NET, under the constraint that I wanted to support variable-length keys. But I'm really getting ahead of myself.
CPTrie
itself is an abstract base class that only understands keys as arrays of bytes. Each kind of key you want to support needs a derived class to translate keys into byte arrays. In order to store strings in a CPTrie
, I provide the CPStringTrie
class. When you store a string in CPStringTrie
, the string is encoded in UTF-8 format before being placed in the trie. To store integer keys, use CPIntTrie
. If you actually want to use byte arrays as keys, use CPByteTrie
.
If you use CPStringTrie
to hold strings, it will likely end up as a tree of "sparse" (a.k.a. "standard") nodes (CPSNode
). When using CPIntTrie
, up to three different kinds of nodes are used depending on the distribution of keys. Small collections (regardless of key distribution) always use CPSNode
s.
It helps to explain the structure of a CPTrie
in picture form. Suppose you run the following code:
CPStringTrie<int> t = new CPStringTrie<int>();
t["This little piggy went to market"] = 1;
t["This little piggy stayed at home"] = 2;
t["This little piggy had roast beef"] = 3;
t["This little piggy had none"] = 4;
t["And this little piggy went"] = 5;
t["'Wee wee wee' all the way home."] = 6;
t["And then some wolf came along"] = 7;
t["and blew down two of their houses!"] = 8;
CPStringTrie
will end up containing two nodes, structured as follows:
As you add keys, the node must either enlarge itself or split itself (creating a child node), to make room for new keys and values. At first, it will enlarge itself, but at some point, it will notice a common prefix and decide to create a child node instead.
Notice that the trie is sorted, but the sort order is case-sensitive (uppercase comes before lowercase, the natural order of ASCII). Unlike a SortedDictionary
, which supports various sort orders, case-sensitive sorting is an inherent limitation of a trie (a kind of workaround is possible, but I have not tried it).
Like a Patricia trie, CPSNode
can hold several bytes of a key within a single node. Somewhat like a Burst trie, CPSNode
does not immediately create new nodes when two keys have a common prefix, because it is more efficient to wait until there are several keys before splitting the node. Unlike a Burst trie, however, a CPSNode
splits by creating children, not by creating a new parent node.
Nitty-gritty details (skip if you don't care)
CPSNode<T>: the "standard" or "sparse" node
A CPSNode<T>
consists of:
- An array of "cells" (
SCell[]
) which are used to encode parts of keys - An array of "children" (
CPNode[]
) which associate child nodes with common prefixes - An array of "values" (
T[]
) associated with keys that terminate in the current node (i.e., keys that do not have children)
Some nodes have no children, so the array of children may be null. Likewise, there may be no values; in fact, storage is not allocated for the null
value, so if you want to store a set of strings instead of a dictionary, use null
for all values to save memory.
The cell is the innovative part that allows nodes to be very compact. Normal .NET strings take up a relatively large amount of space, because each string has a 12-byte header as well as a 2-byte null terminator and sometimes 2 bytes of padding (in a 32-bit process, I think). Moreover, if the text is English, half the space is wasted on null bytes of UTF-16. By encoding several strings into a single array of "cells", only one 12-byte header is needed for all the strings, and there are no null terminators.
Cells are 4-byte groups that encode partial (or complete) keys and pointers to values or child nodes. If a CPSNode
contains _count
keys (or partial keys), the first _count
cells are sorted and describe the beginning of those keys. Each cell contains up to 3 bytes of a key (K0
, K1
, and K2
), so the first _count
cells tell us the first three bytes of each key. The fourth byte, called P
, acts as a pointer to another cell, to a child, or to a value, depending on its value:
- A child node (if
P < _count
); [P]
is an index into the child array - Another cell (if
P < 222
); [P]
is an index into the cell array - A non-null value (if
P < 254
), where [253-P]
is an index into the value array - The null value (if
P == 254
)
If P == 255
, it means that the cell is free, and it is part of a doubly-linked list of cells that are not in use.
Because P
is only one byte, the size of a CPSNode
must be fairly limited. There are a maximum of 32 values; all keys together cannot use more than 222 cells; and I arbitrarily selected limits of 34 for _count
and 50 bytes for the length of any single key*. The number of keys in a single node must be strictly limited anyway, because the first _count
cells form a sorted array, and building a sorted array with N items requires O(N2) time. Building an unsorted array first and sorting it afterward would be faster (O(N log N)), but for various reasons, that seems impractical. The limit should not be too low either, because as the number of keys decreases, CPSNode
becomes less memory-efficient as more objects and arrays are allocated, leading to more overhead. Also, traversing from one node to another may take extra time, as the processor stalls more often waiting for cache lines to fill.
* Note: I am not saying keys are limited to 50 bytes, only that keys longer than 50 bytes will be broken up into a chain of nodes.
If P
points to a value or child node, the key might not use the entire three bytes that are available in that cell. Special values of K2
indicate this situation; for instance, K2==254
means that the key ends after only one byte (K0
). Of course, if the number 254 actually happens to be the third byte of the key, special handling is required.
There can be (at most) one zero-length key in any given node, but there can sometimes be a zero-length cell following a cell of length three, in some cases, where a special number like 254 is encountered at the end of a key.
CPBNode<T>
Since a CPSNode
can contain up to 34 items, it can comfortably hold a set of words starting with each letter of the alphabet (e.g., "apple", "banana", "coconut", donut", etc.). However, if there are more than 34 different characters at the beginning of the keys (or following a common prefix), a different data structure is required.
I would have liked to do something resembling Judy's "bitmap nodes", but I couldn't think of a very efficient way to do it in .NET. So instead, I implemented a simple kind of bitmap node that I call CPBNode
. CPBNode
simply divides the "byte space" into 8 slices, with a child CPSNode
for each slice. There is a CPSNode
for keys that start with a byte between 0 and 31 (child #0), another CPSNode
for keys between 32 and 63 (child #1), and so forth. This way, there can never be more than 32 unique starting bytes in any given child. When CPBNode
is asked to create a key such as "~foo~
", it observes that '~'
(0x7E) is in slice #3, so it forwards the "create" request to child #3. If some slices are not needed (e.g., if there are no non-ASCII characters), CPSNode
s are not allocated for unused ranges.
CPBNode
will switch back to CPSNode
when the number of items drops below 24. CPBNode
never converts itself to CPBitArrayLeaf
.
CPBitArrayLeaf<T>
CPBitArrayLeaf
is a new node type I added to improve the space efficiency of CPIntTrie
when densely packed keys are detected, i.e., when sequences of integers are close together. CPBitArrayLeaf
is also the fastest node type. Theoretically, string tries can also use this node type, but only when there are a large number of constant-length strings that differ only in their last character.
As its name implies, CPBitArrayLeaf
is a leaf node with an array of bits (called _flags
). It only supports 1-byte keys, and the bit array has two bits for each of the 256 possible keys. There is also an array of _values
(type T[]
) and a table of _indices
that map keys to slots in the _values
array. _flags
is really two arrays of 256 bits: one array tracks which keys are in use, and the other tracks which entries in _values
are in use, so that a free slot can be found quickly.
If a key is ever added that is not 1 byte, the node must be converted immediately to CPSNode
or CPBNode
.
CPBitArrayLeaf
has a fixed size for keys - the number of non-null values affects its size, but the number of keys does not. Therefore, CPBitArrayLeaf
is very efficient when it is almost full (near 256 keys), but inefficient when it is nearly empty (near 0 keys).
Normally, CPSNode
switches to CPBitArrayLeaf
when it is full with 34 items and every key is one byte (with no children); CPBitArrayLeaf
switches back if the number of items drops below 24, which is the approximate point at which both node types occupy the same amount of memory. However, CPBitArrayLeaf
is more compact when there are no values associated with the keys, because of the overhead of the _indices
map. Therefore, if you are storing a set (with all null
values) rather than a dictionary, CPSNode
will switch to CPBitArrayLeaf
at 16 items and switch back to CPSNode
when the number of items drops below 12.
Deletions
CPTrie
is quite memory-efficient if you only insert nodes, but what about deletions? Well, rest assured, deleting keys is as fast as inserting them... at least I think so, but I didn't make a benchmark for it. However, after deleting the majority of keys, the trie may have a less efficient structure.
Remember, during insertion, a node is not eager to divide because multiple small nodes are typically less efficient than one large node. This is because if nodes are small, a larger fraction of their size goes to .NET object headers, and on average, more nodes must be traversed when scanning a key (keys are "scanned" as we insert, remove, find, or enumerate them). When deleting keys, small nodes are not consolidated into larger nodes; a small node is only eliminated when its last key is removed.
The worst case is not terrible, however. In the worst case, after removing most of the nodes from a very large CPTrie
, the "Patricia trie" property can be lost, increasing the space overhead. But tries cannot be "unbalanced" the way a binary search tree can be, so it is not crucial for CPTrie
to have a balancing algorithm. CPSNode
does notice when the majority of its cells are unused, and it will switch to a smaller array in that case.
Optimizing the trie
You may have heard that as you add items to a List<T>
, the items are added to a fixed-size internal array. When that array is full, List<T>
copies the items to another array that is twice as big, before discarding the original array.
CPSNode<T>
does roughly the same thing for each of its arrays (_cells[]
, _children[]
, and _values[]
); consequently, there are usually some unused array entries sitting around, wasting space.
CPTrie
can optimize itself when it makes a copy of itself, so you can eliminate this space by calling Clone()
and then discarding the original version of the trie. Unfortunately, at this time, it does not support optimizing in-place, so when you make the clone, memory will be required for the original trie and its clone simultaneously (which is only a problem if the trie is extremely large).
In some applications, you will build a trie and then scan it many times without modifying it at all. In such cases, you should optimize the trie. Note that the optimization process merely discards unused array entries; it does not optimize the structure of the trie (e.g., to account for deleted items).
My benchmark program shows that an optimized CPTrie
tends to be about 20% smaller.
In a program (such as my benchmark) that uses datasets larger than the processor cache, I have found that an optimized CPTrie
is about 10% faster when looking up keys. I believe this is for two reasons related to locality of reference:
- Because there are fewer blank spaces in the cells array, a key longer than 3 bytes, but in the same cell array, is more likely to be on the same cache line.
- An optimized trie node allocates all three of its arrays together. The .NET memory manager tends to put objects adjacent in memory when they are allocated at the same time, so the node itself and its three arrays are likely to be contiguous in memory. Therefore, they share some cache lines and can be accessed with fewer processor stalls.
When should you use a trie?
When managing any large amount of data, it's important to choose the right data structure for the job. If your keys are not strings and can't be represented as a byte array, then you can't use CPTrie
anyway. If you do not need to access keys in sorted order, you are probably best off using a hash table. .NET's Dictionary<K,V>
is an excellent hash table design that is both simple (small code size) and fast.
If you need keys in sorted order, but not until all the keys are added to the list, it might still be better to use a standard Dictionary
, or even a List<KeyValuePair<K,V>>
, because both of those are so much faster than SortedDictionary
and CPTrie
. Add all your key-value pairs to the Dictionary
or List
, then (if you used Dictionary
), call ToList()
to convert it to a List
, then sort the list with List<T>.Sort()
.
On the other hand, if you need interleave queries (that depend on the sort order) with modifications to the collection, SortedDictionary
and CPTrie
make sense. If you need to sort strings in case-insensitive or culture-sensitive order, you should probably use SortedDictionary
.
In some cases, you should consider some kind of parallel data structure or algorithm (standard Dictionary
, SortedDictionary
, and CPTrie
are all single-threaded).
Okay, so when should you use CPTrie?
One reason to use CPTrie
is that it offers four operations that SortedDictionary
lacks:
- Reverse-order enumeration: get an enumerator from
GetEnumerator()
, then call MovePrev()
repeatedly instead of MoveNext()
. - Find the nearest larger key: call
Find()
or FindAtLeast()
to get an enumerator that points to the key equal to or greater than the requested key. - Get the next key/value and get the previous key/value:
SortedDictionary
will only let you get the next key if you start enumerating from the beginning. With CPTrie
, you can start enumerating (backward or forward) from any key you want.
There is no reason SortedDictionary
couldn't offer these operations, but for some reason, it doesn't.
Here are some more reasons to consider CPTrie
:
- Your keys have many common prefixes. For example, a set of URLs all start with one of a few prefixes, such as "http://", "http://www.", "ftp://", "file:///", and so forth. In such cases,
CPTrie
will store the common prefix only once, saving a lot of memory in a large set. In such cases, SortedDictionary
has lookup speed of at worst O(K log N) (where K is the key length and N is the collection size), whereas CPTrie
is no worse than O(K). - Your data set is extremely large, and you need to avoid using more virtual memory than physical RAM. For instance, if the machine has 1 GB of RAM, the computer will slow to a crawl when your data structure exceeds that size. In a 32-bit Windows process, you can't normally use more than 2 GB total; if you might approach that limit, consider
CPTrie
. In some cases, CPTrie
can encode the trie itself and all the keys in the same amount of memory it takes Dictionary
and SortedDictionary
to hold the data structure and merely pointers to the keys. Provided that you do not need to store copies of the keys in string form, you can often use half the memory by using CPTrie
(see graph below). - You need a set, not a dictionary, and you are not using .NET 4.0 (which offers a sorted set collection). If you use a reference type as the value type of a
CPTrie
, CPTrie
will save at least 4 bytes of memory every time you use null
as the value for a key. This allows CPTrie
to store sets very compactly. - You want to store a set of integers that are highly clustered, meaning that there are long runs of almost-consecutive integers, e.g., 1, 2, 3, 4, 5, 7, 8, 9, 97, 98, 99, 101, 102, 104, etc.
CPIntTrie
can store large clusters efficiently with CPBitArrayLeaf
nodes, especially if no values need to be associated with the integers.
Performance for string storage
I compared CPStringTrie
to Dictionary
and SortedDictionary
for three data sets:
- A list of English words from 12dicts.
- A collection of 200,000 strings. Each key was built by concatenating two random English words with a space in between.
- A collection of 1,000,000 strings. Each key was built by concatenating one of 31 prefixes to a random English word with a space in between. In the second test, I expect the
CPTrie
to receive some benefit (in speed and space) from the relatively small number of prefixes.
You can see the results of the first data set in the first two lines of the benchmark in the screenshot at the beginning of this article (or, just run the benchmark on your own machine - be sure to use a Release build). The first line uses a normal non-optimized CPTrie
, while the second line (marked "opt.
") shows the performance difference when the trie is optimized, but since the English word list fits in my processor's 2 MB cache, it doesn't make much difference.
Speed
After first publishing this article, I realized I had not been fair to SortedDictionary
because I had used the default sort order. Since CPStringTrie
sorts by ordinal, SortedDictionary
ought to use the StringComparer.Ordinal
comparer instead, to make the comparison fair. This change makes SortedDictionary
about twice as fast, so CPStringTrie
is generally slower for collections with less than 200,000 items, but it is still faster for very large collections.
The timer used for measurements has a 10-15ms. resolution, so the tests on the first data set are repeated 10 times to get more accurate results. For the larger data sets, I only used 1 or 2 reps so that the benchmark didn't take too long.
I only benchmarked the speed of insertions and retrievals ("scanning"). There are no benchmarks for enumeration or removals. I suspect that CPTrie
has relatively slow enumeration because when you ask for the value of each key, it must be converted from byte[]
to string
, whereas Dictionary
and StringDictionary
store strings directly. However, I don't expect removals to be slow.
For the last two data sets, I did a sequence of tests to show the effect of varying collection sizes. In the first test of each data set, I stored the entire data set (200,000 or 1,000,000 items) in a single collection. For the other tests, I divided the data set into "sections" of size 1000, 500, 250, or smaller, and I placed each section in its own collection. This is intended to simulate common applications that manage large amounts of data but do not store all the data in one place--that is, the data is spread out over many collections.
Only the first data set is small enough to (almost) fit in a typical 2 MB L2 cache. In all other cases, I tried to access the data in such a way that the processor cache is stressed. Before scanning the keys, their order is randomized, and one key is added or retrieved from each collection in turn, so that the cache tends to be "cold" when retrieving each key. I thought this access pattern would be relatively friendly to CPTrie
because of its compact size, but it doesn't seem to matter much.
So, here are the results:
These graphs show that (on my workstation) CPTrie
only beats StringDictionary
for very large collections. Note that StringDictionary
must be constructed with ordinal sort order for this to be true; the default sort order (which I think is case insensitive and culture-aware) cuts StringDictionary
's speed in half. In any case, StringDictionary
gets incrementally slower as the collection size increases, which makes sense because it uses a red-black tree, which requires O(log N) time for each insertion or retrieval.
CPTrie
has a more complex insertion performance. Insertion speed suddenly becomes slower for collections with 64 rather than 32 items; I suspect this is because the trie is forced to split into multiple nodes after it exceeds its 34-item-per-node limit, and the splitting process is somewhat expensive. However, performance never gets much worse; a collection with 200,000 items is constructed almost as fast as 1600 collections with 125 items each.
It almost seems as though CPTrie
defies the law I was taught in school that sorting algorithms generally cannot be faster than O(N log N). Yet theoretically, I believe a trie can produce a sorted data set in O(N * K) time, where K is the (average?) key length, and there is no "log N" penalty. Does some academic care to explain this discrepancy?
Memory
As I mentioned, CPTrie
can use less memory than standard collections, and it does especially well for large data sets.
I performed memory measurements in a purely analytical way, under the assumption of a 32-bit processor. To determine the memory use of these collections, I assumed 8-byte headers on normal objects, 12- and 16-byte headers on arrays (see here), 4-byte object alignment and standard alignment rules, and then I examined each of the relevant classes to figure out its memory requirements.
For CPTrie
, I wrote a method (CountMemoryUsage()
) to compute the exact memory usage; for Dictionary
and SortedDictionary
, I used Reflector to examine the implementation details to find out how much memory is used, and I wrote methods to calculate memory use. If you trust my arithmetic, the memory measurement for SortedDictionary
is exact, whereas Dictionary
is only approximate. Dictionary
, like List<T>
, uses arrays that roughly double in size when they are enlarged, which means that between 0% and 50% of entries in a Dictionary
are unused. A Dictionary
cannot be asked for its current capacity, so for my calculation, I assumed that 25% of the entries were unused.
If you are using a 64-bit process instead, I suspect CPTrie
will look relatively more favorable, although all types of collections will require substantially more memory.
So, here are the memory graphs to go along with the speed graphs above. The trie measurements are for a non-optimized trie.
There is an important "cheat" here: the memory consumption of the keys is added to the total memory use of the Dictionary
and SortedDictionary
, but not for the CPStringTrie
. So, while it appears CPTrie
uses less memory, this is only true if you discard the keys after adding them to the trie. If you keep all the keys lying around in the form of strings (as the benchmark does, admittedly), you will not save memory by using CPStringTrie
!
So, keeping in mind the rule about keys, the 200,000 item CPStringTrie
can use 57% less memory than the Dictionary
, assuming 4-byte values associated with each key. The million-item trie is even better, using 67% less memory.
I thought that the million-item trie would save quite a bit of memory by having only 31 prefixes, but if I use random word pairs instead (not shown on the graph), the savings is still a healthy 65% compared to the Dictionary
. Note that there are only 41,238 unique words, so there are still many common prefixes in the million-item test.
If you optimize your trie with Clone()
after you are done building it, you can save about 20% more memory than shown on the graphs, plus you'll get 10% faster queries if your access pattern stresses the processor cache (but you'll take a performance hit if you modify the trie after optimizing it). Notice the memory spike for collections with 64 items - I believe this happens because most of the 32-item tries fit in one node, but the 64-item tries require two or more, increasing overhead disproportionately. All three collection types require more memory if there are very few items per collection (because the sheer number of collections is very high, and each one has some overhead).
Remember that in these tests, the keys are ASCII. This helps save memory because CPStringTrie
uses UTF-8, compared to normal UTF-16 strings which use two bytes per character. However, if you store mainly non-ASCII strings such as Chinese, CPStringTrie
will need substantially more memory (I believe a typical Chinese character uses 3 bytes in UTF-8 versus 2 for UTF-16).
CPIntTrie
I wrote a single class, CPIntTrie<TValue>
, for storing integers of all sizes from 8 bits to 64 bits, signed or unsigned. It contains overloaded methods for adding integers of various sizes, and it implements two dictionary interfaces, IDictionary<int, TValue>
and IDictionary<long, TValue>
, which should cover most use cases.
In order to store data efficiently in CPSNode
s, CPIntTrie
encodes each key into bytes in one of three formats:
- A 3-byte form for 24-bit integers in the range -0x10000 to 0xFAFFFF
- A 6-byte form for 41-bit integers in the range -0xFFFFFFFFFF to 0xFFFFFFFFFF
- A 9-byte form for large signed and unsigned integers
Using multiples of 3 bytes makes sense for storing numbers in a CPSNode
, where there are 3 bytes per node. In large collections, the encoding method is not really important as long as it maintains sort order, but in small collections, this encoding is as compact as possible.
To maintain sort order, all the 6-byte negative integers need to appear "smaller" than the 3-byte ones, and all the 6-byte positive integers need to appear "larger". Likewise, the negative 9-byte integers need to appear the smallest, and positive 9-byte integers need to appear the largest. Therefore, the keys are stored in a kind of big-endian format, with a prefix byte in front that indicates magnitude and length:
- 0x00: 64-bit signed integer below -0xFFFFFFFFFF in 9 bytes
- 0x01: 41-bit signed integer between -0xFFFFFFFFFF and -0x10000 in 6 bytes
- 0x02-0xFD: 24-bit integer between -0x10000 and 0xFAFFFF in 3 bytes; subtract 3 from this byte to compute the first byte of the 24-bit number
- 0xFE: 40-bit unsigned integer between 0xFB0000 and 0xFFFFFFFFFF in 6 bytes
- 0xFF: 64-bit unsigned integer above 0xFFFFFFFFFF in 9 bytes
The original data type is not encoded: the "byte" 100 and the "int" 100 are stored the same way. Although it would be very unusual to use signed and unsigned 64-bit numbers as keys simultaneously in the same collection, CPTrie
does allow it. Therefore, there is no primitive data type that can represent all possible keys.
Performance for integer storage
For the integer benchmarks, I took a different approach than the string benchmarks. In the string benchmarks, I examined how the trie's performance is affected by collection size, by repeating the tests with keys divided into "sections" of various sizes. I simplified the integer benchmarks by only testing a single large collection at once, but I used a variety of key distributions, thinking this could affect performance and space consumption of the trie.
Speed
The executive summary: CPIntTrie
is usually slower than SortedDictionary
. However, if the keys are very densely clustered, CPIntTrie
is faster and much smaller. CPIntTrie
is also a good choice for very large collections. However, as before, Dictionary
is much faster than either of the sorted collections.
I did tests involving 24-bit, 32-bit, and 64-bit keys. I tried three kinds of key distribution:
- Random: a uniform distribution.
- Exponential: in 32-bit mode, a random number is chosen with a random number of bits between 15 and 32 so that the high bits are zero; this should approximate picking a random point on a logarithmic scale. In 64-bit mode, a random 32-bit number is left-shifted by up to 32 bits.
- Clustered: clusters are groups of integers that are close together. But how close? There are an infinite number of variations of the idea of "clustered" data. In my tests, I build random clustered key sets using parameters
(C, S, D, start)
where C
is the maximum number of integers in each cluster (C/2
is the minimum), S
is the maximum space between clusters (S/2
is the minimum), D
is the maximum distance between adjacent integers in the cluster (1
is the minimum), and start
is the first integer in the series (start
is set to 0x0123456789ABCDEF
for 64-bit clusters, and 0x1000000
for 32-bit clusters). I picked various arbitrary values for C
, S
, and D
in the tests. Clustered data often appears in ID columns of databases. Each new row is given an ID one greater than the last, but over time, some rows or series of adjacent rows are deleted, leaving small or large gaps in the integer space. CPIntTrie
stores such sets very compactly. Clustered data also come from various other sources, such as histograms.
Of course, the key distribution has little or no effect on the performance of Dictionary
or SortedDictionary
. CPIntTrie
, it turns out, is usually not dramatically affected by the key distribution either, except that CPIntTrie
handles densely clustered data faster, and it is faster when storing a set (null
values) rather than a dictionary.
Here's some raw data from my machine:
|-Int Dictionary--| |-SortedDictionary-| |----CPIntTrie----|
Scenario Reps Set size Fill Scan Memory Fill Scan Memory Fill Scan Memory
-------- ---- -------- ---- ---- ------ ---- ---- ------ ---- ---- ------
1-100,000, sorted 10 100000 15ms 12ms 2.5M 128ms 58ms 2.7M 63ms 46ms 0.9M
1-100,000, random 10 100000 15ms 4ms 2.5M 132ms 62ms 2.7M 58ms 44ms 0.9M
1-100,000 w/ null vals 10 100000 15ms 6ms 2.5M 129ms 58ms 2.7M 47ms 43ms 0.0M
24-bit keys with 100K items:
Random 24-bit ints 10 100000 19ms 9ms 2.5M 134ms 66ms 2.7M 157ms 65ms 2.0M
Random set (null vals.) 10 100000 18ms 13ms 2.5M 134ms 62ms 2.7M 117ms 63ms 1.3M
Clusters(20, 100,2) 10 100000 15ms 9ms 2.5M 134ms 66ms 2.7M 101ms 60ms 2.2M
Clusters(same w/ nulls) 10 100000 18ms 9ms 2.5M 135ms 63ms 2.7M 81ms 46ms 0.2M
Clusters(20, 100,9) 10 100000 16ms 9ms 2.5M 131ms 60ms 2.7M 134ms 65ms 2.3M
Clusters(20,1000,2) 10 100000 21ms 9ms 2.5M 131ms 63ms 2.7M 126ms 70ms 1.4M
Clusters(20,1000,9) 10 100000 15ms 12ms 2.5M 132ms 55ms 2.7M 129ms 74ms 1.4M
Clusters(50, 100,2) 10 100000 15ms 10ms 2.5M 129ms 60ms 2.7M 84ms 47ms 1.6M
Clusters(50, 100,9) 10 100000 21ms 12ms 2.5M 132ms 58ms 2.7M 128ms 52ms 3.1M
Clusters(50,1000,2) 10 100000 15ms 12ms 2.5M 131ms 62ms 2.7M 137ms 62ms 1.7M
Clusters(50,1000,9) 10 100000 16ms 10ms 2.5M 128ms 58ms 2.7M 143ms 70ms 2.2M
Tests with 32-bit keys:
Random 32-bit ints 10 100000 23ms 16ms 2.5M 154ms 58ms 2.7M 175ms 96ms 2.0M
Random 32-bit ints 5 200000 49ms 43ms 5.1M 315ms 165ms 5.3M 468ms 237ms 5.1M
Random 32-bit ints 3 500000 135ms 130ms 12.7M 1051ms 530ms 13.4M 1291ms 614ms 8.5M
Random 32-bit ints 2 1000000 335ms 281ms 25.4M 2593ms 1257ms 26.7M 2390ms 1328ms 13.3M
Exponential 32-bit 10 100000 21ms 13ms 2.5M 134ms 62ms 2.7M 185ms 88ms 1.9M
Exponential 32-bit 5 200000 49ms 49ms 5.1M 321ms 162ms 5.3M 390ms 218ms 3.8M
Exponential 32-bit 3 500000 140ms 135ms 12.7M 1056ms 525ms 13.4M 1234ms 567ms 9.1M
Exponential 32-bit 2 1000000 335ms 288ms 25.4M 2671ms 1218ms 26.7M 2601ms 1218ms 17.8M
Clusters(25,25,1) 10 100000 15ms 0ms 2.5M 129ms 62ms 2.7M 81ms 62ms 1.5M
Clusters(25,30000,5) 10 100000 16ms 10ms 2.5M 134ms 60ms 2.7M 155ms 93ms 1.3M
Clusters(50,50000,5) 10 100000 16ms 12ms 2.5M 134ms 60ms 2.7M 180ms 90ms 2.0M
Clusters(75,90000,5) 10 100000 19ms 9ms 2.5M 129ms 60ms 2.7M 169ms 82ms 2.1M
Clusters(75,90000,5) 5 200000 40ms 40ms 5.1M 318ms 165ms 5.3M 368ms 205ms 4.3M
Clusters(75,90000,5) 3 500000 130ms 130ms 12.7M 1031ms 530ms 13.4M 1202ms 577ms 10.6M
Clusters(75,90000,5) 2 1000000 320ms 296ms 25.4M 2624ms 1241ms 26.7M 2562ms 1234ms 21.3M
Clusters(75,90000,5) 1 2000000 687ms 593ms 50.9M 6671ms 2796ms 53.4M 5265ms 2656ms 42.7M
Clusters(99,90000,2) 1 2000000 703ms 578ms 50.9M 6640ms 2796ms 53.4M 5328ms 3062ms 28.1M
Tests with 64-bit keys:
Clusters(25,50000,9) 10 100000 23ms 15ms 3.1M 151ms 71ms 3.1M 177ms 97ms 1.4M
Clusters(50,20000,5) 10 100000 23ms 15ms 3.1M 148ms 68ms 3.1M 179ms 91ms 2.0M
Clusters(75,1000,3) 10 100000 19ms 13ms 3.1M 144ms 71ms 3.1M 156ms 70ms 1.8M
Random 32-bit longs 10 100000 26ms 15ms 3.1M 149ms 71ms 3.1M 179ms 96ms 2.0M
Random 40-bit longs 10 100000 24ms 15ms 3.1M 174ms 78ms 3.1M 190ms 93ms 2.5M
Random 64-bit longs 10 100000 26ms 15ms 3.1M 148ms 68ms 3.1M 201ms 94ms 3.1M
Random set (null vals.) 10 100000 27ms 16ms 3.1M 144ms 71ms 3.1M 187ms 90ms 2.3M
Exponential longs 10 100000 26ms 15ms 3.1M 154ms 74ms 3.1M 210ms 112ms 2.3M
Exponential longs 5 200000 55ms 46ms 6.1M 365ms 187ms 6.1M 534ms 246ms 4.7M
Exponential longs 3 500000 156ms 135ms 15.3M 1296ms 567ms 15.3M 1406ms 697ms 11.4M
Exponential longs 2 1000000 343ms 296ms 30.5M 3218ms 1452ms 30.5M 3343ms 1531ms 22.3M
I only made a graph of the 32-bit section of these results:
Note that this graph doesn't show cases of all-null values, which are faster, but you can see that the dense cluster (25,25,1)
is faster than SortedDictionary
.
Memory
Compared to CPStringTrie
, CPIntTrie
's memory savings are not as dramatic, but still distinct. The graph below shows how CPIntTrie
handles various distributions and data sizes of 32-bit integers.
The graph doesn't show any cases of null
values. If you look at the raw data, you'll see that in cases where all values are set to null
(e.g., "1-100,000 w/ null values"), dramatically less memory may be required, particularly if the data is densely clustered. In such cases, the trie acts more like a bit array in regions of the integer space that contain dense clusters of keys. In the best such case, less than 4 bits per key are required, so the total comes to 0.0 MB after rounding to the nearest 0.1 MB. You can choose whether or not to use a value on a key-by-key basis, but in order to get such dramatic memory savings, large regions of the number space must only use null
values.
Curiously, there is a spike in memory usage for the case of 200K random integers. I am not certain why this is the case, but I suspect the key density is the problem. CPTrie
is efficient when the keys are either sparse (using CPSNode
) or dense (using CPBitArrayLeaf
), but at certain densities, it would have to use CPBNode
, and since the data is random, it requires 8 CPSNode
s per CPBNode
, which adds a lot of overhead.
On the other hand, memory use dips for 1 million random numbers and for certain kinds of clustered data. In the case of 1 million random integers, I can only assume this key density is a sweet spot, even though it is not dense enough to use CPBitArrayLeaf
. The clusters (25,25,1)
and (99,90000,2)
are probably benefiting from CPBitArrayLeaf
, while perhaps the clusters (25,30000,5)
make efficient use of CPSNode
leaves.
The end
I hope you enjoyed reading about my new data structure. Let me know if you find a practical application!
History
- March 30, 2010: Added
CPIntTrie
(and CPBitArrayLeaf
) with benchmarks and tests. - February 26, 2010: Initial release with
CPStringTrie
and CPByteTrie
.