(untagged)

Patricia and Huffman, Sitting in a Trie

Michael Jaworski

0.00/5 (No votes)

27 Dec 2004

Article demonstrates a text-based Patricia trie and adds new text-compression features.

Introduction

A while back, I was looking for a data structure that could store strings quickly and retrieve them very quickly. I stumbled upon the Patricia (PAT) trie ("Practical Algorithm To Retrieve Information Coded in Alphanumeric"). Unfortunately, I could not find a good enough explanation of the trie. Every explanation I read, glossed over too many details. Every code piece (there are not many) that I could find had very nice implementations, but did not store text only numbers. So, I wrote my own class that stores text.

If you want to store a good amount of unique text strings (duplicates can be handled, just not by the class itself), retrieve them super fast, and associate data or functions with them, then read on!

This class could be used for (as examples, but not limited to):

command lists (like menus or OSI(7) protocols)
a dictionary or thesaurus
associative arrays
catalogue ports on a TCP server
store/parse XML

Not bad, huh?

Patricia Trie

I would like to give a brief overview of what a trie is and then speak about PAT tries more specifically. It is assumed you understand how a binary tree works.

A trie is a binary tree that uses key space decomposition (as opposed to object space decomposition). The idea is that instead of branching left or right based on what data value happens to be at a particular node in the tree (that is not a mistake, tries are trees), we branch based on predefined key values. So the root node of a tree could send the first thirteen letters of the alphabet to the left child while the others go to the right. This does not guarantee a balanced tree, but it helps to keep the tree balanced and ensures that the order in which data is input will not affect the performance of the data structure (i.e., insert a sorted list into a binary tree and end up with a list).

In a PAT trie, all of the internal nodes (a node with at least one child) hold only enough information to make branching decisions (go left or right), but no actual data. The leaves of the trie (a node with no children) hold the data: a key and a link to one class (that holds whatever we need it to).

My PAT trie uses integers (on a 32-bit machine) for keys. If you want to use a long for the key, then you must change the type of the key (keyType). If you change the type of the key or you compile this class on a machine that encodes integers (int) with more than 32 bits, then you must set the constant MAX_KEY_BITS to the number of bits that your data type is wide (default is 32). We will see later how to convert text to integers.

With PAT tries, at every level of the trie, we evaluate one bit of the key, and based on that bit, move left or right. To find any key in our trie, we only have to check at most the number of bits that our key is wide (32 bits).

a PAT trie

The image above shows a PAT trie that stores a few keys (binary representation shown). The black circles are internal nodes. The yellow circles are leaf nodes.

Text To Number

Now, we turn our attention to making this a useful PAT trie. We need a way to turn a string of text into our key. All of the three methods described below are based on the fact that Unicode or ASCII characters are really just numbers. The character 'a' is also 97.

The first method involves converting all of our characters into 8-bit numbers (a byte) and concatenating them. There are only 26 letters in the alphabet; we can represent them with a byte. This is not a very good solution, because we could only discriminate against four letters (4 letters * 8 bits = 32 bits). Also, most of the branching would be on the same characters (eight bits to store an ASCII character, eight identical internal node checks).

The second method allows us to store more letters in our key. We could add our characters ('a' + 'e' + ... = 4532). In this way, with the largest ASCII character being no more than 255 (2^8-1) or in Unicode 65535 (2^16-1), we could have our key encode many numbers and we would not be wasting space on comparing eight internal nodes for one character. This method has one fatal flaw: encoding "was" or "saw" we will get the same number (3 + 5 = 5 + 3).

The last method uses a Huffman code. Huffman codes are primarily used for compression. They encode letters in a new format using the least amount of bits possible by assigning the least number of bits to the most frequent character. So, if 'e' is the most common letter in a document, then we will represent 'e' with a zero (or just one bit). We would represent the next most frequent letter with a one and the third most frequent with a two. In this way, the Huffman code saves the most space of the three methods, by replacing the most frequent letter with the smallest representation.

Please note that with any of these three methods, there is no guarantee that the generated key will be unique. If all of the letters are identical until the point where our key is full (we have represented as many characters as possible), then we will have generated the same key for two distinct strings ("a baby ate the fox" and "the baby ate the fox" will translate to the same key). To ease this issue, use a key with greater than 32 bits. With a 32 bit key, we can represent at most 32 characters (provided those characters can be represented by one bit ~ 32 e's or t's).

Huffman Is Our Man

The Huffman code used in my PAT trie is an array of twenty-six numbers. The first cell holds the Huffman code for the letter 'a', the next cell for 'b', and so on. The code stored in each cell is the reduced bit representation of the corresponding letter. For instance, the fifth cell is for 'e' and holds the number zero. 'e' is the most frequent letter in the English alphabet and so should map to the smallest number. To build our key:

for each letter right to left (is there more variability at the end of words?)
look up the Huffman code for the letter (we will refer to as "Huffing")
shift the bits on the key to the left for the number of bits that the Huffman code is long (make space to add the code to the key)
add (arithmetic) the new Huffman code to the key

To better explain the last three steps, here is an example:

'l' huffs to 8 or 1000 (in binary)
the key 0000 0000 0000 0000 0101 1101 0110 0011, becomes 0000 0000 0000 0101 1101 0110 0011 0000
we can add 8 to the key to get 0000 0000 0000 0101 1101 0110 0011 1000

How was the decision made that the letter 'e' is the most frequent and 'z' is the twenty-third most frequent? Initially, I wrote a program to mill through text and produce a list of the most frequent letters in the English alphabet. Fortunately, it was brought to my attention (by Don Clugston - on Code Project) that there is an official list of the most frequent letters. I should have thought of that... Irregardless, the code has been updated and kudos/thanks to Don! The change will improve the speed of the code. Note that if you use this class to store command lists or some type of structured document where the same words are being used over and over, it pays to rearrange the frequency to match the most frequently used letters. If you were storing C programming code, then if and while would crop up quite a bit. The most frequent letter should (possibly) be i.

Huffing Again and Again

What if you Huff duplicates? My PAT trie is set up so that you associate an entire class with a key. It will only store a key once, but the class associated with it could remember information about duplicates. For instance, here are three scenarios:

You could keep a variable to count how many times you found a particular word. Instead of inserting, just search for the key and increment the variable in the associated class.
You could build a linked-list in the associated class to store the position of the text string in a text document. Then, you would know that the word was stored at positions 45, 64, 134, and 856. Note: this would be down right horrible if you wanted to piece the document back together.
You could build a linked-list that, instead of storing the position of the string, pointed to the next string in the document (or just stored the next word in text). That would help if you wanted to rebuild the document a little more efficiently later on. You could follow the bread crumbs of each string.

Code Overview

There are three classes that together create my Patricia Trie:

PatriciaTrieC ~ the Patricia trie.
PatriciaNodeC ~ an internal or leaf node in the trie.
ASSOCIATED_CLASS (PayloadC) ~ the associated class, originally PayloadC, but it can be changed by changing the constant ASSOCIATED_CLASS.

The Huffman Code [CODE]

The Huffman code array is a static member of the PatriciaTrieC class. It is initialized once for the class and available for all objects of PatriciaTrieC. If you want to change the values, you will have to go in and tweak the array itself.

//STATIC HUFFMAN CODE - ONLY CREATED ONCE

//*****************************************************************

//build huffman tree

//based on common letter usage in the English language

//source: UNIX manual pages (find, ls, mv, cp, ...) all added together

//

//e t a o i n s r h l d c u m f p g w y b v k x j q z

//

char PatriciaTrieC::huffman_code [] = 
{
/* A-D */
2,          //A

18,         //B

11,         //C

10,         //D


/* E-H */
0,          //E

The code that Huffs text to keys will ignore all characters that are not A through Z (case-insensitive). So, the strings "HAPPY" and "H A P P Y" would Huff to the same key. Also note that the character strings should be terminated by the end of string character ('\0' or zero).

  //convert text to integer

  do
  {
    //if we get to the end of the text string

    //before the key is filled that is okay

    //we will just use what we have!

    if (txt [txt_count] == END_OF_STRING)
      break;

    //if big caps, then make it small

    if (txt [txt_count] >= 'a' && txt [txt_count] <= 'z')
      txt [txt_count] -= LOWER_TO_UPPER_FACTOR;

    //skip all weird characters (only normal letters are handled)

    if (txt [txt_count] < 'A' || txt [txt_count] > 'Z')
    {
      //go to next character

      txt_count += 1;

      //try next character

      continue;
    }
  .
  .
  .

Using The Code [CODE]

You can insert, search, and remove a key (or one data element) out of the trie. There is a method BuildKey that you can use to convert text into keys if you wanted to store or find out the exact key signature for a string of text. You can also call Print to print the entire trie (if DEBUG is set) or Clear to wipe the entire trie.

The class associated with every leaf node (where data is stored) is PayloadC. If you want your own class, just change the constant ASSOCIATED_CLASS in the pat.h file to the name of your class.

The included project has a nice little program that loads words from a text file and inserts, searches and then removes them all. Below is a simple example of those three operations (insert, search, remove):

#include "pat.h"

#include <iostream.h>


int main ()
{

  //CREATE VARIABLES

  //~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  //~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  //create two associated classes

  PayloadC* p_dyn = new PayloadC ();
  PayloadC p_stc;

  //for getting returns on associated class from remove and search

  PayloadC* catch_payload = NULL;

  //the PAT trie

  PatriciaTrieC pat;
  //~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  //~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


  //insert HAPPY and search for it

  //////////////////////////////////////////////////

  if (pat.Insert (p_dyn, "HAPPY"))
      cout << "inserted HAPPY" << endl;
  else
      cout << "did not insert HAPPY" << endl;

  catch_payload = pat.Search ("HAPPY");


  //insert SAD and search for it

  //////////////////////////////////////////////////

  if (pat.Insert (&p_stc, "SAD"))
      cout << "inserted SAD" << endl;
  else
      cout << "did not insert SAD" << endl;
  
  catch_payload = pat.Search ("SAD");

  
  //search for WHO, it should fail

  //////////////////////////////////////////////////

  catch_payload = pat.Search ("WHO");
  
  if (!catch_payload)
      cout << "WHO does not exist" << endl;
  
  
  //REMOVE EVERYTHING, WHO WILL NOT BE REMOVED

  //////////////////////////////////////////////////

  catch_payload = pat.Remove ("HAPPY");
  catch_payload = pat.Remove ("SAD");
  catch_payload = pat.Remove ("WHO");
  
  return 0;
}

Points of Interest

The word frequency list I used came from here.

Worst-Case Running Times

`Insert`	O(bc+bc*bc)	where bc is the maximum number of bits for the key (32 bit comparisons to search for the spot to insert, plus for every break you have a full 32 bit comparison for duplication - this could be removed, but then duplicates would automatically create a path 32 bits deep, which is fine, but the extra work is better in the average case)
`Search`	O(bc)	where bc is the maximum number of bits for the key (using an `int` on a 32 bit machine, 32 bit comparisons)
`Delete`	O(bc)	where bc is the maximum number of bits for the key
`BuildKey`	O(n)	where n is the length of the string

The PAT trie with a 32 bit key can store (2^32-1) / 2 = 2,147,483,647 words and find any of them in 32 bit comparisons.

History

version 0.0 [November 9, 2004]
version 0.1 [December 3, 2004]
- The Huffman encoding code is better optimized based on this.
- No GPL. If you want it, use it. If I goofed and did not remove something, remove it and use it. I am sure I got everything though.
version 0.2 [December 18, 2004]
- Academic Free License, Version 1.2 applied to source.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here