Fast LZW Compression Using Binary Tree

Hatem Mostafa

4.78/5 (64 votes)

17 Dec 2012CPOL10 min read

227.8K

3.7K

Fast LZW implementation using Binary Tree as a dictionary

Introduction
Background
- LZW Idea
- LZW Algorithm
  - Compression
  - Decompression
Code Usage
Code Description
- Compression
- Decompression
Points of Interest
Source Code Files
References
Thanks to...

Introduction

One day, I was talking with one of my colleagues about compression, and I was surprised when he told me that there is an algorithm that uses a dictionary during compression then leaves it aside and saves the compressed data without the dictionary. I asked him, "How can it construct the original data again?", he replied: "Simply by constructing the dictionary again during the decompression process"!!!

It is the LZW compression. I decided that day to read about it and try to implement this so nice idea algorithm, and I spent a long time to make it as fast as I can, but I think I failed as it compresses 1 MB/sec in normal cases. The code seems hard to be understood, but it will be easy if you take a look at the algorithm through the Background section or through any of the References down the page.

Background

In 1984, Terry Welch was working on a compression algorithm for high-performance disk controllers. He developed a rather simple algorithm that was based on the LZ78 algorithm and that is now called LZW. LZW is a lossless 'dictionary based' compression algorithm. Dictionary based algorithms scan a file for sequences of data that occur more than once. These sequences are then stored in a dictionary, and within the compressed file, references are put where-ever repetitive data occurred.

LZW Idea

LZW compression replaces strings of characters with single codes. It does not do any analysis of the incoming text. Instead, it just adds every new string of characters it sees to a table of strings. Compression occurs when a single code is output instead of a string of characters. So, any code in the dictionary must have a string in the compressed buffer for the first time only, that means the dictionary just keeps a reference to uncompressed strings in the compressed buffer, no need to let the dictionary take a copy of these strings. So, it is not necessary to preserve the dictionary to decode the LZW data stream. This can save quite a bit of space when storing the LZW-encoded data.

The code that the LZW algorithm outputs can be of any arbitrary length, but it must have more bits in it than a single character. The first 256 codes (when using eight bit characters) are, by default, assigned to the standard character set. The remaining codes are assigned to strings as the algorithm proceeds. The sample program runs as shown with 12 bit to 31 bit codes. This means, codes 0-255 refer to individual bytes, while codes 256-2^n refer to substrings, where n equal to number of bits per code, as in the following figure.

LZW Algorithm

Compression

The LZW Compression Algorithm can be summarized as follows:

w = NIL;
while ( read a character k )
{
    if wk exists in the dictionary
        w = wk;
    else
    {
        add wk to the dictionary;
        output the code for w;
        w = k;
    }
}

Original algorithm used a dictionary of 4094 entries; first 256 for ASCII codes, and the rest for substrings. As you see, the algorithm scans one character at a time. If the string is in the dictionary, then it adds the character to the current string and goes for the next character. If the work string is not in the dictionary, it adds the work string to the dictionary and sends the work string without the new character. It then sets the work string to the new character.

Compression Example

Input string is "^WED^WE^WEE^WEB^WET". 

             w    k    output    index    symbol
          -----------------------------------------
            NIL   ^               
             ^    W      ^        256       ^W
             W    E      W        257       WE
             E    D      E        258       ED
             D    ^      D        259       D^
             ^    W       
            ^W    E     256       260      ^WE
             E    ^      E        261       E^
             ^    W
            ^W    E
           ^WE    E     260       262     ^WEE
             E    ^
            E^    W     261       263      E^W
             W    E
            WE    B     257       264      WEB
             B    ^      B        265       B^
             ^    W
            ^W    E
           ^WE    T     260       266     ^WET
             T   EOF     T

[1] Stepping through the start of the algorithm for this string, you can see that in the first pass through the loop, a check is performed to see if the string "^W" is in the table. Since it isn't, the code for '^' is output, and the string "^W" is added to the table. Since we have 256 characters already defined for codes 0-255, the first string definition can be assigned to code 256. After the third letter, 'E', has been read in, the second string code "WE" is added to the table and the code for letter 'W' is output. This continues until in the second word, the characters '^' and 'W' are read in, matching string number 256. In this case, the code 256 is output, and a three character string is added to the string table. The process continues until the string is exhausted and all of the codes have been output.

Decompression

The LZW Decompression Algorithm is as follows:

read a character k;
output k;
w = k;
while ( read a character k )
// k could be a character or a code.
{
    entry = dictionary entry for k;
    output entry;
    add w + entry[0] to dictionary;
    w = entry;
}

As we said before, the decompression builds its own dictionary that matches exactly the compressor's, so that only the codes need to be saved in the compression.

Decompression Example

Input string is "^WED<256>E<260><261><257>B<260>T". 

             w      k    output    index    symbol
          -------------------------------------------
                    ^       ^
             ^      W       W       256       ^W
             W      E       E       257       WE
             E      D       D       258       ED
             D    <256>    ^W       259       D^
           <256>    E       E       260      ^WE
             E    <260>   ^WE       261       E^
           <260>  <261>    E^       262     ^WEE
           <261>  <257>    WE       263      E^W
           <257>    B       B       264      WEB
             B    <260>   ^WE       265       B^
           <260>    T       T       266     ^WET

[1] Just like the compression algorithm, it adds a new string to the string table each time it reads in a new code. All it needs to do in addition to that is translate each incoming code into a string and send it to the output. The important thing to note is that the string table ends up looking exactly like the table built up during compression. The output string is identical to the input string from the compression algorithm. Note that the first 256 codes are already defined to translate to single character strings, just like in the compression code.

Code Usage

Code usage is so simple, just the input buffer and its length, and the output buffer and its length. The compress function includes an additional parameter nBitsPerSample to determine the number of bits to save dictionary code, so dictionary entries will be up to 2^nBitsPerSample.

bool CompressLZW(BYTE* pSrc, int nSrcLen, BYTE* &pDes,
                 int &nDesLen, int nBitsPerSample = 12);
bool DecompressLZW(BYTE *pSrc, int nSrcLen, BYTE *&pDes, int &nDesLen);

Code Description

I have only two functions CompressLZW and DecompressLZW, I will start with the first.

CompressLZW Function

In this function, I am scanning the input buffer pSrc and adding new entries to the dictionary component CBinaryTreeNode. CBinaryTreeNode is a Binary Tree class that keeps a unique instance of all input keys, and it does that in a very fast way, using binary search. To have more information about it, you can check my article about it in my Articles page.
In the out buffer, I precede it with a header of 5 bytes: 4 bytes for the original length, 1 byte for bits per sample.

1:bool CompressLZW(BYTE *pSrc, int nSrcLen, BYTE *&pDes, 
                  int &nDesLen, int nBitsPerSample = 12)
2:{
3:     ...
4:     CBinaryTree<CBuffer, CBuffer*, int, int> dictionary;
5:     // keep first 256 IDs for ascii Samples
6:     dictionary.Serial = 256;
7:     // scan the input buffer
8:    while(nSrcLen-- > 0)
9:    {
10:         ...
11:         pNode = dictionary.Insert(&node, -1, pNode);
12:         if(pNode->Count > 1 && nSrcLen)
13:             // (repeated Sample)
14:             nSample = pNode->ID, node.m_nLength++;
15:         else
16:         {    // write last success Sample
17:             *(DWORD*)(pDes+sizeof(DWORD)+1+
                         (nDesLen>>3)) |= nSample << (nDesLen&7);
18:             nDesLen += nBitsPerSample;
19:             // initialize node to next Sample
20:             node.m_pBuffer += node.m_nLength-1;
21:             node.m_nLength = 2;
22:             // copy first byte of the node as a new Sample
23:             nSample = *node.m_pBuffer;
24:             ...
25:         }
26:     }
27:    ...
28:}

I included only the main points of the algorithm. I find line 17 is the only line that needs a description:

17: *(DWORD*)(pDes+sizeof(DWORD)+1+(nDesLen>>3)) |= nSample << (nDesLen&7);

It copies the sample code to the compressed buffer, but it needs to start copying at the right bit. I know some people always invert the function of the shift operators in their minds, so I will dig a little bit on this line:

sizeof(DWORD)+1: to exclude the buffer header.
(nDesLen>>3): >>3 to divide by 8 to reach the right byte to start with.
|=: to avoid writing over the previous code.
(nDesLen&7): &7 to get the remainder of dividing by 8, to get the start bit.
nSample << (nDesLen&7): to shift the code to left to match the starting bit at the destination buffer.

DecompressLZW Function

This function is faster that the compression as it doesn't need any comparisons, as in the insertion in the CBinaryTree in the compression process.
In this function, I am scanning the input buffer pSrc and getting code from it, then getting node string pointer from the dictionary.
I am using the simple vector STL template to keep the dictionary.
In the out buffer, I precede it with a header of 5 bytes: 4 bytes for the original length, 1 byte for bits per sample.

1:bool DecompressLZW(BYTE *pSrc, int nSrcLen, BYTE *&pDes, int &nDesLen)
2:{
3:    // copy bits pre Sample
4:    int nBitsPerSample = *(pSrc+sizeof(DWORD)); 
5:    ...
6:    //    dictionary array 
7:    vector<CBUFFER *> dictionary;
8:    // scan input buffer
9:    while(nDesIndex < nDesLen)
10:    {
11:        // read sample code
12:        nSample = (*(DWORD*)(pSrc+sizeof(DWORD)+1+
                     (nSrcIndex>>3)))>>(nSrcIndex&7) & (nMaxSamples-1);
13:        nSrcIndex += nBitsPerSample;
14:        // get code string node from the dictionary
15:        if(nSample >= 256)
16:            if(nSample-256 < (int)dictionary.size()) 
17:            {    // normal case, valid dictionary Sample 
18:                pNodeSample = dictionary.at(nSample-256); 
19:                nSampleLen = pNodeSample->m_nLength+1; 
20:                // copy dictionary node    buffer to decompressed buffer 
21:                memcpy(pDes+nDesIndex, pNodeSample->m_pBuffer, 
                          pNodeSample->m_nLength); 
22:                nDesIndex += pNodeSample->m_nLength;
23:            } 
24:            else // out of range Sample 
25:                ...
26:        else 
27:            nDesIndexSave =    nDesIndex, *(pDes+nDesIndex++) 
                                  = (BYTE)nSample, nSampleLen = 2;
28:        ... 
29:        // add current segment to the dictionary
30:        dictionary.push_back(new CBuffer(node)); 
31:        // increment next node pointer to the last char of the added Sample 
32:        node.m_pBuffer += node.m_nLength-1; 
33:        node.m_nLength = nSampleLen;
34:    } 
35:    ...
36:}

As in the compression, I included only the main points of the algorithm. I find line 12 is the only line that needs a description at this point:

12: nSample = (*(DWORD*)(pSrc+sizeof(DWORD)+1+
       (nSrcIndex>>3)))>>(nSrcIndex&7) & (nMaxSamples-1);

It copies the sample code from the compressed buffer, but it needs to start copying at the right bit:

sizeof(DWORD)+1: to exclude the buffer header.
(nSrcIndex>>3): >>3 to divide by 8 to reach the right byte to start with.
(nSrcIndex&7): &7 to get the remainder of dividing by 8, to get the start bit.
>>(nSrcIndex&7): shift right to the start bit to take the code at bit 0 (closer the wall).
(nMaxSamples-1): nMaxSamples must be power of 2 as it is calculated by 2^nBitsPerSample or 1<<nPitsPerSample; so to use nMaxSamples as a mask, we subtract 1.
& (nMaxSamples-1): to use it as a mask to remove extra bits that exceed code limits (Bits Per Sample).

Points of Interest

At the compression function, I am using a binary tree to hold the dictionary. I was thinking of using hash table, may be it can increase insertion speed. Any way, it takes less than 1 sec to compress 1 MB, and decompresses it in less than 200 ms.
This code's advantage is the variable number of bits for dictionary index, the compression function includes a parameter nBitsPerSample.
That means, if the number of entries in the dictionary reaches limits of max index value, we need to free the dictionary and start a new empty dictionary. This can be seen in the following section at the CompressLZW function:
```
int nMaxSamples = 1 << nBitsPerSample;
...
while(nSrcLen-- > 0)
{
    if(dictionary.Serial == nMaxSamples)
    {
        dictionary.RemoveAll();
        dictionary.Serial = 256;
    }
    ...
}
```
As in the previous point, at the decompression function, get nBitsPerSample from the compressed buffer (fifth byte). And in the same way, if number of entries in the dictionary reaches limits of max index value, we need to free the dictionary and start a new empty dictionary. This can be seen in the following section at the DecompressLZW function:
```
int nBitsPerSample = *(pSrc+sizeof(DWORD));
...
while(nDesIndex > nDesLen)
{
    ...
    if(dictionary.size() == nMaxSamples-256)
        ClearDictionary(dictionary);
    ...
}
```

As mentioned in the first reference [1], there is a single exception case in the LZW compression algorithm that causes some trouble to the decompression function. Suppose there is a string consisting of a (string, char) pair already defined in the table, and the input stream then sees a sequence of (string, char, string, char, string), or we can say, if the encoding algorithm reads the string that it has just added to the dictionary in the previous step. During the decoding, this string is not yet present in the dictionary. A string can occur twice in a row in the char stream only if its first and last characters are equal, because the next string always starts with the last character of the previous one. As in the following example in the compression process:

Input string is "...JOEYNA...JOEYNJOEYNJOEY...".

            w        k    output    index   dictionary    
           --------------------------------------------
          JOEY       N   288(JOEY)    300    JOEYN
            N        A      N         301    NA
            ...
           JOEYN     J   300(JOEYN)   400    JOEYNJ
           JOEYNJ    O   400(JOEYNJ)  401    JOEYNJO
            O        E   O            402    OE
            ...

In the decompression algorithm, it is like that:

1:Input string is "...<288>NA...<300><400>O..."
2:          w     k    output    index    dictionary
3:          --------------------------------------------
4:          ...  <288>   JOEY    299     ...J
5:          JOEY   J       J     300     JOEYJ     
6:          ...
7:          ...  <300>   JOEYN   399     ...J
8:          <400>  O     JOEYNJ  400     JOEYNJ
9:          ...

The problem is at line 8, as the new symbol to be added to the dictionary is JOEYN plus the incoming character which is the code 400, which is not coming yet to the dictionary. So we need a special case of repeating the first character of the previous symbol at the end of the dictionary entry, so (JOEYN?) becomes (JOEYNJ). To handle that, we take the previous symbol JOEN and end it with its first character. The following code handles this case:

bool DecompressLZW(BYTE *pSrc, int nSrcLen, BYTE *&pDes, int &nDesLen)
{
    ...
    while(nDesIndex < nDesLen)
    {
        nSample = (*(DWORD*)(pSrc+sizeof(DWORD)+1+
                  (nSrcIndex>>3)))>>(nSrcIndex&7) & (nMaxSamples-1);
        nSrcIndex += nBitsPerSample;

        if(nSample >= 256)
            if(nSample-256 < (int)dictionary.size())
            {
                // normal case, valid dictionary Sample
                nDesIndexSave = nDesIndex;
                ...
            }
            else
            {
                // out of range Sample
                nSampleLen = nDesIndex-nDesIndexSave+2;
                // copy previous decompressed Sample as a new one + ...
                memcpy(pDes+nDesIndex, pDes+nDesIndexSave, 
                       nDesIndex-nDesIndexSave);
                nDesIndex += nDesIndex-nDesIndexSave;
                // add first char of the previous decompressed Sample
                *(pDes+nDesIndex++) = *(pDes+nDesIndexSave);
                nDesIndexSave += nSampleLen-2;
            }
        else
            ...
        ...
    }
    ...
}

At this code, I just save the old symbol entry by the variable nDesIndexSave to be used in case of out of range symbol.

Source Code Files

LZW.cpp, LZW.h and BinaryTree.h.

Updates

20-1-2005: I have solved a very special bug in the decompression, if any one took the code before this date, please update it.
1-1-2008: solved the bug of end file mismatch for big files' compression. Thanks go to "Wael AlGhool."
15-01-2008: update the pseudo code of the compression.

References

Mark Nelson's "LZW Data Compression" from the October, 1989 issue of Dr. Dobb's Journal.

Thanks to...

I owe a lot to my colleagues for helping me in implementing and testing this code. (JAK)

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)