Introduction
One day, I was talking with one of my colleagues about compression, and I was surprised when he told me that there is an algorithm that uses a dictionary during compression then leaves it aside and saves the compressed data without the dictionary. I asked him, "How can it construct the original data again?", he replied: "Simply by constructing the dictionary again during the decompression process"!!!
It is the LZW compression. I decided that day to read about it and try to implement this so nice idea algorithm, and I spent a long time to make it as fast as I can, but I think I failed as it compresses 1 MB/sec in normal cases. The code seems hard to be understood, but it will be easy if you take a look at the algorithm through the Background section or through any of the References down the page.
Background
In 1984, Terry Welch was working on a compression algorithm for high-performance disk controllers. He developed a rather simple algorithm that was based on the LZ78 algorithm and that is now called LZW. LZW is a lossless 'dictionary based' compression algorithm. Dictionary based algorithms scan a file for sequences of data that occur more than once. These sequences are then stored in a dictionary, and within the compressed file, references are put where-ever repetitive data occurred.
LZW Idea
LZW compression replaces strings of characters with single codes. It does not do any analysis of the incoming text. Instead, it just adds every new string of characters it sees to a table of strings. Compression occurs when a single code is output instead of a string of characters. So, any code in the dictionary must have a string in the compressed buffer for the first time only, that means the dictionary just keeps a reference to uncompressed strings in the compressed buffer, no need to let the dictionary take a copy of these strings. So, it is not necessary to preserve the dictionary to decode the LZW data stream. This can save quite a bit of space when storing the LZW-encoded data.
The code that the LZW algorithm outputs can be of any arbitrary length, but it must have more bits in it than a single character. The first 256 codes (when using eight bit characters) are, by default, assigned to the standard character set. The remaining codes are assigned to strings as the algorithm proceeds. The sample program runs as shown with 12 bit to 31 bit codes. This means, codes 0-255 refer to individual bytes, while codes 256-2^n refer to substrings, where n equal to number of bits per code, as in the following figure.
LZW Algorithm
Compression
The LZW Compression Algorithm can be summarized as follows:
w = NIL;
while ( read a character k )
{
if wk exists in the dictionary
w = wk;
else
{
add wk to the dictionary;
output the code for w;
w = k;
}
}
Original algorithm used a dictionary of 4094 entries; first 256 for ASCII codes, and the rest for substrings. As you see, the algorithm scans one character at a time. If the string is in the dictionary, then it adds the character to the current string and goes for the next character. If the work string is not in the dictionary, it adds the work string to the dictionary and sends the work string without the new character. It then sets the work string to the new character.
Compression Example
Input string is "^WED^WE^WEE^WEB^WET".
w k output index symbol
-----------------------------------------
NIL ^
^ W ^ 256 ^W
W E W 257 WE
E D E 258 ED
D ^ D 259 D^
^ W
^W E 256 260 ^WE
E ^ E 261 E^
^ W
^W E
^WE E 260 262 ^WEE
E ^
E^ W 261 263 E^W
W E
WE B 257 264 WEB
B ^ B 265 B^
^ W
^W E
^WE T 260 266 ^WET
T EOF T
[1] Stepping through the start of the algorithm for this string, you can see that in the first pass through the loop, a check is performed to see if the string "^W" is in the table. Since it isn't, the code for '^' is output, and the string "^W" is added to the table. Since we have 256 characters already defined for codes 0-255, the first string definition can be assigned to code 256. After the third letter, 'E', has been read in, the second string code "WE" is added to the table and the code for letter 'W' is output. This continues until in the second word, the characters '^' and 'W' are read in, matching string number 256. In this case, the code 256 is output, and a three character string is added to the string table. The process continues until the string is exhausted and all of the codes have been output.
Decompression
The LZW Decompression Algorithm is as follows:
read a character k;
output k;
w = k;
while ( read a character k )
// k could be a character or a code.
{
entry = dictionary entry for k;
output entry;
add w + entry[0] to dictionary;
w = entry;
}
As we said before, the decompression builds its own dictionary that matches exactly the compressor's, so that only the codes need to be saved in the compression.
Decompression Example
Input string is "^WED<256>E<260><261><257>B<260>T".
w k output index symbol
-------------------------------------------
^ ^
^ W W 256 ^W
W E E 257 WE
E D D 258 ED
D <256> ^W 259 D^
<256> E E 260 ^WE
E <260> ^WE 261 E^
<260> <261> E^ 262 ^WEE
<261> <257> WE 263 E^W
<257> B B 264 WEB
B <260> ^WE 265 B^
<260> T T 266 ^WET
[1] Just like the compression algorithm, it adds a new string to the string table each time it reads in a new code. All it needs to do in addition to that is translate each incoming code into a string and send it to the output. The important thing to note is that the string table ends up looking exactly like the table built up during compression. The output string is identical to the input string from the compression algorithm. Note that the first 256 codes are already defined to translate to single character strings, just like in the compression code.
Code Usage
Code usage is so simple, just the input buffer and its length, and the output buffer and its length. The compress function includes an additional parameter nBitsPerSample
to determine the number of bits to save dictionary code, so dictionary entries will be up to 2^nBitsPerSample
.
bool CompressLZW(BYTE* pSrc, int nSrcLen, BYTE* &pDes,
int &nDesLen, int nBitsPerSample = 12);
bool DecompressLZW(BYTE *pSrc, int nSrcLen, BYTE *&pDes, int &nDesLen);
Code Description
I have only two functions CompressLZW
and DecompressLZW
, I will start with the first.
CompressLZW Function
- In this function, I am scanning the input buffer
pSrc
and adding new entries to the dictionary component CBinaryTreeNode
. CBinaryTreeNode
is a Binary Tree class that keeps a unique instance of all input keys, and it does that in a very fast way, using binary search. To have more information about it, you can check my article about it in my Articles page. - In the out buffer, I precede it with a header of 5 bytes: 4 bytes for the original length, 1 byte for bits per sample.
1:bool CompressLZW(BYTE *pSrc, int nSrcLen, BYTE *&pDes,
int &nDesLen, int nBitsPerSample = 12)
2:{
3: ...
4: CBinaryTree<CBuffer, CBuffer*, int, int> dictionary;
5:
6: dictionary.Serial = 256;
7:
8: while(nSrcLen-- > 0)
9: {
10: ...
11: pNode = dictionary.Insert(&node, -1, pNode);
12: if(pNode->Count > 1 && nSrcLen)
13:
14: nSample = pNode->ID, node.m_nLength++;
15: else
16: {
17: *(DWORD*)(pDes+sizeof(DWORD)+1+
(nDesLen>>3)) |= nSample << (nDesLen&7);
18: nDesLen += nBitsPerSample;
19:
20: node.m_pBuffer += node.m_nLength-1;
21: node.m_nLength = 2;
22:
23: nSample = *node.m_pBuffer;
24: ...
25: }
26: }
27: ...
28:}
I included only the main points of the algorithm. I find line 17 is the only line that needs a description:
17: *(DWORD*)(pDes+sizeof(DWORD)+1+(nDesLen>>3)) |= nSample << (nDesLen&7);
It copies the sample code to the compressed buffer, but it needs to start copying at the right bit. I know some people always invert the function of the shift operators in their minds, so I will dig a little bit on this line:
sizeof(DWORD)+1
: to exclude the buffer header. (nDesLen>>3)
: >>3
to divide by 8 to reach the right byte to start with. |=
: to avoid writing over the previous code. (nDesLen&7)
: &7
to get the remainder of dividing by 8, to get the start bit. nSample << (nDesLen&7)
: to shift the code to left to match the starting bit at the destination buffer.
DecompressLZW Function
- This function is faster that the compression as it doesn't need any comparisons, as in the insertion in the
CBinaryTree
in the compression process. - In this function, I am scanning the input buffer
pSrc
and getting code from it, then getting node string pointer from the dictionary. - I am using the simple
vector
STL template to keep the dictionary. - In the out buffer, I precede it with a header of 5 bytes: 4 bytes for the original length, 1 byte for bits per sample.
1:bool DecompressLZW(BYTE *pSrc, int nSrcLen, BYTE *&pDes, int &nDesLen)
2:{
3:
4: int nBitsPerSample = *(pSrc+sizeof(DWORD));
5: ...
6:
7: vector<CBUFFER *> dictionary;
8:
9: while(nDesIndex < nDesLen)
10: {
11:
12: nSample = (*(DWORD*)(pSrc+sizeof(DWORD)+1+
(nSrcIndex>>3)))>>(nSrcIndex&7) & (nMaxSamples-1);
13: nSrcIndex += nBitsPerSample;
14:
15: if(nSample >= 256)
16: if(nSample-256 < (int)dictionary.size())
17: {
18: pNodeSample = dictionary.at(nSample-256);
19: nSampleLen = pNodeSample->m_nLength+1;
20:
21: memcpy(pDes+nDesIndex, pNodeSample->m_pBuffer,
pNodeSample->m_nLength);
22: nDesIndex += pNodeSample->m_nLength;
23: }
24: else
25: ...
26: else
27: nDesIndexSave = nDesIndex, *(pDes+nDesIndex++)
= (BYTE)nSample, nSampleLen = 2;
28: ...
29:
30: dictionary.push_back(new CBuffer(node));
31:
32: node.m_pBuffer += node.m_nLength-1;
33: node.m_nLength = nSampleLen;
34: }
35: ...
36:}
As in the compression, I included only the main points of the algorithm. I find line 12 is the only line that needs a description at this point:
12: nSample = (*(DWORD*)(pSrc+sizeof(DWORD)+1+
(nSrcIndex>>3)))>>(nSrcIndex&7) & (nMaxSamples-1);
It copies the sample code from the compressed buffer, but it needs to start copying at the right bit:
sizeof(DWORD)+1
: to exclude the buffer header. (nSrcIndex>>3)
: >>3
to divide by 8 to reach the right byte to start with. (nSrcIndex&7)
: &7
to get the remainder of dividing by 8, to get the start bit. >>(nSrcIndex&7)
: shift right to the start bit to take the code at bit 0 (closer the wall). (nMaxSamples-1)
: nMaxSamples
must be power of 2 as it is calculated by 2^nBitsPerSample
or 1<<nPitsPerSample
; so to use nMaxSamples
as a mask, we subtract 1. & (nMaxSamples-1)
: to use it as a mask to remove extra bits that exceed code limits (Bits Per Sample).
Points of Interest
- At the compression function, I am using a binary tree to hold the dictionary. I was thinking of using hash table, may be it can increase insertion speed. Any way, it takes less than 1 sec to compress 1 MB, and decompresses it in less than 200 ms.
- This code's advantage is the variable number of bits for dictionary index, the compression function includes a parameter
nBitsPerSample
.
That means, if the number of entries in the dictionary reaches limits of max index value, we need to free the dictionary and start a new empty dictionary. This can be seen in the following section at the CompressLZW function:
int nMaxSamples = 1 << nBitsPerSample;
...
while(nSrcLen-- > 0)
{
if(dictionary.Serial == nMaxSamples)
{
dictionary.RemoveAll();
dictionary.Serial = 256;
}
...
}
- As in the previous point, at the decompression function, get
nBitsPerSample
from the compressed buffer (fifth byte). And in the same way, if number of entries in the dictionary reaches limits of max index value, we need to free the dictionary and start a new empty dictionary. This can be seen in the following section at the DecompressLZW function:
int nBitsPerSample = *(pSrc+sizeof(DWORD));
...
while(nDesIndex > nDesLen)
{
...
if(dictionary.size() == nMaxSamples-256)
ClearDictionary(dictionary);
...
}
- As mentioned in the first reference [1], there is a single exception case in the LZW compression algorithm that causes some trouble to the decompression function. Suppose there is a string consisting of a (
string
, char
) pair already defined in the table, and the input stream then sees a sequence of (string
, char
, string
, char
, string
), or we can say, if the encoding algorithm reads the string that it has just added to the dictionary in the previous step. During the decoding, this string is not yet present in the dictionary. A string can occur twice in a row in the char stream only if its first and last characters are equal, because the next string always starts with the last character of the previous one. As in the following example in the compression process:
Input string is "...JOEYNA...JOEYNJOEYNJOEY...".
w k output index dictionary
--------------------------------------------
JOEY N 288(JOEY) 300 JOEYN
N A N 301 NA
...
JOEYN J 300(JOEYN) 400 JOEYNJ
JOEYNJ O 400(JOEYNJ) 401 JOEYNJO
O E O 402 OE
...
In the decompression algorithm, it is like that:
1:Input string is "...<288>NA...<300><400>O..."
2: w k output index dictionary
3: --------------------------------------------
4: ... <288> JOEY 299 ...J
5: JOEY J J 300 JOEYJ
6: ...
7: ... <300> JOEYN 399 ...J
8: <400> O JOEYNJ 400 JOEYNJ
9: ...
The problem is at line 8, as the new symbol to be added to the dictionary is JOEYN plus the incoming character which is the code 400, which is not coming yet to the dictionary. So we need a special case of repeating the first character of the previous symbol at the end of the dictionary entry, so (JOEYN?) becomes (JOEYNJ). To handle that, we take the previous symbol JOEN and end it with its first character. The following code handles this case:
bool DecompressLZW(BYTE *pSrc, int nSrcLen, BYTE *&pDes, int &nDesLen)
{
...
while(nDesIndex < nDesLen)
{
nSample = (*(DWORD*)(pSrc+sizeof(DWORD)+1+
(nSrcIndex>>3)))>>(nSrcIndex&7) & (nMaxSamples-1);
nSrcIndex += nBitsPerSample;
if(nSample >= 256)
if(nSample-256 < (int)dictionary.size())
{
nDesIndexSave = nDesIndex;
...
}
else
{
nSampleLen = nDesIndex-nDesIndexSave+2;
memcpy(pDes+nDesIndex, pDes+nDesIndexSave,
nDesIndex-nDesIndexSave);
nDesIndex += nDesIndex-nDesIndexSave;
*(pDes+nDesIndex++) = *(pDes+nDesIndexSave);
nDesIndexSave += nSampleLen-2;
}
else
...
...
}
...
}
At this code, I just save the old symbol entry by the variable nDesIndexSave
to be used in case of out of range symbol.
Source Code Files
LZW.cpp, LZW.h and BinaryTree.h.
Updates
- 20-1-2005: I have solved a very special bug in the decompression, if any one took the code before this date, please update it.
- 1-1-2008: solved the bug of end file mismatch for big files' compression. Thanks go to "Wael AlGhool."
- 15-01-2008: update the pseudo code of the compression.
References
- Mark Nelson's "LZW Data Compression" from the October, 1989 issue of Dr. Dobb's Journal.
Thanks to...
I owe a lot to my colleagues for helping me in implementing and testing this code. (JAK)