Introduction
I've looked around the internet for an implementation of dynamic LZW compression in MFC. Unfortunately, since I couldn't find one, I've decided to write one myself. This article will not explain the basics of the LZW algorithm. To find that all you need to do is to look for "LZW algorithm" in google, and you'll find a lot information.
But I will explain what is the dynamic and static compression in the LZW implementations, and some problems that someone might face during implementing LZW algorithm.
The files LZWCompression.h, LZWCompression.cpp, Dictionary.h, Dictionary.cpp are all you need to use, to insert the LZW into your project.
Background
what is static compression in LZW?
The static compression is when we use a static number of bits to make the compression. The total number of bits we have to our disposal is 32, which gives us 4,294,967,295 possible entries in our dictionary. In most cases we don't get that far, actually, most of the times we won't get passed the 16 bit barrier (which gives us 65,535 possible entries in the dictionary). So in order to speed up the processing time, we set the number of bits used for compression to a limit (most applications put it on 14). That way every information written to the destination buffer, will be in the same bit size.
So if someone wants to represent the letter 'A' in the 14 bit format it will look like this: 00000001000001. If the dictionary have the value of 245 in its dictionary it will look like so: 00000011110101. As you can see, we have some leading 0 in the number, which means that the file can be compressed even more.
what is dynamic compression in LZW?
The dynamic compression changes the number of bits used to compress the data. It starts with 9 bits for each new value, and goes up until it reaches 32 or until the file ends.
The number of bits to use in the dynamic compression is set by the size of the dictionary you have. Since the dictionary begins with all the 256 (0-255) ASCII codes. Once the dictionary reached the point were the next entry will have to have another bit to represent her (256, 512, ...) we change the bit size of the compression from that point. This also explains why we start with 9 bit compression.
Since the dynamic compression have only one leading 0 in front of each number, you get a more compressed file then the static method. But you pay in extra calculation time. for each entry you put in the dictionary, you have to calculate the number of bits to use.
So if someone wants to represent the letter 'A' it will look like this: 001000001.
If the dictionary have the value of 245 in its dictionary it will look like so: 011110101.
As you can see, we have one leading 0 in the number, and the number of bits to use is still 9.
If the dictionary have the value of 276 in its dictionary it will look like so: 0100010100. And the limit will now be 512, which means we are now dealing with 10 bits.
Compressing the file
Compressing the file is simply saving all the bits in a buffer, and then writing them into the destination file in bytes. To accomplish that we can't just write the data into the file (writing the value 245 into the file will cause the number to look like this: 0000000011110101, which we don't want).
So we have to push the buffer data to the left, in order to make room for the new data to be inserted.
The number of bits we push the data is changing according to the number of bits we need for the data. After pushing the data in the buffer we add to it the data we want it to have. After doing so, we write some information into the destination file, and continue with the application. The following function receive the destination file to write in, and the new data to combine:
void CLZWCompression::CompressData(CFile &dest, long toSave)
{
DWORD writeData = 0;
m_SavedData |= (DWORD) toSave << (32 - m_MaxBits - m_TotalBits);
m_TotalBits += m_MaxBits;
while (m_TotalBits >= 8)
{
writeData = m_SavedData;
writeData >>= 24;
dest.Write(&writeData, 1);
m_SavedData <<= 8;
m_TotalBits -= 8;
}
}
m_MaxBits
holds the number of bits that represent the number held by toSave
m_SavedData
is the buffer (DWORD) that holds the data to be written to the file, and m_TotalBits
is the remaining bits in the buffer that represent a number.
Since we have to use bytes when we write in to files, and since the bit size we are using changes, we remain with bits in the buffer of a previous data. This is the main reason for the use of m_TotalBits
to prevent from us to write information on the previous number information, and to make sure that all the data will reach the file correctly.
Decompressing the file
Decompressing the file is to take all the data we had from the previous compression, and translate it into sequences of information that we can understand.
In the decompression we take the code we had in the file, and using the dictionary (which we build again while decompressing), we get the sequence of letters that construct this code.
The decode buffer is filled in the opposite way, meaning from the end byte that needs to be written, to the beginning byte of the sequence.
void CDictionary::GetBytesFromCode(CByteArray *Buffer, DWORD code)
{
dicElement *tmpEl = NULL;
while (code > 255)
{
tmpEl = (dicElement*)GetAt(code - 256);
Buffer->Add(tmpEl->m_Letter);
code = tmpEl->m_Prefix;
}
Buffer->Add((BYTE)code);
}
The buffer then is written into the file when the letters are written from the last to end.
while (counter > 0)
{
writeData = (BYTE)decodeString.GetAt(--counter);
destination.Write(&writeData, 1);
}
decodeString.RemoveAll();
Unlike in the compression, we can't stop the decompression process when the file is over, since we still have information inside the buffer that needs to be decoded. So to end the file, in the compression we entered at the end of the file the maximum number possible by the current bit size.
void CLZWCompression::CloseCompressedFile(CFile &source)
{
CompressData(source, m_MaxCode[m_MaxBits]);
CompressData(source, 0);
}
We flash the end after the m_MaxCode[m_MaxBits]
with '0' to make sure that the file will contain all the code. Since the '0' after it will never be used. This value is compared with the currently read data, and if exist means that the end of file arrived, and that we need to end the decompression process.
while ((data = DecompressData(source)) != m_MaxCode[m_MaxBits])
Using the code
All you need to use this code, is just to insert the code into your project. The class creates its own dictionary, and clears it after the use of the compression or decompression process, so using one class to manage all your LZW compression in the project is easy.
Unfortunately, when using this class in Thread, you must make a new instance of the class for every thread you are running, since the class can have only one dictionary at a time.
Code History
- 25 Apr 2004 - The code was added with
convertASCIIToText
function for showing the code actual character in decompression, as suggested by WREY. The previous log string was commented and a new one is in use.