Introduction
I've written, some many years ago, a dynamic Huffman algorithm to compress and decompress data. It is mainly targeted to data with some symbols occurring more often than the rest (e.g., having some data file consisting of three different symbols, and their total number of occurrences in that file is s1(1000), s2(200), s3(30), so the total length of the file is 1000+200+30=1230 bytes; it is encoded, assigning one bit to s1 and two bits to s2, s3, so the encoded length is 1*1000+2*(200+30)=1460 bits=182 bytes). In the best case, a file consisting of just one symbol will be encoded with a compression ratio of 1:8. Huffman coding is used in image compression; however, in JPEG2000, an arithmetic codec is employed.
Background
The article intends to provide the code only, and is not a Huffman tutorial. You may dig for online tutorials on the subject.
Using the code
The console is straightforward to use to encode a source file to a Huffman compressed one:
>>huffman.exe e sourcefile destinationfile
or to decode a Huffman compressed source file back to the original file:
>>huffman.exe d sourcefile destinationfile
The Huffman class provides three functions to use:
void encode(unsigned char *dest, int &csize, unsigned char *sour, int usize)
- to encodevoid decode(unsigned char *dest, int &usize, unsigned char *sour)
- to decodeint get_uncompressed_size(unsigned char *sour)
- to get the uncompressed file size from Huffman encoded data file
dest
and sour
are the destination and source buffers, usize
and csize
are the uncompressed size of the original data and the compressed size of the Huffman encoded data. To encode, just provide your data in the sour
buffer and specify its size in usize
. The csize
will be given the size of the compressed Huffman array you need to provide in the dest
pointer. Make sure it is allocated at least the size of your original data. To decode, just provide in sour
your Huffman encoded data, and query with get_uncompressed_size(sour)
the size of the dest
buffer you need to allocate. Upon decompression completion, the usize
will be overwritten with the uncompressed data size.
Points of interest
I know storing data symbols and their frequencies into Huffman encoded files is space consuming, and that more compact representations are present.