Introduction
Huffman tree compression is almost as simple as RLE compression, but can be equally fast and gives more reasonable compression ration, thus is more effective. This source code, along with the binary, is an example of Huffman tree compression.
Background
Huffman tree compression and tree construction is fairly described on Wiki. Huffman tree coding is based on prefixes. So there cannot be two different leaves with exact same paths, even more, there must be a path, that ends up only in one leaf (no intersection leaves). Huffman tree coding, furthermore, is based on a frequency or probability of appearance of each one particular symbol. Given this idea, we work with a priority queue, that is initially filled with small trees, each with its symbol/char and probability of appearance/ref.count
. We sort our queue/array descending, so most frequent symbols are topmost, while least common are the last. We then merge trees, starting from least one end, going topmost, while merging two trees into one new. After each merging, sorting is performed, so the queue/array is ordered descending again. We do merging, sorting until there is no more than 1 tree in the queue. Then we can finish with tree construction and go to compression, where each symbol is replaced by its path in bits from tree.
Using the Code
The code contain one public
class simple_Huffman
in simple_huffman.cpp.
Main Class
#include <memory.h> // memset
#include <stdlib.h> // qsort
typedef unsigned char BYTE;
class simple_Huffman{
private:
class node{
public:
node *left, *right;
int count; int code; int codelength; BYTE chr; node(void) {memset(this, 0, sizeof(node));}
~node() {
if (left) {delete left; left = NULL;}
if (right) {delete right; right = NULL;}
}
};
node *trees[256];
node *leaves[256]; node *trees_backup[256];
int STEPS[256];
int stepscount;
node nodes[256];
void MakeTree();
void MakeTreeD(); int treescount;
BYTE *allocatedoutput;
void InitializeLeaves(node *, int, int);
static int comparefreq(const void *A, const void *B);
void moveTreesRight(node **toTree);
void moveTreesRight2(node **toTree);
void tryToPop();
void moveToTop();
void performInternalCleanup();
void initialize();
int LastError;
public:
simple_Huffman();
int Compress(BYTE *input, int inputlength);
int Decompress(BYTE *input, int inputlength);
int CompressFile(char *InFile, char *OutFile);
int DecompressFile(char *InFile, char *OutFile);
void Finalize(); BYTE *getOutput(); int getLastError(); };
Using the Code
int Compress(BYTE *input, int inputlength); int Decompress(BYTE *input, int inputlength); int CompressFile(char *InFile, char *OutFile); int DecompressFile(char *InFile, char *OutFile);
Points of Interest
Huffman compression gives fairly better ratio than RLE, and is more effective in general. But, of course, it does not consume patterns, nor repeats. Giving ratio of saving ranges from 20% - 50% in general.
Huffman tree coding is used worldwide, e.g., in deflate gzip, JPEG images, ....
If you want to learn huffman in instant, feel free to visit best Czech page, intended for this subject. Go here and fill up a textfield with some letters. Then click "Cetnosti" to count symbols. You see the small trees? Nice. Now click "Vytvor strom" to create final tree, or try "Jeden krok" for one per click tree merging.
Storing path information in output tree structure
When storing a tree inside an output file, it would be interesting to determine whether int
is sufficiently large enough, to contain all possible paths, and if there are any bad cases, where it fails. While writing a Huffman table (output tree structure), basically we have several choices to do so. The choices are discussed below.
The header
In a two-pass Huffman implementation (get probabilities, then perform compression/decompression), we need to inform decoder, about a tree structure,
used in compression. That structure can be represented by a table, simply with a list of characters and their probabilities. Basically you have a few choices
on how to generate an original tree. You can:
- Send a tree as a whole (path length, path itself, symbol) or
- send character counts (count, character) or
- send sorted sequence of characters, present in the original stream and send a "recipe" on how to reconstruct a tree or
- send a canonic representation of tree
The details:
- Sending a tree with a path length, path itself and a symbol requires 4+4+1 = 9 bytes (4 bytes length, 4 bytes path, 1 byte char).
When there is 256 possible characters, resulting header would be 256*9 = 2304 bytes.
- Sending character counts require 4 bytes for counts and 1 byte for char. It sums 4+1 = 5 bytes. A 256 characters would cost 256 * 5 = 1280 bytes.
- Sorted sequence is present in current implementation (3.0), and we will discuss that more closely here. A tree-making algorithm goes
from end to the begining (of trees array), passing through all character trees (a tree with only 1 character) and merging 2 last trees into 1 new tree.
Each time, when merging is executed, all remaining trees are re-sorted, to reflect actual change in trees, and to assume, that it will be 2 with minimal probabilities,
which are last. Merging takes 2 trees and creates a new tree, which have 1 left and 1 right node (sub-tree), both are those minimal probability trees taken.
Their probabilities is summed and included in newly-created tree. Now we need to assume, that trees are in correct order (most to last probable). To do this,
we can simply re-sort whole tree. A key thing here is, that re-sorting is not needed at all. Think just deeper about this: re-sorting checks all trees,
to make sure they are sorted. But we know, that it is only 1 tree, that changed actually and needs to be checked. So, we simply check, whether there
is a better position for that new tree (from last to first). Whether a better position is found, we call this checking step.
We store each step (int value = new position). A step list can be named "recipe" as it tells us, how to reconstruct original tree using exact steps.
Our sorted sequence is just characters sorted along their probabilities (from most to last). This is initial array, just before any step has been performed.
So we need at most 256 bytes for characters and 256 bytes for steps, the sum of 512 bytes. We saved 768 bytes in comparsion to 2. Another benefit is the fact,
that we are worrying about path length storing (4 bytes sufficient? see ) no longer.
For a canonical representation you can visit [1]. Or you can see [2], which seems to be easier to read.
References
[1] http://www.arturocampos.com/ac_canonical_huffman.html#Canonical%20Huffman
[2] http://www.compressconsult.com/huffman/
History
- 01.04.2012 - Added
int CompressFile
and int DecompressFile
. Header included in the compressed output is now shorter (max before (bytes): 1280, current max: 512). Compression and decompression is now slight faster. This version is not compatible with any previous version. - Complete history list can be found in the header of the huffman.cpp C++ source file.
- 30.8.2010 - First source revision, classes split into separate headers, plus made public only Compress and Decompress (plus constructor).
- 29.8.2010 - Comments and class rename.