Fast and Simple Huffman Compressor

Jan Mojzis

4.80/5 (16 votes)

2 Apr 2012CPOL5 min read

78.6K

4.8K

Fast and simple huffman compressor

Introduction

Huffman tree compression is almost as simple as RLE compression, but can be equally fast and gives more reasonable compression ration, thus is more effective. This source code, along with the binary, is an example of Huffman tree compression.

Background

Huffman tree compression and tree construction is fairly described on Wiki. Huffman tree coding is based on prefixes. So there cannot be two different leaves with exact same paths, even more, there must be a path, that ends up only in one leaf (no intersection leaves). Huffman tree coding, furthermore, is based on a frequency or probability of appearance of each one particular symbol. Given this idea, we work with a priority queue, that is initially filled with small trees, each with its symbol/char and probability of appearance/ref.count. We sort our queue/array descending, so most frequent symbols are topmost, while least common are the last. We then merge trees, starting from least one end, going topmost, while merging two trees into one new. After each merging, sorting is performed, so the queue/array is ordered descending again. We do merging, sorting until there is no more than 1 tree in the queue. Then we can finish with tree construction and go to compression, where each symbol is replaced by its path in bits from tree.

Using the Code

The code contain one public class simple_Huffman in simple_huffman.cpp.

Main Class

C++

#include <memory.h>     // memset
#include <stdlib.h>     // qsort
typedef unsigned char BYTE;
class simple_Huffman{
private:
    class node{
    public:
        node *left, *right;
        int count;       // ref. count
        int code;        // code in bits
        int codelength;  // code length
        BYTE chr;        // ASCII char, serves in compression and then in writing of frequencies
        // constructor, memset is faster than individ.
        node(void)  {memset(this, 0, sizeof(node));}
        ~node() {
            if (left) {delete left; left = NULL;}
            if (right) {delete right; right = NULL;}
        }
    };
    // Array of trees, here will be only one tree at pos.
    // 0 when finished with merging 
    node *trees[256];
    node *leaves[256]; // array of leaves, each contains ASCII char,
         // ref. count, later code and codelength too. It is a reference to original trees
    node *trees_backup[256];
    // as the tree is constructed, this stores
    // exact steps (movings) of the last tree
    int STEPS[256];
    int stepscount;
    node nodes[256];

    void MakeTree();
    void MakeTreeD(); // maketree for decompression (no searching)
    int treescount;
    BYTE *allocatedoutput;
    // initialization of leaves with codes and code lengths
    void InitializeLeaves(node *, int, int);
    // sorting, we want to sort descending
    static int comparefreq(const void *A, const void *B);
    void moveTreesRight(node **toTree);
    void moveTreesRight2(node **toTree);
    void tryToPop();
    void moveToTop();
    void performInternalCleanup();
    void initialize();
    int LastError;
public:
    simple_Huffman();
    int Compress(BYTE *input, int inputlength);
    int Decompress(BYTE *input, int inputlength);
    int CompressFile(char *InFile, char *OutFile);
    int DecompressFile(char *InFile, char *OutFile);
    void Finalize(); // call after each stream compress/decompress to deallocate
    BYTE *getOutput(); // call after each stream compress/decompress to obtain output
    int getLastError(); // get code of last error (-1 input file failed
                        // to open, -2 output failed, -3 output allocation failed)
};

Using the Code

C++

int Compress(BYTE *input, int inputlength);   // compress streams, return value is output size
int Decompress(BYTE *input, int inputlength); // decompress streams, return value is output size
int CompressFile(char *InFile, char *OutFile);  // compress file directly, return value is output size
int DecompressFile(char *InFile, char *OutFile); // decompress file directly return value is output size

Points of Interest

Huffman compression gives fairly better ratio than RLE, and is more effective in general. But, of course, it does not consume patterns, nor repeats. Giving ratio of saving ranges from 20% - 50% in general.

Huffman tree coding is used worldwide, e.g., in deflate gzip, JPEG images, ....

If you want to learn huffman in instant, feel free to visit best Czech page, intended for this subject. Go here and fill up a textfield with some letters. Then click "Cetnosti" to count symbols. You see the small trees? Nice. Now click "Vytvor strom" to create final tree, or try "Jeden krok" for one per click tree merging.

Storing path information in output tree structure

When storing a tree inside an output file, it would be interesting to determine whether int is sufficiently large enough, to contain all possible paths, and if there are any bad cases, where it fails. While writing a Huffman table (output tree structure), basically we have several choices to do so. The choices are discussed below.

The header

In a two-pass Huffman implementation (get probabilities, then perform compression/decompression), we need to inform decoder, about a tree structure, used in compression. That structure can be represented by a table, simply with a list of characters and their probabilities. Basically you have a few choices on how to generate an original tree. You can:

Send a tree as a whole (path length, path itself, symbol) or
send character counts (count, character) or
send sorted sequence of characters, present in the original stream and send a "recipe" on how to reconstruct a tree or
send a canonic representation of tree

The details:

Sending a tree with a path length, path itself and a symbol requires 4+4+1 = 9 bytes (4 bytes length, 4 bytes path, 1 byte char). When there is 256 possible characters, resulting header would be 256*9 = 2304 bytes.
Sending character counts require 4 bytes for counts and 1 byte for char. It sums 4+1 = 5 bytes. A 256 characters would cost 256 * 5 = 1280 bytes.
Sorted sequence is present in current implementation (3.0), and we will discuss that more closely here. A tree-making algorithm goes from end to the begining (of trees array), passing through all character trees (a tree with only 1 character) and merging 2 last trees into 1 new tree. Each time, when merging is executed, all remaining trees are re-sorted, to reflect actual change in trees, and to assume, that it will be 2 with minimal probabilities, which are last. Merging takes 2 trees and creates a new tree, which have 1 left and 1 right node (sub-tree), both are those minimal probability trees taken. Their probabilities is summed and included in newly-created tree. Now we need to assume, that trees are in correct order (most to last probable). To do this, we can simply re-sort whole tree. A key thing here is, that re-sorting is not needed at all. Think just deeper about this: re-sorting checks all trees, to make sure they are sorted. But we know, that it is only 1 tree, that changed actually and needs to be checked. So, we simply check, whether there is a better position for that new tree (from last to first). Whether a better position is found, we call this checking step. We store each step (int value = new position). A step list can be named "recipe" as it tells us, how to reconstruct original tree using exact steps.

Our sorted sequence is just characters sorted along their probabilities (from most to last). This is initial array, just before any step has been performed.

So we need at most 256 bytes for characters and 256 bytes for steps, the sum of 512 bytes. We saved 768 bytes in comparsion to 2. Another benefit is the fact, that we are worrying about path length storing (4 bytes sufficient? see ) no longer.

For a canonical representation you can visit [1]. Or you can see [2], which seems to be easier to read.

References

[1] http://www.arturocampos.com/ac_canonical_huffman.html#Canonical%20Huffman

[2] http://www.compressconsult.com/huffman/

History

01.04.2012 - Added int CompressFile and int DecompressFile. Header included in the compressed output is now shorter (max before (bytes): 1280, current max: 512). Compression and decompression is now slight faster. This version is not compatible with any previous version.
Complete history list can be found in the header of the huffman.cpp C++ source file.
30.8.2010 - First source revision, classes split into separate headers, plus made public only Compress and Decompress (plus constructor).
29.8.2010 - Comments and class rename.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)