(untagged)

Templated Burrows-Wheeler transformation

Rasmus Kaae

0.00/5 (No votes)

28 May 2005

An article on how to use the templated class for Burrows-Wheeler transformation.

Download source files - 3.4 Kb

Introduction

This piece of code is something that I wrote while finishing my master thesis at Aalborg University. We were, among other things, researching on methods to compress inverted indexes in order to keep the disk space requirements low. The attached code demonstrates a way of implementing the Burrows-Wheeler transformation in a templated C++ environment. I so far only tested it with G++ on Sun Solaris -- however the code is platform independent and should be compilable anywhere.

Background

While researching on methods to rearrange data in order to achieve higher compression ratios, I stumbled across a multitude of code snippets on many different codecs. However I never found any code for the Burrows-Wheeler transformation (also known as BWT).

But what is BWT?

BWT is an advanced technique for rearranging data in order to achieve higher compression rates by the following codecs. In general, BWT takes a sequence of symbols as input and tries to order the data by rearranging the positions without tampering the actual values. As a standalone codec, the BWT does nothing but add a slight overhead to the data, however when developing customized compression scehems, several codecs may be set in sequence in order to achieve higher compression ratios. A simple codec scenario would be first to execute the BWT and then execute a Run-Length-Encoding (RLE). A sample application of the BWT followed by RLE follows:

Given the following input to the BWT:

The quick brown fox jumped over the silvermoon. The quick brown fox jumped 
over the silvermoon. The quick brown fox jumped over the silvermoon. The 
quick brown fox jumped over the silvermoon.

The BWT will output the following code:

...kkkknnnnxxxxddddeeeeeeeerrrr.nnnn|       iiiieeeehhhhhhhhppppvvvvvvvv    
TTTTttttuuuussss    cccciiiirrrruuuuwwwwoooooooommmm    rrrrffffmmmm    eeee
eeeebbbb        qqqqjjjjoooolllloooooooo|

Where "|" is the BWT-delimiter symbol.

This code is highly compressible compared to the original text. Below is shown a simple RLE encoding on the above BWT code:

3.#4k#4n#4x#4d#8e#4r#.#4n#|#7 #4i#4e#8h#4p#8v#4T#4t#4u#4s#4 #4c#4i#4r#4u
#4w#8o#4m#4 #8e#4b#4q#4j#3o#4l#8o#|#

Where "#" is the RLE-delimiter symbol.

Applying other more efficient codecs than the RLE will result in better compression ratios.

Using the code

Since the code is templated you can put more or less any kind of data into the collection and perform the BWT. However please make sure that you allocate one symbol as delimiter to be used in the BWT and inverse-BWT.

The first example shows how to transform a short string into easily compress-able BWT form:

    // instantiate the BWT and allocate the value 127 as delimiter

    cBWTransform<char> transform(127);
    
    // push the string "BANANA" of length 6 onto the BWT-list

    transform.push_back("BANANA", 6);
    
    // perform the BWT transformation

    cBWTransformationCode<char> code = transform.Transform();
    
    // prepare for the inverse-BWT transformation

    // and allocate the value 127 delimiter

    // the delimiter value must be

    // the same for the inverse-BWT and the BWT

    cBWTransformation<char> inverse_transform(127);
    
    // execute the inverse-BWT on the BWT-code from above

    inversetransform.InverseTransform(code);
    
    // output the value to std out

    for (cBWTransform< char>::iterator i=inversetransform.begin(); 
                                       i!=inversetransform.end(); i++) {
        cout << *i << " ";
    }
    cout << endl;

Points of Interest

The above code is written using STL and relies heavily as such on that. Through our master thesis project, we have discovered that being under time pressure makes the STL very useful, however for special purposes one might want to write hand-optimized code instead :-)

For more information on the BWT transformation, please refer to these sources:

History

28/05/05 - Initial version.
28/05/05 - Added (hopefully) explanatory samples.
29/05/05 - Updates ZIP with VS.NET 2003 project.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here