Introduction
This piece of code is something that I wrote while finishing my master thesis at Aalborg University. We were, among other things, researching on methods to compress inverted indexes in order to keep the disk space requirements low. The attached code demonstrates a way of implementing the Burrows-Wheeler transformation in a templated C++ environment. I so far only tested it with G++ on Sun Solaris -- however the code is platform independent and should be compilable anywhere.
Background
While researching on methods to rearrange data in order to achieve higher compression ratios, I stumbled across a multitude of code snippets on many different codecs. However I never found any code for the Burrows-Wheeler transformation (also known as BWT).
But what is BWT?
BWT is an advanced technique for rearranging data in order to achieve higher compression rates by the following codecs. In general, BWT takes a sequence of symbols as input and tries to order the data by rearranging the positions without tampering the actual values. As a standalone codec, the BWT does nothing but add a slight overhead to the data, however when developing customized compression scehems, several codecs may be set in sequence in order to achieve higher compression ratios. A simple codec scenario would be first to execute the BWT and then execute a Run-Length-Encoding (RLE). A sample application of the BWT followed by RLE follows:
Given the following input to the BWT:
The quick brown fox jumped over the silvermoon. The quick brown fox jumped
over the silvermoon. The quick brown fox jumped over the silvermoon. The
quick brown fox jumped over the silvermoon.
The BWT will output the following code:
...kkkknnnnxxxxddddeeeeeeeerrrr.nnnn| iiiieeeehhhhhhhhppppvvvvvvvv
TTTTttttuuuussss cccciiiirrrruuuuwwwwoooooooommmm rrrrffffmmmm eeee
eeeebbbb qqqqjjjjoooolllloooooooo|
Where "|" is the BWT-delimiter symbol.
This code is highly compressible compared to the original text. Below is shown a simple RLE encoding on the above BWT code:
3.#4k#4n#4x#4d#8e#4r#.#4n#|#7 #4i#4e#8h#4p#8v#4T#4t#4u#4s#4 #4c#4i#4r#4u
#4w#8o#4m#4 #8e#4b#4q#4j#3o#4l#8o#|#
Where "#" is the RLE-delimiter symbol.
Applying other more efficient codecs than the RLE will result in better compression ratios.
Using the code
Since the code is templated you can put more or less any kind of data into the collection and perform the BWT. However please make sure that you allocate one symbol as delimiter to be used in the BWT and inverse-BWT.
The first example shows how to transform a short string into easily compress-able BWT form:
cBWTransform<char> transform(127);
transform.push_back("BANANA", 6);
cBWTransformationCode<char> code = transform.Transform();
cBWTransformation<char> inverse_transform(127);
inversetransform.InverseTransform(code);
for (cBWTransform< char>::iterator i=inversetransform.begin();
i!=inversetransform.end(); i++) {
cout << *i << " ";
}
cout << endl;
Points of Interest
The above code is written using STL and relies heavily as such on that. Through our master thesis project, we have discovered that being under time pressure makes the STL very useful, however for special purposes one might want to write hand-optimized code instead :-)
For more information on the BWT transformation, please refer to these sources:
History
- 28/05/05 - Initial version.
- 28/05/05 - Added (hopefully) explanatory samples.
- 29/05/05 - Updates ZIP with VS.NET 2003 project.