Word Aligned Hybrid (WAH) Compression for BitArrays

Mehdi Gholam

4.89/5 (35 votes)

28 Feb 2015CPOL6 min read

138.5K

2.8K

Word Aligned Hybrid (WAH) compression for BitArrays

Introduction

After searching the internet for a .NET implementation for WAH compressed BitArray and not finding one, I decided to write and publish an article here so the .NET community are not left out as all the available implementations are in the Java language. This ties into a pretty advanced topic of bitmap indexing techniques for databases and I needed this for my RaptorDB Document data-store database engine.

What Is It?

A BitArray is a data structure in the .NET library for storing true/false or bits of data in a compact form within an array of Int32 objects. A WAH or word-aligned hybrid BitArray is a special run length compressed version of a BitArray which saves a lot of space and memory. All the implementations that exist in Java essentially duplicate the functionality of a BitSet, that is the AND, OR, NOT, and XOR operations with the compressed internal storage format.

In my implementation, I defer the functionality of the BitArray to itself and just add compression and decompression routines. This is much faster than the Java way at the expense of memory usage. To overcome this, I have also added a FreeMemory method to release the BitArray contents and keep the compressed contents. Arguably, if you are using 100 million bits, then a full implementation is more performant than my implementation, but for most of our Use Cases, we are at most in the 10 millions of bits range.

This original method was invented at the Berkeley Labs of US Department of Energy; it is a project named FastBit and is used for high energy physics department experiments; you can see it here: FastBit site.

Why Should I Care?

So what?, you ask. Well, as mentioned before, BitArrays are used in an indexing technique called bitmap indexes (Wiki) and column based databases which store data in columns instead of rows. An example which you might know is Microsoft's PowerPivot for Excel which can process millions of rows in seconds. Interestingly, Microsoft has only recently announced the implementation of bitmap indexes in the upcoming SQL Server platform, post 2008 R2. It has long been in use by other RDBM vendors like Oracle.

How it Works

The WAH compression algorithm is as follows:

Take 31 bits from the array.
If all zeros, then increment the zero count by 31.
1. if ones count > 0, then output 32 bits with bit 32 =1 and bit 31=1 and 0-30 = ones count.
If all ones, then increment the ones count by 31.
1. If zeros count >0, then output 32 bits with bit 32=1 and bit 31=0 and 0-30 = zeros count.
Else output the 31 bits as 32 bits with bit 32 = 0.
1. If zeros or ones count >0, then output as above.

From the above, in the worst case, you will get N/31 more bits encoded or about 3% increase in size to the original.

What You Get

WAHBitArray is essentially the same as the standard BitArray in the .NET Framework, with the following additions:

FreeMemory(): This will first compress the internal BitArray then free the memory it uses.
GetCompressed(): This will compress the current BitArray then return it as an array of uint.
CountOnes(), CountZeros(): will count the respective bits in the array.
GetBitIndexes(bool): will return an enumeration using yield of the respective bit position; for example, if the bit array contains 10001... this will return integers 0,4,... if the bool parameter was true, and 1,2,3,... if bool was false.
Get(), Set(): Methods implemented with auto resizing and no exceptions.
DebugPrint(): generates a string of 1 and 0 values.

Using the Code

To create a WAHBitArray, you can do the following:

WAHBitArray ba1 = new WAHBitArray(1000000); // 1 million bits
// 2 million bits from another bitarray
WAHBitArray ba2 = new WAHBitArray(new BitArray(2000000));
WAHBitArray ba3 = new WAHBitArray(1000000, new uint[] { /* compressed values here */});
// from a compressed list of uint

To perform operations, you can do the following:

WAHBitArray result1 = ba1.And( ba2);
WAHBitArray result2 = ba1.Or( ba3);
WAHBitArray result3 = ba1.Xor( ba2);

long count = result1.CountOnes(); // will count the true values

result2.Set(2000000, true); // will resize as needed
bool b = result2.Get(3000000); // will resize as needed

foreach(int pos in result2.GetBitIndexes(true))
{
    ReadDatabaseRecordNumber(pos); // pos is the index of true values
}

Using the Code v1.3

As of version 1.3, you don't need to specify the size of the BitArray and all operations will auto resize as needed.

WAHBitArray ba1 = new WAHBitArray(); // no need to specify the size
WAHBitArray ba3 = new WAHBitArray(new uint[] { /* compressed values here */});
// from a compressed list of uint

Points of Interest

BitArray class is sealed by Microsoft so inheriting from it was not possible.
BitArray throws an exception if the length of two BitArrays are not equal on bit operations, WAHBitArray makes them the same as the largest before operations.
BitArray must be resized in 32 increments, otherwise it mangles the compression bits.

Version 2.0

For extra speed in compressing and uncompressing the bits, and the fact that the .NET Framework implementation does not give access to the internal data structures in the BitArray, I had to rewrite all the BitArray functionality in WAHBitArray.

Using Reflector to see the internal implementation of the BCL BitArray one can see the following snippets:

// AND operation
for (int i = 0; i < array.Length; i++)    
    array[i] &= val[i];

// OR operation
for (int i = 0; i < array.Length; i++)
    array[i] |= val[i];

// XOR operation
for (int i = 0; i < array.Length; i++)
    array[i] ^= val[i];

As you can see, the bit operations are done on the int32 values.

Now with access to the internal uint[] bits, the compression method gets 31 bit blocks of data instead of one by one. This is done in the Take31Bits() method, which finds the two adjacent uint values in the _uncompressed list and does some bit manipulations as follows:

public WAHBitArray And(WAHBitArray op)
{
    this.CheckBitArray(); // check the bit array is uncompressed

    uint[] ints = op.GetUncompressed(); // get the values

    FixSizes(ints, _uncompressed); // make the sizes the same

    for (int i = 0; i < ints.Length; i++)
        ints[i] &= _uncompressed[i]; // do the AND operation

    return new WAHBitArray(false, ints); // return a new object
}

The compression and uncompression routines were rewritten to operate in the uint[] arrays as follows:

private void Compress()
{
    _compressed = new List<uint>();
    uint zeros = 0;
    uint ones = 0;
    int count = _uncompressed.Count << 5;
    for (int i = 0; i < count; )
    {
        uint num = Take31Bits(i);
        i += 31;
        if (num == 0)
        {
            zeros += 31;
            FlushOnes(ref ones);
        }
        else if (num == 0x7fffffff)
        {
            ones += 31;
            FlushZeros(ref zeros);
        }
        else
        {
            FlushOnes(ref ones);
            FlushZeros(ref zeros);
            _compressed.Add(num);
        }
    }
    FlushOnes(ref ones);
    FlushZeros(ref zeros);
}

private void Uncompress()
{
    int index = 0;
    List<uint> list = new List<uint>();
    if (_compressed == null)
        return;

    foreach (uint ci in _compressed)
    {
        if ((ci & 0x80000000) == 0) // literal
        {
            this.Write31Bits(list, index, ci & 0x7fffffff);
            index += 31;
        }
        else
        {
            uint c = ci & 0x3ffffff;
            if ((ci & 0x40000000) > 0) // ones count
                this.WriteBits(list, index, c);

            index += (int)c;
        }
    }
    this.ResizeAsNeeded(list, index);
    _uncompressed = list;
}

Because taking or updating 31 bits of data can overlap two adjacent uint values, some bit massage is necessary as follows:

private uint Take31Bits(int index)
{
    long l1 = 0;
    long l2 = 0;
    long l = 0;
    long ret = 0;
    int off = (index % 32);
    int pointer = index >> 5;

    l1 = _uncompressed[pointer];
    pointer++;
    if (pointer < _uncompressed.Count)
        l2 = _uncompressed[pointer];

    l = (l1 << 32) + l2;
    ret = (l >> (32 - off)) & 0x07fffffff;

    return (uint)ret;
}

Version 2.5

This version is a post back from the work done in RaptorDB, which is an overhaul of all the routines focusing on multi thread access and concurrency issues. A very important lesson learned was locking at the source of the resource being used as opposed to a higher level, which takes care of concurrency issues at the lowest levels, and saves a lot of headaches and debugging.

Much of the code has been reformatted and optimized. I am very confident in the correctness of this version as it is being tested to it's maximum in RaptorDB.

History

Initial release v1.0: 22 June 2011
Update v1.1: 24 June 2011
- Bit operations now return a WAHBitArray instead of BitArray
- A bit operation will take either a WAHBitArray or a BitArray as the input
- CountZeros(), CountOnes() methods added
- GetBitIndexes() method added
Update v1.2: 28 June 2011
- Get, set methods with auto resizing
Update v1.3: 23 July 2011
- Removed the need to specify the initial size
- Bug fix resize in 32 increments
- Bug fix in BitArray arithmetic
- DebugPrint() method implemented
Update v2.0
- Complete rewrite
- Optimized compression and decompression ~9x speed increase
- All bit operations are done internally without BCL BitArray
Update v2.5: 10 May 2012
- Thread safe access to bits
- Optimized Count() method
- Bits set in int[] in high order
- Bug fix getting uncompressed bits
- GetBitIndexes() will return the one bits only
Update v2.6 : 15^th June 2013
- bug fix edge case decompress code (thanks to andrey269 )
Update v2.6.1 : 22^nd June 2013
- bug fix a logic mistake
- added a unit test project
Update v2.6.2 : 6^th October 2013
- bug fix last remaining bits output in WriteOnes()
Update v2.7.0 : 1^st March 2015
- post back fixes and changes from RaptorDB

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)