Introduction
Modern CPUs are very fast, but RAM is not able to keep up with the speed. If RAM turns out to be a bottleneck, then compression can become attractive.
In this tip, I want to introduce you to compression algorithms for (unsigned) 32bit integers. A variety of algorithms exist, with several patent-free open source implementations available.
Delta Encoding
All algorithms have one thing in common: the smaller the integer, the better the compression. After all, small integers require fewer bits. Reducing the number of bits is especially easy when the integers are sorted, because then it's sufficient to store the delta values. This is called "Delta Encoding":
A sorted sequence
N1, N2, ... Nn
can be encoded as:
N1, N2 - N1, N3 - N2, ... Nn - Nn-1
During decompression, each delta is added to the previous value and the original value is reconstructed. The actual compression algorithm only stores the deltas. Since deltas are usually relatively small, the compression ratio improves.
But, as mentioned previously, delta encoding will only work with sorted sequences. And it slows down some of the operations, i.e., appending a value to the end of the sequence.
Variable Byte Compression
Variable Byte (or Varint
, Varbyte
) compression is quite old and simple. Small numbers (< 127) are stored in a single byte. If the number is larger, a "continuation bit" is set, and an additional byte is used. Here are a few examples:
Decimal Binary Variable Byte encoded
127 00000000 01111111 01111111
128 00000000 10000000 10000000 00000000
16384 01000000 00000000 10000000 10000000 00000001
It requires just a few lines of code to uncompress an integer:
static inline int
read_int(const uint8_t *in, uint32_t *out)
{
*out = in[0] & 0x7Fu;
if (in[0] < 128)
return 1;
*out = ((in[1] & 0x7Fu) << 7) | *out;
if (in[1] < 128)
return 2;
*out = ((in[2] & 0x7Fu) << 14) | *out;
if (in[2] < 128)
return 3;
*out = ((in[3] & 0x7Fu) << 21) | *out;
if (in[3] < 128)
return 4;
*out = ((in[4] & 0x7Fu) << 28) | *out;
return 5;
}
However, the variable byte compression is relatively slow. There are just too many "if
" branches which are impossible to predict for the CPU. Therefore the CPU cannot run multiple instructions in parallel, and the CPU pipeline stalls frequently.
Nevertheless, it is possible to use SIMD instructions to speed up the decoding. The continuation bits of up to 16 bytes can be collected with the SSE instruction _mm_movemask_epi8
and these bits can be used as an index into a lookup table for further instructions. The Masked Vbyte library from Daniel Lemire, Jeff Plaisance and Nathan Kurz does exactly this, although it uses only 12 bits for the lookup table (a 16bit lookup table would be too large). The table itself contains parameters for another SSE instruction, pshufb
, which then rearranges the bytes.
Here is an example code which uses the Masked Vbyte library. It uses delta compression, and it has functions to lookup values or to select an integer from a compressed sequence, without decompressing it. I have contributed some of these functions. You will not find a faster variable byte compression library.
static void
test_maskedvbyte()
{
simdvbyteinit();
std::vector<uint8_t> compressed(input.size() * 5);
size_t used = vbyte_encode_delta(&input[0], input.size(), &compressed[0], 0);
std::cout << "MVByte: compressed " <<
input.size() << " integers ("
<< input.size() * 4 << " bytes) to "
<< used << " bytes" << std::endl;
uint32_t v = masked_vbyte_select_delta(&compressed[0], input.size(), 0, 3);
std::cout << "integer at position 3: " << v << std::endl;
std::vector<uint32_t> output(input.size());
masked_vbyte_decode_delta(&compressed[0], &output[0], input.size(), 0);
assert(input == output);
}
Frame Of Reference Compression
The Masked Vbyte library requires a relatively new-ish Intel CPU for its SIMD instructions. For older CPUs, or non-Intel platforms, you will have to resort to other implementations. But even without SIMD, it is possible to come up with very efficient implementations. One of them is libfor, which implements the "Frame Of Reference" (FOR) algorithm. FOR does not store the deltas to the previous value, but to the beginning of the sequence. And unlike Variable Byte, it does not store a sequence of bytes, but a stream of bits and therefore can
have an even better compression ratio.
With FOR, a sequence
N1, N2, ... Nn
will be encoded as:
N1, N2 - N1, N3 - N1, ... Nn - N1
libfor has a clean API and also implements functions which directly operate on the compressed sequence, i.e., to append values, perform searches or to select a specific value. It has different APIs for sorted and unsorted sequences. FOR's implementation of select is very fast, and FOR can search sorted sequences very efficiently with binary search.
static void
test_for()
{
uint32_t bytes_reqd = for_compressed_size_sorted(&input[0], input.size());
std::vector<uint8_t> compressed(bytes_reqd);
uint32_t used = for_compress_sorted(&input[0], &compressed[0], input.size());
std::cout << "libfor: compressed " <<
input.size() << " integers ("
<< input.size() * 4 << " bytes) to "
<< used << " bytes" << std::endl;
uint32_t v = for_select(&compressed[0], 3);
std::cout << "integer at position 3: " << v << std::endl;
std::vector<uint32_t> output(input.size());
for_uncompress(&compressed[0], &output[0], input.size());
assert(input == output);
}
Other Libraries and Projects
FOR is implemented in pure scalar C code, SimdFOR is a very similar implementation with SIMD instructions. It is faster, but only works on Intel CPUs. Both are not compatible.
The SimdComp library is similar to SimdFOR, but uses delta compression. It therefore requires sorted integers as input, and its compression ratio is by far the best of all the libraries listed here.
Here are a few questions you have to answer before choosing a library:
- Are your integers sorted? Then pick one with delta compression.
- Is your software also running on non-Intel CPUs? Then avoid implementations using SSE instructions.
- Do you want to directly work on compressed data, without uncompressing? Then compare the libraries and check whether your requirements are met.
- And finally run benchmarks for your specific data. Algorithms will compress better or worse depending on your input.
You can download the sample code from the following link:
You can find in-depth benchmarks of multiple compression libraries here: http://upscaledb.com/0009-32bit-integer-compression-algorithms.html.