(untagged)

Better Than Zip Algorithm For Compressing In-Memory Data

Andrey Dryazgov

0.00/5 (No votes)

23 May 2006

An article on the in-memory data compression engine for .NET

Introduction

Sometimes you may need to compress data in memory without creating disk files. For example, it is possible to compress binary data in a CDATA section of an XML file. Also, you may want to compress data transmitted between the client and server machines over a low-bandwidth connection.

The proposed classes implement the original compression algorithm. This method is based on the "DEFLATE Compressed Data Format Specification" (RFC 1951) with some enhancements. It works slightly better than the generally used zip compression. The algorithm is especially effective on Centrino and other platforms with 2 MB L2 cache. The above picture shows results of the mscorlib.xml file compression/decompression on a machine with Pentium M processor (1.8 GHz). Running the same test on a machine with AMD Athlon 2100+ processor (with 256 KB L2 cache) gives 721 ms as the compression time and 40 ms as the decompression time. It is interesting that Intel and AMD platforms have such an evident difference in performance for various operations (here 110 ms on the Pentium M is quite imperfect data, however, because of the big inaccuracy).

You can compare this compression engine with the standard DeflateStream compressor added in .NET 2.0. To do so, please open the source code for the demo project (Form1.cs) and comment/uncomment several code blocks. They are marked in the source code. Then, try the demo project with various data files and compare results with the MemCompressor engine. You will see, the standard DeflateStream engine is rather poor.

Using MemCompressor

The MemCompressor assembly includes two main classes: AcedDeflator and AcedInflator. The AcedDeflator class has the Compress method. This method accepts a byte array and returns new byte array with the same data in the compressed form. For example:

private int _originalSize;
private byte[] _originalData;

private int _originalChecksum;
private byte[] _compData;
private int _compSize;

private void CompressData()
{
    _originalChecksum = AcedUtils.Adler32(_originalData, 0, _originalSize);
    _compData = AcedDeflator.Instance.Compress(_originalData, 0, _originalSize,
        AcedCompressionLevel.Normal, 0, 0);
    _compSize = _compData.Length;
}

The AcedInflator class works quite the contrary. It has the Decompress method which accepts a binary array with compressed data. This method restores the original data in the output array.

private byte[] _decompData;
private int _decompSize;
private int _decompChecksum;

private void DecompressData()
{
    _decompData = AcedInflator.Instance.Decompress(_compData, 0, 0, 0);
    _decompSize = _decompData.Length;
    _decompChecksum = AcedUtils.Adler32(_decompData, 0, _decompSize);
    Debug.Assert(_decompChecksum == _originalChecksum);
}

The AcedDeflator.Compress method accepts, among other, the compressionLevel parameter. This parameter chooses balance between the speed and the compression ratio. Here you may pass one of the following values:

Store (simple copying data to the output array without compression)
Fastest (poor compression ratio but the best speed)
Fast (optimal for less-redundant data)
Normal (better choice for higher-redundant data, such as texts)
Maximum (slow but has the best compression ratio)

AcedDeflator Class

public class AcedDeflator

Provides methods for compressing binary data with a combination of the LZ77 algorithm and Huffman coding similar to the method described in RFC 1951. You should synchronize calls to the methods of this class in a multithreaded application.

public AcedDeflator()

Creates a new instance of the AcedDeflator class. It is not recommended to call this constructor directly. When possible, please use a single cached instance of the AcedDeflator class which is returned by the Instance static property.

public static AcedDeflator Instance { get; }

Returns a cached instance of the AcedDeflator class. Such an instance occupies a lot of memory (a little more than 2 MB). So, a single instance is created and cached in an internal static field of this class. This instance can be used each time when you want to compress some data. The only exception may be a multithreaded application where you compress data in several threads simultaneously. If so, you should create each instance with the regular constructor.

public byte[] Compress(byte[] sourceBytes, int sourceIndex, int sourceLength,
    AcedCompressionLevel compressionLevel, int beforeGap, int afterGap)

Compresses binary data passed in the sourceBytes array starting from sourceIndex with sourceLength bytes in length. The compression ratio depends on the compressionLevel parameter. This method creates a new byte array which is enough to store the compressed data. You may also reserve beforeGap bytes before the compressed data block and afterGap bytes after the compressed data block. This may be useful if you want to supply a custom header and/or tail to the compressed data and you don't want to reallocate memory unnecessarily. The result byte array consists of (beforeGap + Compressed_Data_Length + afterGap) bytes.

public int Compress(byte[] sourceBytes, int sourceIndex, int sourceLength,
    AcedCompressionLevel compressionLevel, byte[] destinationBytes,
    int destinationIndex)

This method is similar to the previous one but it doesn't allocate the memory for the output array. The output array is passed as the destinationBytes argument. The compressed data will be stored in that array starting from the destinationIndex. The maximum length of the compressed data is (sourceLength + 4) bytes. The output array must have enough space. This method returns the number of bytes stored in the output array. If you pass null in the destinationBytes argument, the method performs "dummy" compression. In such a way, you can find out the length of the output data. However, the "dummy" compression takes almost the same time as the regular compression.

public static void Release()

Call this method to release an internal static reference to the cached instance of the AcedDeflator class. After that, you may call GC.Collect() to free memory which was occupied by that cached instance.

AcedInflator Class

public class AcedInflator

Provides methods for decompressing binary data which was compressed using the AcedDeflator class. You should synchronize calls to the methods of this class in a multithreaded application.

public AcedInflator()

Creates a new instance of the AcedInflator class. It is not recommended to call this constructor directly. When possible, to avoid unnecessary memory reallocation, please use a single cached instance of the AcedInflator class which is returned by the Instance static property.

public static AcedInflator Instance { get; }

Returns a cached instance of the AcedInflator class. Such an instance is created when you access this property for the first time. Then, the instance is cached in an internal static field of this class. You can use it to decompress any data in a single-threaded application. In a multithreaded application, please create an AcedInflator for each the thread with the regular constructor.

public static int GetDecompressedLength(byte[] sourceBytes, int sourceIndex)

Returns the minimum length, in bytes, of the binary array where you can store the decompressed data for the specified compressed binary array (sourceBytes). The sourceIndex argument specifies the start index in the sourceBytes array from which the compressed data begins.

public byte[] Decompress(byte[] sourceBytes, int sourceIndex,
    int beforeGap, int afterGap)

Decompresses binary data passed in the sourceBytes array starting from sourceIndex. This method creates a new byte array which is enough to store the decompressed data. You may also reserve beforeGap bytes before the decompressed data block and afterGap bytes after the decompressed data block. This may be useful if you want to supply a custom header and/or tail to decompressed data and you don't want to reallocate memory unnecessarily. The result array will consist of (beforeGap + Decompressed_Data_Length + afterGap) bytes.

public int Decompress(byte[] sourceBytes, int sourceIndex,
    byte[] destinationBytes, int destinationIndex)

The same as the previous method but the memory for the output array must be allocated before calling this method. You should pass the destination array in the destinationBytes argument. The destinationIndex specifies offset in that array from which the decompressed data will be stored. The output array must have enough space to store all the data. To obtain the decompressed data size, please call the GetDecompressedLength method of this class. The Decompress method returns the number of bytes stored in the output (destinationBytes) array. If you pass null in the destinationBytes argument, this method does nothing, just returns the decompressed data length, in bytes.

public static void Release()

Call this method to release an internal static reference to the cached instance of the AcedInflator class. After that, you may call GC.Collect() to free memory which was occupied by that cached instance.

AcedUtils Class

public class AcedUtils

This class has one public static method to compute the Adler-32 checksum.

public static int Adler32(byte[] bytes, int offset, int length)

Computes the Adler-32 checksum in accordance with RFC 1950 for length bytes of the bytes array starting from the offset index.

Conclusion

Just have these simple classes in mind when you will consider adding some compression capabilities to your application.

History

23^rd May, 2006: Initial post

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here