(untagged)

Compressing Persisted DataSets

Adrian_Moore

0.00/5 (No votes)

19 Jun 2005

Using .NET 2.0 DeflateStream and GZipStream to compress persisted DataSets.

Sample Image

Introduction

Persisting a cached DataSet using the WriteXML method can produce very large XML files. .NET 2.0 introduces a new System.IO.Compression namespace that provides stream objects for compressing data. Of course, compression and decompression are not helpful if the performance of reading or writing the XML file is significantly different. The image at the beginning of this article shows the performance differences of reading and writing XML files with and without compression. A 35 MB XML file was used during the development of this article, but a smaller file is included in the sample code. The remainder of this article highlights important sections of the sample code. You will need Visual Studio 2005 Beta 2 in order to build the sample, or the .NET 2.0 redistributable installed, as a minimum, to run the demo.

Using the code

As a baseline, the time to read and write the raw XML file into and out of the DataSet was measured.

ts1 = New TimeSpan(Now.Ticks)
ds.ReadXml("..\input.xml")
ts2 = New TimeSpan(Now.Ticks)
Console.WriteLine("Took " & ts2.Subtract(ts1).ToString _
                                   & " to read raw XML")

ts1 = New TimeSpan(Now.Ticks)
ds.WriteXml("test.xml")
ts2 = New TimeSpan(Now.Ticks)
Console.WriteLine("Took " & ts2.Subtract(ts1).ToString _
                                  & " to write raw XML")

DeflateStream Class

According to Microsoft's documentation:

This class represents the Deflate algorithm, an industry-standard algorithm for lossless file compression and decompression. It uses a combination of the LZ77 algorithm and Huffman coding. Data can be produced or consumed, even for an arbitrarily long sequentially presented input data stream, using only an a priori bounded amount of intermediate storage. The format can be implemented readily in a manner not covered by patents. For more information, see the RFC 1951: DEFLATE 1.3 specification.

In order to write the DataSet's XML to a compressed stream, the only additional steps, prior to calling WriteXml, are to create a file to store the data and create the DeflateStream object; passing the file's stream and setting the compression mode to Compress.

Note that if the DataSet is small, the resulting compressed XML file might actually be bigger due to the overhead of the initial header data.

Looking at the image at the beginning of the article, the time to compress and write the output stream is only about 10-30% slower than writing the raw XML to a file. However, the resulting file is about 9 times smaller.

outfile = New FileStream("test.xmd", FileMode.Create, FileAccess.Write)
DefStream = New DeflateStream(outfile, CompressionMode.Compress, False)

ds.WriteXml(DefStream)

Reading the compressed XML file is roughly the same. Prior to calling the DataSet's ReadXml method, the file is opened in Read mode and the DeflateStream object is created; passing the file's stream and setting the compression mode to Decompress. The time to decompress and read the input stream is only about 10-30% lower than reading the raw XML from a file.

infile = New FileStream("test.xmd", FileMode.Open, FileAccess.Read)
DefStream = New DeflateStream(infile, CompressionMode.Decompress, False)

ds.ReadXml(DefStream)

GZipStream Class

Again, according to Microsoft's documentation:

This class represents the gzip data format that uses an industry-standard algorithm for lossless file compression and decompression. The format includes a cyclic redundancy check value for detecting data corruption. gzip uses the same algorithm as the DeflateStream class, but can be extended to use other compression formats. The format can be implemented readily in a manner not covered by patents. The format for gzip is available from the RFC 1952: GZIP 4.3 specification.

In the prior code snippets, DeflateStream can be replaced with GZipStream. What's interesting to note is that the performance of the GZipStream class is slightly slower than that of the DeflateStream class. This is probably due to the additional overhead of being extensible.

BinaryFormatter Class (Update)

Based on feedback from the initial article, someone suggested I look at the size and timing of a DataSet stored using the BinaryFormatter class and the DataSet's RemotingFormat property. As DataSets are typically passed between tiers of a distributed application, XML can generate rather large data packets. In an effort to avoid this, Microsoft has provided a way to persist a DataSet in binary form by setting the DataSet's RemotingFormat property to SerializationFormat.Binary.

Dim formatter As New BinaryFormatter

ds.RemotingFormat = SerializationFormat.Binary
formatter.Serialize(outfile, ds)

Based on the image above, it can be seen that the binary file is about 1/3 the size of the original XML file, but roughly 3 times larger than the compressed XML file. However, the time to load the binary DataSet into memory is almost 6 times faster than reading an XML file. Out of interest, I decided to compress the binary file just to see if the binary data could be compressed any further.

Dim formatter As New BinaryFormatter
ds.RemotingFormat = SerializationFormat.Binary

DefStream = New DeflateStream(outfile, CompressionMode.Compress, False)
formatter.Serialize(DefStream, ds)

The result was that the additional time to compress the binary data gives a marginal decrease in size and so is probably not worth using.

Points of Interest

I found out that it's very important to explicitly close both the compression stream and the output file stream after calling the WriteXml method of the DataSet.

outfile = New FileStream("test.xmz", FileMode.Create, FileAccess.Write)
ZipStream = New GZipStream(outfile, CompressionMode.Compress, False)

ds.WriteXml(ZipStream)

' neglecting to close either of the following streams

' results in a corrupted file when trying to read later

ZipStream.Close() ' important to close this first to flush compressed stream

outfile.Close()   ' important to close this second to flush output stream

As the comment indicates, if these streams are not explicitly closed, the file will be corrupt. An "Unexpected End of File" exception will be thrown later when using ReadXml to read the compressed file. This wasn't mentioned in Microsoft's documentation and may be due a bug in the beta software.

After running the demo, try renaming the resulting .xmd or .xmz files to .zip. The resulting compressed archive cannot be read by Windows as a valid Zip file. While some might consider this a limitation, I think this is a simple way to protect the contents of the XML data from the casual user.

It's worth mentioning that for those looking for more flexibility in compression formats, the SharpZipLib project has been around since the early days of .NET and provides Zip, GZip, Tar and BZip2 archive formats.

Conclusions

The new GZipStream and DeflateStream classes greatly reduce the size of a DataSet persisted to a file with little additional cost.

Storing a DataSet to binary format does reduce the size of the DataSet persisted to a file, but not an much as the compression scheme. However, the binary file is much faster to load afterwards.

I hope this article has been helpful to someone. Don't forget to vote.

History

05-06-2005 - Initial release.
14-06-2005 - Updated with serializing to binary format.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here