Introduction
Persisting a cached DataSet
using the WriteXML
method can produce very large XML files. .NET 2.0 introduces a new System.IO.Compression
namespace that provides stream objects for compressing data. Of course, compression and decompression are not helpful if the performance of reading or writing the XML file is significantly different. The image at the beginning of this article shows the performance differences of reading and writing XML files with and without compression. A 35 MB XML file was used during the development of this article, but a smaller file is included in the sample code. The remainder of this article highlights important sections of the sample code. You will need Visual Studio 2005 Beta 2 in order to build the sample, or the .NET 2.0 redistributable installed, as a minimum, to run the demo.
Using the code
As a baseline, the time to read and write the raw XML file into and out of the DataSet
was measured.
ts1 = New TimeSpan(Now.Ticks)
ds.ReadXml("..\input.xml")
ts2 = New TimeSpan(Now.Ticks)
Console.WriteLine("Took " & ts2.Subtract(ts1).ToString _
& " to read raw XML")
ts1 = New TimeSpan(Now.Ticks)
ds.WriteXml("test.xml")
ts2 = New TimeSpan(Now.Ticks)
Console.WriteLine("Took " & ts2.Subtract(ts1).ToString _
& " to write raw XML")
DeflateStream Class
According to Microsoft's documentation:
This class represents the Deflate algorithm, an industry-standard algorithm for lossless file compression and decompression. It uses a combination of the LZ77 algorithm and Huffman coding. Data can be produced or consumed, even for an arbitrarily long sequentially presented input data stream, using only an a priori bounded amount of intermediate storage. The format can be implemented readily in a manner not covered by patents. For more information, see the RFC 1951: DEFLATE 1.3 specification.
In order to write the DataSet
's XML to a compressed stream, the only additional steps, prior to calling WriteXml
, are to create a file to store the data and create the DeflateStream
object; passing the file's stream and setting the compression mode to Compress
.
Note that if the DataSet
is small, the resulting compressed XML file might actually be bigger due to the overhead of the initial header data.
Looking at the image at the beginning of the article, the time to compress and write the output stream is only about 10-30% slower than writing the raw XML to a file. However, the resulting file is about 9 times smaller.
outfile = New FileStream("test.xmd", FileMode.Create, FileAccess.Write)
DefStream = New DeflateStream(outfile, CompressionMode.Compress, False)
ds.WriteXml(DefStream)
Reading the compressed XML file is roughly the same. Prior to calling the DataSet
's ReadXml
method, the file is opened in Read
mode and the DeflateStream
object is created; passing the file's stream and setting the compression mode to Decompress
. The time to decompress and read the input stream is only about 10-30% lower than reading the raw XML from a file.
infile = New FileStream("test.xmd", FileMode.Open, FileAccess.Read)
DefStream = New DeflateStream(infile, CompressionMode.Decompress, False)
ds.ReadXml(DefStream)
GZipStream Class
Again, according to Microsoft's documentation:
This class represents the gzip data format that uses an industry-standard algorithm for lossless file compression and decompression. The format includes a cyclic redundancy check value for detecting data corruption. gzip uses the same algorithm as the DeflateStream
class, but can be extended to use other compression formats. The format can be implemented readily in a manner not covered by patents. The format for gzip is available from the RFC 1952: GZIP 4.3 specification.
In the prior code snippets, DeflateStream
can be replaced with GZipStream
. What's interesting to note is that the performance of the GZipStream
class is slightly slower than that of the DeflateStream
class. This is probably due to the additional overhead of being extensible.
BinaryFormatter Class (Update)
Based on feedback from the initial article, someone suggested I look at the size and timing of a DataSet
stored using the BinaryFormatter
class and the DataSet
's RemotingFormat
property. As DataSet
s are typically passed between tiers of a distributed application, XML can generate rather large data packets. In an effort to avoid this, Microsoft has provided a way to persist a DataSet
in binary form by setting the DataSet
's RemotingFormat
property to SerializationFormat.Binary
.
Dim formatter As New BinaryFormatter
ds.RemotingFormat = SerializationFormat.Binary
formatter.Serialize(outfile, ds)
Based on the image above, it can be seen that the binary file is about 1/3 the size of the original XML file, but roughly 3 times larger than the compressed XML file. However, the time to load the binary DataSet
into memory is almost 6 times faster than reading an XML file. Out of interest, I decided to compress the binary file just to see if the binary data could be compressed any further.
Dim formatter As New BinaryFormatter
ds.RemotingFormat = SerializationFormat.Binary
DefStream = New DeflateStream(outfile, CompressionMode.Compress, False)
formatter.Serialize(DefStream, ds)
The result was that the additional time to compress the binary data gives a marginal decrease in size and so is probably not worth using.
Points of Interest
I found out that it's very important to explicitly close both the compression stream and the output file stream after calling the WriteXml
method of the DataSet
.
outfile = New FileStream("test.xmz", FileMode.Create, FileAccess.Write)
ZipStream = New GZipStream(outfile, CompressionMode.Compress, False)
ds.WriteXml(ZipStream)
ZipStream.Close()
outfile.Close()
As the comment indicates, if these streams are not explicitly closed, the file will be corrupt. An "Unexpected End of File" exception will be thrown later when using ReadXml
to read the compressed file. This wasn't mentioned in Microsoft's documentation and may be due a bug in the beta software.
After running the demo, try renaming the resulting .xmd or .xmz files to .zip. The resulting compressed archive cannot be read by Windows as a valid Zip file. While some might consider this a limitation, I think this is a simple way to protect the contents of the XML data from the casual user.
It's worth mentioning that for those looking for more flexibility in compression formats, the SharpZipLib project has been around since the early days of .NET and provides Zip, GZip, Tar and BZip2 archive formats.
Conclusions
The new GZipStream
and DeflateStream
classes greatly reduce the size of a DataSet
persisted to a file with little additional cost.
Storing a DataSet
to binary format does reduce the size of the DataSet
persisted to a file, but not an much as the compression scheme. However, the binary file is much faster to load afterwards.
I hope this article has been helpful to someone. Don't forget to vote.
History
- 05-06-2005 - Initial release.
- 14-06-2005 - Updated with serializing to binary format.