Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / desktop / WinForms

.NET Native Multiple File Compression

4.83/5 (16 votes)
16 Jan 2012CPOL5 min read 37.4K   2K  
Multiple-file, searchable, streaming compression library implemented natively in .NET.

Introduction

The .NET built-in compression library has a nifty implementation of the Deflate algorithm. This provides simple compression and decompression tools via the use of Streams.

These are great if your app is talking to a web-site that uses compression, or if you wish to compress a single file.

If you want to compress multiple files into an archive though, you either have to do a lot of work, or utilize a third-party library.

I recently hit a situation where I needed to compress a lot of little files into one file - I also could not use a third party library.

This left me with one option: Implement my own archive file system entirely within .NET. I needed to be able to search the archive for files and extract them individually. I did not need to open the archive with another system, so I had no dependency on any existing file format, giving me free reign to develop my own.

Now I have a little spare time and I thought I would share the results.

Background

My first idea was to build a class that would contain a list of ArchiveFile objects. The ArchiveFile objects would contain details of the file, and a byte array filled with the contents of the file. I would store and compress the data by serializing the class through a compression stream.

I even implemented this solution (I left it in the library as the TinyArchive class) - I was going to use it, but realized it had a fatal flaw: the entire object and all the uncompressed data would need to be loaded into memory. That put definite limits on the maximum size of the archive.

Back I went to the drawing board. I could not use the compressed binary serialization code I had written to handle the file structure, because it could not selectively load parts of the archive file. I would need to create my own.

I wanted to be able to read just the details of the files in the archive, then use those details to be able to selectively read specific portions of the files. I also realized, I could not put all the indexes at the front of the file, because this would make it very difficult to add more files into the archive.

I settled for starting with two bytes that contained the length of the next section of the archive. That next section would be details of the compressed file: its name and its length in the archive, followed immediately by the compressed contents of the file.

The next file would be added in an identical manner. The code would read in the first two bytes, convert that to a ushort, and use that value to specify the length of the next block of bytes read in from the archive, which are the index details. From the index, it gets the length of the compressed data block, which it uses to skip to the next index.

In this fashion, the reader can catalogue the entire archive very quickly. It builds a searchable index that specifies the start byte index number and length of every file compressed into the archive.

Extracting a compressed file is just a matter of opening a file-stream on the archive file, seeking to the index location of the file you wish to extract, then reading in the correct number of bytes. The compressed bytes are then expanded to their original size through the DeflateStream class.

The next most difficult part was removing files from the archive. It wasn't required for my initial application, but I felt it was required to make the library complete.

I had a lot of trouble with this, and I'm still not entirely happy with my solution. The remove method basically creates a copy of the current archive in a temporary file location, skipping those files it was told to remove. It then deletes the original and moves the copy into its place.

The Example Application

NativeFileArchive2/ScreenShot.png

Supplied in the Visual Studio project attached is a demo archive forms application. It is a quick, basic implementation of the multiple file archive. You can create archives, add files to it, extract them, and remove them. You can navigate the files in the archive with a tree-view control.

Using the Code

The main class handling the archive is StreamingArchiveFile.

The below code shows how to create a new archive, and add a folder of files into it:

C#
// create a new or access an existing streaming archive file:
StreamingArchiveFile arc = new StreamingArchiveFile(@"C:\Temp\streamingArchive.arc");

// now iterate through the files in a specific directory and add them to the archive:
foreach (string fileName in Directory.GetFiles(@"C:\Temp\Test\", "*.*"))
{
    // add the file
    arc.AddFile(new ArchiveFile(fileName));
}

Extracting files: This clip shows how to enumerate the files in an archive, extract them to a temp folder, and shell open them.

C#
// open the existing archive file:
StreamingArchiveFile arc = new StreamingArchiveFile(@"C:\Temp\streamingArchive.arc");

// iterate the files in the archive:
foreach (string fileName in arc.FileIndex.IndexedFileNames)
{
    // write the name of the file
    Debug.Print("File: " + fileName);

    // extract the file:
    ArchiveFile file = arc.GetFile(fileName);

    // save it to disk:
    String tempFileName = Path.GetTempPath() + "\\" + file.Name;
    file.SaveAs(tempFileName);

    // open the file:
    Process.Start(tempFileName);
}

There is also a search method that uses Regular Expressions and LINQ to find files in the archive:

C#
/// <summary>
/// search the index for a file matching the expression.
/// </summary>
/// <param name="fileNameExpression"></param>
/// <returns></returns>
public IEnumerable<ArchiveFileIndex> Search(String fileNameExpression)
{
    return (from file in _fileIndex
            where Regex.IsMatch(file.FileName, fileNameExpression)
            select file);
}

To really see how the archive can be used, check out the code in the FrmStreamingArchiveUI form.

Points of Interest

.NET provides two compression classes: GZipStream and DeflateStream. They both use the deflate algorithm, the GZipStream class is actually built on the DeflateStream, and adds some additional header information and CRC checks.

I have used the DeflateStream class because it is quicker and leaves a smaller footprint.

The files are stored in the archive as ArchiveFile objects. This object stores properties of the file, like creation date and size, as well as a byte array filled with the actual contents of the file.

The ArchiveFile object is serialized into the archive using the class TinySerializer. This was developed to produce the smallest possible serialization of a class. It has an optional custom SerializationBinder that strips out the AssemblyName and TypeName (this means that the object you are serializing must contain only simple objects as fields or properties) and it can serialize or de-serialize through the Deflate stream class to provide quick and easy compression/decompression.

C#
/// <summary>
/// deserialize an object from compressed data.
/// </summary>
/// <typeparam name="T">the type of object to deserialize</typeparam>
/// <param name="compressedInputStream">stream of compressed
///         data containing an object to deserialize</param>
/// <returns></returns>
public static T DeSerializeCompressed<T>(Stream compressedInputStream, bool useCustomBinder = false)
{
    // construct the binary formatter and assign the custom binder:
    BinaryFormatter formatter = new BinaryFormatter();
    if (useCustomBinder)
        formatter.Binder = new TinySerializer(typeof(T));
            
    // read the stream through a GZip decompression stream.
    using (DeflateStream decompressionStream = 
           new DeflateStream(compressedInputStream, CompressionMode.Decompress, true))
    {
        // deserialize to an object:
        object graph = formatter.Deserialize(decompressionStream);

        // check the type is correct and return.
        if (graph is T)
            return (T)graph;
        else
            throw new ArgumentException("Invalid Type!");
    }
}

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)