Introduction
The .NET built-in compression library has a nifty implementation of the Deflate algorithm. This provides simple compression and decompression tools via the use of Streams.
These are great if your app is talking to a web-site that uses compression, or if you wish to compress a single file.
If you want to compress multiple files into an archive though, you either have to do a lot of work, or utilize a third-party library.
I recently hit a situation where I needed to compress a lot of little files into one file - I also could not use a third party library.
This left me with one option: Implement my own archive file system entirely within .NET. I needed to be able to search the archive for files and extract them individually. I did not need to open the archive with another system, so I had no dependency on any existing file format, giving me free reign to develop my own.
Now I have a little spare time and I thought I would share the results.
Background
My first idea was to build a class that would contain a list of ArchiveFile
objects. The ArchiveFile
objects would contain details of the file, and a byte array filled with the contents of the file. I would store and compress the data by serializing the class through a compression stream.
I even implemented this solution (I left it in the library as the TinyArchive
class) - I was going to use it, but realized it had a fatal flaw: the entire object and all the uncompressed data would need to be loaded into memory. That put definite limits on the maximum size of the archive.
Back I went to the drawing board. I could not use the compressed binary serialization code I had written to handle the file structure, because it could not selectively load parts of the archive file. I would need to create my own.
I wanted to be able to read just the details of the files in the archive, then use those details to be able to selectively read specific portions of the files. I also realized, I could not put all the indexes at the front of the file, because this would make it very difficult to add more files into the archive.
I settled for starting with two bytes that contained the length of the next section of the archive. That next section would be details of the compressed file: its name and its length in the archive, followed immediately by the compressed contents of the file.
The next file would be added in an identical manner. The code would read in the first two bytes, convert that to a ushort
, and use that value to specify the length of the next block of bytes read in from the archive, which are the index details. From the index, it gets the length of the compressed data block, which it uses to skip to the next index.
In this fashion, the reader can catalogue the entire archive very quickly. It builds a searchable index that specifies the start byte index number and length of every file compressed into the archive.
Extracting a compressed file is just a matter of opening a file-stream on the archive file, seeking to the index location of the file you wish to extract, then reading in the correct number of bytes. The compressed bytes are then expanded to their original size through the DeflateStream
class.
The next most difficult part was removing files from the archive. It wasn't required for my initial application, but I felt it was required to make the library complete.
I had a lot of trouble with this, and I'm still not entirely happy with my solution. The remove method basically creates a copy of the current archive in a temporary file location, skipping those files it was told to remove. It then deletes the original and moves the copy into its place.
The Example Application
Supplied in the Visual Studio project attached is a demo archive forms application. It is a quick, basic implementation of the multiple file archive. You can create archives, add files to it, extract them, and remove them. You can navigate the files in the archive with a tree-view control.
Using the Code
The main class handling the archive is StreamingArchiveFile
.
The below code shows how to create a new archive, and add a folder of files into it:
StreamingArchiveFile arc = new StreamingArchiveFile(@"C:\Temp\streamingArchive.arc");
foreach (string fileName in Directory.GetFiles(@"C:\Temp\Test\", "*.*"))
{
arc.AddFile(new ArchiveFile(fileName));
}
Extracting files: This clip shows how to enumerate the files in an archive, extract them to a temp folder, and shell open them.
StreamingArchiveFile arc = new StreamingArchiveFile(@"C:\Temp\streamingArchive.arc");
foreach (string fileName in arc.FileIndex.IndexedFileNames)
{
Debug.Print("File: " + fileName);
ArchiveFile file = arc.GetFile(fileName);
String tempFileName = Path.GetTempPath() + "\\" + file.Name;
file.SaveAs(tempFileName);
Process.Start(tempFileName);
}
There is also a search method that uses Regular Expressions and LINQ to find files in the archive:
public IEnumerable<ArchiveFileIndex> Search(String fileNameExpression)
{
return (from file in _fileIndex
where Regex.IsMatch(file.FileName, fileNameExpression)
select file);
}
To really see how the archive can be used, check out the code in the FrmStreamingArchiveUI
form.
Points of Interest
.NET provides two compression classes: GZipStream
and DeflateStream
. They both use the deflate algorithm, the GZipStream
class is actually built on the DeflateStream
, and adds some additional header information and CRC checks.
I have used the DeflateStream
class because it is quicker and leaves a smaller footprint.
The files are stored in the archive as ArchiveFile
objects. This object stores properties of the file, like creation date and size, as well as a byte array filled with the actual contents of the file.
The ArchiveFile
object is serialized into the archive using the class TinySerializer
. This was developed to produce the smallest possible serialization of a class. It has an optional custom SerializationBinder
that strips out the AssemblyName
and TypeName
(this means that the object you are serializing must contain only simple objects as fields or properties) and it can serialize or de-serialize through the Deflate
stream class to provide quick and easy compression/decompression.
public static T DeSerializeCompressed<T>(Stream compressedInputStream, bool useCustomBinder = false)
{
BinaryFormatter formatter = new BinaryFormatter();
if (useCustomBinder)
formatter.Binder = new TinySerializer(typeof(T));
using (DeflateStream decompressionStream =
new DeflateStream(compressedInputStream, CompressionMode.Decompress, true))
{
object graph = formatter.Deserialize(decompressionStream);
if (graph is T)
return (T)graph;
else
throw new ArgumentException("Invalid Type!");
}
}