Introduction
This articles explains a technique for accessing files stored inside a zip file without downloading the whole zip archive. It can be used to access parts of a big ZIP archive, or to download only specific parts of an archive. It works with OpenOffice documents too.
The standard approach of serving parts of an archive file through a Web Server is using a script on the server side, with the fact that the single parts are sent uncompressed; this solution is totally client side and uses the Range header of the HTTP protocol to retrieve specific parts of a ZIP archive.
The analysis and decompression of the ZIP archives is performed using the Open Source SharpZipLib.
Architecture
We start with a description of the ZIP archive that is needed to explain how this technique works. The ZIP archive is organized as in figure; first there are the compressed files, each of them preceded by a local header, and finally a central directory that stores again all the file details. The central directory is needed to speedup the read of the listing of the files, and it is stored at the end because in this way the ZIP archive can be created on a stream. The central directory has a header, then a record for each file, and finally an ending marker.
A record in the central directory contains the offset to the Local Header entry, that can be used to extract the specified file. Unluckily, the local header has to be read because in the record of the central directory, there is no information about the size of the local header.
Additional details on the ZIP file format can be obtained here.
The idea is to use the Range options of the HTTP protocol to extract the listing of the files from the Central Directory. When the client application requires a file in the archive, only the compressed content is transferred and the decompression is performed on the client side. Clearly, this technique works best when the communication uses the Keep-Alive option of HTTP.
As stated by the RFC2616 (HTTP 1.1), the Range option of the request header specifies the first byte position and optionally the end byte position of a HTTP request. If the first byte position is negative, the offset is considered from the end of the data stream.
The Web Server responds to a Range request with a "206 - Partial Content" response that specifies the ranges that are sent back to the client. The response header option Content-range reports the returned ranges. The HTTP protocol is also able to manage multiple range request in each HTTP request.
Implementation
The RemoteZip
class contains all the implementation for this technique. The most interesting part is the search for the End of Record of the ZIP archive, because it is at the end of the file but it has a variable length.
To access a web resource, we use the HttpWebRequest
class of the .NET Framework, and the method AddRange
hides all the details of using the Range options. When the range has a negative value, it means that the offset is relative to the end.
This is an example of the HTTP request required for extracting a file:
- Request the last 280 bytes of the archive to find the End of Central Directory. From the End Of Central Directory, obtain the offset and length of the Central Directory;
- Load the Central Directory with all the entries information. Find the requested file and obtain the offset of its Local Header in the archive;
- Request a block of data starting at the Local Header and sized as the maximum size of the Local Header plus the compressed size; skip the Local Header part from the requested data. Then, serve the decompressed data as coming from the Web Server.
Because the Local Header has a dynamic size, we request 16+64K*2+compressedSize bytes. The 64K*2 is the maximum dimension of the dynamic part of the Local Header, but usually it's a really small value. An alternative could be to download only the static part of the Local Header, then perform another request to obtain the compressed data, but it should be avoided because of the additional overhead of the HTTP request.
Extension to the SharpLibZip
This technique is presented as an extension to SharpLibZip. The RemoteZipFile
class should be used instead of the ZipFile
class: it provides an enumeration of the entries in the archive, and a Stream can be obtained for each entry for reading data.
Usage
RemoteZipFile zip = new RemoteZipFile();
zip.Load(url);
foreach(ZipEntry ze in zip)
{
}
ZipEntry ze = ...;
Stream uncompressedStream = zip.GetInputStream(ze);
Subclasses of the Stream class
An interesting implementation detail is the definition of two Stream classes that wrap other streams:
NoCloseSubStream
is a stream that is attached to another Stream, and it detaches itself on Close
. It can be used when a Stream should not be closed when the user has finished to work on in;
PartialInputStream
is a stream that presents a part of a stream as whole stream. It is used when decompressing the data coming from the Web Server with a specific size that is different from the size returned by the original stream.
Example Application
In the demo project, there is the RemoteZip application that can be used to test this technique. It accesses a remote zip file and shows all the contained files with the associated information. Each file can be saved on disk, or previewed as text or as an image (as long as .NET recognizes the image data).
This snapshot shows the preview inside a text file:
In this case, an image of a OpenOffice Text Document (.sxw) is previewed:
Conclusion and future Work
This article presents one usage of the HTTP Range option for accessing parts of a big resource like ZIP archives. The technique can be used to recover specific files in archives, like metadata from OpenOffice files or JARs, and it is efficient when the whole archive is not required.
Could be very interesting to explore the possibilities of multi-range request in the HTTP protocol. This is an example from the HTTP specification:
HTTP/1.1 206 Partial Content
Date: Wed, 15 Nov 1995 06:25:24 GMT
Last-Modified: Wed, 15 Nov 1995 04:58:08 GMT
Content-type: multipart/byteranges; boundary=THIS_STRING_SEPARATES
--THIS_STRING_SEPARATES
Content-type: application/pdf
Content-range: bytes 500-999/8000
...the first range...
--THIS_STRING_SEPARATES
Content-type: application/pdf
Content-range: bytes 7000-7999/8000
...the second range
--THIS_STRING_SEPARATES--
Actually, it is limited to HTTP but it could be applied also to Local and FTP files. The case of FTP files is tricky because it uses the REST command to start the transfer from a specific position of the file, and then the downloading should be interrupted when the required number of bytes has been transferred.
References