Introduction
In my recently published article about the Zeta Uploader application (in short, a website to upload files and send e-mail messages with links to the uploaded files), the discussion came up (Thanks to Phil.Benson) about the need to administer the uploaded files in order to avoid copyright infringements.
This article introduces a library that I have written last evening and this morning (so it is "really" fresh) to take a first step in the right direction.
What the library does
Since I wanted to avoid (at least for now) forcing users of the Zeta Uploader to register and login in order to use the service, I decided to try a different approach:
After a file is uploaded, it is checked for whether it is considered "prohibited" in terms of that it cannot be uploaded with Zeta Uploader. Currently I've included files like movies (AVI, MOV, etc.) and music (MP3, WAV, etc.) as being prohibited.
How the library works
The detection algorithm uses the following mechanisms to test a file for being prohibited or allowed:
-
File extension
Look at the file extension. If it matches a given extension on the prohibited list, the file is considered "prohibited".
-
File content
Look inside the first few bytes of the file for known binary pattern ("magic bytes") to match a list of prohibited patterns.
-
Archive extraction
The file is detected to be an archive file, the file is being temporarily extracted and the extracted files are scanned, too (recursively, if they contain archives, too).
The next section briefly discusses these different mechanisms.
File extension checking
This goes straight to the extension of the file name. Since this is rather easy to cheat, the file extension checking is done as a first quick check only. If it matches, the whole detection is done for a given file.
If not, a content analysis is done, as described next.
Content analysis
The main work of the library is to apply simple "pattern matching" to the content of a file. Through an extensible ISignatureChecker
interface, more complex tests can be added later. I've included a simple check for MP3s that does a little bit more than just pattern matching (class Mp3SignatureChecker
).
The ISignatureChecker
interface
is defined as follows:
internal interface ISignatureChecker
{
bool MatchesSignature(
byte[] buffer );
int FirstNumberOfBytesToRead
{
get;
}
int MinimumRequiredBufferLength
{
get;
}
}
Through this interface
, the check engine communicates with the discrete interfaces. See the source files for details and examples.
Archive extraction
Since most files are compressed archives, it is important to extract these too.
Again, I've built an extensible mini-framework based on the IArchiveExtractor
interface
to allow for adding more archive extractors in the future.
The interface
is defined as follows:
internal interface IArchiveExtractor
{
void Extract(
FileInfo filePath,
DirectoryInfo folderPathToExtractInto );
}
Currently I am using the SharpZipLib to provide extractors for ZIP, gzip and bzip2.
Test application
There is no test application in the download. Instead the following code snippet is the complete Main
function of my own testing console application.
private static void Main()
{
ContentDetectorEngine engine = new ContentDetectorEngine();
FileInfo[] filePaths = new FileInfo[]
{
new FileInfo( @"c:\AnotherFolder\112431940.mp3" ),
new FileInfo( @"c:\AnotherFolder\247293565.txt" ),
new FileInfo( @"c:\AnotherFolder\008284502.zip" ),
new FileInfo( @"c:\AnotherFolder\190243241.mdb" ),
new FileInfo( @"c:\AnotherFolder\182944456.zip" ),
};
foreach ( FileInfo filePath in filePaths )
{
bool contains =
engine.ContainsFileProhibitedContent( filePath );
Console.WriteLine(
@"Contains '{0}': {1}.",
filePath.Name,
contains );
}
FileInfo[] prohibitedPaths =
engine.ContainsFolderProhibitedContent(
new DirectoryInfo(
@"C:\SomeFolder" ) );
Console.WriteLine( @"Folder contains {0} prohibited files.",
prohibitedPaths.Length );
foreach ( FileInfo prohibitedPath in prohibitedPaths )
{
Console.WriteLine(
@"\tProhibited file: '{0}'.", prohibitedPath );
}
}
Simply copy it into your own console application and you are done.
Conclusion
In this article I've shown you a library to detect file types based on their content. Although this is only a first version of the library and probably some approaches are somewhat naive, I'm sure the code is useful and can be extended in the future to be even more usable.
If you have feedback, questions or comments, simply post them in the comments section below. I'm looking forward to your messages!
References
- HeaderSig.txt - Several signatures for file types
- Magic number (programming) - Wikipedia article
History
- 2007-05-01: Initial release of the library