(untagged)

A small Content Detection Library

Uwe Keim

0.00/5 (No votes)

1 May 2007

Introducing a library to detect content based on file content (and extension)

Download source files - 117.6 KB

Screenshot - ContentDetectorLib_01.png

Introduction

In my recently published article about the Zeta Uploader application (in short, a website to upload files and send e-mail messages with links to the uploaded files), the discussion came up (Thanks to Phil.Benson) about the need to administer the uploaded files in order to avoid copyright infringements.

This article introduces a library that I have written last evening and this morning (so it is "really" fresh) to take a first step in the right direction.

What the library does

Since I wanted to avoid (at least for now) forcing users of the Zeta Uploader to register and login in order to use the service, I decided to try a different approach:

After a file is uploaded, it is checked for whether it is considered "prohibited" in terms of that it cannot be uploaded with Zeta Uploader. Currently I've included files like movies (AVI, MOV, etc.) and music (MP3, WAV, etc.) as being prohibited.

How the library works

The detection algorithm uses the following mechanisms to test a file for being prohibited or allowed:

File extension

Look at the file extension. If it matches a given extension on the prohibited list, the file is considered "prohibited".
File content

Look inside the first few bytes of the file for known binary pattern ("magic bytes") to match a list of prohibited patterns.
Archive extraction

The file is detected to be an archive file, the file is being temporarily extracted and the extracted files are scanned, too (recursively, if they contain archives, too).

The next section briefly discusses these different mechanisms.

File extension checking

This goes straight to the extension of the file name. Since this is rather easy to cheat, the file extension checking is done as a first quick check only. If it matches, the whole detection is done for a given file.

If not, a content analysis is done, as described next.

Content analysis

The main work of the library is to apply simple "pattern matching" to the content of a file. Through an extensible ISignatureChecker interface, more complex tests can be added later. I've included a simple check for MP3s that does a little bit more than just pattern matching (class Mp3SignatureChecker).

The ISignatureChecker interface is defined as follows:

/// <summary>
/// Interface to implement when checking a buffer
/// for a certain signature.
/// </summary>
internal interface ISignatureChecker
{
    /// <summary>
    /// Check whether a given buffer matches the signature.
    ///
    /// <param name="buffer">The buffer.</param>
    /// <returns></returns>
    bool MatchesSignature(
        byte[] buffer );

    /// <summary>
    /// Gets the first number of bytes to read.
    /// </summary>
    /// <value>The first number of bytes to read.</value>
    int FirstNumberOfBytesToRead
    {
        get;
    }

    /// <summary>
    /// Gets the minimum length of the required buffer.
    /// </summary>
    /// <value>The minimum length of the required buffer.</value>
    int MinimumRequiredBufferLength
    {
        get;
    }
}

Through this interface, the check engine communicates with the discrete interfaces. See the source files for details and examples.

Archive extraction

Since most files are compressed archives, it is important to extract these too.

Again, I've built an extensible mini-framework based on the IArchiveExtractor interface to allow for adding more archive extractors in the future.

The interface is defined as follows:

/// <summary>
/// Interface for archive extractors.
///
internal interface IArchiveExtractor
{
    /// <summary>
    /// Extracts the specified file path.
    /// </summary>
    /// <param name="filePath">The file path.</param>
    /// <param name="folderPathToExtractInto">The folder path
    /// to extract into.</param>
    void Extract(
        FileInfo filePath,
        DirectoryInfo folderPathToExtractInto );
}

Currently I am using the SharpZipLib to provide extractors for ZIP, gzip and bzip2.

Test application

There is no test application in the download. Instead the following code snippet is the complete Main function of my own testing console application.

/// <summary>
/// The main function.
/// </summary>
private static void Main()
{
    // Instantiate the engine.
    ContentDetectorEngine engine = new ContentDetectorEngine();

    // --
    // Testing discrete files.

    // Collect some files to test.
    FileInfo[] filePaths = new FileInfo[]
    {
        new FileInfo( @"c:\AnotherFolder\112431940.mp3" ),
        new FileInfo( @"c:\AnotherFolder\247293565.txt" ),
        new FileInfo( @"c:\AnotherFolder\008284502.zip" ),
        new FileInfo( @"c:\AnotherFolder\190243241.mdb" ),
        new FileInfo( @"c:\AnotherFolder\182944456.zip" ),
    };

    // Iterate over the files.
    foreach ( FileInfo filePath in filePaths )
    {
        bool contains =
            engine.ContainsFileProhibitedContent( filePath );
        Console.WriteLine(
            @"Contains '{0}': {1}.",
            filePath.Name,
            contains );
    }

    // --
    // Testing a complete folder.

    // Find all files in the given folder.
    FileInfo[] prohibitedPaths =
        engine.ContainsFolderProhibitedContent(
        new DirectoryInfo(
        @"C:\SomeFolder" ) );

    Console.WriteLine( @"Folder contains {0} prohibited files.",
        prohibitedPaths.Length );

    foreach ( FileInfo prohibitedPath in prohibitedPaths )
    {
        Console.WriteLine(
            @"\tProhibited file: '{0}'.", prohibitedPath );
    }
}

Simply copy it into your own console application and you are done.

Conclusion

In this article I've shown you a library to detect file types based on their content. Although this is only a first version of the library and probably some approaches are somewhat naive, I'm sure the code is useful and can be extended in the future to be even more usable.

If you have feedback, questions or comments, simply post them in the comments section below. I'm looking forward to your messages!

References

HeaderSig.txt - Several signatures for file types
Magic number (programming) - Wikipedia article

History

2007-05-01: Initial release of the library

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

A small Content Detection Library

Introduction

What the library does

How the library works

File extension

File content

Archive extraction

File extension checking

Content analysis

Archive extraction

Test application

Conclusion

References

History

License