Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / C#

EHFileDiff - File Compare Utility

5.00/5 (6 votes)
2 Sep 2009CPOL2 min read 33.4K   929  
My entry to the Code Lean and Mean File Comparison Contest.

Introduction

This is my C# entry to the Code Lean and Mean Competition. The idea of the contest is basically to create the leanest and meanest file diff program that can show the differences of Grid1.html and Grid2.html, running as fast as possible and using the least amount of memory.

I have also submitted an entry for the C++ version, called MeanFileCompare. I found the C# version much more challenging because it seems like no matter what you do, a ton of memory is allocated. My intentions were to use the least amount of memory possible and get a fast time. I will explain more of my memory optimizations further down in the article.

Using my techniques, I was able to get diff times for a first time run in the 30 to 50 ms time ranges, and of 4 to 10 ms running a second time after all the JITting has taken place.

Image 1

Using the Code

I used a byte-to-byte comparison to do the diff because I wanted to display line numbers as well as the text, so I needed to be able to count the number of newlines in the file. Here is the main loop of the file comparison:

C#
public void CompareFiles(string file1, string file2)
{
    //
    // Load the files to compare
    //
    try
    {
        _compareFile1 = new CompareFile(file1);
        _compareFile2 = new CompareFile(file2);
    }
    catch (FileNotFoundException ex)
    {
        _output.Append(ex.Message);
        return;
    }

    //
    // Keep comparing until both files have been processed
    //
    while (true)
    {
        //
        // Compare until we have a mismatch
        //
        if (_compareFile1._buffer[_compareFile1._currentBufferIndex] ==
            _compareFile2._buffer[_compareFile2._currentBufferIndex])
        {
            //
            // Check for a new line so we can increment the line number and
            // save the position for the beginning of the line(just in case the line
            // is different).
            //
            if (_compareFile1._buffer[_compareFile1._currentBufferIndex] == '\n')
            {
                _compareFile1._currentLineNumber++;
                _compareFile2._currentLineNumber++;

                _compareFile1._startOfLine = _compareFile1._currentBufferIndex + 1;
                _compareFile2._startOfLine = _compareFile2._currentBufferIndex + 1;
            }

            _compareFile1._currentBufferIndex++;
            _compareFile2._currentBufferIndex++;

            //
            // Check to see if we have reached the end of the buffers 
            // and read more if necessary.  both buffers need to be
            // checked independently because they can get out of sync
            // with differences
            //
            if (_compareFile1._currentBufferIndex == _compareFile1._totalBytesInBuffer)
            {
                _compareFile1.ReadMoreData();
            }

            if (_compareFile2._currentBufferIndex == _compareFile2._totalBytesInBuffer)
            {
                _compareFile2.ReadMoreData();
            }

            //
            // No need to check if we have processed all bytes until
            // we know we've read both complete files
            //
            if (_compareFile1._fileCompletelyRead || _compareFile2._fileCompletelyRead)
            {
                if (_compareFile1.EndOfFile || _compareFile2.EndOfFile)
                {
                    //
                    // Do one of the files have extra data?
                    //
                    if (!_compareFile1.EndOfFile || !_compareFile2.EndOfFile)
                    {
                        FindDifferences();
                    }

                    //
                    // We're finished
                    //
                    break;
                }
            }
        }
        else
        {
            _compareFile1.ReadMoreData();
            _compareFile2.ReadMoreData();
            FindDifferences();

            if (_compareFile1.EndOfFile && _compareFile2.EndOfFile)
            {
                //
                // We're done
                //
                break;
            }
        }
    }

    _compareFile1.CloseFile();
    _compareFile2.CloseFile();

}

Lean and Mean

The whole idea of this competition was to keep your program Lean and Mean, so to keep in the spirit of the competition, I allocated a 2K buffer and read in chunks to perform the diff, while an easier way would have been to read in the full file and perform the diff using the entire 187,731 bytes, but I don't think that would be keeping your program "lean".

buffer_chart.png

If I need to increase the buffer due to comparing a line that exceeds the 2K size limit, or if a comparison of differences is exceeding the limit, I increase the buffer size in 2K blocks until the line or comparison has been satisfied.

To make things even leaner, I did not even create a separate buffer to read the data into. I created an unsafe pointer so I could append the data being read from the file directly into the buffer.

C#
[DllImport("kernel32.dll", SetLastError = true)]
public static extern unsafe bool ReadFile(IntPtr handle, IntPtr buffer, 
                            UInt32 numberOfBytesToRead, 
                            out UInt32 numberOfBytesRead, 
                            IntPtr overlapped);
     .
     .
     .
unsafe bool ReadBytes(int offset, UInt32 bytesToRead, ref UInt32 bytesRead)
{
    // The following fixed statement pins the location of the buffer
    // in memory so that it will not be moved by garbage collection.
    fixed (byte* pBuffer = _buffer)
    {
        IntPtr p = (IntPtr)(pBuffer + offset);

        if (ReadFile(_fileHandle, p, bytesToRead, out bytesRead, IntPtr.Zero))
        {
            return true;
        }

        return false;
    }
}

History

  • 8/31/2009 - Initial release.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)