NLineDiff: A diff Implementation for Text Files

Nicholas Butler

4.42/5 (4 votes)

28 Aug 2009CPOL3 min read

31.2K

952

An entry for the Lean and Mean competition

Key: Red lines are those deleted from the first file and green lines are those inserted from the second file.

Introduction

This article is an entry for the Lean and Mean competition, which ran on The Code Project during August 2009.

I have implemented a slight variation of Eugene W. Myers' diff algorithm with the linear space refinement from the paper listed in the downloads.

The algorithm is O(ND) in time and O(N) in space, where:

N = number of lines in FileA + number of lines in FileB
D = number of different lines

The output is rendered using a simple WinForms WebBrowser control.

Results

For the test files, N = 3,573 and D = 19. On my 2.5 GHz machine, the solution takes about 5 ms and uses about 450 KB of memory.

Time

A large proportion of the time is spent reading the files from disk and splitting them into string arrays. Using File.ReadAllText takes about 1.7 ms for both files, while File.ReadAllLines takes about 1.9 ms. Using File.OpenRead and StreamReader.ReadLine executes in almost identical times.

Building the HTML from the results also takes some time. Just showing the changes takes about 0.15 ms, which rises to 2.6 ms if the common lines are also included. I have chosen to include the common lines to give more useful results.

As for the diff algorithm itself, the average of 100 iterations is about 0.33 ms. The time for a single run obviously varies more than the average, but consistently takes about 0.4 ms.

These operations give the total time of 1.9 + 0.4 + 2.6 ≈ 5 ms.

The diff is run once before the timing run to allow the JIT compiler to work. This takes about 27 ms, which dwarfs the calculation time. I think that not including this time is legitimate as Chris said in the Notes:

"... there's no need to report memory usage or speed of the framework you use to support your code."

If you disagree, you can add in the JIT time to give a total time of about 32 ms.

Space

The space used is the sum of three areas:

The files are loaded into memory, which takes about 360 KB.
They are split into string arrays, which takes about 30 KB.
The algorithm itself uses an int array, which is about 60 KB.

The sum gives a maximum working memory of 450 KB.

I see that other entries have used Process.PeakWorkingSet64, but this is mostly a measure of how much memory the framework is using. Running this code reports about 900 KB being used by the Thread.Sleep call, which is obviously not correct.

static void Main( string[] args )
{
  var p = Process.GetCurrentProcess();
  var peak = p.PeakWorkingSet64;

  Thread.Sleep( 1000 );

  p.Refresh();

  Console.WriteLine( String.Format( "peak: {0:N0} bytes", p.PeakWorkingSet64 - peak ) );
}

Using the Code

The binary is a single executable that has the test files included as embedded resources. It is configured to load the two texts and split them into string arrays, then time the diff algorithm.

You can also supply two file paths on the command line to use those files.

The source contains some variations which are commented out in Form1.cs. You can edit this file and recompile to verify my time and space measurements on your machine.

Points of Interest

Myers' algorithm is a work of art and I have thoroughly enjoyed learning it. It has taken me about a week to implement it because that's how long it took for me to fully understand it. It is definitely worthy of an article on its own (coming soon!)

History

2009 08 27: First edition
2009 08 28: Fixed bug with overlapping diagonal when D=1 && Δ is even

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)