An LCS based diff-ing library in C#

Nish Nishant

4.98/5 (28 votes)

24 Aug 2009CPOL7 min read

108.2K

1.5K

This is an LCS based diff library that was written for the Code Project Lean and Mean diff engine contest

Introduction

I wrote this article mostly for fun and as a candidate entry for The Code Project's Lean and Mean competition to write a text diffing tool. The library and demo application are both written in C# using VS 2010 Beta and tested on a 64 bit Windows 7 RC machine. Some of the articles that have been submitted so far do not really do a diff in the proper sense of the word and so that gave me more of an impetus to write one that actually does a standard diff. I primarily referred to the Wikipedia entry on the LCS algorithm but also referred to this article (university lecture notes). I've called the class SimpleDiff<T> because while it does do a proper LCS based diff, it could probably be further optimized, though I've applied some bare optimizations of my own.

Performance/Memory info

The implementation is intended to be more mean than lean and the primary objective was to write a reasonably fast diff-engine. This does not mean it cannot be made faster, nor does it mean it cannot be made leaner without losing out majorly on speed. For the test files provided by Code Project, here're the timing results for five iterations, along with the average.

35.0069 ms

33.3978 ms

34.1942 ms

34.9919 ms

34.2274 ms

Average : 34.3636 ms

The tests were performed on a 64-bit Core i7 machine with 12 GB RAM.

The memory usage (in bytes) is probably not as impressive as it may have been for a C++ app or for one that did not load both files into memory in their entirety.

Paged memory	6,492,160
Virtual memory	0
Working set	6,766,592

Not surprisingly (given the 12 GB ram), virtual memory usage was 0. But the working set (delta) was 6.7 MB. The two biggest memory hogs are the fact that we store both files in memory throughout the process, and the fact that the LCS matrix can be pretty huge for files of any non-trivial size. I've considered a version that never fully loaded either file but I couldn't think of a reasonable way to avoid a performance fall (which invariably accompanies frequent disk reads for fetching/seeking data).

Note on memory calculation

The memory consumption was calculated using the

Process

class's PeakWorkingSet64 and related properties. I took the values once, invoked the code, and read the values again and calculated the delta. To account for JIT memory, the diff class was created once but not used prior to calculating the memory.

Using the demo app

The demo app takes 2 parameters, the source and target files (or the left and right files as you may prefer). And it prints out all the lines, and uses a ++ prefix to indicate a line added to the left file, and a -- prefix to denote a line removed from the left file. Here are some screenshots that show it run against Chris M's sample files and also against my simple test files.

Figure 1: The screenshot shows the left and right files as well as the console output.

Figure 2: The screenshot shows a partial output of comparing the test files provided by the competition creators.

If you are wondering about line numbers, note that this display is the function of the calling program, my diff library does not really provide any output - it just provides an event that callers can hook onto. For my console app I chose to imitate the Unix diff program (though not identically), but it'd be trivial to add line numbers. It'd be equally simple to write a WPF or WinForms UI for this library.

Class design

It's a generic class where the generic argument specifies the type of the comparable item. In our case, since we are comparing text files, this would be System.String (representing a line of text). Using the class would be something like :

leftLines = File.ReadAllLines(args[0]);
rightLines = File.ReadAllLines(args[1]);

var simpleDiff = new SimpleDiff<string>(leftLines, rightLines);
simpleDiff.LineUpdate += simpleDiff_LineUpdate;
simpleDiff.RunDiff();
Console.WriteLine(simpleDiff.ElapsedTime);

There is only one constructor and that takes two arrays - one for the left entity and one for the right one.

public SimpleDiff(T[] left, T[] right)
{
    _left = left;
    _right = right;

    InitializeCompareFunc();
}

InitializeCompareFunc sets the comparer delegate which is defined as:

private Func<T, T, bool> _compareFunc;

I have special-cased for String so we get the most performant comparison available.

private void InitializeCompareFunc()
{
    // Special case for String types
    if (typeof(T) == typeof(String))
    {
        _compareFunc = StringCompare;
    }
    else
    {
        _compareFunc = DefaultCompare;
    }
}

/// <summary>
/// This comparison is specifically
/// for strings, and was nearly thrice as 
/// fast as the default comparison operation.
/// </summary>
/// <param name="left"></param>
/// <param name="right"></param>
/// <returns></returns>
private bool StringCompare(T left, T right)
{
    return Object.Equals(left, right);
}

private bool DefaultCompare(T left, T right)
{
    return left.CompareTo(right) == 0;
}

There's also a public event that's fired for each line :

public event EventHandler<DiffEventArgs<T>> LineUpdate;

public class DiffEventArgs<T> : EventArgs
{
    public DiffType DiffType { get; set; }

    public T LineValue { get; set; }

    public DiffEventArgs(DiffType diffType, T lineValue)
    {
        this.DiffType = diffType;
        this.LineValue = lineValue;
    }
}

public enum DiffType
{
    None = 0,
    Add = 1,
    Subtract = 2
}

And in the demo app it's handled in a awfully unfussy manner. But for a more UI-ish app, the possibilities are boundless.

static void simpleDiff_LineUpdate(object sender, DiffEventArgs<string> e)
{
    String indicator = "  ";
    switch (e.DiffType)
    {
        case DiffType.Add:
            indicator = "++";
            break;

        case DiffType.Subtract:
            indicator = "--";
            break;
    }

    Console.WriteLine("{0}{1}", indicator, e.LineValue);
}

When RunDiff is called the first time, the most important and also the most time consuming part of the code is executed, once. That's where we calculate the LCS matrix for the two arrays. But prior to that, as an optimization there's code that will look for any potential skip text at the beginning or end of the diff-arrays. The reason we can safely do this is that if the identical lines are removed from both ends, and the LCS calculated for the middle portion, then we can add back the trimmed content to get the LCS for the original array-pair. If that's not clear consider these two strings (imagine we are comparing characters and not strings) :

<font color="#FF0000">abc</font>hijk<font color="#FF0000">xyz</font> and <font color="#FF0000">abc</font>hujkw<font color="#FF0000">xyz</font>

If we remove abc from the left, and xyz from the right for both char arrays, we get :

hijk and hujkw.

The LCS for these two char arrays is hjk. Now add back the removed items to both ends and we get :

abchjkxyz

which is the LCS for the original pair of char arrays. I've applied the same concept here. This method calculates any items that can be skipped at the beginning of the content :

/// <summary>
/// This method is an optimization that
/// skips matching elements at the start of
/// the arrays being diff'ed
/// </summary>
private void CalculatePreSkip()
{
    int leftLen = _left.Length;
    int rightLen = _right.Length;
    while (_preSkip < leftLen && _preSkip < rightLen &&
        _compareFunc(_left[_preSkip], _right[_preSkip]))
    {
        _preSkip++;
    }
}

Now the post-skip is calculated (carefully avoiding an overlap with the pre-Skip which further improves performance).

/// <summary>
/// This method is an optimization that
/// skips matching elements at the end of the 
/// two arrays being diff'ed.
/// Care's taken so that this will never
/// overlap with the pre-skip.
/// </summary>
private void CalculatePostSkip()
{
    int leftLen = _left.Length;
    int rightLen = _right.Length;
    while (_postSkip < leftLen && _postSkip < rightLen &&
        _postSkip < (leftLen - _preSkip) &&
        _compareFunc(_left[leftLen - _postSkip - 1], 
            _right[rightLen - _postSkip - 1]))
    {
        _postSkip++;
    }
}

Next we calculate the LCS matrix :

/// <summary>
/// This is the core method in the entire class,
/// and uses the standard LCS calculation algorithm.
/// </summary>
private void CreateLCSMatrix()
{
    int totalSkip = _preSkip + _postSkip;
    if (totalSkip >= _left.Length || totalSkip >= _right.Length)
        return;

    // We only create a matrix large enough for the
    // unskipped contents of the diff'ed arrays
    _matrix = new int[_left.Length - totalSkip + 1, _right.Length - totalSkip + 1];

    for (int i = 1; i <= _left.Length - totalSkip; i++)
    {
        // Simple optimization to avoid this calculation
        // inside the outer loop (may have got JIT optimized 
        // but my tests showed a minor improvement in speed)
        int leftIndex = _preSkip + i - 1;

        // Again, instead of calculating the adjusted index inside
        // the loop, I initialize it under the assumption that
        // incrementing will be a faster operation on most CPUs
        // compared to addition. Again, this may have got JIT
        // optimized but my tests showed a minor speed difference.
        for (int j = 1, rightIndex = _preSkip + 1; 
            j <= _right.Length - totalSkip; j++, rightIndex++)
        {
            _matrix[i, j] = _compareFunc(_left[leftIndex], _right[rightIndex - 1]) ?
                _matrix[i - 1, j - 1] + 1 :
                Math.Max(_matrix[i, j - 1], _matrix[i - 1, j]);
        }
    }

    _matrixCreated = true;
}

The inline comments should be self explanatory. Interestingly my assumptions about what the JIT optimizer would take care of turned out to be rather inaccurate and hazy. Of course I did not run detailed enough tests to make any serious conclusions, but to be safe it's probably best to do some level of optimizing on your own instead of always thinking, hey the pre-JIT will catch that one. Once the LCS is calculated, all that's left is to traverse the matrix and fire the events as required, and also remembering to iterate through the skipped entries if any :

for (int i = 0; i < _preSkip; i++)
{
    FireLineUpdate(DiffType.None, _left[i]);
}

int totalSkip = _preSkip + _postSkip;
ShowDiff(_left.Length - totalSkip, _right.Length - totalSkip);

int leftLen = _left.Length;
for (int i = _postSkip; i > 0; i--)
{
    FireLineUpdate(DiffType.None, _left[leftLen - i]);
}

I did not spent too much time trying to optimize ShowDiff since my profiling showed that it was not anywhere near as time consuming as the LCS calculation. 87% of the execution time was spent in the LCS matrix loops.

/// <summary>
/// This traverses the elements using the LCS matrix
/// and fires appropriate events for added, subtracted, 
/// and unchanged lines.
/// It's recursively called till we run out of items.
/// </summary>
/// <param name="leftIndex"></param>
/// <param name="rightIndex"></param>
private void ShowDiff(int leftIndex, int rightIndex)
{
    if (leftIndex > 0 && rightIndex > 0 &&
        _compareFunc(_left[_preSkip + leftIndex - 1], 
            _right[_preSkip + rightIndex - 1]))
    {
        ShowDiff(leftIndex - 1, rightIndex - 1);
        FireLineUpdate(DiffType.None, _left[_preSkip + leftIndex - 1]);
    }
    else
    {
        if (rightIndex > 0 &&
            (leftIndex == 0 ||
            _matrix[leftIndex, rightIndex - 1] >= _matrix[leftIndex - 1, rightIndex]))
        {
            ShowDiff(leftIndex, rightIndex - 1);
            FireLineUpdate(DiffType.Add, _right[_preSkip + rightIndex - 1]);
        }
        else if (leftIndex > 0 &&
            (rightIndex == 0 ||
            _matrix[leftIndex, rightIndex - 1] < _matrix[leftIndex - 1, rightIndex]))
        {
            ShowDiff(leftIndex - 1, rightIndex);
            FireLineUpdate(DiffType.Subtract, _left[_preSkip + leftIndex - 1]);
        }
    }
}

Note on hashing

Many C++ implementations compute the hash of items before comparing them. In our case, since we are using .NET that'd actually be a de-optimization because we'd lose out on the benefits of string interning. In most cases, the majority of lines would be the same (in real life scenarios, only a small percentage of lines are changed between file versions). And since we use Object.Equals which does a reference comparison as the first step, and because identical strings are interned, this comparison is extremely fast. Where we slow down on is when we compare long lines that may differ by one character at the right end of the line - that'd give us our worst case false-compare time.

Conclusion

I had initially thought of writing this in C++/CLI so I could mix types - which'd specially be useful when creating the array. The .NET array's big disadvantage is that it's zero-initialized, and while that's one of the fastest operations any CPU can typically perform, it's still time consuming because of the large size of the array. I could have avoided that by using a native array. But the lack of intellisense drove me nuts after a few minutes and I gave up and went back to C#. Maybe if I get time I'll write another version which may do part of the calculations in native code and the C# code can P/Invoke it, though that itself may bring in inefficiencies of its own. Anyway, any suggestions and criticisms are extremely welcome.

History

August 20, 2009
- Article first published
August 24, 2009
- Added binary download
- Added timing/memory usage stats

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)