Introduction
This is a generic algorithm for finding the difference between 2 objects, no matter what the type. The algorithm uses the ICompare
interface to tell if the items are the same. Maybe this could be used for producing a WiKi site in ASP.NET and track revisions.
I just wanted to let all you readers know, I don't have a degree in Computer Science and I haven't read or learned much about algorithm design. I haven't ever really analyzed a Diff algorithm before, but I know a little about them. If there is a way of improving this, or if I am going in the completely wrong direction, please let me know.
Background
I am not really sure of the intimate details of how other Diff algorithms work (like the UNIX Diff utility), but it does use some ideas that I got from general reading on the topic. The Diff Engine uses a bit matrix that employs a sliding window of the 2 files. The matrix looks kind of like this:
The black squares are the squares with matching items. The algorithm starts at 0,0 and looks all the way along the x-axis to find a diagonal sequence, which is 2 or more matching characters in a row. The algorithm looks for the longest sequence, using the the closer of any that are the same length. It then looks down the y-axis, and if it finds a sequence that is larger and closer than any found on the x-axis, it uses that sequence. If it finds a sequence, it performs whatever operation is necessary to get to that sequence. In this case, it wouldn't find a diagonal sequence on the 0 x-axis or the y-axis, so it performs its standard operation, which is to delete the 'B' and insert the 'N'. It then increments our x and y location, so we are now at 1,1. Again, we do not find a diagonal sequence, so we perform the standard operation again. However, at 2,2 we find that we are in a sequence.
Once we find a sequence, first we slide the window down to where we are, since we won't be needing any of the previous characters any more. Then, we continually increment x and y until they reach a non matching square. At 4,4 there is no match, so we switch states again, meaning that we slide the window down, and once again look for a diagonal. Now, on the x-axis 4, we find a diagonal at 4,8, so now must perform whatever is necessary to get there. Moving in the y direction is an insertion, so we insert 'T', 'h', 'e' and ' '. At that point, we enter enter the diagonal, save state, etc.
We follow the diagonal down to the end from there. Pretty straight forward, right?
Using the code
I wrote this code so that it would be as generic as possible, so I built it around 3 basic classes and an interface. There are actually more classes but, I will get to those in a minute.
First, to do a comparison, you will need to subclass ComparableStreamReader
. This is an abstract class that I created, and it has only one method, GetNext()
, which returns an object that implements an IComparable
object. This is really the trick to this. In the demo, I have included a class called ComparableStringReader
, which just returns characters from a string, one at a time. If you look at the code, you will notice that it is actually returning strings with one character and not char objects. What can I say - string already implements IComparable
, and I am lazy. However, if you want to compare items that are not strings, you may have to create your own custom object that implements IComparable
.
public class ComparableStringReader : Diff.ComparableStreamReader
{
protected string _source;
protected int _location;
public ComparableStringReader(string source)
{
_source = source;
_location = 0;
}
public override IComparable GetNext()
{
if ( _location < _source.Length )
return _source.Substring(_location++, 1);
else
return null;
}
}
Once you have created a reader class, you just need to create an instance of the DiffEngine
class. DiffEngine
has two properties, and one method, Compare()
. Compare takes in 2 ComparableStreamReader
s, the source (original), and the destination (changed document). Compare()
returns an EditScript
object, which is just a collection of DiffCommands
objects. Before I go into the DiffCommand
objects, however, I want to mention the two properties. The property WindowSize
is an integer that defaults to 256. You have to be careful with this number however. The higher it goes, the more exact and concise the Diff script will be, but it uses memory at a rate of (n2/8). That's right. So at 256, we are only using 8K of memory to store the matrix (it's actually called an Edit Graph). However, if we decide to move up to 1024, we are not using 128K. Also, the higher the window size, the longer it takes to compute, at about the same rate (I am not really sure how to use big-oh notation, only read it, so I don't give it). 255 should work fine for most cases, especially if you are comparing entire lines, one at a time.
DiffEngine de = new DiffEngine();
de.WindowSize = 10;
ComparableStringReader csrSource = new ComparableStringReader(txtSource.Text);
ComparableStringReader csrDest = new ComparableStringReader(txtDestination.Text);
EditScript es = de.Compare(csrSource, csrDest);
Now, once the Compare()
method returns, it returns a script (that same script is available through the Script
property). The EditScript
, like I said before, is a collection of DiffCommand
s. Each command represents an operation that must be performed against the source (original) to produce the destination (changed document). There are three types of commands, which are members of the DiffCommandType
enum
. There's Insert
, Delete
, and Skip
. The Skip
and Delete
commands utilize the Length
property of DiffCommand
class, and the Insert
command utilities the Value
property. The Length
property just represents the number of times to repeat the operation. The Value
property represents the object to be inserted, and it is also the exact object that was returned from ComparableStreamReader.GetNext()
.
In the sample, I have used the EditScript
to provide an exploded view of the source and destination strings so that you can see what they had in common. I also generated a very simple Diff script that could be used to recreate the destination from the original.
B i t Matrix
N ot The Matrix
-1+N-1+o*2+T+h+e+ *6
In this example, the Diff script is using three operators - '*' for Skip
followed by the number of characters to skip, '-' for Delete
followed by the number of characters to delete, and '+' for Insert
, followed by the character to insert.
Eventually, I will write a method to merge a source and an EditScript
to recreate the original, but first, I want to do some better testing just to make sure that this algorithm is actually viable. If you use it successfully, please let me know. Also, if you have any recommendations for improving the algorithm, also let me know.
Points of Interest
There is a reusable BitMatrix
class that this algorithm employs to cut down on memory use. If you want, you can change it to use just bool
values. This may or not speed up the algorithm, since you would be using 8 times more memory (unless the C# compiler does lots of optimizing!).
History
- 4/21/2004 - Version 0.9 - released.