Introduction
I originally wrote this code and article for the Code Lean and Mean[^] contest. The code presented in the first version of this article was specialized for working with strings (lines of text files). This version of the article presents a generic diff engine that can be used with characters, strings, or just about any other type. The included demo program still does a line-by-line diff of two text files.
Theory of this Algorithm
I began this project with the following postulates:
- A file diff is a sequence of copy operations and edit operations.
- The optimal edit distance is achieved by the longest copy operations and the shortest edit operations.
- Every item (character or line) of the source and target must be accounted for.
- Every source item must be copied or deleted.
- Every target item must be copied or inserted.
- Any sequence of copy operations can be reduced to one copy operation.
- Any sequence of delete operations can be reduced to one delete operation.
- Any sequence of insert operations can be reduced to one insert operation.
- Any sequence of delete and insert operations can be reduced to one replace operation.
- The order of delete and insert operations within a replace operation is not significant.
- All possible sequences of delete and insert operations that describe a particular edit operation are equivalent.
- While seeking the optimal edit operation, we need track only the number of delete and insert operations.
As we'll see, it's these last two postulates that yield the greatest savings in processing.
Two details of the implementation that are also very important are:
- The use of a breadth-first search for the end of an edit section.
- The desire to find the most-diagonal path by preferring replace operations.
First Things First
LibExt.Lines
I began with developing a character-by-character solution. This allowed me to test with strings, but I knew the eventual solution would need to work with text files. My first goal, then, was to enable my code to work with either strings or text files without needing to know the details. I knew that System.String
implements the System.Collections.Generic.IEnumerable<char>
interface, but I didn't find a file reader that did, so I wrote my own enumerator to do it. I then did likewise for System.Collections.Generic.IEnumerable<string>
. Here are the Extension Methods that I use to enumerate the lines in a text file:
namespace PIEBALD.Lib.LibExt.Lines
{
public static partial class LibExt
{
public static System.Collections.Generic.IEnumerable<string />
Lines
(
this System.IO.FileInfo File
)
{
return ( File.OpenRead().Lines() ) ;
}
public static System.Collections.Generic.IEnumerable<string />
Lines
(
this System.IO.FileStream Stream
)
{
return ( (new System.IO.StreamReader ( Stream )).Lines() ) ;
}
public static System.Collections.Generic.IEnumerable<string />
Lines
(
this System.IO.StreamReader Stream
)
{
string current ;
while ( ( current = Stream.ReadLine() ) != null )
{
yield return ( current ) ;
}
yield break ;
}
}
}
ItemBuffer
The next order of business was to implement a way to buffer the items read from the enumerator. The buffer needed to hold only the items currently under consideration, once a decision was made, the old items could be removed. As with the above code, I first wrote a char-based buffer (using a StringBuilder
), then used it as the basis for the following class:
public class ItemBuffer<T> : System.IDisposable
{
protected readonly object pickle ;
protected System.Collections.Generic.IEnumerator<T> source ;
protected System.Collections.Generic.List<T> buffer ;
public ItemBuffer
(
System.Collections.Generic.IEnumerable<T> Source
)
{
if ( Source == null )
{
throw ( new System.ArgumentNullException ( "Source" , "You must provide a source" ) ) ;
}
this.pickle = new object() ;
this.source = Source.GetEnumerator() ;
this.buffer = new System.Collections.Generic.List<T>() ;
return ;
}
public virtual int
Count
{
get
{
lock ( this.pickle )
{
return ( this.buffer == null ? -1 : this.buffer.Count ) ;
}
}
}
public virtual bool
TryGet
(
int Index
,
out T Value
)
{
bool result = false ;
Value = default(T) ;
lock ( this.pickle )
{
if ( this.source != null )
{
while
(
( this.buffer.Count <= Index )
&&
( this.source.MoveNext() )
)
{
this.buffer.Add ( this.source.Current ) ;
}
if ( this.buffer.Count > Index )
{
Value = this.buffer [ Index ] ;
result = true ;
}
}
}
return ( result ) ;
}
public virtual void
Clear
(
)
{
lock ( this.pickle )
{
this.buffer.Clear() ;
}
return ;
}
public virtual System.Collections.Generic.IEnumerable<T>
GetRange
(
int StartIndex
,
int Length
)
{
lock ( this.pickle )
{
return ( this.buffer.GetRange ( StartIndex , Length ).AsReadOnly() ) ;
}
}
public virtual void
RemoveRange
(
int StartIndex
,
int Length
)
{
lock ( this.pickle )
{
this.buffer.RemoveRange ( StartIndex , Length ) ;
}
return ;
}
}
2009-09-20: Unsealed, changed the fields to protected, and changed the methods to virtual. This allows the following char-specialized subclass:
CharBuffer
This is a char-specialized ItemBuffer
. It uses a StringBuilder
rather than a List<char>
to buffer the characters. It may or may not offer improved performance, but GetRange
returns a string
so I find it easier to use.
public sealed class CharBuffer : PIEBALD.Types.ItemBuffer<char>
{
private System.Text.StringBuilder builder ;
public CharBuffer
(
System.Collections.Generic.IEnumerable<char> Source
)
: base
(
Source
)
{
this.buffer = null ;
this.builder = new System.Text.StringBuilder() ;
return ;
}
public override int
Count
{
get
{
lock ( this.pickle )
{
return ( this.builder == null ? -1 : this.builder.Length ) ;
}
}
}
public override bool
TryGet
(
int Index
,
out char Value
)
{
bool result = false ;
Value = default(char) ;
lock ( this.pickle )
{
if ( this.source != null )
{
while
(
( this.builder.Length <= Index )
&&
( this.source.MoveNext() )
)
{
this.builder.Append ( this.source.Current ) ;
}
if ( this.builder.Length > Index )
{
Value = this.builder [ Index ] ;
result = true ;
}
}
return ( result ) ;
}
}
public override void
Clear
(
)
{
lock ( this.pickle )
{
this.builder.Length = 0 ;
}
return ;
}
public override System.Collections.Generic.IEnumerable<char>
GetRange
(
int StartIndex
,
int Length
)
{
lock ( this.pickle )
{
return ( this.builder.ToString ( StartIndex , Length ) ) ;
}
}
public override void
RemoveRange
(
int StartIndex
,
int Length
)
{
lock ( this.pickle )
{
this.builder.Remove ( StartIndex , Length ) ;
return ;
}
}
}
LimitedQueue
I also make use of my LimitedQueue[^]. The diff algorithm doesn't need to limit the number of items in the queue, but it does need to access the items by index, and LimitedQueue
allows that (as part of the "and mean" requirement of the contest :-D ).
Other Supporting Characters
DiffSectionType
Each DiffSection
has a type:
[System.FlagsAttribute()]
public enum DiffSectionType : byte
{
Copy = 0
,
Delete = 1
,
Insert = 2
,
Replace = 3
}
DiffSectionBase<T>
An abstract
base class for the various types of diff operations; it only needs to hold the type of operation it represents and the content associated with it. This class was one of the stumbling blocks in achieving a pleasing generic solution. To be generic, I want it to hold List<T>
, but when writing a specialized diff to perform character-by-character diffs, I want it to hold a string
. I eventually decided to use IEnumerable<T>
so I can store a List<T>
or a string
as appropriate. This has the added benefit of matching the type that is sent into the diff engine -- IEnumerable<T>
in, IEnumerable<T>
out.
public abstract class DiffSectionBase<T>
{
protected DiffSectionBase
(
DiffSectionType Type
,
System.Collections.Generic.IEnumerable<T> SourceContent
,
System.Collections.Generic.IEnumerable<T> TargetContent
)
{
this.Type = Type ;
this.SourceContent = SourceContent ;
this.TargetContent = TargetContent ;
return ;
}
public DiffSectionType Type
{
get ;
private set ;
}
public System.Collections.Generic.IEnumerable<T> SourceContent
{
get ;
private set ;
}
public System.Collections.Generic.IEnumerable<T> TargetContent
{
get ;
private set ;
}
}
CopyDiffSection
The classes for the four types of diff operations are implemented like this:
public sealed class CopyDiffSection<T> : DiffSectionBase<T>
{
public CopyDiffSection
(
System.Collections.Generic.IEnumerable<T> SourceContent
)
: base
(
DiffSectionType.Copy
,
SourceContent
,
null
)
{
return ;
}
}
DiffCandidate
While seeking the optimal edit operation, each sequence of deletes and inserts under consideration is held in one of these. As mentioned earlier, only the type of operation and the number of deletes and inserts needs to be held by each candidate. A great deal of memory and processing is saved by storing a hash and using it to determine whether or not the candidate is unique. This should satisfy the "lean" requirement of the contest.
This class is also used for copy operations, but that's not as interesting.
This class limits the implementation to tracking copy/edit sections of up to System.Int16.MaxValue
(32767) items. This ought to be enough for most diff operations and the application may hit an OutOfMemoryException
long before it gets to that point. The current implementation does not protect against OutOfMemory
situations, but if the limit of 32767 items is reached, the implementation will simply produce what it has and continue.
public sealed class DiffCandidate
{
public const int Limit = System.Int16.MaxValue ;
public readonly int Hash ;
internal DiffCandidate
(
PIEBALD.Types.DiffSectionType Type
,
int SourceIndex
,
int TargetIndex
)
{
this.Hash = (int) Type ;
this.Hash <<= 15 ;
this.Hash |= SourceIndex ;
this.Hash <<= 15 ;
this.Hash |= TargetIndex ;
return ;
}
public PIEBALD.Types.DiffSectionType
Type
{
get
{
return ( (PIEBALD.Types.DiffSectionType) ( ( this.Hash >> 30 ) & 3 ) ) ;
}
}
public int
SourceIndex
{
get
{
return ( ( this.Hash >> 15 ) & Limit ) ;
}
}
public int
TargetIndex
{
get
{
return ( this.Hash & Limit ) ;
}
}
}
2009-09-20: Changed from using ushort
to int
-- this avoids a bunch of casting later.
DiffEngine
This class implements the diff algorithm. It should work for any type T
where T
implements System.IComparable<T>
; it definitely works for char
and string
.
The tracker field is used to hold the various edit sequences that are under consideration. It needs to be FIFO and indexable, hence the use of my LimitedQueue
.
The Truncate
method is used after a decision has been made about a section of the diff
, it prepares the fields for the next section.
public sealed class DiffEngine<T> : System.IDisposable
where T : System.IComparable<T>
{
[System.FlagsAttribute()]
private enum EngineState : byte
{
None = 0
,
ReadSource = 1
,
ReadTarget = 2
,
ReadBoth = 3
,
Equal = 4 + ReadBoth
,
GotCandidate = 8
}
private readonly PIEBALD.Types.LimitedQueue<DiffCandidate> tracker ;
private PIEBALD.Types.DiffSectionType type ;
private readonly PIEBALD.Types.ItemBuffer<T> source ;
private readonly PIEBALD.Types.ItemBuffer<T> target ;
private int sourceindex ;
private int targetindex ;
private EngineState state ;
private DiffEngine
(
PIEBALD.Types.ItemBuffer<T> Source
,
PIEBALD.Types.ItemBuffer<T> Target
)
{
this.tracker = new PIEBALD.Types.LimitedQueue<DiffCandidate>() ;
this.source = Source ;
this.target = Target ;
this.Truncate() ;
return ;
}
private bool
HasCandidates
{
get
{
return ( this.tracker.Count > 0 ) ;
}
}
private void
Truncate
(
)
{
this.tracker.Clear() ;
this.type = PIEBALD.Types.DiffSectionType.Copy ;
this.source.RemoveRange ( 0 , this.sourceindex ) ;
this.target.RemoveRange ( 0 , this.targetindex ) ;
this.sourceindex = 0 ;
this.targetindex = 0 ;
return ;
}
...
}
2009-09-20: Changed sourceindex
and targetindex
from ushort
to int
.
AddCandidate
This method will instantiate a new DiffCandidate
, check to see if it's unique, and if so put it in the tracker. If the new candidate is not unique, it will be dropped.
private void
AddCandidate
(
PIEBALD.Types.DiffSectionType Type
,
int SourceIndex
,
int TargetIndex
)
{
PIEBALD.Types.DiffCandidate temp ;
temp = new PIEBALD.Types.DiffCandidate
(
Type
,
SourceIndex
,
TargetIndex
) ;
for ( int i = this.tracker.Count - 1 ; i >= 0 ; i-- )
{
if ( this.tracker [ i ].Hash == temp.Hash )
{
return ;
}
}
this.tracker.Enqueue ( temp ) ;
return ;
}
GetCandidate
This method retrieves the next candidate from the queue, gets the next pieces of text from the buffers, and determines whether or not the text matches.
private void
GetCandidate
(
)
{
this.state = EngineState.None ;
if ( this.HasCandidates )
{
PIEBALD.Types.DiffCandidate Candidate = this.tracker.Dequeue() ;
this.type = Candidate.Type ;
this.sourceindex = Candidate.SourceIndex ;
this.targetindex = Candidate.TargetIndex ;
this.state |= EngineState.GotCandidate ;
}
T sourcecontent ;
if ( this.source.TryGet ( this.sourceindex , out sourcecontent ) )
{
this.state |= EngineState.ReadSource ;
}
T targetcontent ;
if ( this.target.TryGet ( this.targetindex , out targetcontent ) )
{
this.state |= EngineState.ReadTarget ;
}
if
(
( ( this.state & EngineState.ReadBoth ) == EngineState.ReadBoth )
&&
( sourcecontent.CompareTo ( targetcontent ) == 0 )
)
{
this.state |= EngineState.Equal ;
}
return ;
}
PromoteCandidate
This method is used to construct a DiffSectionBase<T>
from a DiffCandidate
once the candidate has been selected as the optimal edit.
The buffers are then truncated in preparation for the next diff section.
private PIEBALD.Types.DiffSectionBase<T>
PromoteCandidate
(
)
{
PIEBALD.Types.DiffSectionBase<T> result = null ;
switch ( this.type )
{
case PIEBALD.Types.DiffSectionType.Copy :
{
result = new PIEBALD.Types.CopyDiffSection<T>
(
this.source.GetRange ( 0 , this.sourceindex )
) ;
break ;
}
case PIEBALD.Types.DiffSectionType.Delete :
{
result = new PIEBALD.Types.DeleteDiffSection<T>
(
this.source.GetRange ( 0 , this.sourceindex )
) ;
break ;
}
case PIEBALD.Types.DiffSectionType.Insert :
{
result = new PIEBALD.Types.InsertDiffSection<T>
(
this.target.GetRange ( 0 , this.targetindex )
) ;
break ;
}
case PIEBALD.Types.DiffSectionType.Replace :
{
result = new PIEBALD.Types.ReplaceDiffSection<T>
(
this.source.GetRange ( 0 , this.sourceindex )
,
this.target.GetRange ( 0 , this.targetindex )
) ;
break ;
}
}
this.Truncate() ;
return ( result ) ;
}
DoDiff
DoDiff
controls the flow of the algorithm.
- Get the next candidate from the queue and set the fields (there is no candidate on the first pass, but the fields are set properly by the constructor)
- The state field will hold flags indicating whether or not there was a candidate, which buffers hold data, and whether or not the specified source and target texts match.
- If the current candidate has hit the limit and there are no other candidates, then promote it. In either case, drop it and keep going.
- Otherwise, act in accordance with the state:
- If no text remains, we're done -- promote the final candidate (if any) and exit.
- If we have only source text left, then we'll be producing delete operations from here on; but first, if the latest candidate is a copy, we need to promote it.
- If we have only target text left, then we'll be producing insert operations from here on; but first, if the latest candidate is a copy, we need to promote it.
- If we have text from both streams, but they don't match, then we need to consider both a delete and an insert operation; but first, if the latest candidate is a copy, we need to promote it.
- If we have text from both streams, and they do match, then the next candidate will be a copy; but first, if the latest candidate is not a copy, we need to promote it.
private System.Collections.Generic.IEnumerable<PIEBALD.Types.DiffSectionBase<T>>
DoDiff
(
)
{
while ( true )
{
this.GetCandidate() ;
if
(
( this.sourceindex == PIEBALD.Types.DiffCandidate.Limit )
||
( this.targetindex == PIEBALD.Types.DiffCandidate.Limit )
)
{
if ( !this.HasCandidates )
{
yield return ( this.PromoteCandidate() ) ;
}
continue ;
}
switch ( this.state )
{
case EngineState.GotCandidate :
{
yield return ( this.PromoteCandidate() ) ;
goto case EngineState.None ;
}
case EngineState.None :
{
this.Dispose() ;
yield break ;
}
case EngineState.ReadSource | EngineState.GotCandidate :
{
if ( this.type == PIEBALD.Types.DiffSectionType.Copy )
{
yield return ( this.PromoteCandidate() ) ;
}
goto case EngineState.ReadSource ;
}
case EngineState.ReadSource :
{
this.AddCandidate
(
this.type | PIEBALD.Types.DiffSectionType.Delete
,
this.sourceindex + 1
,
this.targetindex
) ;
break ;
}
case EngineState.ReadTarget | EngineState.GotCandidate :
{
if ( this.type == PIEBALD.Types.DiffSectionType.Copy )
{
yield return ( this.PromoteCandidate() ) ;
}
goto case EngineState.ReadTarget ;
}
case EngineState.ReadTarget :
{
this.AddCandidate
(
this.type | PIEBALD.Types.DiffSectionType.Insert
,
this.sourceindex
,
this.targetindex + 1
) ;
break ;
}
case EngineState.ReadBoth | EngineState.GotCandidate :
{
if ( this.type == PIEBALD.Types.DiffSectionType.Copy )
{
yield return ( this.PromoteCandidate() ) ;
}
goto case EngineState.ReadBoth ;
}
case EngineState.ReadBoth :
{
if ( this.sourceindex > this.targetindex )
{
this.AddCandidate
(
this.type | PIEBALD.Types.DiffSectionType.Insert
,
this.sourceindex
,
this.targetindex + 1
) ;
this.AddCandidate
(
this.type | PIEBALD.Types.DiffSectionType.Delete
,
this.sourceindex + 1
,
this.targetindex
) ;
}
else
{
this.AddCandidate
(
this.type | PIEBALD.Types.DiffSectionType.Delete
,
this.sourceindex + 1
,
this.targetindex
) ;
this.AddCandidate
(
this.type | PIEBALD.Types.DiffSectionType.Insert
,
this.sourceindex
,
this.targetindex + 1
) ;
}
break ;
}
case EngineState.Equal | EngineState.GotCandidate :
{
if ( this.type != PIEBALD.Types.DiffSectionType.Copy )
{
yield return ( this.PromoteCandidate() ) ;
}
goto case EngineState.Equal ;
}
case EngineState.Equal :
{
this.AddCandidate
(
PIEBALD.Types.DiffSectionType.Copy
,
this.sourceindex + 1
,
this.targetindex + 1
) ;
break ;
}
}
}
}
2009-09-20: No more casting of sourceindex
and targetindex
.
In case EngineState.ReadBoth
, bias the addition of Delete
and Insert
candidates to favor replaces.
A Note about Replace-biasing
A difference section is a sequence of deletes and inserts. The order of the deletes and inserts is not important; the number of them is. On a Levenshtein table, an edit section with five deletes and three inserts will equate to three replaces and two deletes. If the number of deletes and inserts in a difference section are equal, then when mapped on a Levenshtein table it will be diagonal. If the number of deletes and inserts in a difference section are not equal, then it will depart from diagonal. The shortest edit distance will likely be close to diagonal, so we would prefer a more-diagonal edit sequence to a less-diagonal edit sequence. The first candidate that reaches a copy operation will be the one chosen. When we are adding both a delete and an insert to a DiffCandidate , we need to decide which to add first. By choosing to add an insert first when the candidate already has more deletes than inserts, or adding a delete first when the candidate already has more inserts than deletes, the result is that the resultant candidate that is more-diagonal will be evaluated (and perhaps promoted) before the one that is less-diagonal.
|
Diff
Diff
is static
. It handles the gruntwork of wrapping the documents in ItemBuffer
s, instantiating a DiffEngine
to perform the diff
, and kicking off the diff
.
public static System.Collections.Generic.IEnumerable<PIEBALD.Types.DiffSectionBase<T>>
Diff
(
System.Collections.Generic.IEnumerable<T> Source
,
System.Collections.Generic.IEnumerable<T> Target
)
{
DiffEngine<T> engine = new DiffEngine<T>
(
new PIEBALD.Types.ItemBuffer<T> ( Source )
,
new PIEBALD.Types.ItemBuffer<T> ( Target )
) ;
return ( engine.DoDiff() ) ;
}
If you choose to use the CharBuffer
or your own customized ItemBuffer
, you can use this overload of Diff
to perform the Diff
.
public static System.Collections.Generic.IEnumerable<PIEBALD.Types.DiffSectionBase<T>>
Diff
(
PIEBALD.Types.ItemBuffer<T> Source
,
PIEBALD.Types.ItemBuffer<T> Target
)
{
return ( ( new DiffEngine<T>
(
Source
,
Target
) ).DoDiff() ) ;
}
PIEBALDdiff
Which brings us to the demo app. PIEBALDdiff
is a console application; it takes the names of two files as parameters. If both files exist, they get wrapped in string
enumerators which get passed to the DiffEngine<string>
. The resulting DiffSection<string>
enumerator gets enumerated and the results displayed (with line numbers!).
namespace PIEBALD
{
using PIEBALD.Lib.LibExt.Lines ;
public static class PIEBALDdiff
{
[System.STAThreadAttribute()]
public static int
Main
(
string[] args
)
{
int result = 0 ;
try
{
if ( args.Length > 1 )
{
System.IO.FileInfo f1 = new System.IO.FileInfo ( args [ 0 ] ) ;
System.IO.FileInfo f2 = new System.IO.FileInfo ( args [ 1 ] ) ;
if ( f1.Exists && f2.Exists )
{
int sline = 0 ;
int tline = 0 ;
long bws = System.Diagnostics.Process.GetCurrentProcess().PeakWorkingSet64 ;
System.Diagnostics.Stopwatch sw = new System.Diagnostics.Stopwatch() ;
sw.Start() ;
foreach
(
PIEBALD.Types.DiffSectionBase<string> section
in
PIEBALD.Types.DiffEngine<string>.Diff
(
f1.Lines()
,
f2.Lines()
)
)
{
switch ( section.Type )
{
case PIEBALD.Types.DiffSectionType.Copy :
{
foreach ( string s in section.SourceContent )
{
++sline ;
++tline ;
}
break ;
}
case PIEBALD.Types.DiffSectionType.Delete :
{
System.Console.WriteLine ( "*****Delete******" ) ;
foreach ( string s in section.SourceContent )
{
System.Console.WriteLine ( "{0,5}:{1}" , ++sline , s ) ;
}
System.Console.WriteLine ( "*****************\r\n" ) ;
break ;
}
case PIEBALD.Types.DiffSectionType.Insert :
{
System.Console.WriteLine ( "*****Insert******" ) ;
foreach ( string s in section.TargetContent )
{
System.Console.WriteLine ( "{0,5}:{1}" , ++tline , s ) ;
}
System.Console.WriteLine ( "*****************\r\n" ) ;
break ;
}
case PIEBALD.Types.DiffSectionType.Replace :
{
System.Console.WriteLine ( "*****Replace*****" ) ;
foreach ( string s in section.SourceContent )
{
System.Console.WriteLine ( "{0,5}:{1}" , ++sline , s ) ;
}
System.Console.WriteLine ( "*****With********" ) ;
foreach ( string s in section.TargetContent )
{
System.Console.WriteLine ( "{0,5}:{1}" , ++tline , s ) ;
}
System.Console.WriteLine ( "*****************\r\n" ) ;
break ;
}
}
}
sw.Stop() ;
long pws = System.Diagnostics.Process.GetCurrentProcess().PeakWorkingSet64 ;
System.Console.WriteLine ( "Elapsed : {0}" , sw.Elapsed ) ;
System.Console.WriteLine
(
"PeakWorkingSet64 : {0} ({1})"
,
pws
,
pws - bws
) ;
}
else
{
System.Console.WriteLine ( "At least one of the specified files does not exist." ) ;
}
}
else
{
System.Console.WriteLine ( "Syntax: PIEBALDdiff source target" ) ;
}
}
catch ( System.Exception err )
{
System.Console.WriteLine ( err ) ;
}
return ( result ) ;
}
}
}
(The included version also gives a summary of the difference sections.)
History
- 2009-08-28
- 2009-09-17
- Updated to present the generic version of the algorithm
- 2009-09-20
- Altered the interface of
DiffCandidate
to use int
rather than ushort
(to avoid a bunch of casting) - Biased the adding of
delete
and insert
candidates to favor replaces - Altered
ItemBuffer
so it can be derived - Added an overload of
Diff
that can take customized ItemBuffers