Introduction
The initial motivation for writing this code was to compare the staging version with the current production version of the HTML document inside RainbowPortal when workflow is turned on, in order to provide a way for editor and/or approver to visually compare what is modified. The code is not published in RainbowPortal.
After some research, I chose to use the algorithm proposed by Eugene W. Myers in his paper "An O(ND) Difference Algorithm and its Variation" which is also the algorithm used by Unix Diff. A copy of the file can be found here.
The Unix Diff implementation found here is referenced during the implementation.
Design
The overall requirement can be summarized briefly as:
- The to-be-compared is HTML text which may contain HTML tags, HTML comments, whitespaces, and special characters such as " " "{" etc.
- The documents should be compared as a whole instead of line by line.
- Only the differences of content of the document is of importance because the formatting is already visually displayed.
- The differences should be highlighted.
Since it is going to be document-wise comparison, for the sake of performance and also convenience for only comparing the content of the documents, I decided to compare the document word by word instead of character by character. Thus, the process can be split into two parts. The first one is to parse the document to find out the tags, natural words, punctuations, special characters and whitespaces etc. The second part is to compare the words, punctuations and some of the special characters to find out the differences. Since the ignored things during the comparison are still needed for reconstructing the document, they must be retained by associating them with the words they are adjacent to.
A class Word
is defined to represent the parsed word as well as the tags, whitespaces, and special characters before and after it. The definition of class Word
is like:
internal class Word : IComparable
{
private string _word = string.Empty;
private string _prefix = string.Empty;
private string _suffix = string.Empty;
public Word() {...}
public Word(string word, string prefix, string suffix) {...}
...
public string reconstruct(){...}
public string reconstruct(string beginTag, string endTag){...}
public int CompareTo(object obj){...}
}
The parsing follows the following rules:
- Anything starting with '<' and ending with '>' is treated as an HTML tag.
- HTML tags and whitespaces are treated as prefix or suffix to adjacent words and are put in the prefix or suffix fields of the
Word
object.
- English words separated by space(s), " ", "&#xxx", tailing punctuation are treated as words and are put in the
word
field of the Word
class.
- Whitespaces == {' ', '\t', '\n'}.
A strong typed collection WordsCollection
is also defined to hold the parsed words.
The class HtmlTextParser
is defined with one static method to do the parsing.
internal class HtmlTextParser
{
static public WordsCollection parse(string s) {...}
}
Finally, class Merger
is defined to do the dirty work of comparing and merging:
class Merger
{
private WordsCollection _original;
private WordsCollection _modified;
private IntVector fwdVector;
private IntVector bwdVector;
public Merger(string original, string modified){ ..}
...
private MiddleSnake findMiddleSnake(Sequence src, Sequence des){...}
private string doMerge(Sequence src, Sequence des){...}
...
public string merge(){...}
}
The constructor of Merger
class will take in the HTML text strings and call HtmlTextParser.parse()
to parse them into two word collections. When the public method merge()
is called, it constructs Sequence
objects from the collections and call the private method doMerge()
, and doMerge
will recursively compare and merge the whole collections.
The engine and the test code are separated into two projects. In the test/bin/release directory, there are old.html and new.html files, running the test program will generate a file called merged.html. Use the browser to test them.
Conclusion
When the HTML document has about 170 pages with about 50% modifications, the engine takes about 15 seconds to parse, compare and merge the two files. If the document's length is less than 10 pages, it takes less than 1-2 seconds. The compared result is tested to be very accurate.