(untagged)

V Diff - A File Comparer with visual output

shankarp

0.00/5 (No votes)

5 Mar 2003

A simple program that uses the LCS algorithm to compare two files.

HTML output produced by VDiff

Introduction

This program compares two text files and produces the a difference analysis in various colours in plain HTML format.

Background

Comparing two files (or rather two versions of same file) is a problem programmers come across too often. Remember the way ClearCase and other versioning systems show the modifications you made to your code: how nice it would be if your application can provide such functionality to the user, There are famous tools under every platform which solves this problem, 'WinDiff' on Windows, 'diff' on UNIX platforms.

The source code for these programs are freely available too, But they are

Not simple (since they have passed through intensive optimization cycles)
Not well documented (as there were no good reason to)
Hard to understand; they never explain the underlying algorithms.
They are not designed to be integrated into your application.

VDiff is simple & easy to understand and it is written in pure C++, So it compiles literally under any platform (tested using VC++ 6.0, & g++ 2.95). The output of this program works fine with IE and Netscape (with minor compromise on scrolling functionality), and is easy to integrate into your own application. On the other hand, it is not as fast as the other famous tools, ut I am sure future versions will be comparable to those in terms of speed.

VDiff uses the Longest Common Sequence (LCS) as a algorithm to differenciate two files. LCS is a widely used algorithm in various fields (for instance in genetic engineering to compare DNA Sequences). Given two sequential array of entities, the algorithm will find the longest sequence of entities which is found in both arrays.

In our problem, an entity is a line of code. The two files are arrays of entities. The LCSequence produced by the algorithm in this case is a sequence of lines which are unchanged: keeping that as reference we can produce a picture of added/deleted/changed lines.

I strongly suggest you look at following material on LCS before going to code.

Dan Hirsberg is one of the main contributors to LCS. Find his work here.

Using the code

CCmpMngr is the interface class which provides access to the whole functionality. An example is worth a thousand words so please have a look at the code snippet below. This is the simple way of invoking it into your code.

int main(int argCount, char **argArray)
{
    if (argCount < 3)
    {
        cout<<"Usage 'cmp file1 file2 [output dir]' "<<"\n";
        return 0;
    }
    else
    {
        CCmpMngr cMgr;
        std::string file1,file2,oDir;
        
        cMgr.baseFile = std::string(argArray[1]);
        cMgr.compFile = std::string(argArray[2]);
        cMgr.compare();
        if (argCount > 3) // Visual diffrence is needed

        {
            oDir = std::string(argArray[3]);
           if (oDir[oDir.length()-1] != '\\') oDir.append(std::string("\\"));
            cMgr.print_diff_in_HTML(oDir);
        }
        cout<<"Added Lines = "<<cMgr.nAddedLines<<"\n";
        cout<<"Deleted Lines = "<<cMgr.nDeletedLines<<"\n";
        cout<<"Changed Lines = "<<cMgr.nChangedLines<<"\n";
    }
    return 1;
}

In the code above, the files to be compared are taken from command line arguments. Please notice the class CCmpMngr which has two members baseFile & compfile: they should be assigned with a valid filename including full path. Once that is done call the Compare() method which will do the real comparison. After the comparison the statistics of difference is available through members nAddedLines, nDeletedLines, and nChangedLines. Now the most interesting part, the visual presentation of the differences in various colours. That is produced by the method print_diff_in_HTML(oDir). The variable passed is a preferred path name where you want the three HTML files (explained later) to be produced.

How comparison is implemented

The LCS algorithm is fairly simple to understand, Let us say that we are given with two text files. The two files are first read in as lines. Since text lines are time consuming to compare, they are converted to a integer value through very simple hashing. It is done by hash(filemap& f, hashed_filemap& hf). Now we are left with 2 arrays bHm and cHm, (bHm = base Hashed Map, cHm = comp Hashed Map). For each member of bHm we are supposed to find a 'best' match from cHm, Or you can declare there is no match, Now imagine comparing two '.CPP' files. You are at line 3 and it is a single character line with '{'. You will get at least a hundred matches from the second file. The criteria to choose a best match from possible matches is...,.hmmmm.., Well it is a decision you make. If you take a decision at that point which leads to finding a long common sequence, then it is a good decision, however if it leads to the LONGEST common sequence, then it becomes BEST decision, thus the BEST match.

Since computers are stupid and have no common sense, the only way to choose the best decision is to take all the decisions, compare outcomes and decide best. In order to find the best match (decision) for a member in bHm we have to iterate through each element (decision) in cHm,

for (i = 0; i <= m; i++) // For each line in the first file....

{        
    for (j = 0; j <= n; j++) // ....Find the best match in second file

    {
    }
}

Now imagine somehow you have done it for all the elements and are now at the last element. The decision making is fairly simple because all the elements except the last few in cHm would have been declared as best matches for some element in bHm, and you are left with one or two unmatched elements, and most probably there will be no question of the best match because only one will be matching. Even if you are left with 3 lines and all the three matches perfectly, you can blindly choose any because this decision will not effect the matching of elements remaining (=zero) in bHm.

Please note here the last statement. The decision you make at point 'x' impacts matching of following elements, on the other hand, you need the data of matching for remaining elements in order to decide the best match. To avoid this mutual dependency, we can start from last element that had no elements following, thus no data is required.

for (i = m; i >= 0; i--) // For each line in the first file....

{
    for (j = n; j >= 0; j--) // ....Find the best match in second file

    {
    }
}

Inside the loop we prepare a matrix where row and column represent the elements in first and second sequence. The matrix is composed of lengths of the longest common subsequence for each match. Please note that for each match the length of best available LCS is increased by one, otherwise the the best LCS remains the same.

if (bHm[i].second == cHm[j].second) // match

{
    LCSMatrix(i,j) = 1 + LCSMatrix(i+1, j+1); // Increment

}
else 
{
    // retain the best among available LCS

    LCSMatrix(i,j) = max(LCSMatrix(i+1, j), LCSMatrix(i, j+1));
}

Please note the matrix LCSMatrix is very memory consuming if you are comparing big files (i.e. if you are comparing files with 30000 lines each then you need 3.3GB of memory). Given that we need only LCSMatrix(i+1, j) and LCSMatrix(i, j) at a given point we can manage with two one dimensional arrays one for (i)'th row and another for (i+1)'th row, altering the code to that given below. By the way, if you don't understand the algorithm don't worry, there is a much better explanation with good illustration graphics, available from http://www.ics.uci.edu/~eppstein/161/960229.html.

bool CCmpMngr::compare_hashed_files_LCS(hashed_filemap &bHm,
      hashed_filemap& cHm,match& dHm)
{    
    int i,j,k,m,n;    
    m = bHm.size(); n= cHm.size(); // length of two sequences to be compared

    LCSBitMatrix.resizeTo(m+2,n+2); // fix the size of bit matrix.

    std::vector ci,ci1;

    ci.resize(n+2,0);ci1.resize(n+2,0); // resize for ample space

    for (i=0;i<=n;i++) {ci[i]=0;ci1[i]=0;} // initialize arrays


    // Prepare LCS matrix

    for (i = m; i >= 0; i--) // For each line in the first file....

    {        
        // Make a copy for comparison, We r going to initialize

        for (k=0;k<=n;k++) {ci1[k]=ci[k];} 
        
        for (k=0;k<=n;k++) {ci[k]=0;} // initialize 

        for (j = n; j >= 0; j--) // ....Find the best match in second file

        {                
            if (bHm[i].second == cHm[j].second)
            {                    
                ci[j] = 1 + ci1[j+1];
            }
            else 
            {
                if (ci1[j] > ci[j+1]) // Equivalent to above line

                {
                    // to remember the condition result even beyond loop

                    LCSBitMatrix.put(i,j,true); 

                    // to continue looping according to condition result 

                    ci[j] = ci1[j]; 
                }
                else
                {
                    // to remember the condition result even beyond loop

                    LCSBitMatrix.put(i,j,false); 

                    // to continue looping according to condition result

                    ci[j] = ci[j+1]; 
                }
            }                                                
            /* Same above done with huuuuge 2D array named LCSMatrix.
            if (bHm[i].second == cHm[j].second)
            {
                LCSMatrix(i,j) = 1 + LCSMatrix(i+1, j+1);
            }
            else 
            {
                LCSMatrix(i,j) = max(LCSMatrix(i+1, j), LCSMatrix(i, j+1));
            }
            */
        }
    }
}

Once the matrix is populated with LCS lengths at each point, we can produce the longest common sub sequence by walking through it as follows; start with the top left element, if the elements match, then move one step in row & one step in column, and add the elements to the LCS match list. If the elements doesn't match move one step ahead in the column or row depending on which direction has the longest LCS path available.

    dHm.empty();
    i = 0;j = 0;
    // Following link walks through the bitmatrix we created by the loop above

    // and produces a LCS, A LCS here is a vector of pairs.  A pair has two line

    // numbers each from different file which is perfect match.

    // If the pair has <12,17>.., Then it means

    // The 12th line in the File1 has became 17th line in File2, User might have

    // simply added 5 lines in between or added 8 lines and deleted 3 lines.. Etc

    while (i < m && j < n)
    {
        if (bHm[i].second == cHm[j].second)
        {
            dHm.push_back(match_pair (i,j)); // A pair match added to LCS.

            i++; j++;
       }
       else 
        {
            // In case we were using that Huuuuge 2D array, The condition would

            // will be like below             

            // if (LCSMatrix(i+1,j) >= LCSMatrix(i,j+1)) 

            if (LCSBitMatrix.get(i,j) == true) 
                i++;
            else
                j++;
        }
    }

    print_diff_in_HTML(std::string(""));
    return true;
}

How HTML output is produced

Now that we have a array of matching pairs. We can identify the added/deleted/changed lines. Again an example is worth a thousand words so let us start with a sample array of pairs

(1,1)
(2,2)
(3,3)
(4,6)
(5,7)
(6,8)
(9,10)
(10,11)
(11,12)
(12,13)
(15,16)

Please note that you have an indication on added lines from pairs 3 and 4 (lines 4 & 5 are newly added lines in second file), you get a indication of deleted lines from pairs 6 and 7 (lines 7 & 8 are deleted from first file) and you get a indication of changed lines from pairs 10 and 11 (lines 13 and 14 from the first file and 14 and 15 from thesecond file are changed).

This information is then encoded into HTML by the print_diff_in_HTML(), method. The HTML output is composed of 3 different frames: one frame each to show both the files being compared and one main frame. The following script which is included in two sub frames enables synchronized scrolling of both;

function myScroll()
{
    if (document.layers) {//for netscape

        x  = window.pageXOffset;
        y  = window.pageYOffset;
    }
    else if (document.all) {//for IE

        x = document.body.scrollLeft;
        y = document.body.scrollTop;
    }
    parent.left.scrollTo(x,y); 
}

Points of Interest

Besides file comparison, this program uses a rare data structure; a matrix of bits. The reason to write a new class which acts as a bit matrix is the uncertainty in the way data type bool is implemented in various platforms under various memory packing schemes.

The class CBitMatrix is built on sample code from the official C++ FAQ maintained by Marshall Cline. I had a confusion on the code (what happens when overloaded functions vary only in return type & 'const'ness') and mailed him. I was surprised by his reply and his professionalism to answer questions from a "nobody".

Let me reproduce the mail as it will be useful for you. Also it answers the question why the CBitMatrix doesn't have a simple interface to put & get members.

-----Mail from Marshall Cline -----
Subject: RE: Constructive feedback on C++ FAQ Lite[13.8] (Martix sample has a bug...?)
Date: Mon, 7 Oct 2002 13:01:46 -0500
From: "Marshall Cline"

Thanks for writing.

Unfortunately the code works *exactly* according to its intention. The non-const version is *supposed* to be called in both places you said. In particular, if you have two methods in a class X that differ in their constness, e.g.,
  class Foo {
  public:
    void f();
    void f() const;
  };
then the compiler is *required* to call the non-const version if and only if the Foo object is non-const. The return type is irrelevant, and is not normally considered during this "overloading resolution" process.

Marshall

-----Original Message-----
From: shankar pratap
Sent: Monday, October 07, 2002 9:45 AM
Subject: Constructive feedback on C++ FAQ Lite[13.8] (Martix sample has a bug...?)

Dear Mr.Marshall Cline

The piece of code mentioned in the subject line does not work according to the intention.

Functions
------------
1) int& operator() (unsigned int row, unsigned int col);
2) int operator() (unsigned int row, unsigned int col) const;

Intention
---------
Matrix m(10,10);
m(5,8) = 106.15; //supposed to invoke function 1) which returns int&
std::cout << m(5,8); //supposed to invoke function 2)

but function 1) is invoked in both situations

The compiler i tried are
1) g++
2) MSVC++ 6.0.

Best Regards
Shankar Pratap

That's all for now. Happy Coding!

History

Feb 11 2003
Feb 24 2003
1. Easy GUI with Progress bar Introduced.
2. White spaces are not removed from HTML output anymore.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here