Introduction
Calculating the minimum difference between two sets of data is a common requirement. The
compare<>
class can be used to identify the minimum difference between two
sets of arbitrary types. It is a template class that uses a generic data source class
interface to supply data to it. This means that the algorithm is entirely independent
of the data, and the representation of the data.
Algorithm
To calculate the minimum difference between two data sets is to reverse the problem and
calculate the Longest Common Subsequence and use the result of this to determine
the differences. The LCS problem is well documented on the
Internet
and in research papers. I have implemented the Iterative LCS algorithm described by
Professor David Eppstein
at the University of California.
The LCS will identify the longest groups of elements common between the two sources. By
definition, extracting these groups from the original will identify the minimum non-common
elements between the two.
Implementation
The algorithm is encapsulated entirely within one class, compare<>
which is
implemented within the cmp
namespace. cmp::compare<>
is a template
class that is instantiated with a data source class to supply its data. I have included
two data sources for demonstration purposes. The first is another template class called
cmp::data_source<>
. This can be used for any basic data type, for example to
compare to text strings. The second is cmp::data_source_text_file
. used to compare
two text files
cmp::data_source<> example
To make life a little easier, we can typedef
the cumbersome template classes:
typedef cmp::data_source<const char> compare_data_source_t;
typedef cmp::compare<compare_data_source_t> compare_t;
Here's the data to compare:
const char str1[] = "little jack horner";
const char str2[] = "sat in a corner";
So we simply instantiate two data sources, and give them the data:
compare_data_source_t compare_data1(str1, strlen(str1));
compare_data_source_t compare_data2(str2, strlen(str2));
Now we can instantiate the cmp::compare<>
class:
compare_t compare(&compare_data1, &compare_data2);
And finally we can call process()
compare_t::result_set seq;
int lcs = compare.process(&seq);
process()
returns an integer which is the length of the Longest Common
Subsequence. This isn't really very useful to us when trying to determine the
difference, but the return value can give two useful pieces of information.
A return value of -1
indicates an error, and 0
(zero)
indicates that the two data sets are identical.
The seq
parameter will, on successful return, contain the result
sequence. This contains a list of records from the two data sets, along with
information about their relationship to the second dataset. Each record will
be marked as KEEP
, REMOVE
, or INSERT
.
These are descriptive of creating the second data set from the first, thus:
KEEP
- records in both data sources
REMOVE
- records in the first, but not the second data set
INSERT
- records not in the first, but in the second data set
Memory requirements
The LCS implementation is heavy on memory allocation. It requires
n*m*sizeof(short)
bytes of storage where n
and m
are the number of records in each of the data sets.
For example, to compare two 1KB datasets requires 1MB of storage!
To overcome this overhead, we compare larger blocks of data and subdivide
differing blocks and perform a separate LCS process on those. For example,
the cmp::data_source_text_file
class compares text files line
by line to yield a result similar to that of a version control system,
identifying which lines in the two files are different. To identify character
changes in a text file, each line that is different could be passed through a
cmp::data_source<char>
as in the first example.
Contact