See
Document Distance Problem Definition[
^] for the algorithm.
From that, it should be easy to write a C# program calculating that, e.g. in pseudo code:
double DocumentDistance(string textA, string textB)
{
Dictionary<string, int> binsA = CalculateWordFrequencies(textA);
Dictionary<string, int> binsB = CalculateWordFrequencies(textB);
double innerProduct = CalculateInnerProduct(binsA, binsB);
double normA = CalculateNorm(binsA);
double normB = CalculateNorm(binsB);
return Math.Acos(innerProduct / (normA * normB));
}
Dictionary<string, int> CalculateWordFrequencies(string text)
{
Dictionary<string, int> bins = new Dictionary<string, int>();
foreach(string word in GetWords(text))
{
if (bins.ContainsKey(word)) bins[word]++;
else bins.Add(word, 1);
}
return bins;
}
IEnumerable<string> GetWords(string text)
{
return Regex.Matches(text, @"\b\w+\b").Cast<Match>().Select(m=>m.Value);
}
double CalculateInnerProduct(Dictionary<string, int> binsA, Dictionary<string, int> binsB)
{
double product = 0.0;
foreach(string word in binsA.Keys.Concat(binsB.Keys).Unique())
{
int frequencyA = binA.ContainsKey(word) ? binA[word] : 0;
int frequencyB = binB.ContainsKey(word) ? binB[word] : 0;
product += (double)(frequencyA * frequencyB);
}
return product;
}
double CalculateNorm(Dictionary<string, int> bins)
{
double sum = 0.0;
foreach(int frequency in bins.Values)
{
sum += (double)(frequency * frequency);
}
return Math.Sqrt(sum);
}
To my understanding, it works for plain text files as well as for XML files: the word splitting algorithm takes also tags and attributes as words - if they match to 100%, the distance will be 0. If some elements or attributes differ, the distance will be greater than 0.
Cheers
Andi
PS: The pseudo code above follows the description of the referenced document - optimization is left as exercise to you (e.g. calculating the inner product can be improved by taking the
Intersection of both bins' Keys and no need to check for existance in each of the bins. Reason: all words that are only in one of the bins do not contribute to the product - they are 0).