Introduction
This article describes a way of capturing the similarity between two strings (or words). String similarity is a confidence score that reflects the relation between the meanings of two strings, which usually consists of multiple words or acronyms. Currently, in this approach I am more concerned on the measurement which reflects the relation between the patterns of the two strings, rather than the meaning of the words.
I implemented this algorithm when I was developing a tool to make the matching between XML schemas semi-automatic.
Preparing the ground
In this paper, I have implemented two algorithms:
- Levenshtein algorithm[1]
- The Kuhn-Munkres algorithm (also known as the Hungarian method)[2]
Without going "deep into theory", if you want to understand these algorithms, please read about them in the algorithm books. Other information about them can be reached at:
- Levenshtein
- The optimal assignment problem
Problem
The string similarity algorithm was developed to satisfy the following requirements:
- A true reflection of lexical similarity - strings with small differences should be recognized as being similar. In particular, a significant sub-string overlap should point to a high level of similarity between the strings.
- A robustness to changes of word order- two strings which contain the same words, but in a different order, should be recognized as being similar. On the other hand, if one string is just a random anagram of the characters contained in the other, then it should (usually) be recognized as dissimilar.
- Language independence - the algorithm should work not only in English, but also in many different languages.
Solution
The similarity is calculated in three steps:
- Partition each string into a list of tokens.
- Computing the similarity between tokens by using a string edit-distance algorithm (extension feature: semantic similarity measurement using the WordNet library).
- Computing the similarity between two token lists.
Tokenization
A string is a list of words or abbreviations, it may be composed to follow the Camel or Pascal casing without separator characters. Take an example: ‘fileName’ is being tokenized as “file” and “Name”. This work is done by the tokeniser.cs class.
Compute the similarity between two words
The first method uses an edit-distance string matching algorithm: Levenshtein. The string edit distance is the total cost of transforming one string into another using a set of edit rules, each of which has an associated cost. Levenshtein distance is obtained by finding the cheapest way to transform one string into another. Transformations are the one-step operations of (single-phone) insertion, deletion and substitution. In the simplest version substitutions cost about two units except when the source and target are identical, in which case the cost is zero. Insertions and deletions costs half that of substitutions.
This code shows a dynamic implementation of the algorithm:
private int ComputeDistance (string s, string t)
{
int n=s.Length;
int m=t.Length;
int[,] distance=new int[n + 1, m + 1];
int cost=0;
if(n == 0) return m;
if(m == 0) return n;
for(int i=0; i <= n; distance[i, 0]=i++);
for(int j=0; j <= m; distance[0, j]=j++);
for(int i=1; i <= n; i++)
{
for(int j=1; j <= m;j++)
{
cost=(t.Substring(j - 1, 1) ==
s.Substring(i - 1, 1) ? 0 : 1);
distance[i,j]=Min3(distance[i - 1, j] + 1,
distance[i, j - 1] + 1,
distance[i - 1, j - 1] + cost);
}
}
return distance[n, m];
}
The similarity score
I assume that all relation scores are in the [0, 1] range, which means that if the score gets a maximum value (equal to 1) then the two string are absolutely similar.
public float GetSimilarity(string string1, string string2)
{
float dis=ComputeDistance(string1, string2);
float maxLen=string1.Length;
if (maxLen < string2.Length)
maxLen = string2.Length;
if (maxLen == 0.0F)
return 1.0F;
else
return 1.0F - dis/maxLen;
}
Similarity between two token lists
After splitting each string into token lists, we capture the similarity between two strings by computing the similarity of those two token lists, which is reduced to the bipartite graph matching problem. A related classical problem on matching in bipartite graphs is the assignment problem, which is the quest to find the optimal assignment of workers to jobs that maximizes the sum of ratings, given all non-negative ratings Cost[i,j] of each worker i to each job j.
The problem can now be described as follows
Given a graph G(V,E), G can be partitioned into two sets of disjoint nodes X(left) and Y (right) such that every edge connects a node in X with a node in Y, and each edge has a non-negative weight.
- X is the set of the first list of tokens.
- Y is the set of the second list of tokens.
E is a set of edges connecting between each couple of vertex (X ,Y), the weight of each edge which connects an x1 to a y1 is computed by the similarity of x1 token and y1 token (using the GetSimilarity
function).
Example: x<SUB>1</SUB>= "test", y<SUB>1</SUB>="tet", e(x<SUB>1</SUB>, y<SUB>1</SUB>) = 0.75
The task is to find a subset of node-disjoint edges that has the maximum total weight. The similarity of two strings is computed by the number of matched strings.
Initialize the weight of the edges
The results of GetSimilarity
function are used to compute the weight of edges.
w(x,y) = GetSimilarity(token[x], token[y])
Connecting the edges to maximize the total weight
We use the Hungarian method to solve the maximum total weight of bipartite matching problem. The class used to implement this algorithm is MatchsMaker.cs.
Future improvements
Computing semantic similarity between words using the WordNet
The above section allows us to get the similarity score between patterns of strings. However, sometimes we need a semantic measurement. This problem leads us to find a semantic similarity. Now, I am experimenting with using WordNet[1] to compute the similarity between words inside the English dictionary.
The WordNet
WordNet is a lexical database which is available online and provides a large repository of English lexical items. The WordNet was designed to establish connections between four types of POS (Parts of Speech), noun, verb, adjective, and adverb. The smallest unit in WordNet is synset, which represents a specific meaning of a word. It includes the word, its explanation, and the synonyms of its meaning. The specific meaning of one word under one type of POS is called a sense. Each sense of a word is in a different synset. Synsets are equivalent to senses = structures containing sets of terms with synonymous meanings. Each synset has a gloss that defines the concept it represents. For example the words night, nighttime and dark constitute a single synset that has the following gloss: the time after sunset and before sunrise while it is dark outside. Synsets are connected to one another through explicit semantic relations. Some of these relations (hypernymy, hyponymy for nouns and hypernymy and troponymy for verbs) constitute kind-of and part-of (holonymy and meronymy for nouns) hierarchies. For example, tree is a kind of plant, tree is a hyponym of plant and plant is hypernym of tree. Analogously trunk is a part of tree and we have that trunk is meronym of tree and tree is holonym of trunk. For one word and one type of POS, if there are more than one sense, WordNet organizes them in the order of the most frequently used to the least frequently used.
Base on WordNet and its .NET API (that is provided by Troy Simpson); I use synonym and hypernym relations to capture the semantic similarities of tokens. Given a pair of words, once a path that connects the two words is found, I determine their similarity based on two factors: the length of the path and the order of the sense involved in this path.
To find a path connection between words we use the breadth-first search to check if they are synonyms. Searching the connection between two words in WordNet is an expensive operation due to the large searching space. We define two restrictions in order to reduce the computational time. The first one is that only synonym relations are considered (hyponym and hypernym will be considered later), since exhausting all the relations is too costly. This restriction is also adopted in some related works. Another restriction is to limit the length of the searching path. If the searcher cannot find a path that has been connected within a limited length, we stop searching and give the result as no path found. Hence, this case should have a substitution, the edit-distance instead.
The similarity score
Due to the synonym consideration, first we use the following formula to calculate the semantic similarity of words score:
WordSim(s,t)=SenWeight(s)*SenWeight(t) / PathLength
- Where
s
and t
: denote the source and target words being compared.
SenseWeight
: denotes a weight calculated according to the order of this sense and the count of total senses.
PathLength
: denotes the length of the connection path from s
to t
.
(This work is an experiment and is not available in this version, it will soon be released.)
Using the code
Add the similarity project to your workspace. Create a method like the following:
public Test()
{
string s1="The code project's article";
string s2="Article of The CodeProject";
MatchsMaker match=new MatchsMaker(s1, s2) ;
Console.WriteLine(match.Score) ;
}
where s1
, s2
are two strings for capturing similarity. The match.Score
is the similarity score of the two strings. In this example, the score is 0.94.
References