Introduction
Levenshtein's distance measures the minimum number of character edits required to change a word into another. In this tip, we'll see a simple implementation of the Levenshtein algorithm in Visual Basic. It will be useful in several situations, when managing - for example - large amount of text, and we are in need of fast and massive modifications in our data, to spot and correct typos, or to match string
s in terms of similarity, and not simply in terms of equal/not equal. Levenshtein's algorithm allows to compute a score regarding string
similarity. We'll see how in a moment.
Standard Levenshtein Algorithm
Here follows the standard Levenshtein implementation in VB.NET, according to the algorithm as shown at Wikipedia.
Public Function Levenshtein(ByVal s As String, ByVal t As String) As Integer
Dim n As Integer = s.Length
Dim m As Integer = t.Length
Dim d(n + 1, m + 1) As Integer
If n = 0 Then
Return m
End If
If m = 0 Then
Return n
End If
Dim i As Integer
Dim j As Integer
For i = 0 To n
d(i, 0) = i
Next
For j = 0 To m
d(0, j) = j
Next
For i = 1 To n
For j = 1 To m
Dim cost As Integer
If t(j - 1) = s(i - 1) Then
cost = 0
Else
cost = 1
End If
d(i, j) = Math.Min(Math.Min(d(i - 1, j) + 1, d(i, j - 1) + 1), d(i - 1, j - 1) + cost)
Next
Next
Return d(n, m)
End Function
Examples
Using the above function(s) is trivial: it's sufficient to call it by passing two string
s, plus the optional boolean flag for case-sensitivity check. Some examples could be:
MsgBox(Levenshtein("Inwards", "inwards").ToString) MsgBox(Levenshtein("Inwards", "inwards").ToString) MsgBox(Levenshtein("towards", "towards").ToString) MsgBox(Levenshtein("dinner", "breakfast").ToString) MsgBox(Levenshtein("breakfast", "braekfast").ToString)
MsgBox(Levenshtein("efficient", "sufficient").ToString) MsgBox(Levenshtein("grandma", "anathema").ToString) MsgBox(Levenshtein("aunt", "ant").ToString)
The results (1, 0, 0, 9, 2, 2, 5, 1) are the Levenshtein's distances between given string
s, i.e., a score regarding string
s similarity. The lower the score, the nearer are the two entities. A value of zero means, obviously, a total convergence of the two. We could use a function like this (with predetermined conditions to be satisfied, like "the score must not exceed 3", for example) to correct typos (as in the "breakfast - braekfast" example), or to search for differences in hypothetical data (like "new York - New York"), and so on.
Bibliography
History
- 2015-01-06 Added standard algorithm, revised text, revised code
- 2015-01-03 First release for CodeProject