Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / PowerShell

Levenshtein Distance in Windows PowerShell

5.00/5 (3 votes)
16 Aug 2010CPOL2 min read 26.7K  
A Script for determing the levenshtein distance between two strings in PowerShell

Introduction


Many popular search engines and/or Information Retrieval libraries(ie Lucene) provide Fuzzy string matching. This allows approximate string matching when no exact string matches occur. This is very useful for handling typos or misspelled search terms. The Levenshtein Distance is a common metric/algorithm for measuring the similarity of two strings. The distance between two strings is the # of edits needed to get from one string to the other. Edits in this context include insertion,removal and substitution. The script provided by this article get-ld.ps1 makes it very easy to leverage this approximate string matching and add one more utility in your arsenal of handy Powershell scripts.

Levenshtein Distance


For example, the Levenshtein distance between "kitten" and "sitting" is 3, since the following three edits change one into the other, and there is no way to do it with fewer than three edits:

  1. kitten -> sitten (substitution of 's' for 'k')
  2. sitten -> sittin (substitution of 'i' for 'e')
  3. sittin -> sitting (insert 'g' at the end)


Using get-ld.ps1


The get-ld.ps1 script takes 3 parameters(2 mandatory and 1 optional). The first two parameters are the 2 strings being compared. The third optional parameter is a flag indicated if it should ignore case First example shows how Madonna and madona are two edits apart. First edit is substituting capital M for lowercase m. Second edit is adding the second n. The Second example shows how doing a case-insensitive comparison has a distance of only 1. The lowercase m to capital M substitution is not needed anymore so the only needed edit is the insertion of the second n.
PS C:\scripts> ./get-ld.ps1 Madonna madona
2
PS C:\scripts> ./get-ld.ps1 Madonna madona -i
1


Why should I care about the Levenshtein distance


If you're building any type of search application you may have to deal with approximate string matching. This is especially true if you have text input from a user. People are relying on spell checkers and technology more and more to aid them in tasks like spelling and grammar. So you should plan on handling user input in a very flexible manner if you can. And Levenshtein distance is one tool to help handle spelling errors and typos. Keep in mind also that this is not the only metric for measuring string similarity. There are several and they all have different reasons why you might choos one over the other. For example, soundex uses phonetic spelling to find similar sounding syllables. This is a common source of errors in a world of spell checkers and our growing dependence on them.

The Code(get-ld.ps1)



# get-ld.ps1 (Levenshtein Distance)
# Levenshtein Distance is the # of edits it takes to get from 1 string to another
# This is one way of measuring the "similarity" of 2 strings
# Many useful purposes that can help in determining if 2 strings are similar possibly
# with different punctuation or misspellings/typos.
#
########################################################

# Putting this as first non comment or empty line declares the parameters
# the script accepts
###########
param([string] $first, [string] $second, [switch] $ignoreCase)

# No NULL check needed, why is that?
# PowerShell parameter handling converts Nulls into empty strings
# so we will never get a NULL string but we may get empty strings(length = 0)
#########################

$len1 = $first.length
$len2 = $second.length

# If either string has length of zero, the # of edits/distance between them
# is simply the length of the other string
#######################################
if($len1 -eq 0)
{ return $len2 }

if($len2 -eq 0)
{ return $len1 }

# make everything lowercase if ignoreCase flag is set
if($ignoreCase -eq $true)
{
  $first = $first.tolowerinvariant()
  $second = $second.tolowerinvariant()
}

# create 2d Array to store the "distances"
$dist = new-object -type 'int[,]' -arg ($len1+1),($len2+1)

# initialize the first row and first column which represent the 2
# strings we're comparing
for($i = 0; $i -le $len1; $i++) 
{  $dist[$i,0] = $i }
for($j = 0; $j -le $len2; $j++) 
{  $dist[0,$j] = $j }

$cost = 0

for($i = 1; $i -le $len1;$i++)
{
  for($j = 1; $j -le $len2;$j++)
  {
    if($second[$j-1] -ceq $first[$i-1])
    {
      $cost = 0
    }
    else   
    {
      $cost = 1
    }
    
    # The value going into the cell is the min of 3 possibilities:
    # 1. The cell immediately above plus 1
    # 2. The cell immediately to the left plus 1
    # 3. The cell diagonally above and to the left plus the 'cost'
    ##############
    # I had to add lots of parentheses to "help" the Powershell parser
    # And I separated out the tempmin variable for readability
    $tempmin = [System.Math]::Min(([int]$dist[($i-1),$j]+1) , ([int]$dist[$i,($j-1)]+1))
    $dist[$i,$j] = [System.Math]::Min($tempmin, ([int]$dist[($i-1),($j-1)] + $cost))
  }
}

# the actual distance is stored in the bottom right cell
return $dist[$len1, $len2];


Other Resources



Wikipedia


http://en.wikipedia.org/wiki/Levenshtein_distance
http://en.wikipedia.org/wiki/Soundex

History


Initial Revision.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)