Website of the project
For information about this project, please refer to http://www.horizonideas.com/spellingdice/
Intro
No matter you are a
native English speaker or not, you may make some spelling mistakes every now and
then. Isn't it neat, if the computer
can provide some spelling suggestion for your unsure input? The most famous one
should be Google Suggest. Every time you input some mistakes, there will be "Did
you mean? XXX XXX XXX" on the next page. We know the whole system is based on Bayesian network, but how? On the other hand,
if our own website or small size program could include this great function, that
would be a valuable asset to attract people.
Peter Norvig, the Director of Google
Research revealed the secrets of Google Suggest. I believe it's not the whole algorithm,
but it's good enough for us to add spices ourselves and utilize in our own small-scale
program.
I believe his essay is much better
than my following descriptions. You can find his paper at
Here [English]
Here[Chinese]
Here[Japanese]
Here[Russian]
Suppose someone input "Majar", the
program should be able to identify what the user actually meant was "major".
Ok, then how does it work?
We let P(s|i) denotes the probability of one suggested word s by providing an input
word i.
So we have P(s|i)=P(s,i)/P(i)
P(i) is the probability of one random input string. As the user can input anything,
so the probability is neglectable. We let it be 1. So we have:
P(s|i)=P(s,i)/1=P(s,i)
At the same time,
P(s,i)=P(i|s) * P(s) so we have :
P(s|i)=P(s,i)= P(i|s) * P(s)
Where P(s) is the probability of one word's probability. Question: How can we define
this number?
Answer: find any book and read it through then make statics of each word's occurrence.
Or download the API which includes the 5000 most commonly used English word. For
instance, the word "the" must have a lot higher probability than "tae" [therapeutic
arterial embolization].
Ok, what about P(i|s)? literately, P(i|s) is the probability of one input string
by providing an recommend string. But it does not make sense. So it becomes the
probability of one suggested string by calculating the editing distance.
Oh, edit distance. You don't know what? Read
http://en.wikipedia.org/wiki/Edit_distance
The most famous one is Levenshtein Distance which is a build-in function in PHP.
Another example is Hamming Distance in Information Theory. For instance D(01010,
10010)=2 . now you get what I mean by edit distance, right?
Ok, let's continue. They less edit distance you have, the higher probability you
have.
For instance if someone inputs "hallo", "hello" has much higher P(i|s) than "helen".
What if the word is correct? In other words, if one in the dictionary matches the
input. Then the program should not return any suggesition. To save time complexity,
ths program should break here.
Understand what I said? If not, again, please refer to his original article.
Ok. Let's take a closer look at "hallo". The program may find "Hallow" "Mallo"(if
indeed, there is such word) and "Hello". They may have the same P(s) as we won't
see too many people use tons of "hello" in their formal writings. And they have
the same edit distance, so should they be placed in the order of alphabets? No!
Absolutely not. Use your head to think! How many people would mis-type the first
letter? Not so many~ So get rid of "mallo". Cool! Then you will find people tend
to mis-type more vowels than consonants. The reason is because it's very obvious
for human eyes to discover the typos of consonants. So they may correct them before
submit to computer. Thus, "hello" must have much better P(i|s) "hallow". Is that
clear?
The above are all in Dr. Peter Norvig's essay. But I don't think that's totally
enough for determining P(i|s). So I observe how people type and find out some of
us like me have giant fingers. So instead of type "hello" , we type "hrllo" or "hekko".
If you take a closer look at your keyboard you will realize why. Because these 'e'
and 'r' are near so are 'k' and 'l'. So here we have another standard, the P(i|s)
will be increased if a letter is replaced by its neighbors on the keyboard. So when
type "hrllo", "hello" must have higher probability than "hillo" (if indeed this
is such word"). Ok, I hope you are not lost here. This is the idea of SpellingDice.
If you don't want to know about the Algorithm, you can skip the following and go
directly to download the APIs for PHP and C# at
HERE.
Examples
Try out the online SpellingDice Corrector PHP
Please click
HERE
Algorithm
Array<string> ReturnSuggestion(string inputWord)
{
Array<string> result=null;
if( inputWord[0].ToLowerCase()<='z' && inputWord[0].ToLowerCase()<='a')
// if continues only if it's a English letter
{
Array<string> Candidates = ReadFromXML(inputWord[0].ToLowerCase()); // To
speed up the process, only read the dictionary that start with the first letter.
For instance, if input is "hallo" only read 'h.xml'.
Maps<string, double> RefinedCandidates;
foreach(string candidate in Candidates)
{
if(LevenshteinDistance(candidate, inputWord)<=2) //only check those words who
have an edit distance of 2 comparing to the input. Otherwise the candidate list
will be long and it does not help that much.
{
RefinedCandidates[candidate]=CalculateConditionProbability(candidate, inputWord)*probability(candidate);
}
}
Sort(RefinedCandidates);
result=ConvertMapIntoArray(RefinedCandidates);
}
return result;
}
APIs
To Download the Set, please click HERE
PHP
SpellDice::ReturnSuggestion($inputString);
// no matter the input is a sentence or a word, it will return a suggested sentence
or 5 suggested words. if no suggestion, null will be returned. NOTE: When the input
is a word, the return value will be a string array, if it is a sentence, a sentence
string will be returned.
SpellDice::SentenceProcess($inputString);
//only process a sentence, return a suggested sentence
SpellDice::WordProcess($inputString)
//only process a word, a array of at most 5 suggested words will be returned.
C#
namespace SpellCheck
public class SpellingDice
{
public SpellingDice() //intilize method
public List<string>
ReturnSuggestion(string input); ///no matter the input is a sentence or a word,
it will return a suggested sentence or 5 suggested words. if no suggestion, null
will be returned. NOTE: When the input is a word, the return value will be a string
array, if it is a sentence, an array of only one element will be returned.
public List<string> WordProcess(string word) //only process a word,
a array of at most 5 suggested words will be returned.
}
A Snippet of Codes
PHP
function GuessCandidates($compareString)
{
global $vocabularyArray;
$letter=$compareString[0];
if(count($vocabularyArray[$letter])==0)
{
$parser = xml_parser_create();
xml_parser_set_option($parser,XML_OPTION_SKIP_WHITE,1); <o:p />
xml_parser_set_option($parser,XML_OPTION_CASE_FOLDING,0);
<o:p> </o:p>
$data = implode("",file('dictionary/'.$letter.'.xml'));
xml_parse_into_struct($parser,$data,&$d_ar,&$i_ar)
or print_error();
$counter=0;
for($i=0; $i<count($i_ar['WordEntry']);
$i++) {
if($d_ar[$i_ar['WordEntry'][$i]]['type']=='open') {
for($j=$i_ar['WordEntry'][$i]; $j<$i_ar['WordEntry'][$i+1];
$j++) {
if($d_ar[$j]['tag'] == 'word')
{
$word = $d_ar[$j]['value'];
}elseif($d_ar[$j]['tag'] == 'probability')
{
$probability = $d_ar[$j]['value'];
}
}
$editDistance=SpellDice::StringDistance($compareString,$word);
if($compareString==$word)
return $word;
else
if($editDistance<=20)
$CandidateArray[$compareString][$word]=$probability*(20/$editDistance);
}
}
xml_parser_free($parser);
}
arsort($CandidateArray[$compareString]);
if(count($CandidateArray[$compareString])>5)
$CandidateArray[$compareString]=array_slice($CandidateArray[$compareString],
0, 5);
$result=array();
foreach ($CandidateArray[$compareString]
as $key=> $value)
{
array_push($result,$key);
}
return $result; <o:p />
}
C#
public List<string> WordProcess(string
word)
{
List<string> result = new List<string>();
SortedDictionary<double, string>
listOfCandidates = new SortedDictionary<double, string>();
if (word[0] <= 'z' &&
word[0] >= 'a')
{
if (candidatesList[word[0]].Count
== 0)
ReadFromXMLFile(word[0]);
foreach (Dictionary
DictionaryItem in candidatesList[word[0]])
{
if
(DictionaryItem.word == word)
{
result.Add(word);
return result;
}
else
{
int LD = Levenshtein(word, DictionaryItem.word);
if (LD <= 2)
{
double editDistance = StringDistance(word, DictionaryItem.word,
LD);
listOfCandidates[1 / (DictionaryItem.probability * (1 / editDistance))] = DictionaryItem.word;
} } } <o:p />
foreach (KeyValuePair<double,string> kvp in listOfCandidates)
{
result.Add(kvp.Value);
}
if (listOfCandidates.Count > 5)
result.RemoveRange(5, listOfCandidates.Count - 5);
}
else
result.Add(word); <o:p />
return result;
}
Conclusion
The performance of this algorithm is actually very good as we have some heuristic to ensure the least loops.
It needs O(n) to read the dictionary
Then O(n) to loop the dictionary
O(m) to check every word( m) is the length of words)
O(nlogn) to sort
So in totally we are looking at a O(n)+O(nlogn)+o(n*m) algorithm.
The PHP version is slightly faster than C# as it has some neat build-in functions. C# don't have a auto-sorted data structure to handle the array (SortedDictionary only sort the keys not value, and it's from low to high which is something we want in opposite way).
The program also largely depends on your dictionary. My dictionary included has 5000 English words. But it's not enough. You may want to expand the dictionary yourself. I made them 26 XML files. You can do it more yourself. This program could deal with regular spelling corrections, but I cannot guarantee it provides the same effects as Google Suggest does as they have much larger data stored and other advantages.
In all, this is a small-size fast program to help you achieve a neat feature on your website or small project. If you have any question or comments, please contact me directly.