Introduction
We all make the odd tryping error, so spull checkers are an essential part of
any application that involves the user editing large amounts of text. (OK, I
promise, no more corny mis-spellings!) This article describes the spell checking
engine that I developed for the editing-intensive application I am currently
working on.
The spell checking engine described is simply that - it checks words, and
indicates whether or not they are spelt correctly. It is designed to sit behind
a front end which allows the user to control spell checking, and as such, the
design and operation of the user interface is beyond the scope of this article.
Dictionary Structure
Words are stored in a tree, with each node representing one character. Each
node can have two children - a Next node and an Alternate node. The diagram
below shows a dictionary containing three words, "horse", "horses" and "hose".
Notice how we reuse nodes where the words start with the same common substring.
Storing the words in this manner is actually quite efficient in terms of
storage space. When compared with an ASCII text file containing one word per
line, the corresponding dictionary file is often considerably smaller,
particularly when there are a large number of words.
Adding and Removing Words
To add a word, we traverse the tree to find as much of the word as possible,
and then the remaining letters of the word are inserted into the tree. The best
way to show this is by example, so say we wish to insert the word "hosts" into
the dictionary above. We would perform the following steps :
- Start at the first node
- Check the node. The character is equal to the first letter of "hosts", so we
move onto the Next node, and insert the word "osts"
- Check the node. The character is equal to the first letter of "osts", so we
move onto the Next node, and insert the word "sts"
- Check the node. The character is not equal to the first letter of "sts", so
we move onto the Alternate node, and insert the word "sts"
- Check the node. The character is equal to the first letter of "sts", so we
move onto the Next node, and insert the word "ts"
- Check the node. The character is not equal to the first letter of "ts". The
node has no Alternate node, so we create one and set the character to 't'. We
then move onto this new node
- The node has no Next node, so we add one and set it to the next character in
the word, which 's'. We then move onto this new node.
- The node has no Next node, so we add one and set it to the next character in
the word, which is the null terminator. Because we have reached the end of the
word, we end here.
Removing words is considerably easier. We traverse the tree to find the word,
and then set the terminating character to something other than the null
terminator. Although this does not reduce the size of the dictionary, it does
mean that the word will no longer be found.
Checking Words
To check a word, we traverse the tree in much the same way as we did when we
inserted a word. However, if we come to the point where there is no node with a
matching character, the word is not in the dictionary and we end the search.
Getting Suggestions
When getting suggestions for a word, we consider four types of error that the
user could have made. We assume only one error has been made in a word - if we
accounted for more errors, our suggestion list could grow to something which is
too large to be of use to the user.
The four errors that we account for are:
- Extra character : For example, "horpse" instead of "horse"
- Missing character : For example, "hrse" instead of "horse"
- Incorrect character : For example "hprse" instead of "horse"
- Two characters transposed: For example, "hrose" instead of "horse"
Because traversal of the tree is so fast, we individually check for each of
the combinations if characters that make up each of the errors.
So, when finding suggestions for the extra character error, we will try
and find the words "orpse", "hrpse", "hopse", "horse", "horpe" and "horps". Any
matches that we find are added to the suggestions list. Similarly, we try all
the combinations of transposed characters to find suggestions.
The missing and incorrect character errors use pattern matching to find
suggestions. For the missing character errors, we insert a wildcard character at
each character position, and see what words we can find that match the search
string. When looking for incorrect character errors, we replace each character
in turn with the wildcard character.
So, if we were getting suggestions for the word "hrse", we would get the
suggestions "horse" (missing character 'o') and "hose" (incorrect character
'r').
Case
Words are inserted into the dictionary in lower case, unless they contain
upper case characters. So, place names will be inserted in "correct" case.
When we try to find words, we always start by trying to get an exact match.
If we do not, we then see what the best match we can get is - either matching
apart from a capitalised first letter, matching but with mixed case, or no match
at all. If the match indicates a capitalised first letter, it is up to the
application to determine whether or not this is valid. For example, this may be
treated as valid for the first letter of a sentence, but invalid elsewhere.
When getting suggestions for a mis-spelt word, we get words regardless of
case. However, there is also an option to only return words where they differ
from the search word only by case. This may be used when the search indicated a
valid word with incorrect case.
Source Code
All of the code is contained in the class CDictionary. The class CNode
represents a node in the tree. CDictionary offers methods to find words, obtain
suggestions, load and save dictionaries, and populate the dictionary from an
ASCII text file (with one word per line).
Demo Application
The demo application demonstrates the functions offered by the dictionary
class. One neat "bonus" feature is pattern matching. Entering a word to find,
with a question mark for wildcard characters allows all matching words to be
found, a handy tool for any crossword fans!
Conclusion
Initial use of the dictionary has found that it has been suitably fast both
when loading the dictionary of approximately 220,000 words, and when searching
for words and suggestions. There are no future enhancements planned at the
moment - time will tell what changes are needed!
I started computer programming on the Spectrum (writing nothing more complicated than "Hello World" and a few programs that tunelessly Beeped ad infinitum) but then progressed to slightly more serious programming on the Amiga.
After A-Levels in Maths, Physics and Chemistry, I went to the University of East Anglia, Norwich, and studied beer, women and Computing Science.
Some years after graduating, I still have an appreciation of Computing Science, but as I am now married, my other studies are frowned upon.
Since graduating, I have worked on many diverse projects in areas including call centres, logistics, architecture and engineering, and heritage.