Spell Checker Dictionary Engine

Martyn Pearson

Rate me:

4.86/5 (34 votes)

20 May 2003CPOL6 min read

120.2K

4.6K

A dictionary engine for use in applications requiring spell checking

Introduction

We all make the odd tryping error, so spull checkers are an essential part of any application that involves the user editing large amounts of text. (OK, I promise, no more corny mis-spellings!) This article describes the spell checking engine that I developed for the editing-intensive application I am currently working on.

The spell checking engine described is simply that - it checks words, and indicates whether or not they are spelt correctly. It is designed to sit behind a front end which allows the user to control spell checking, and as such, the design and operation of the user interface is beyond the scope of this article.

Dictionary Structure

Words are stored in a tree, with each node representing one character. Each node can have two children - a Next node and an Alternate node. The diagram below shows a dictionary containing three words, "horse", "horses" and "hose". Notice how we reuse nodes where the words start with the same common substring.

Sample screenshot

Storing the words in this manner is actually quite efficient in terms of storage space. When compared with an ASCII text file containing one word per line, the corresponding dictionary file is often considerably smaller, particularly when there are a large number of words.

Adding and Removing Words

To add a word, we traverse the tree to find as much of the word as possible, and then the remaining letters of the word are inserted into the tree. The best way to show this is by example, so say we wish to insert the word "hosts" into the dictionary above. We would perform the following steps :

Start at the first node
Check the node. The character is equal to the first letter of "hosts", so we move onto the Next node, and insert the word "osts"
Check the node. The character is equal to the first letter of "osts", so we move onto the Next node, and insert the word "sts"
Check the node. The character is not equal to the first letter of "sts", so we move onto the Alternate node, and insert the word "sts"
Check the node. The character is equal to the first letter of "sts", so we move onto the Next node, and insert the word "ts"
Check the node. The character is not equal to the first letter of "ts". The node has no Alternate node, so we create one and set the character to 't'. We then move onto this new node
The node has no Next node, so we add one and set it to the next character in the word, which 's'. We then move onto this new node.
The node has no Next node, so we add one and set it to the next character in the word, which is the null terminator. Because we have reached the end of the word, we end here.

Removing words is considerably easier. We traverse the tree to find the word, and then set the terminating character to something other than the null terminator. Although this does not reduce the size of the dictionary, it does mean that the word will no longer be found.

Checking Words

To check a word, we traverse the tree in much the same way as we did when we inserted a word. However, if we come to the point where there is no node with a matching character, the word is not in the dictionary and we end the search.

Getting Suggestions

When getting suggestions for a word, we consider four types of error that the user could have made. We assume only one error has been made in a word - if we accounted for more errors, our suggestion list could grow to something which is too large to be of use to the user.

The four errors that we account for are:

Extra character : For example, "horpse" instead of "horse"
Missing character : For example, "hrse" instead of "horse"
Incorrect character : For example "hprse" instead of "horse"
Two characters transposed: For example, "hrose" instead of "horse"

Because traversal of the tree is so fast, we individually check for each of the combinations if characters that make up each of the errors.

So, when finding suggestions for the extra character error, we will try and find the words "orpse", "hrpse", "hopse", "horse", "horpe" and "horps". Any matches that we find are added to the suggestions list. Similarly, we try all the combinations of transposed characters to find suggestions.

The missing and incorrect character errors use pattern matching to find suggestions. For the missing character errors, we insert a wildcard character at each character position, and see what words we can find that match the search string. When looking for incorrect character errors, we replace each character in turn with the wildcard character.

So, if we were getting suggestions for the word "hrse", we would get the suggestions "horse" (missing character 'o') and "hose" (incorrect character 'r').

Case

Words are inserted into the dictionary in lower case, unless they contain upper case characters. So, place names will be inserted in "correct" case.

When we try to find words, we always start by trying to get an exact match. If we do not, we then see what the best match we can get is - either matching apart from a capitalised first letter, matching but with mixed case, or no match at all. If the match indicates a capitalised first letter, it is up to the application to determine whether or not this is valid. For example, this may be treated as valid for the first letter of a sentence, but invalid elsewhere.

When getting suggestions for a mis-spelt word, we get words regardless of case. However, there is also an option to only return words where they differ from the search word only by case. This may be used when the search indicated a valid word with incorrect case.

Source Code

All of the code is contained in the class CDictionary. The class CNode represents a node in the tree. CDictionary offers methods to find words, obtain suggestions, load and save dictionaries, and populate the dictionary from an ASCII text file (with one word per line).

Demo Application

The demo application demonstrates the functions offered by the dictionary class. One neat "bonus" feature is pattern matching. Entering a word to find, with a question mark for wildcard characters allows all matching words to be found, a handy tool for any crossword fans!

Conclusion

Initial use of the dictionary has found that it has been suitably fast both when loading the dictionary of approximately 220,000 words, and when searching for words and suggestions. There are no future enhancements planned at the moment - time will tell what changes are needed!

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Written By

Martyn Pearson

Software Developer

United Kingdom

I started computer programming on the Spectrum (writing nothing more complicated than "Hello World" and a few programs that tunelessly Beeped ad infinitum) but then progressed to slightly more serious programming on the Amiga.

After A-Levels in Maths, Physics and Chemistry, I went to the University of East Anglia, Norwich, and studied beer, women and Computing Science.
Some years after graduating, I still have an appreciation of Computing Science, but as I am now married, my other studies are frowned upon.

Since graduating, I have worked on many diverse projects in areas including call centres, logistics, architecture and engineering, and heritage.

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.