Introduction
This package gives you the ability to spell check a word, and the ability to suggest some correctly spelled words the user might have meant when a misspelled word is encountered (spell guessing). It also provides support for a user's personal auxiliary dictionary. An American English lexicon is provided, and instructions for creating a lexical database in other languages is given. A guide is given if you wish to port this spelling checker code to another operating system. The code has been around for many years, and has proven itself to be quite fast and stable. I call this spelling checker, EDX.
EDXSPELL.DLL
The edxspell.dll contains the following four routines:
edx$dic_lookup_word | Checks the spelling of a word. You hand it a word, and it will return a value indicating if the word is correctly spelled or not. It does this by first checking the EDX lexical database (i.e., "dictionary"), and then checking the user's personal Aux1 dictionary. If the word is found it returns EDX__WORDFOUND . If it does not find the word, it returns EDX__WORDNOTFOUND . |
edx$spell_guess | Guesses what word the user meant to type. If edx$spell_guess returns EDX__WORDNOTFOUND you may then make repeated calls to edx$spell_guess to get some words the user might have meant to type. Each call to edx$spell_guess will return a correctly spelled word from the dictionary, similar to the misspelled word passed to edx$dic_lookup_word . You may continue making calls to edx$spell_guess until edx$spell_guess returns EDX__WORDNOTFOUND indicating there are no more suggestions. |
edx$add_persdic | Add a word to the user's personal auxiliary dictionary. A user's personal auxiliary dictionary is a plain text file with one word per line. The user may also edit this file with an ordinary text editor to add or remove words. |
edx$dll_version | (Returns the version number of edxspell.dll. Also returns information about the EDX dictionary and the user's personal auxiliary dictionary if they are loaded.) |
edx$dic_lookup_word
edx$dic_lookup_word
You pass a word to edx$dic_lookup_word
, and it will return a value indicating if the word is correctly spelled or not.
int edx$dic_lookup_word(char *spellwordptr,
char *errbuf,
int errbuflen,
char *Dic_File_Name,
char *Aux1_File_Name);
char *spellwordptr
The word you want to check the spelling of (pointer to ASCIIZ string). This string should contain just the word. Any leading or trailing spaces will not be trimmed off for you. It's up to you to trim off any leading and trailing spaces. The case of the word (uppercase or lowercase) is not important. (edx$dic_lookup_word
makes a lowercase copy of the word before looking it up in the lexical database.)
char *errbuf
int errbuflen
You provide an error buffer where error messages can be written to. Provide a pointer to a buffer, and provide the length of the buffer. One instance where you will get an error message is if the EDX lexical database file (i.e., the "dictionary") can't be found. The error message returned in errbuf
, in this case, will be something like:
"Error opening C:\Program Files\Multi-Edit 2006\EDXDIC.DIC
Error is: 2: The system cannot find the file specified."
It is up to you to display the error message to the user.
char *Dic_File_Name
ASCIIZ string containing the filename of the lexical database file. This is usually the full path/filename, e.g., Dic_File_Name = 'C:\Program Files\Multi-Edit 2006\EDXDIC.DIC'
. If you specify just "EDXDIC.DIC", then the current directory is searched. If the file is not found, edx$dic_lookup_word
returns EDX__ERROR
, and the error message is placed in errbuf
.
If you wish, you may rename EDXDIC.DIC to something else (perhaps EDX_DICTIONARY.DAT). Just specify the new name here.
On the first call to edx$dic_lookup_word
, the lexical database file specified in Dic_File_Name
will be mapped into memory. The value of Dic_File_Name
is ignored on all future calls once the lexical database file is loaded.
char *Aux1_File_Name
ASCIIZ string containing the filename of the user's personal auxiliary dictionary file. You may use a null string here if the user does not have a personal auxiliary dictionary file. This is usually the full path/filename, e.g., Aux1_File_Name = 'C:\Program Files\Multi-Edit 2006\EDXAUX1.TXT'
. If you specify just "EDXAUX1.TXT", then the current directory is searched. If a file is specified, and the specified file is not found, it is created.
A user's personal auxiliary dictionary file is a plain text file with one word per line. The user may edit this file with an ordinary text editor to add or remove words. You may also use the edx$add_persdic
function to append a word to this file.
On the first call to edx$dic_lookup_word
, the user's personal auxiliary dictionary specified in Aux1_File_Name
will be loaded into memory (unless a NULL string is passed, in which case this step is skipped). The value of Aux1_File_Name
is ignored on all future calls.
Return values:
Add the following three defines to your code:
#define EDX__WORDFOUND 1
#define EDX__WORDNOTFOUND 2
#define EDX__ERROR 4
These are the three possible return values. Note there are two underscore characters after EDX. If the return value is EDX__ERROR
then errbuf
will contain further information about the error. It's up to you to display this error message to the user.
Discussion:
On the first call to edx$dic_lookup_word
, the main dictionary specified by Dic_File_Name
is opened and mapped, and if Aux1_File_Name
is not a null string ("") then that file is briefly opened and read into memory. Then the word to spell check in spellwordptr
is searched for in the main dictionary and in the user's personal auxiliary dictionary. If the word is found EDX__WORDFOUND
is returned. If the word is not found EDX__WORDNOTFOUND
is returned.
Notes:
- Spell checking a zero length string will return
EDX__WORDFOUND
. - The dictionary can not store words longer than 31 characters.
- It's up to you to trim off any leading and trailing spaces of the word you want to spell check before handing it to
edx$dic_lookup_word
. - You may, if you wish, initialize the spelling checker by spell checking a zero length string, passing the names of the EDX dictionary file
Dic_File_Name
and optional user's personal dictionary file Aux1_File_Name
to edx$dic_lookup_word
. You may then use NULL strings for these two parameters on all future calls to edx$dic_lookup_word
, as these two parameters will be ignored on all future calls.
edx$spell_guess
edx$spell_guess
Every call to edx$dic_lookup_word
sets up spell guessing. The word passed to edx$dic_lookup_word
is saved, and spell guessing is initialized. If the said word is misspelled, you may then make repeated calls to edx$spell_guess
to get suggested words the user might have meant to type. (You may make these calls even if the word is not misspelled, though I don't know why anyone would bother.) Each call to edx$spell_guess
will return a correctly spelled word from the main EDX dictionary or the user's personal auxiliary dictionary which is similar to the misspelled word passed to edx$dic_lookup_word
. You may continue making calls to edx$spell_guess
until edx$spell_guess
returns EDX__WORDNOTFOUND
indicating there are no more suggestions.
int edx$spell_guess(char *guessword, char *errbuf, int errbuflen);
char *guessword
Pointer to a buffer which you supply to receive the guess-word. The buffer should be large enough to hold a 31 character word (don't forget the trailing NULL
byte, so you need 32 bytes).
char *errbuf
int errbuflen
You provide an error buffer where error messages can be written to. Provide a pointer to a buffer, and provide the length of the buffer. (I suggest an error buffer size of around 400 characters.) The only instance I know of where you will get an error message is if the EDX lexical database file (i.e., the "dictionary") is located on a remote computer and the network connection to that remote computer is lost. In this case, the error message returned in errbuf
will be:
"EDXspell.dll encountered
error EXCEPTION_IN_PAGE_ERROR. This error can
occur if the EDX dictionary file is on a
remote computer and the network connection
to that remote computer is lost."
It is up to you to display the error message to the user. (You don't have to display it. You could just treat this error as if EDX__WORDNOTFOUND
were returned, and stop spell guessing. In this case, you may specify errbuflen = 0
and not receive the error message since you're not going to display it.) Normally, the EDX dictionary file is on the same computer as the program, and this is not a problem.
Return values:
EDX__WORDFOUND - guessword is filled with another guess word.
EDX__WORDNOTFOUND - all out of guesses.
EDX__ERROR - EXCEPTION_IN_PAGE_ERROR (see above).
Here is an outline of how edx$spell_guess
goes about spell guessing:
- reversals (test for transposed characters)
- vowels (test for the wrong vowel used)
- minus chars (test for extra characters in the word)
- plus chars (test for characters missing from the word)
- consonants (test for wrong characters used)
- give up (give up)
The code takes care not to guess the same word twice.
edx$add_persdic
edx$add_persdic
Adds a word to the user's auxiliary personal dictionary. The user's personal auxiliary dictionary is a plain text file with one word per line. The user may also edit this file with an ordinary text editor.
char *newword
The word you want to add to the user's personal auxiliary dictionary. (pointer to ASCIIZ string). Leading and trailing spaces are trimmed and the word is lowercased before adding to the file. The resulting word can be no longer than 31 characters.
char *errbuf
int errbuflen
You provide an error buffer where error messages can be written to. Provide a pointer to a buffer, and provide the length of the buffer. It is up to you to display the error message to the user. (I suggest an error buffer size of around 400 characters.)
Return values:
EDX__WORDFOUND
- Successfully added word to user's personal dictionary. EDX__ERROR
- Error adding word to user's personal dictionary. Error text is in errbuf.
edx$dll_version
edx$dll_version
Returns a long string containing information. The string may look something like:
EDX Spelling Checker file edxspell.dll version 7.2 November 26, 2006.
EDX dictionary file is version 5 (Extended ANSI character compatible)
There are no extended ANSI characters in the dictionary.
Extended ANSI Guessing is: OFF.
User's personal auxiliary dictionary file is: EDXMYAUX1DIC.TXT
char *buf
int buflen
You provide a buffer where the version message string can be written to. Provide a pointer to a buffer, and provide the length of the buffer. (I suggest a buffer size of around 550 characters.)
Running the demo
To try the demo:
- Download the demo.
- Read the "0Readme EDX Spellchecker Demo.txt" file.
Using the code
If you are spell checking a buffer, then the general outline of code you would write would be:
- Parse off the next word.
- Call
edx$dic_lookup_word
to spell-check the word. - If
edx$dic_lookup_word
returns EDX__WORDNOTFOUND
, then declare the word misspelled, and offer suggestions to the user by making repeated calls to edx$spell_guess
until edx$spell_guess
returns EDX__WORDNOTFOUND
(meaning, no more guess words).
Here is a pseudo code example which spell checks testword
:
status = edx$dic_lookup_word(testword,errbuf,errbuflen,edxdic);
switch( status )
{
EDX__WORDFOUND:
break;
EDX__WORDNOTFOUND:
while (EDX__WORDFOUND == (guess_status =
edx$spell_guess(ResultBuf, errbuf, errbuflen)))
{
<Display guess word to user.>
}
if (guess_status == EDX__WORDNOTFOUND)
{
}
else if (guess_status == EDX__ERROR)
{
<Bad. Display error message and stop spell checking.>
}
break;
EDX__ERROR:
<Bad. Display error message and stop spell checking.>
break;
}
For a simple working example, see file "CallEdxSpell.cpp" in the "CallEdxSpell Source" folder of the source download.
Considerations when adding a spelling checker to your program
Parsing off the next word
It's up to you to provide words to edx$dic_lookup_word
, so if you're spell checking a buffer, you must write some code that will parse off the next word to spell check. See the file "Parsing off words.txt" in the "Documentation" folder of the source download for some suggestions.
User's personal auxiliary dictionary
The code now supports an optional user's personal auxiliary dictionary (also called the "User's Aux1 dictionary" or the "User's Aux1 Lexical Database"). This is a plain text file with one word per line. The contents of the file are loaded on the first call to edx$dic_lookup_word
. Leading and trailing spaces are trimmed, and the words are lowercased when loaded. Resulting words may be no longer than 31 characters. edx$dic_lookup_word
will return an error message if it finds a word longer than 31 characters when loading the user's personal auxiliary dictionary.
Words in the user's personal auxiliary dictionary are checked when spell checking a word, and when spell guessing.
Keep track of spelling corrections
A further enhancement would be to keep track of spelling corrections as they are made. If the user misspells a word, and selects a correctly spelled word from your list of guess words, then you can save that correction. If you encounter the same misspelled word again, offer to make the same change. (The EDX spelling checker does not do this for you.)
Performance speed
This spelling checker has proven itself to be quite fast. The secret to the speed of the EDX Spelling Checker is to keep page faults down to a minimum. The design of the EDX lexical database file reflects this goal by keeping memory reads near each other. For more information about the layout of the lexical database file and how to optimize it, see the file "Lexical Database File Layout.txt" in the source download. For more information about what page faults are and why a good understanding of them is so crucial to program execution speed, see the file "PAGE_FAULTS_AND_ARRAY_ADDRESSING.TXT" in the "Documentation" folder of the source download.
Loading the lexical database file
The other secret to speed is to map the lexical database file ("dictionary") into virtual memory instead of reading it in. The dictionary could be loaded by first allocating enough memory to hold the file, and then reading the entire file into the allocated memory. This would be quite slow due to the large size of the database. Also, a user's page file quota limits the total amount of memory a user may allocate, and the memory required to hold the database file is a considerable amount of memory.
Instead of this, we use system service calls to the Microsoft Windows Operating System supplied functions CreateFileMapping
and MapViewOfFile
to load the dictionary file. CreateFileMapping
and MapViewOfFile
accomplish the same result of allocating memory and then reading the file into memory, except they never allocate memory from the system, and never read in the file. Instead, they expand the process region by the size of the file EDXDIC.DIC (the "dictionary", i.e., the lexical database file), thereby instantly making new virtual memory available, and then it declares that the physical file EDXDIC.DIC itself is the read-only paging file for that section of the memory. The initialization is now complete, with hardly any work having been done.
Now, when the program attempts to read some of the dictionary that's in that memory range, a page fault will occur if that page is not already in memory, and that page is automatically read into memory. And since we're not using the system paging file for this, the user's page file quota is not affected.
It also helps if you defragment the file EDXDIC.DIC, since it's being used as a paging file.
English Lexical Database Provided
An American English lexical database is provided with over 90,000+ words. Every effort has been made to assure all the words are correctly spelled.
Other Languages
You may also create a lexical database in another language, if you wish. All you have to do is supply a file containing all the words in whatever language you want. The only limitation is that the maximum word length is 31 characters, and the file must be sorted by byte value. (This is the usual sort order, where we pay attention to the unsigned value of each byte, rather than what character the byte represents.) Below are links to a few places where you can get lexicons.
Lexicons (a file that contains a list of words) for other languages may be found at SourceForge. (See SCOWL - Spelling Checker Oriented Word Lists)
Be forewarned, the lexicons at the above web site contain a lot of words which aren't to be found in any dictionary! (They contain a lot of misspelled words, or words which should actually be hyphenated or two separate words.) A lot of work has gone into ensuring the words in my lexicon, EDX_DICTIONARY.TXT, are correctly spelled.
Another site that has Lexicons in various languages is: WinEdt Dictionaries.
For more info on creating the EDX lexical database file and optimizing your EDX_COMMONWORDS.TXT file, see "0Readme EDXBuildDictionary.txt" in the "Build EDX Dictionary Source" folder of the source download.
Update: The code has been updated to handle all ANSI characters 128 - 255. So it can now handle characters such as: š œ ž ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ø ù ú û ü ý þ ÿ See files "EDX Using Extended ANSI Characters.txt" and "EDX lowercasing extended letters.htm" in the Documentation folder for more information about this.
Other operating systems
See the file "Porting EDXspell to other operating systems.txt" in the "Documentation" folder of the source download if you wish to port this code to another operating system besides Microsoft Windows. (The code was originally written for the VMS operating system, and later modified to work with Microsoft Windows.)
History
- EDX was the name of a text editor I wrote in TPU for VMS operating systems, many years ago, back when dinosaurs roamed the earth.
- March 1990 - Original spelling checker code written by me from scratch, in VAX Macro Assembly Language.
- July 1993 - Forced to convert the code to the C language as VAX Macro became obsolete.
- March 2004 - Converted the C code to work on Microsoft Windows so I could add a spelling checker to the (Multi-Edit text and code editor) (see this). (Tested on Windows XP, Windows 2000, and Windows ME.)
- July 2006 - Submitted code to The Code Project. I've been using this spelling checker code for many years with no problems. The code is stable and fast! If you use the code, give me credit for it, that's all I ask.
- Nov 2006 - Added support for extended ANSI characters (characters above ASCII 127, characters with accent marks, as found in some foreign languages such as French).
- Nov 2006 - Added support for a user's personal auxiliary dictionary file.
Glossary
- ASCIIZ - A zero terminated character string. The kind of string used in the C and C++ programming languages.
- Dictionary - Technically, a "dictionary" should contain definitions of words. However, with spelling checkers, it's common to say "dictionary" when what we really mean is "lexicon" or "lexical database file".
- Lexicon - a file that contains a list of words.
- The "EDX Dictionary" is also called the "EDX Lexical Database File". Both refer to the file EDXDIC.DIC. (It's just shorter to say "dictionary" than "Lexical Database File". Also you can change the name of file EDXDIC.DIC to anything you want.)