Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / C

EDX Spelling Checker

4.85/5 (15 votes)
27 Nov 2006CPOL15 min read 2   1.8K  
Add to your program the ability to spell check a word and perform spell guessing.

Introduction

This package gives you the ability to spell check a word, and the ability to suggest some correctly spelled words the user might have meant when a misspelled word is encountered (spell guessing). It also provides support for a user's personal auxiliary dictionary. An American English lexicon is provided, and instructions for creating a lexical database in other languages is given. A guide is given if you wish to port this spelling checker code to another operating system. The code has been around for many years, and has proven itself to be quite fast and stable. I call this spelling checker, EDX.

EDXSPELL.DLL

The edxspell.dll contains the following four routines:

edx$dic_lookup_wordChecks the spelling of a word. You hand it a word, and it will return a value indicating if the word is correctly spelled or not. It does this by first checking the EDX lexical database (i.e., "dictionary"), and then checking the user's personal Aux1 dictionary. If the word is found it returns EDX__WORDFOUND. If it does not find the word, it returns EDX__WORDNOTFOUND.
edx$spell_guessGuesses what word the user meant to type. If edx$spell_guess returns EDX__WORDNOTFOUND you may then make repeated calls to edx$spell_guess to get some words the user might have meant to type. Each call to edx$spell_guess will return a correctly spelled word from the dictionary, similar to the misspelled word passed to edx$dic_lookup_word. You may continue making calls to edx$spell_guess until edx$spell_guess returns EDX__WORDNOTFOUND indicating there are no more suggestions.
edx$add_persdicAdd a word to the user's personal auxiliary dictionary. A user's personal auxiliary dictionary is a plain text file with one word per line. The user may also edit this file with an ordinary text editor to add or remove words.
edx$dll_version(Returns the version number of edxspell.dll. Also returns information about the EDX dictionary and the user's personal auxiliary dictionary if they are loaded.)

edx$dic_lookup_word

edx$dic_lookup_word

You pass a word to edx$dic_lookup_word, and it will return a value indicating if the word is correctly spelled or not.

int edx$dic_lookup_word(char *spellwordptr,
                        char *errbuf,
                        int errbuflen,
                        char *Dic_File_Name,
                        char *Aux1_File_Name);

char *spellwordptr

The word you want to check the spelling of (pointer to ASCIIZ string). This string should contain just the word. Any leading or trailing spaces will not be trimmed off for you. It's up to you to trim off any leading and trailing spaces. The case of the word (uppercase or lowercase) is not important. (edx$dic_lookup_word makes a lowercase copy of the word before looking it up in the lexical database.)

char *errbuf

int errbuflen

You provide an error buffer where error messages can be written to. Provide a pointer to a buffer, and provide the length of the buffer. One instance where you will get an error message is if the EDX lexical database file (i.e., the "dictionary") can't be found. The error message returned in errbuf, in this case, will be something like:

"Error opening C:\Program Files\Multi-Edit 2006\EDXDIC.DIC
 Error is: 2: The system cannot find the file specified."

It is up to you to display the error message to the user.

char *Dic_File_Name

ASCIIZ string containing the filename of the lexical database file. This is usually the full path/filename, e.g., Dic_File_Name = 'C:\Program Files\Multi-Edit 2006\EDXDIC.DIC'. If you specify just "EDXDIC.DIC", then the current directory is searched. If the file is not found, edx$dic_lookup_word returns EDX__ERROR, and the error message is placed in errbuf.

If you wish, you may rename EDXDIC.DIC to something else (perhaps EDX_DICTIONARY.DAT). Just specify the new name here.

On the first call to edx$dic_lookup_word, the lexical database file specified in Dic_File_Name will be mapped into memory. The value of Dic_File_Name is ignored on all future calls once the lexical database file is loaded.

char *Aux1_File_Name

ASCIIZ string containing the filename of the user's personal auxiliary dictionary file. You may use a null string here if the user does not have a personal auxiliary dictionary file. This is usually the full path/filename, e.g., Aux1_File_Name = 'C:\Program Files\Multi-Edit 2006\EDXAUX1.TXT'. If you specify just "EDXAUX1.TXT", then the current directory is searched. If a file is specified, and the specified file is not found, it is created.

A user's personal auxiliary dictionary file is a plain text file with one word per line. The user may edit this file with an ordinary text editor to add or remove words. You may also use the edx$add_persdic function to append a word to this file.

On the first call to edx$dic_lookup_word, the user's personal auxiliary dictionary specified in Aux1_File_Name will be loaded into memory (unless a NULL string is passed, in which case this step is skipped). The value of Aux1_File_Name is ignored on all future calls.

Return values:

Add the following three defines to your code:

#define EDX__WORDFOUND 1
#define EDX__WORDNOTFOUND 2
#define EDX__ERROR 4

These are the three possible return values. Note there are two underscore characters after EDX. If the return value is EDX__ERROR then errbuf will contain further information about the error. It's up to you to display this error message to the user.

Discussion:

On the first call to edx$dic_lookup_word, the main dictionary specified by Dic_File_Name is opened and mapped, and if Aux1_File_Name is not a null string ("") then that file is briefly opened and read into memory. Then the word to spell check in spellwordptr is searched for in the main dictionary and in the user's personal auxiliary dictionary. If the word is found EDX__WORDFOUND is returned. If the word is not found EDX__WORDNOTFOUND is returned.

Notes:

  • Spell checking a zero length string will return EDX__WORDFOUND.
  • The dictionary can not store words longer than 31 characters.
  • It's up to you to trim off any leading and trailing spaces of the word you want to spell check before handing it to edx$dic_lookup_word.
  • You may, if you wish, initialize the spelling checker by spell checking a zero length string, passing the names of the EDX dictionary file Dic_File_Name and optional user's personal dictionary file Aux1_File_Name to edx$dic_lookup_word. You may then use NULL strings for these two parameters on all future calls to edx$dic_lookup_word, as these two parameters will be ignored on all future calls.

edx$spell_guess

edx$spell_guess

Every call to edx$dic_lookup_word sets up spell guessing. The word passed to edx$dic_lookup_word is saved, and spell guessing is initialized. If the said word is misspelled, you may then make repeated calls to edx$spell_guess to get suggested words the user might have meant to type. (You may make these calls even if the word is not misspelled, though I don't know why anyone would bother.) Each call to edx$spell_guess will return a correctly spelled word from the main EDX dictionary or the user's personal auxiliary dictionary which is similar to the misspelled word passed to edx$dic_lookup_word. You may continue making calls to edx$spell_guess until edx$spell_guess returns EDX__WORDNOTFOUND indicating there are no more suggestions.

int edx$spell_guess(char *guessword, char *errbuf, int errbuflen);

char *guessword

Pointer to a buffer which you supply to receive the guess-word. The buffer should be large enough to hold a 31 character word (don't forget the trailing NULL byte, so you need 32 bytes).

char *errbuf

int errbuflen

You provide an error buffer where error messages can be written to. Provide a pointer to a buffer, and provide the length of the buffer. (I suggest an error buffer size of around 400 characters.) The only instance I know of where you will get an error message is if the EDX lexical database file (i.e., the "dictionary") is located on a remote computer and the network connection to that remote computer is lost. In this case, the error message returned in errbuf will be:

"EDXspell.dll encountered
error EXCEPTION_IN_PAGE_ERROR. This error can 
occur if the EDX dictionary file is on a
remote computer and the network connection 
to that remote computer is lost."

It is up to you to display the error message to the user. (You don't have to display it. You could just treat this error as if EDX__WORDNOTFOUND were returned, and stop spell guessing. In this case, you may specify errbuflen = 0 and not receive the error message since you're not going to display it.) Normally, the EDX dictionary file is on the same computer as the program, and this is not a problem.

Return values:
EDX__WORDFOUND     - guessword is filled with another guess word.
EDX__WORDNOTFOUND  - all out of guesses.
EDX__ERROR         - EXCEPTION_IN_PAGE_ERROR (see above).

Here is an outline of how edx$spell_guess goes about spell guessing:

  1. reversals (test for transposed characters)
  2. vowels (test for the wrong vowel used)
  3. minus chars (test for extra characters in the word)
  4. plus chars (test for characters missing from the word)
  5. consonants (test for wrong characters used)
  6. give up (give up)

The code takes care not to guess the same word twice.

edx$add_persdic

edx$add_persdic

Adds a word to the user's auxiliary personal dictionary. The user's personal auxiliary dictionary is a plain text file with one word per line. The user may also edit this file with an ordinary text editor.

char *newword

The word you want to add to the user's personal auxiliary dictionary. (pointer to ASCIIZ string). Leading and trailing spaces are trimmed and the word is lowercased before adding to the file. The resulting word can be no longer than 31 characters.

char *errbuf

int errbuflen

You provide an error buffer where error messages can be written to. Provide a pointer to a buffer, and provide the length of the buffer. It is up to you to display the error message to the user. (I suggest an error buffer size of around 400 characters.)

Return values:
  • EDX__WORDFOUND - Successfully added word to user's personal dictionary.
  • EDX__ERROR - Error adding word to user's personal dictionary. Error text is in errbuf.

edx$dll_version

edx$dll_version

Returns a long string containing information. The string may look something like:

EDX Spelling Checker file edxspell.dll version 7.2 November 26, 2006.
EDX dictionary file is version 5 (Extended ANSI character compatible)
There are no extended ANSI characters in the dictionary.
Extended ANSI Guessing is: OFF.
User's personal auxiliary dictionary file is: EDXMYAUX1DIC.TXT

char *buf

int buflen

You provide a buffer where the version message string can be written to. Provide a pointer to a buffer, and provide the length of the buffer. (I suggest a buffer size of around 550 characters.)

Running the demo

To try the demo:

  1. Download the demo.
  2. Read the "0Readme EDX Spellchecker Demo.txt" file.

Using the code

If you are spell checking a buffer, then the general outline of code you would write would be:

  1. Parse off the next word.
  2. Call edx$dic_lookup_word to spell-check the word.
  3. If edx$dic_lookup_word returns EDX__WORDNOTFOUND, then declare the word misspelled, and offer suggestions to the user by making repeated calls to edx$spell_guess until edx$spell_guess returns EDX__WORDNOTFOUND (meaning, no more guess words).

Here is a pseudo code example which spell checks testword:

status = edx$dic_lookup_word(testword,errbuf,errbuflen,edxdic);

switch( status )
{
   EDX__WORDFOUND:
      //Good. Word is correctly spelled.
      break;
   EDX__WORDNOTFOUND:
      //Word misspelled. Let's spell guess.
      while (EDX__WORDFOUND == (guess_status =
             edx$spell_guess(ResultBuf, errbuf, errbuflen)))
      {
         //ResultBuf contains a guess word.
         <Display guess word to user.>
      }
      //When we drop out here
      //guess_status is either EDX__WORDNOTFOUND
      //indicating no more guesses
      //or EDX__ERROR (which is very unlikely)
      if (guess_status == EDX__WORDNOTFOUND)
      {
         //No more guesses.
      }
      else if (guess_status == EDX__ERROR)
      {
         <Bad. Display error message and stop spell checking.>
      }
      break;
   EDX__ERROR:
     <Bad. Display error message and stop spell checking.>
     break;
}

For a simple working example, see file "CallEdxSpell.cpp" in the "CallEdxSpell Source" folder of the source download.

Considerations when adding a spelling checker to your program

Parsing off the next word

It's up to you to provide words to edx$dic_lookup_word, so if you're spell checking a buffer, you must write some code that will parse off the next word to spell check. See the file "Parsing off words.txt" in the "Documentation" folder of the source download for some suggestions.

User's personal auxiliary dictionary

The code now supports an optional user's personal auxiliary dictionary (also called the "User's Aux1 dictionary" or the "User's Aux1 Lexical Database"). This is a plain text file with one word per line. The contents of the file are loaded on the first call to edx$dic_lookup_word. Leading and trailing spaces are trimmed, and the words are lowercased when loaded. Resulting words may be no longer than 31 characters. edx$dic_lookup_word will return an error message if it finds a word longer than 31 characters when loading the user's personal auxiliary dictionary.

Words in the user's personal auxiliary dictionary are checked when spell checking a word, and when spell guessing.

Keep track of spelling corrections

A further enhancement would be to keep track of spelling corrections as they are made. If the user misspells a word, and selects a correctly spelled word from your list of guess words, then you can save that correction. If you encounter the same misspelled word again, offer to make the same change. (The EDX spelling checker does not do this for you.)

Performance speed

This spelling checker has proven itself to be quite fast. The secret to the speed of the EDX Spelling Checker is to keep page faults down to a minimum. The design of the EDX lexical database file reflects this goal by keeping memory reads near each other. For more information about the layout of the lexical database file and how to optimize it, see the file "Lexical Database File Layout.txt" in the source download. For more information about what page faults are and why a good understanding of them is so crucial to program execution speed, see the file "PAGE_FAULTS_AND_ARRAY_ADDRESSING.TXT" in the "Documentation" folder of the source download.

Loading the lexical database file

The other secret to speed is to map the lexical database file ("dictionary") into virtual memory instead of reading it in. The dictionary could be loaded by first allocating enough memory to hold the file, and then reading the entire file into the allocated memory. This would be quite slow due to the large size of the database. Also, a user's page file quota limits the total amount of memory a user may allocate, and the memory required to hold the database file is a considerable amount of memory.

Instead of this, we use system service calls to the Microsoft Windows Operating System supplied functions CreateFileMapping and MapViewOfFile to load the dictionary file. CreateFileMapping and MapViewOfFile accomplish the same result of allocating memory and then reading the file into memory, except they never allocate memory from the system, and never read in the file. Instead, they expand the process region by the size of the file EDXDIC.DIC (the "dictionary", i.e., the lexical database file), thereby instantly making new virtual memory available, and then it declares that the physical file EDXDIC.DIC itself is the read-only paging file for that section of the memory. The initialization is now complete, with hardly any work having been done.

Now, when the program attempts to read some of the dictionary that's in that memory range, a page fault will occur if that page is not already in memory, and that page is automatically read into memory. And since we're not using the system paging file for this, the user's page file quota is not affected.

It also helps if you defragment the file EDXDIC.DIC, since it's being used as a paging file.

English Lexical Database Provided

An American English lexical database is provided with over 90,000+ words. Every effort has been made to assure all the words are correctly spelled.

Other Languages

You may also create a lexical database in another language, if you wish. All you have to do is supply a file containing all the words in whatever language you want. The only limitation is that the maximum word length is 31 characters, and the file must be sorted by byte value. (This is the usual sort order, where we pay attention to the unsigned value of each byte, rather than what character the byte represents.) Below are links to a few places where you can get lexicons.

Lexicons (a file that contains a list of words) for other languages may be found at SourceForge. (See SCOWL - Spelling Checker Oriented Word Lists)

Be forewarned, the lexicons at the above web site contain a lot of words which aren't to be found in any dictionary! (They contain a lot of misspelled words, or words which should actually be hyphenated or two separate words.) A lot of work has gone into ensuring the words in my lexicon, EDX_DICTIONARY.TXT, are correctly spelled.

Another site that has Lexicons in various languages is: WinEdt Dictionaries.

For more info on creating the EDX lexical database file and optimizing your EDX_COMMONWORDS.TXT file, see "0Readme EDXBuildDictionary.txt" in the "Build EDX Dictionary Source" folder of the source download.

Update: The code has been updated to handle all ANSI characters 128 - 255. So it can now handle characters such as: š œ ž ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ø ù ú û ü ý þ ÿ See files "EDX Using Extended ANSI Characters.txt" and "EDX lowercasing extended letters.htm" in the Documentation folder for more information about this.

Other operating systems

See the file "Porting EDXspell to other operating systems.txt" in the "Documentation" folder of the source download if you wish to port this code to another operating system besides Microsoft Windows. (The code was originally written for the VMS operating system, and later modified to work with Microsoft Windows.)

History

  • EDX was the name of a text editor I wrote in TPU for VMS operating systems, many years ago, back when dinosaurs roamed the earth.
  • March 1990 - Original spelling checker code written by me from scratch, in VAX Macro Assembly Language.
  • July 1993 - Forced to convert the code to the C language as VAX Macro became obsolete.
  • March 2004 - Converted the C code to work on Microsoft Windows so I could add a spelling checker to the (Multi-Edit text and code editor) (see this). (Tested on Windows XP, Windows 2000, and Windows ME.)
  • July 2006 - Submitted code to The Code Project. I've been using this spelling checker code for many years with no problems. The code is stable and fast! If you use the code, give me credit for it, that's all I ask.
  • Nov 2006 - Added support for extended ANSI characters (characters above ASCII 127, characters with accent marks, as found in some foreign languages such as French).
  • Nov 2006 - Added support for a user's personal auxiliary dictionary file.

Glossary

  • ASCIIZ - A zero terminated character string. The kind of string used in the C and C++ programming languages.
  • Dictionary - Technically, a "dictionary" should contain definitions of words. However, with spelling checkers, it's common to say "dictionary" when what we really mean is "lexicon" or "lexical database file".
  • Lexicon - a file that contains a list of words.
  • The "EDX Dictionary" is also called the "EDX Lexical Database File". Both refer to the file EDXDIC.DIC. (It's just shorter to say "dictionary" than "Lexical Database File". Also you can change the name of file EDXDIC.DIC to anything you want.)

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)