Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

A free Spell Checker with a dictionary editor program

0.00/5 (No votes)
6 Jan 2001 1  
An article on adding a spell checker to your application.

Sample Image

Introduction

First things first. Yes, it is a spelling checker, no, you don't need any DLLs or ActiveX controls and yes, it is free.

But it's slow, and it needs quite a bit more work (as you can see from the screenshot - it's giving hundreds of suggestions!). The main reason I'm posting it here unfinished is so that other people can improve (rip out and start again) the code that is the main "engine" of it, and add other features to it.

Thanks

Sections of code come from the following people:

  • Zafir Anjum - CListCtrl sort code
  • Chris Maunder - For this site and the CProgressControl class

Background

A few months ago, I was in need of a spell checker to add to one of my applications. I looked at various DLLs and ActiveX controls, but they didn't seem to do things the way I wanted them to. I also found several Open Source spell checkers for Unix, but it wasn't really possible to easily convert them to Windows.

In the end, I decided to try making my own, and after experimenting a bit, I decided I needed a dictionary first, in order to test the accuracy of the results. For the last four months, I've been compiling the dictionary, and then I tested the original test code I had, to see how it worked. It was very slow, and as I don't have much experience with huge arrays and lists, I thought it best to post it here, and let others see and improve/change it.

Components

This project/article consists of three sections and downloads. The dictionary itself, a dictionary editor program, and the test spell checker program.

Dictionary

The dictionary consists of nearly 75,000 words, and is in English (UK/International). The bulk of the words came from several available dictionaries on the web, but surprisingly, though they had lots of place names, and words I'd never even heard of before (but checked OED and found they did exist :) ), they seemed to lack many everyday words (e.g. their, owner, to name a few). I've spent the months since I started this project adding words, and checking them (as well as most of the existing words), and while I'm fairly confident its accuracy is very good, it's not going to be perfect, and there'll probably be more than a few words missing :).

By all means, add words to it, or send the words to me in a text file and I'll update it, but it's UK English at the moment, so it's probably better to start an American English version if we're going to be adding words like "Color" and "Customize" rather than "Colour" and "Customise" <g>.

It is in the binary format the sample programs use, which I'll explain in the next section. You can use the Dictionary Editor program to export it to a plain text file.

Dictionary Editor

First of all, I'll talk about the format the dictionary is in - both programs use this format to read and edit the dictionary.

I'm basically just using a CStringArray array which contains all the words, and I serialize this to a file.

It's faster than using a CObArray array which I was using before, but it's still not fast enough.

The program itself is simply a default MFC AppWizard program, with a CListView view subclassed to a custom CListCtrl that has a sort function added (thanks to Zafir Anjum on CodeGuru), and the function modified slightly so that it is case insensitive.

It uses a CStringArray array to store each word.

Most of the code in the Dictionary Editor is commented, so I won't explain it completely here.

It has three toolbar buttons (Add, Delete and Modify) which allow you to add, delete and modify words.

It also has Import and Export commands on the File menu, which allow you to import and export words from and to text files with a single word on each line.

The Import function has code commented out that will check each word that is to be added, and make sure it doesn't exist already.

The reason it's commented out is that it doesn't work - it was still going after 7 hours importing the full dictionary, whereas UltraEdit manages to remove duplicate items AND sort them in under two seconds, which says much about this code :).

Clicking on the Word header control will sort the contacts.

The Tools menu contains three commands:

Trim

The Trim command goes through all the words and does a TrimLeft() and TrimRight() on all the words to remove leading and trailing spaces.

void CDictionaryEditorView::OnToolsTrim() 
{
    CWaitCursor wait; // Display a wait cursor


    CListCtrl& m_List = GetListCtrl(); // Get handle to the base List Control


    // Get the number of words in the list control

    int x = m_List.GetItemCount();
    
    GetDocument()->array.RemoveAll(); // Remove all words in our dictionary


    for (int j = 0; j < x; j++) // For each item in the list

    {
        CString strWord;

        strWord = m_List.GetItemText(j, 0); // Get the word


        strWord.TrimLeft(); // Remove leading spaces

        strWord.TrimRight(); // Remove trailing spaces


        // Overwrite the old word with the new trimmed word

        m_List.SetItemText(j, 0, strWord);

        GetDocument()->AddWord(strWord); // Add the word to the dictionary

    }

    GetDocument()->SetModifiedFlag(); // Set the document to be modified

}

Lowercase

The Lowercase command goes through all the words and does a MakeLower() on them to make them all lowercase.

void CDictionaryEditorView::OnToolsLowercase() 
{
    CWaitCursor wait; // Display a wait cursor


    CListCtrl& m_List = GetListCtrl(); // Get handle to the base List Control


    // Get the number of words in the list control

    int x = m_List.GetItemCount();

    GetDocument()->array.RemoveAll(); // Remove all words in our dictionary


    for (int j = 0; j < x; j++) // For each item in the list

    {
        CString strWord;

        strWord = m_List.GetItemText(j, 0); // Get the word


        strWord.MakeLower(); // Make the word lowercase


        // Overwrite the old word with the new lowercase word

        m_List.SetItemText(j, 0, strWord);

        GetDocument()->AddWord(strWord); // Add the word to the dictionary

    }

    GetDocument()->SetModifiedFlag(); // Set the document to be modified

}

Find and Find Next

The Find command goes through the list of words and finds words which have certain characters in them and then highlights them and selects them one at a time.

I used this to find words which weren't imported correctly (due to binary format) and had invalid characters on the sides that the Trim command didn't find, and then edit them manually. The Find Next command carries on from the last found word.

void CDictionaryEditorView::OnToolsFind() 
{
    CListCtrl& m_List = GetListCtrl(); // Get handle to the base List Control


    m_nFind = 0; // Reset FindNext counter to 0


    // Get the number of words in the list control

    int x = m_List.GetItemCount(); 
    
    for (int j = m_nFind; j < x; j++) // For each item in the list

    {
        CString strWord;

        strWord = m_List.GetItemText(j, 0); // Get the word


        // Does that word have one of the following characters in it?

        if (strWord.FindOneOf("����������������������������"
                "����������������������������������������") != -1)
        {
            // Select the word in the list

            m_List.SetItemState(j, LVIS_SELECTED | 
              LVIS_FOCUSED, LVIS_SELECTED | LVIS_FOCUSED); 
            // Make sure it's visible in the list

            m_List.EnsureVisible(j, FALSE); 

            m_nFind = j + 1; // Update the FindNext counter


            return; // Return as we've found what we wanted

        }
    }
}

Test spell checker

This is where the problems start. It's messy, and slow.

Anyway, the main test program consists of a main dialog which has a text box and two command buttons on it. The CSpellDlg class handles this dialog, as well as the main spell checking functions.

The Options button brings up the Options dialog that lets you configure where the dictionary is and where the custom dictionary should be stored.

You'll need to do this before you test the program.

There is also the spell checker dialog that you'll probably be familiar with in one form or the other (this one is like Visio's), which displays when a word that isn't recognized is found.

This displays the word and any spelling suggestions, as well as returns what button was clicked to the main dialog.

So, this is how it works (or doesn't as the case may be :) ): Assuming there is a sentence or two in the main text box, when you click the Spell button, the OnSpell() function is executed.

This does the following things:

First, it gets the text from the text box using the UpdateData() function, and then it checks to see if the dictionaries exist, and if they do, it loads them. It then creates a CStringArray to hold the words it should ignore for one session (when the user clicks Ignore All), and initializes the counters.

UpdateData(TRUE); // Get the text from the text box


// Check the main dictionary exists

CFileStatus status;
if(CFile::GetStatus(AfxGetApp()->GetProfileString("Settings", 
                                             "Main", ""), status)) 
{
    CFile cfSettingsFile (AfxGetApp()->GetProfileString("Settings", 
            "Main", ""), CFile::modeNoTruncate | CFile::modeReadWrite );
    CArchive ar ( &cfSettingsFile, CArchive::load );
    array.Serialize ( ar ); // Load the main dictionary

}

// See if a custom dictionary exists

CFileStatus status2;
if(CFile::GetStatus(AfxGetApp()->GetProfileString("Settings", 
                                          "Custom", ""), status2)) 
{
    CFile cfSettingsFile (AfxGetApp()->GetProfileString("Settings", 
            "Custom", ""), CFile::modeNoTruncate | CFile::modeReadWrite );
    CArchive ar ( &cfSettingsFile, CArchive::load );
    custarray.Serialize ( ar ); // Load the custom dictionary

}

CString strStart = m_strText;

CStringArray IgnoreAll; // Initialise Ignore All list


int nPos = 0; // Initialise position counters

int nPos2 = 0;

The next step is to do a While loop, looking for specific characters (using a custom FindOneOf() function that allows you to specify where to start looking from). These characters are the characters that usually separate words ("!'()[]<>,.). This loop basically finds each new word.

Within this loop, the word is extracted, leading and trailing blank spaces are removed, and if it equals nothing, has a length of one character or contains numbers and certain other characters, it is ignored.

CString strWord = strStart.Mid(nPos, nPos2 - nPos);

strWord.TrimLeft();
strWord.TrimRight();

if (strWord == "" || strWord.GetLength() == 1 || 
    strWord.FindOneOf("0123456789+-/@?:*.,") != -1)
{
    // Ignore the word

}

Otherwise, the program goes through the dictionary to see if the word exists. It also goes through the custom dictionary to see if it exists there, and has a look through the IgnoreAll list to see if it should ignore the word. If in any of these cases it finds the word, it sets the bFound flag to TRUE and returns to the main loop.

If after checking through the dictionaries and the IgnoreAll list it has found the word, then it goes back to the main While statement, ignoring the word, and tries the next word.

BOOL bFound = FALSE;

for (int i = 0; i < array.GetSize(); i++) 
// Is the word in the main dictionary?

{
    CString strCheckWord;

    strCheckWord = GetWord(i);

    if (strWord.CompareNoCase(strCheckWord) == 0)
    {
        bFound = TRUE; // Yes, exit this for statement

        break;
    }
}

for (int j = 0; j < custarray.GetSize(); j++) 
// Is the word in the custom dictionary?

{
    CString strCheckWord;

    strCheckWord = GetWordCustom(j);

    if (strWord.CompareNoCase(strCheckWord) == 0)
    {
        bFound = TRUE; // Yes, exit this for statement

        break;
    }
}


for (int f = 0; f < IgnoreAll.GetSize(); f++) // Should we ignore the word?

{
    if (strWord.CompareNoCase((LPCTSTR)IgnoreAll[f]) == 0)
    {
        bFound = TRUE; // Yes, exit this for statement

        break;
    }
}

Otherwise, if it hasn't found the word anywhere, we get to the fun bit, where the word is highlighted in the text box, possible suggestions of what the word should be are found, and the Spell Check dialog box is displayed, so the user can choose what action to take.

This is the section that needs re-writing, as it's messy, and doesn't work that well.

What I attempted to do was to go through each word in the dictionary, and see if it matched all the characters from the left - 1 of the misspelt word, then all the characters from the left - 2, -3, etc., etc., and the closer a word in the dictionary was to the misspelt word to add it to the top of the suggestions list.

However, the code I used wastes a lot of time looking for words that are the same in each of those conditions even though the starting character is different, so 25/26 of the time the loop is useless, so this needs to be changed.

This method of finding the correct word will also not find "upon" if "apon" was typed, whereas most other spell checkers find this with no problem at all, so it would be more accurate if some form of vowel swapping was introduced, as well as doing the algorithm backwards, starting from the right.

m_Text.SetSel(nPos, nPos2); // Select the unfound word in the text box


CSpellDialog dlg;

dlg.m_strWord = strWord; // Initialise the Spelldialog


int nGood = 0;

for (int i = 0; i < array.GetSize(); i++) 
// for each word in the main dictionary

{
    CString strGetWord;

    strGetWord = GetWord(i); // Get the word


    // Loop through from the number of characters backwards

    for (int j = strWord.GetLength(); j >= 1; --j)
    {
        CString strGetWord;

        strGetWord = GetWord(i);

        /* Does the left hand side of the
           dictionary word equal the left hand side
           of the misspelt word?*/

        if (strGetWord.Left(j).CompareNoCase(strWord.Left(j)) == 0)
        {
            BOOL bGood = FALSE;

            if (strWord.GetLength() <= 3) 
            // Is the word roughly of the same length

            {
                // Are the first to characters the same?

                if (strGetWord.Left(2).CompareNoCase(strWord.Left(2)) == 0) 
                {
                    bGood = TRUE; // *Must* be a good match! <g>

                }
            }
            else
            {
                if (strGetWord.Left(strWord.GetLength() - 2).CompareNoCase(
                                 strWord.Left(strWord.GetLength() - 2)) == 0)
                {
                    bGood = TRUE;
                }
            }
            
            BOOL bFoundWord = FALSE;

            for( int k = 0; k < dlg.m_saSuggestions.GetSize(); k++ ) 
            // Is this word already suggested?

            {
                if (strGetWord.CompareNoCase(
                  (LPCTSTR)dlg.m_saSuggestions[k]) == 0)
                {
                    bFoundWord = TRUE;
                    break;
                }
            }

            if (bFoundWord == FALSE)
            {
                if ((strGetWord.GetLength() >= (strWord.GetLength() - 1)) && 
                    (strGetWord.GetLength() <= (strWord.GetLength() + 1)))
                {
                    if (bGood == TRUE) 
                    // Good match, so add it near

                    // the top of the suggestion list

                    {
                        dlg.m_saSuggestions.InsertAt(nGood, strGetWord);
                        nGood ++;
                    }
                    else
                    {
                        dlg.m_saSuggestions.Add(strGetWord);
                    }
                }
                break;
            }
        }
    }
}

This process is then repeated for the custom dictionary.

The final stage is to display the Spell Check dialog box, which allows the user to say if they want to Ignore the word, Add it to the custom dictionary, or Change it for one of the suggestions in the list box. The program then does what the user wants, depending on what button they clicked.

if (dlg.DoModal() == IDOK) 
// Display the Spell Dialog and wait for the user to do something

{
    if (dlg.m_nOption == 1) 
    // Add Word was clicked, so add the word

    // to the custom dictionary

    {
        CString strAddWord = strWord;
        strAddWord.MakeLower();

        AddWordCustom(strAddWord);

        CFile cfSettingsFile (AfxGetApp()->GetProfileString("Settings", 
           "Custom", ""), CFile::modeCreate | 
           CFile::modeNoTruncate | CFile::modeReadWrite );
        CArchive ar2 ( &cfSettingsFile, CArchive::store );
        custarray.Serialize ( ar2 ); // Save the custom dictionary

                        
    }
    else if (dlg.m_nOption == 2) 
    // Change button was clicked, so change the word

    {
        strStart.Delete(nPos, nPos2 - nPos);
        m_strText.Delete(nPos, nPos2 - nPos);

        strStart.Insert(nPos, dlg.m_strChangeTo);
        m_strText.Insert(nPos, dlg.m_strChangeTo);

        nPos2 = nPos + dlg.m_strChangeTo.GetLength();

        UpdateData(FALSE); // Update the text box


    }
    else if (dlg.m_nOption == 3)
    {
        // ignore button was clicked, so ignore the word this once

    }
    else if (dlg.m_nOption == 4) 
    // Ignore all button was clicked, so add it to the IgnoreAll list

    {
        CString strIgnoreAllWord = strWord;
        strIgnoreAllWord.MakeLower();
        IgnoreAll.Add(strIgnoreAllWord);
    }
}
else
{
    return; // Cancel button or Close button where clicked, so quit function

}

It should "work" now :).

Improvements

Lots, I know <g>.

The finding suggestion algorithm needs a lot of work, and a lot of speeding up as it wastes quite a bit of time comparing strings when the first characters aren't even the same - this is the main bottleneck, but everything really needs optimizing.

Presently, it only tries to look for correct words starting with the same characters as the misspelt word, but this isn't always the case. An algorithm that swaps around vowels would probably work best, but looking at the way Microsoft Word suggests words, it knows what word you mean, even if it isn't that similar to the word you type in some cases, so perhaps there's something else going on more complicated like the dictionary subdivided into sections so that words that are often mispelt with each other are together, so it is easier for the program to pick suggestions.

Another thing that probably should be added is a Change All button, where all words misspelt one way are corrected at once.

Half the problem is probably the fact that I'm using all MFC functions, but I tried using standard C functions for everything except the array, and it didn't seem to help, so I took them out again for the purpose of this article (I've left one function in there so you can see the sort of thing I tried).

Any suggestions or improvements, please either send them to me or post them here.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here