Introduction
First things first. Yes, it is a spelling checker, no, you don't need any DLLs or ActiveX controls and yes, it is free.
But it's slow, and it needs quite a bit more work (as you can see from the screenshot - it's giving hundreds of suggestions!). The main reason I'm posting it here unfinished is so that other people can improve (rip out and start again) the code that is the main "engine" of it, and add other features to it.
Thanks
Sections of code come from the following people:
- Zafir Anjum -
CListCtrl
sort code
- Chris Maunder - For this site and the
CProgressControl
class
Background
A few months ago, I was in need of a spell checker to add to one of my applications. I looked at various DLLs and ActiveX controls, but they didn't seem to do things the way I wanted them to. I also found several Open Source spell checkers for Unix, but it wasn't really possible to easily convert them to Windows.
In the end, I decided to try making my own, and after experimenting a bit, I decided I needed a dictionary first, in order to test the accuracy of the results. For the last four months, I've been compiling the dictionary, and then I tested the original test code I had, to see how it worked. It was very slow, and as I don't have much experience with huge arrays and lists, I thought it best to post it here, and let others see and improve/change it.
Components
This project/article consists of three sections and downloads. The dictionary itself, a dictionary editor program, and the test spell checker program.
Dictionary
The dictionary consists of nearly 75,000 words, and is in English (UK/International). The bulk of the words came from several available dictionaries on the web, but surprisingly, though they had lots of place names, and words I'd never even heard of before (but checked OED and found they did exist :) ), they seemed to lack many everyday words (e.g. their, owner, to name a few). I've spent the months since I started this project adding words, and checking them (as well as most of the existing words), and while I'm fairly confident its accuracy is very good, it's not going to be perfect, and there'll probably be more than a few words missing :).
By all means, add words to it, or send the words to me in a text file and I'll update it, but it's UK English at the moment, so it's probably better to start an American English version if we're going to be adding words like "Color" and "Customize" rather than "Colour" and "Customise" <g>.
It is in the binary format the sample programs use, which I'll explain in the next section. You can use the Dictionary Editor program to export it to a plain text file.
Dictionary Editor
First of all, I'll talk about the format the dictionary is in - both programs use this format to read and edit the dictionary.
I'm basically just using a CStringArray
array which contains all the words, and I serialize this to a file.
It's faster than using a CObArray
array which I was using before, but it's still not fast enough.
The program itself is simply a default MFC AppWizard program, with a CListView
view subclassed to a custom CListCtrl
that has a sort function added (thanks to Zafir Anjum on CodeGuru), and the function modified slightly so that it is case insensitive.
It uses a CStringArray
array to store each word.
Most of the code in the Dictionary Editor is commented, so I won't explain it completely here.
It has three toolbar buttons (Add, Delete and Modify) which allow you to add, delete and modify words.
It also has Import and Export commands on the File menu, which allow you to import and export words from and to text files with a single word on each line.
The Import
function has code commented out that will check each word that is to be added, and make sure it doesn't exist already.
The reason it's commented out is that it doesn't work - it was still going after 7 hours importing the full dictionary, whereas UltraEdit manages to remove duplicate items AND sort them in under two seconds, which says much about this code :).
Clicking on the Word header control will sort the contacts.
The Tools menu contains three commands:
Trim
The Trim command goes through all the words and does a TrimLeft()
and TrimRight()
on all the words to remove leading and trailing spaces.
void CDictionaryEditorView::OnToolsTrim()
{
CWaitCursor wait;
CListCtrl& m_List = GetListCtrl();
int x = m_List.GetItemCount();
GetDocument()->array.RemoveAll();
for (int j = 0; j < x; j++)
{
CString strWord;
strWord = m_List.GetItemText(j, 0);
strWord.TrimLeft();
strWord.TrimRight();
m_List.SetItemText(j, 0, strWord);
GetDocument()->AddWord(strWord);
}
GetDocument()->SetModifiedFlag();
}
Lowercase
The Lowercase command goes through all the words and does a MakeLower()
on them to make them all lowercase.
void CDictionaryEditorView::OnToolsLowercase()
{
CWaitCursor wait;
CListCtrl& m_List = GetListCtrl();
int x = m_List.GetItemCount();
GetDocument()->array.RemoveAll();
for (int j = 0; j < x; j++)
{
CString strWord;
strWord = m_List.GetItemText(j, 0);
strWord.MakeLower();
m_List.SetItemText(j, 0, strWord);
GetDocument()->AddWord(strWord);
}
GetDocument()->SetModifiedFlag();
}
Find and Find Next
The Find command goes through the list of words and finds words which have certain characters in them and then highlights them and selects them one at a time.
I used this to find words which weren't imported correctly (due to binary format) and had invalid characters on the sides that the Trim command didn't find, and then edit them manually. The Find Next command carries on from the last found word.
void CDictionaryEditorView::OnToolsFind()
{
CListCtrl& m_List = GetListCtrl();
m_nFind = 0;
int x = m_List.GetItemCount();
for (int j = m_nFind; j < x; j++)
{
CString strWord;
strWord = m_List.GetItemText(j, 0);
if (strWord.FindOneOf("����������������������������"
"����������������������������������������") != -1)
{
m_List.SetItemState(j, LVIS_SELECTED |
LVIS_FOCUSED, LVIS_SELECTED | LVIS_FOCUSED);
m_List.EnsureVisible(j, FALSE);
m_nFind = j + 1;
return;
}
}
}
Test spell checker
This is where the problems start. It's messy, and slow.
Anyway, the main test program consists of a main dialog which has a text box and two command buttons on it. The CSpellDlg
class handles this dialog, as well as the main spell checking functions.
The Options button brings up the Options dialog that lets you configure where the dictionary is and where the custom dictionary should be stored.
You'll need to do this before you test the program.
There is also the spell checker dialog that you'll probably be familiar with in one form or the other (this one is like Visio's), which displays when a word that isn't recognized is found.
This displays the word and any spelling suggestions, as well as returns what button was clicked to the main dialog.
So, this is how it works (or doesn't as the case may be :) ): Assuming there is a sentence or two in the main text box, when you click the Spell button, the OnSpell()
function is executed.
This does the following things:
First, it gets the text from the text box using the UpdateData()
function, and then it checks to see if the dictionaries exist, and if they do, it loads them. It then creates a CStringArray
to hold the words it should ignore for one session (when the user clicks Ignore All), and initializes the counters.
UpdateData(TRUE);
CFileStatus status;
if(CFile::GetStatus(AfxGetApp()->GetProfileString("Settings",
"Main", ""), status))
{
CFile cfSettingsFile (AfxGetApp()->GetProfileString("Settings",
"Main", ""), CFile::modeNoTruncate | CFile::modeReadWrite );
CArchive ar ( &cfSettingsFile, CArchive::load );
array.Serialize ( ar );
}
CFileStatus status2;
if(CFile::GetStatus(AfxGetApp()->GetProfileString("Settings",
"Custom", ""), status2))
{
CFile cfSettingsFile (AfxGetApp()->GetProfileString("Settings",
"Custom", ""), CFile::modeNoTruncate | CFile::modeReadWrite );
CArchive ar ( &cfSettingsFile, CArchive::load );
custarray.Serialize ( ar );
}
CString strStart = m_strText;
CStringArray IgnoreAll;
int nPos = 0;
int nPos2 = 0;
The next step is to do a While
loop, looking for specific characters (using a custom FindOneOf()
function that allows you to specify where to start looking from). These characters are the characters that usually separate words ("!'()[]<>,.). This loop basically finds each new word.
Within this loop, the word is extracted, leading and trailing blank spaces are removed, and if it equals nothing, has a length of one character or contains numbers and certain other characters, it is ignored.
CString strWord = strStart.Mid(nPos, nPos2 - nPos);
strWord.TrimLeft();
strWord.TrimRight();
if (strWord == "" || strWord.GetLength() == 1 ||
strWord.FindOneOf("0123456789+-/@?:*.,") != -1)
{
}
Otherwise, the program goes through the dictionary to see if the word exists. It also goes through the custom dictionary to see if it exists there, and has a look through the IgnoreAll list to see if it should ignore the word. If in any of these cases it finds the word, it sets the bFound
flag to TRUE
and returns to the main loop.
If after checking through the dictionaries and the IgnoreAll list it has found the word, then it goes back to the main While
statement, ignoring the word, and tries the next word.
BOOL bFound = FALSE;
for (int i = 0; i < array.GetSize(); i++)
{
CString strCheckWord;
strCheckWord = GetWord(i);
if (strWord.CompareNoCase(strCheckWord) == 0)
{
bFound = TRUE;
break;
}
}
for (int j = 0; j < custarray.GetSize(); j++)
{
CString strCheckWord;
strCheckWord = GetWordCustom(j);
if (strWord.CompareNoCase(strCheckWord) == 0)
{
bFound = TRUE;
break;
}
}
for (int f = 0; f < IgnoreAll.GetSize(); f++)
{
if (strWord.CompareNoCase((LPCTSTR)IgnoreAll[f]) == 0)
{
bFound = TRUE;
break;
}
}
Otherwise, if it hasn't found the word anywhere, we get to the fun bit, where the word is highlighted in the text box, possible suggestions of what the word should be are found, and the Spell Check dialog box is displayed, so the user can choose what action to take.
This is the section that needs re-writing, as it's messy, and doesn't work that well.
What I attempted to do was to go through each word in the dictionary, and see if it matched all the characters from the left - 1 of the misspelt word, then all the characters from the left - 2, -3, etc., etc., and the closer a word in the dictionary was to the misspelt word to add it to the top of the suggestions list.
However, the code I used wastes a lot of time looking for words that are the same in each of those conditions even though the starting character is different, so 25/26 of the time the loop is useless, so this needs to be changed.
This method of finding the correct word will also not find "upon" if "apon" was typed, whereas most other spell checkers find this with no problem at all, so it would be more accurate if some form of vowel swapping was introduced, as well as doing the algorithm backwards, starting from the right.
m_Text.SetSel(nPos, nPos2);
CSpellDialog dlg;
dlg.m_strWord = strWord;
int nGood = 0;
for (int i = 0; i < array.GetSize(); i++)
{
CString strGetWord;
strGetWord = GetWord(i);
for (int j = strWord.GetLength(); j >= 1; --j)
{
CString strGetWord;
strGetWord = GetWord(i);
if (strGetWord.Left(j).CompareNoCase(strWord.Left(j)) == 0)
{
BOOL bGood = FALSE;
if (strWord.GetLength() <= 3)
{
if (strGetWord.Left(2).CompareNoCase(strWord.Left(2)) == 0)
{
bGood = TRUE;
}
}
else
{
if (strGetWord.Left(strWord.GetLength() - 2).CompareNoCase(
strWord.Left(strWord.GetLength() - 2)) == 0)
{
bGood = TRUE;
}
}
BOOL bFoundWord = FALSE;
for( int k = 0; k < dlg.m_saSuggestions.GetSize(); k++ )
{
if (strGetWord.CompareNoCase(
(LPCTSTR)dlg.m_saSuggestions[k]) == 0)
{
bFoundWord = TRUE;
break;
}
}
if (bFoundWord == FALSE)
{
if ((strGetWord.GetLength() >= (strWord.GetLength() - 1)) &&
(strGetWord.GetLength() <= (strWord.GetLength() + 1)))
{
if (bGood == TRUE)
{
dlg.m_saSuggestions.InsertAt(nGood, strGetWord);
nGood ++;
}
else
{
dlg.m_saSuggestions.Add(strGetWord);
}
}
break;
}
}
}
}
This process is then repeated for the custom dictionary.
The final stage is to display the Spell Check dialog box, which allows the user to say if they want to Ignore the word, Add it to the custom dictionary, or Change it for one of the suggestions in the list box. The program then does what the user wants, depending on what button they clicked.
if (dlg.DoModal() == IDOK)
{
if (dlg.m_nOption == 1)
{
CString strAddWord = strWord;
strAddWord.MakeLower();
AddWordCustom(strAddWord);
CFile cfSettingsFile (AfxGetApp()->GetProfileString("Settings",
"Custom", ""), CFile::modeCreate |
CFile::modeNoTruncate | CFile::modeReadWrite );
CArchive ar2 ( &cfSettingsFile, CArchive::store );
custarray.Serialize ( ar2 );
}
else if (dlg.m_nOption == 2)
{
strStart.Delete(nPos, nPos2 - nPos);
m_strText.Delete(nPos, nPos2 - nPos);
strStart.Insert(nPos, dlg.m_strChangeTo);
m_strText.Insert(nPos, dlg.m_strChangeTo);
nPos2 = nPos + dlg.m_strChangeTo.GetLength();
UpdateData(FALSE);
}
else if (dlg.m_nOption == 3)
{
}
else if (dlg.m_nOption == 4)
{
CString strIgnoreAllWord = strWord;
strIgnoreAllWord.MakeLower();
IgnoreAll.Add(strIgnoreAllWord);
}
}
else
{
return;
}
It should "work" now :).
Improvements
Lots, I know <g>.
The finding suggestion algorithm needs a lot of work, and a lot of speeding up as it wastes quite a bit of time comparing strings when the first characters aren't even the same - this is the main bottleneck, but everything really needs optimizing.
Presently, it only tries to look for correct words starting with the same characters as the misspelt word, but this isn't always the case. An algorithm that swaps around vowels would probably work best, but looking at the way Microsoft Word suggests words, it knows what word you mean, even if it isn't that similar to the word you type in some cases, so perhaps there's something else going on more complicated like the dictionary subdivided into sections so that words that are often mispelt with each other are together, so it is easier for the program to pick suggestions.
Another thing that probably should be added is a Change All button, where all words misspelt one way are corrected at once.
Half the problem is probably the fact that I'm using all MFC functions, but I tried using standard C functions for everything except the array, and it didn't seem to help, so I took them out again for the purpose of this article (I've left one function in there so you can see the sort of thing I tried).
Any suggestions or improvements, please either send them to me or post them here.