Introduction
This article describes an ActiveX control that can be embedded in an html web
page to provide a voice-activated menu tree.
To compile the code you will need VC6, Microsoft's Speech SDK 5.1 and the
Internet
Explorer headers. (If you have WINXP you may already have the required files
on board)
The Demo Program
The demo for this package is a simple web page with two <iframe>
elements: the first <iframe> embeds the ActiveX control while the second
displays the page contents.
After compiling and registering WebVoiceCtl.dll, look for a folder
called demo
and double click on the file inside called
WebVoice.html. You should see the tree control in the left frame, shown
above. Press the Voice button and be patient while the large speech
engines are loaded.
Once loaded, you can speak "go to class one" to start the navigation.
The control should respond with "Please confirm class one" to which you
may reply "positive". The requested item should then be displayed in the
right frame.
Speak "help" at any time to get a list of the active commands. If
you've just navigated to a page, the help response will be "[scroll] up,
down, top bottom; go back or navigate". Speak your scroll commands then say:
"navigate" to return to navigation mode.
Hint: turn the volume on your speakers down to avoid feedback into the
microphone.
Background
The code attached to this article demonstrates the following technology:
- ATL, ActiveX (and wide character string manipulation)
- Tree view searching, expanding and collapsing
- Owner drawn buttons, edit controls and static controls
- Image lists, overlays (and painting inside an ATL composite control)
- Using Microsoft's MSXML parser to load and manipulate an XML file
- Using C++ to interface with the web browser and the html page
- SAPI 5.1, speech recognition and text to speech engines and Visemes
Of course you don't have to understand all of the items above to use this
control in your projects but you may find some of the solutions (a couple of
which credit other Code Project articles) interesting.
Creating Your Own Menu Tree
Your menu items are read from the file "data/WebVoice.xml" (name is currently
hardcoded), which contains information for both the menu-tree and the SAPI
grammar. It's contents are stored in an array of KEY
structures for
later retrieval. A short XML sample file and the KEY
structure are
shown below:
<!---->
<menu>
<item>
<mid>1</mid> <!- menu item id -->
<pid>0</pid> <!- parent id -->
<txt>Class One</txt> <!- menu text and grammar phrase -->
<ref>../html/class1.html</ref> <!- hyperlink reference -->
</item>
<item>
<mid>2</mid>
<pid>1</pid>
<txt>Source One</txt>
<ref>../html/src1.html</ref>
</item>
<!- more items here -- >
</menu>
typedef struct tag_key
{
int mid;
int pid;
int chd;
HTREEITEM hItem;
HTREEITEM hParent;
char txt[32];
char ref[128];
}KEY;
KEY aKeys[NUMBER_OF_KEYS];
You must be careful to ensure that the menu item IDs are numbered
sequentially and that the parent ID refers to an item that is above the current
item in the tree. No error checking is currently performed while loading so an
invalid XML file will cause the control to crash.
SAPI Initialization
The WebVoice control handles SAPI initialization in the function
InitSapi()
as follows:
- Creates the speech engine.
- Creates a recognition context.
- Sets a notification mechanism (windows message) for callback from the
recognition engine.
- Sets recognition event interests.
- Loads specific grammar files
- Creates the text to speech engine (TTS)
- Sets TTS event interests.
- Sets a notification mechanism (windows message) for call back from the TTS
engine.
- Sets the active rule
The Speech SDK documentation and examples clearly show the required SAPI
initialization calls so I won't cover that here. However, the static grammar
file and the dynamic grammar require some explanation.
SAPI Grammar
SAPI grammars may be loaded statically from an XML file or dynamically at
runtime. The WebVoice
control uses both methods. The static part is
loaded from Grammar.xml, which has the following format:
<GRAMMAR LANGID="409">
<DEFINE>
<ID NAME="RID_Tree" VAL="1001"/>
<ID NAME="RID_MenuItem" VAL="1004"/>
</DEFINE>
<RULE ID="RID_Tree" TOPLEVEL="ACTIVE">
<L>
<P>open</P>
<P>go to</P>
</L>
<RULEREF REFID="RID_MenuItem" />
</RULE>
<RULE ID="RID_MenuItem" DYNAMIC="TRUE">
<L PROPID="RID_MenuItem">
<P VAL="1">Dummy Item</P>
</L>
</RULE>
<!-more rules -->
</GRAMMAR>
As you can see this file snippet creates two rules: the first rule,
RID_Tree
, defines the starting navigation phrases then references
the second rule called RID_MenuItem
. The second rule holds a dummy
phrase that will be replaced at runtime with the names of your menu items. This
file is compiled into Grammar.cfg by SAPI's gc.exe then loaded
into a resource inside the DLL. The dynamic rules are added as follows:
HRESULT CWebVoice::LoadGrammar()
{
USES_CONVERSION;
HRESULT hr;
SPPROPERTYINFO pi;
ZeroMemory(&pi,sizeof(SPPROPERTYINFO));
pi.ulId = RID_MenuItem;
pi.vValue.vt = VT_UI4;
for(int i=0; i < m_nNumKeys; i++) {
pi.vValue.ulVal = i+1;
hr=m_cpGrammar->AddWordTransition(hRule,NULL,
T2W(aKeys[i].txt),L" ",SPWT_LEXICAL,1,&pi);
if(FAILED(hr)) return hr;
}
pi.vValue.ulVal = 0;
hr=m_cpGrammar->AddWordTransition(hRule,
NULL, L"*", L" ", SPWT_LEXICAL, 1, &pi);
if(FAILED(hr)) return hr;
hr=m_cpGrammar->Commit(NULL); if(FAILED(hr)) return hr;
hr=m_cpGrammar->SetGrammarState(SPGS_ENABLED); if(FAILED(hr)) return hr;
return hr;
}
Note that each new phrase (taken from aKeys[i].txt
) is assigned
a property ID of RID_MenuItem
and a unique property value
(between 1 and m_nNumKeys
) then added to the grammar with the
AddWordTransition()
function. Note also that a wild card rule
("*")
is added at the end to catch spoken phrases not covered in
the grammar.
Recognition
The recognition engine compares your spoken words to the active grammar rule.
When either a recognition or a false recognition is made by the engine, your
callback routine is called to handle the request. The following shows a section
of the recognition handler:
void CWebVoice::ExecuteCommand(ISpRecoResult *pPhrase, HWND hWnd)
{
USES_CONVERSION;
SPPHRASE *pElements;
static int ind;
int pos;
if (SUCCEEDED(pPhrase->GetPhrase(&pElements))) {
m_cpRecoCtxt->Pause(NULL);
switch (pElements->Rule.ulId ) {
case RID_Tree:
pos=pElements->pProperties->vValue.ulVal;
ind=pos-1;
SetActiveRule(RID_Confirm);
wcscpy(wcs,L"Please confirm: \r\n");
wcscat(wcs,T2W(aKeys[ind].txt));
HandleReply(0,wcs);
break;
case RID_Confirm:
pos=pElements->pProperties->vValue.ulVal;
switch(pos) {
case 1:
HandleConfirm(ind);
SetActiveRule(RID_View);
break;
case 2:
default:SetActiveRule(RID_Tree); HandleReply(MID_Tree,NULL); break;
break;
}
default: SetActiveRule(RID_Tree); HandleReply(RID_Tree,NULL); break;
}
::CoTaskMemFree(pElements);
m_cpRecoCtxt->Resume(NULL);
}
}
When a navigation rule is matched, it's property value is stored in the
static variable ind
and, after confirmation, is passed to the
HandleConfirm(ind)
funtion which uses it to index the data array
(aKeys[ind]
) and retrieve the correct data item. If successful the
tree view will be opened to show the selection and the hyperlink will be
navigated
Points of Interest
Every time I write an ActiveX control or a Web Browser plugin in ATL I have
to re-learn how to use wide character strings; and SAPI uses wide character
strings exclusively. If your code does not have to run on Win98 then you can
just define UNICODE and as long as your strings are defined as
TCHAR*
, the usual API calls will work fine. But if discounting
Win98 users is not an option then you are forced to convert from multibyte to
wide-string whenever you use the Win32 API. Fortunately, ATL has a wonderful set
of conversion macros defined in <atlbase.h>. You just place the macro
USES_CONVERSION
at the beginning of each function that needs to
convert strings then use the W2T()
or T2W()
macros to
perform the conversion. I have no doubt that the overhead to these macros is
alarming -after all they have to allocate memory, copy the string then release
memory in each conversion call. However these macros are so convenient and tidy
that I've even started including <atlbase.h> in my MFC programs.
Another problem I encountered was the need to use owner draw buttons -the
standard dialog-box grey does not cut it on a web page. In MFC I would override
the WM_CTLCOLOR
message and change the background colour there. In
ATL I found that I had to make the buttons owner-drawn then handle the
WM_DRAWITEM
message. All well and good but then I discovered that I
needed both a toggle button and a momentary button and that I now needed to code
the required responses myself. It was all great fun but it took some time before
I was able to get to the SAPI part of the code.
The Microsoft Speech SDK 5.1 is a 68 MB download and if you need to package
the SAPI runtime modules with your code, you must download the full
redistribution package which is 131.58 MB.
Unfortunately Microsoft does not package the runtime modules by themselves.
Either your clients must download the SDK (including the extra 30 MB of
developer code and documentation) or you must prepare a runtime module package
yourself as a separate download from your application.
Revisions
- 29 January 2004 -Subclassed picture control to avoid Win98 problems
and fixed small script error in the demo