Introduction
In this article I'll show you how to add Text-to-Speech (TTS) capabilities to your program.
You'll be able to do it with, essentially, 1 line of code, using the familiar standard ostream
syntax.
Additionally, I'll show how using open source C++ tools can make your code short (my whole code is less than 50 lines), reliable, more robust and more general than the original APIs.
What I'll show:
- How to add simple TTS to your program.
- A simple use of COMSTL
and various other STLSoft
components.
- A simple example of how to use boost::iostreams
Background
I recently had to add audio outputs to a program (running on Windows).
Microsoft's SAPI SDK provides a COM interface through which wide character strings can be spoken via SAPI's TTS engine. The Code Project has many articles explaining how to use SAPI to varying degrees of complexity. So why another?
Well, there were some additional features that I wanted that did not exist in those articles.
- As little or no COM hassle. Ideally, it should work within the simplest Console application.
- Full (transparent) support for types other than wide-char. e.g.
char*
, std::string
s and even int
s, float
s, etc. - Intuitive (or at least familiar) syntax
To achieve these goals I developed audio_ostream
.
audio_ostream
is a full-fledged std::ostream
which supports any type that has an operator<<()
.
You can have as many audio_ostream
s as you like all working in parallel.
To handle COM issues, I used the wonderful COMSTL library which takes care of all the delicate and brittle COMplications, such as (un-)initialization, resource (de-)allocation, reference counting etc.
boost::iostreams
is used to provide the full std::ostream
support with very little effort writing boilerplate code.
Since both boost::iostreams
and COMSTL are header only libraries I decided to make my class header only too. The minor price of this decision is that the SAPI headers will be included into any file that uses audio_ostream
.
Using the code
Using the code cannot be easier:
#include "audiostream.hpp"
using namespace std;
using namespace audiostream;
int main()
{
audio_ostream aout;
aout << "Hello World!" << endl;
return 0;
}
This little program will, obviously, say "Hello World!".
The audio stream is asynchronous so the program will continue running even while the text is being said (that's why the line // some more code...
is there, to allow it to finish speaking). This is conceptually similar to how std::ostream
s buffer results until the internal buffer is full and only then the text is displayed.
To use the class:
#include
the audiostream.hpp
header file. - Create an instance of
audio_ostream
(or waudio_ostream
) - Use the stream as you would any
std::ostream
.
That's really all you need to do to start using the class.
Pre-Requisites
For the code to compile and run you will need 3 libraries:
- For the TTS engine, you will need to install the Microsoft
Speech SDK (I used ver. 5.1).
- For COMSTL you will need the STLSoft
libraries (you'll need STLSoft version 1.9.1 beta 44, or later).
- The Boost Iostreams library. You can download
Boost here.
Set your compiler and linker paths accordingly (Boost and STLSOft are header only).
Advanced Usage
It's possible to change the voice gender, speed, language and many more parameters of the voice using the SAPI text-to-speech (TTS) XML tags.
Just insert the relevant XML tags into the stream to affect change. The complete list of possible XML tags can be found here.
For example:
audio_ostream aout;
// Select a male voice.
aout << "<voice required='Gender=Male'>Hello World!" << endl;
aout << "Five hundred milliseconds of silence" << flush <<
"<silence msec='500'/> just occurred." << endl;
For some reason, the XML tags must be the first items in the SAPI spoken string, without any preceding text. flush
ing the stream before the tag, as in the example, facilitates this.
You can also call SetRate()
with values [-10,10] to control the speed of the speech.
The Magic
The Core Class
The heart of the code is the audio_sink
class:
template < class SinkType >
class audio_sink: public SinkType
{
public:
audio_sink()
{
static comstl::com_initializer coinit;
HRESULT hr;
if(FAILED(hr = comstl::co_create_instance(CLSID_SpVoice, _pVoice)))
throw comstl::com_exception(
"Failed to create SpVoice COM instance",hr);
}
std::streamsize write(const char* s, std::streamsize n)
{
std::string str(s,n);
return write(winstl::a2w(str), str.size());
}
std::streamsize write(const wchar_t* s, std::streamsize n)
{
std::wstring str(s,n);
_pVoice->Speak(str.c_str(), SPF_ASYNC, 0);
return n;
}
void setRate(long n) { _pVoice->SetRate(n); }
private:
stlsoft::ref_ptr< ISpVoice > _pVoice;
};
There's a lot going on in this little class. Let's tease apart the pieces one-by-one.
COMSTL, stlsoft::ref_ptr<> and ISpVoice
The only member of the class is stlsoft::ref_ptr< ISpVoice > _pVoice
.
This is the smart pointer that will handle all the COM stuff for us. The STLSoft class stlsoft::ref_ptr<> provides RAII-safe handling of reference-counted interfaces (RCIs). Specifically, it is ideal for handling COM objects.
We are using it with the ISpVoice
interface. From Microsoft's site:
The ISpVoice
interface enables an application to perform text synthesis operations. Applications can speak text strings and text files, or play audio files through this interface. All of these can be done synchronously or asynchronously.
In the constructor, we first initialize COM usage via the comstl::com_initializer
. This only happens once (since it is a static object), and should not trouble us anymore. To initialize _pVoice
we call comstl::co_create_instance()
with the CLSID_SpVoice
ID. If all goes well, we are now holding an ISpVoice
object handle. All reference counting issues will be handled by stlsoft::ref_ptr<>
. If the call fails an comstl::com_exception
exception is thrown and the class instance will not be created.
To speak some text we just need to call _pVoice->Speak()
with a wide character string.
To "speak text" we just need to call _pVoice->Speak()
with a wide character string.
However, we would like to support other character types like char*
, std::string
and more. In fact, we want to support any type that can be converted to a string or wide-string via an operator<<()
.
Boost Iostreams
boost::iostreams makes it easy to create standard C++ streams and stream buffers for accessing new Sources and Sinks. To rephrase from the site:
A Sink provides write-access to a sequence of characters of a given type. A Sink may expose this sequence by defining a member function write
, invoked indirectly by the Iostreams library through the function boost::iostreams::write
.
There are 2 pre-defined sinks, boost::iostreams::sink
and boost::iostreams::wsink
for writing narrow and wide string respectively.
To make our class a Sink and get all its functionality, all we have to do is to derive our class from either of these classes (depending if we want narrow and wide character output). Thus, audio_sink
is a template class that derives from its template parameter.
To use our sink and create a concrete ostream
, we need to use the boost::iostreams::stream
class.
The supporting class is audio_ostream_t
:
template < class SinkType >
class audio_ostream_t: public boost::iostreams::stream< SinkType >,
public SinkType
{
public:
audio_ostream_t()
{
open(*this);
}
};
typedef audio_ostream_t< audio_sink< boost::iostreams::sink > >
audio_ostream ;
typedef audio_ostream_t< audio_sink< boost::iostreams::wsink > >
waudio_ostream;
This class allows us to combine both the sink and stream objects into a single entity.
Deriving from boost::iostreams::stream
gives us all the ostream
functionality. This stream objects needs to be initialized with a sink object instance. Thus, we also derive from SinkType
(the template parameter) and initialize the boost::iostreams::stream
with *this
. Another advantage of deriving from SinkType
is that it allows us direct access to the sink object. Direct access allows us, for example, to access the SetRate()
method directly, to change the speech speed.
Speaking the Text
The boost::iostreams
machinery will take care of all the type conversions and ostream
syntax. Eventually, audio_sink::write
will be called. Although we provided both narrow and wide character string ostream
s, SAPI supports only wide character strings. Also, the Sink's write()
methods accept non-null-terminated strings and the number of characters to use from the stream.
To address these two issues, we'll convert the continuous stream + size to a null-terminated (w)string using the appropriate std::(w)string
constructor.
To speak the narrow character string, we call the wide write
version with STLSoft's winstl::a2w()
to easily convert from narrow to wide. winstl::a2w()
will take care of any required allocations and deallocation of temporary buffers, and of the conversion itself.
Possible Extensions
Having achieved my the design goals, some possible extensions come to mind.
It might be interesting to extend the ostream
support even further by using locales for language selection. Wrapping some of the XML tags as ostream
manipulators, will give a more natural (or, at least, familiar) syntax. Of course, similar extensions can convert the SAPI Speech Recognition Interfaces into an istream
, but that's a completely different ball game.
It might also be desirable to support synchronous (blocking) speech.
Revision History
- March 30, 2007 Fixed code to compile and run on MSVS 2005, by using
wchar_t
instead of unsigned short
.
Thanks to Jochen Berteld for pointing out the problem and to Matthew Wilson for pointing out the solution.