Introduction
Imagine a situation when, a computer illiterate, or a visibly challenged guy wants to listen to songs in a computer. Can existing media players address these specific requirements? I guess no. So, the applications should be able to recognize voice commands, too. Perhaps, five years from now, there can be machines capable enough to recognize signals originating in human brain but at present the only smarter option is “voice”.
You just need to tell the name of the song you want to listen. This concept can be scaled very easily, like you may be interested in listening to all the songs “Bryan Adams” available in your machine. So, just speak out “All songs of ” “Bryan Adams”.
Note: Voice driven applications are not limited to media player, almost all the applications can get evolved into voice driven. The concept of Voice Intelligent Applications will certainly revolutionize the usage of software, especially, when the users are visibly challenged or unable to operate computer systems because of lack of necessary computer literacy. However, making all the applications voice driven does not always make sense.
Background
When you go to a restaurant, do you place your order by writing them on paper? No right, you simply speak out the names of the items you want to eat.
The idea basically is to make the applications usable to the people who cannot read/write due certain reasons. Apart from this, it is also very useful for the regular users, for instance, assume you have been doing some urgent work in your machine and want to listen to some songs for recreation but do not want to pause your work. Simply speak up the name of the song you want to listen to and the player starts playing the song.
Using the Code
Development of a coerce voice driven media player can be broken into the following steps.
Creating a songs database -> Songs can be anywhere in your drives. So, the first task is to have records of all the songs in a database. I am creating an XML file called SongsBase.xml that stores names of all the songs (here song name will work as command, apart from basic commands like “stop”, “pause”,”restart”, etc.). Along with the song name another attribute of each XML node in the file will be the path name of the song, as shown below:
<song Name="Hero">C:\Songs\MyFolder\Enrique\Hero.mp3</song>
The assembly GrammarBuilder
creates this resource, before the player can be used. It should be placed as a separate engine that should keep monitoring all the drives to keep the player informed about any new song placed in the system.
Grammar Building
The way a child needs to know the meanings of all the frequently used words in order to perform daily activities, speech engine also needs to know about the commands it is expected to receive from the user. I mean, if you say “play” “summer of 69” all the elements of this sentence should be known to the speech engine to play the song.
Here comes grammar building, see the code below:
public void BuildGrammar(ref System.Speech.Recognition.Grammar grammar,
ISongsLocations songs)
{
int count = songs.Count;
string[] phrases = new string[count];
for(int i=0; i<songs.Count;i++)
{
string name = songs[i].Name.Replace('_', ' ');
phrases.SetValue(name, i);
}
Choices choices = new Choices();
choices.Add(phrases);
string[] basicCommands = new string[] {"stop","pause","restart"};
choices.Add(basicCommands);
System.Speech.Recognition.GrammarBuilder grammarBuilder =
new System.Speech.Recognition.GrammarBuilder(choices);
grammar = new Grammar(grammarBuilder);
}
A Grammar
object receives an object of GrammarBuilder
composing Choices. Choices instance shown above receives two different array of strings, phrases and basicCommands
. Phrases are names of the songs while basicCommands
are “stop
”, “pause
”, etc.
Recognition of the “voice command”
We do not survive in a vacuum. There will be some sound around the computer set all the time, for instance, sound by your TV set, or perhaps conversation of your roommates that definitely has the potential to confuse your speech recognition engine. A good speech recognition algorithm should be able to filter the noise and extract the command accurately.
Fortunately, .NET already has speech recognition algorithm that is pretty good in terms of recognizing commands. While in actual Voice Driven Media Player I am using my own speech recognition algorithm, obviously the complexity of the algorithm is very high.
In order to enable speech recognition in your application, first create an object of SpeechRecognizer
class.
SpeechRecognizer spRecognizer = new SpeechRecognizer();
Then load the grammar object created.
spRecognizer.LoadGrammar(grammar);
Enable the recognizer.
spRecognizer.Enabled = true;
Finally, register the SpeechRecognized
event handler.
spRecognizer.SpeechRecognized +=
new EventHandler<SpeechRecognizedEventArgs>(spRecognizer_SpeechRecognized);
See, SpeechRecognized
handler in my application.
void spRecognizer_SpeechRecognized(object sender, SpeechRecognizedEventArgs e)
{
string str = e.Result.Text;
Console.WriteLine(str);
if (str.Equals("stop"))
{
mPlayer.controls.stop();
return;
}
if (str.Equals("pause"))
{
mPlayer.controls.pause();
return;
}
if (str.Equals("restart"))
{
mPlayer.controls.play();
return;
}
Console.WriteLine(str);
string url = songs.TryToGetFullMatch(str);
if (url != null)
{
mPlayer.URL = url;
Console.WriteLine(songs[str].ToString());
mPlayer.controls.play();
}
}
mPlayer
is an object of type WindowsMediaPlayer
.
WindowsMediaPlayer mPlayer = new WindowsMediaPlayer();
Points of Interest
The main challenge was to extract the correct title of the song. A user can rename the song to any crazy name like "123.mp3", while the actual song title is "summer of 69". So here it becomes necessary to parse the mp3/mp4 header to obtain the correct song name.
History
- Nov-07 2010: Original version posted