Introduction
This article helps you to understand how you can write your own Siri application. I have already had a
responsibility to develop an Android Siri application last
year. It is complete and now in the Google Store. I will try to write my experiences while I did it.
What is a Mobile Assistant Application?
| A
mobile assistant application should consist of the below functions:,
- It should be a mobile application (Android,
IOS, Windows Phone Application etc.),
- You can ask written or vocal questions,
- You can get response written, vocal, graphical
or activity for your questions,
- It
should use mobile device skills and abilities such as microphone, screen,
GPS, internet, speaker, and your information stored in device.
|
What a Mobile Assistant Application can Do
A
mobile assistant application can do a lot of features,
the first version of the mobile assistant application that I developed could
understand and respond only 15 commands. Now it can understand and response more than 50
commands. The basic command types should be about news, weathers, set
alarm and call a contact. While I search the mobile assistants in mobile markets
I found out above commands are common. What is more you can add the below
commands as a set of your advance command list to your mobile assistant.
- Set alarm,
- Get Info about news, weather, match scores, wiki
infos,
- Run an application,
- Open a media File (Video, Music),
- Share something on Facebook or twitter, etc,
- Read/Write SMS or Email,
- Read some shared feeds on Social Media,
- Find the nearest Market, Pharmacy, Hospital,
Restaurant, etc,
- Call someone,
- Do basic Mathematical problems,
- Check your bank balance,
- Make a Money Transfer to someone,
- Check latest currency or stock exchange,
- Read/Set Calendar,
- Buy a concert or travel ticket,
- Etc.
Some
of the command types that can be implemented only with third party company integrations.
For instance you can make an integration with Amazon or best buy to order an
item with your mobile assistant.
Mobile Assistants in the Market
There
are more than 60 known mobile assistants in markets. Popular ones are Siri and
Google Voice Search.
Here is a list of mobile assistants in markets and mobile assistant development environments,
- Siri,
- Google Voice
Search,
- Nuance Nina,
- Dragon Mobile Assistant,
- Angel Lexee,
- AIVC,
- Iris,
- Skyvi,
- EverFriends,
- EasyLuncher,
- Speaktoit,
- Evi,
- Turkcell Mobil Asistan(Turkish).
|
|
Siri
and Google Voice Search are popular ones, I will share some information and
video links about Nina, Lexee, Dragon Mobile Assistant and Turkcell Mobil
Asistan.
Nuance Nina: Nuance company
offers to large enterprise organizations a SDK to develop their own mobile
assistant application which can be used as customer service application. It is
a SDK that can be integrated to IOS and Android Application. You can get more
information in their website Meet Nina
.
I
like the video that introduce the Nuance Nina in Youtube.
Lexee: Lexee is the mobile
assistant of Angel Labs Company. Lexee offers a web environment to create your
own mobile assistant also. You can add, update and delete your scenarios without
coding via this web interface. The other point about Lexee is Analyze tools, Angel
Labs are good at analyzing tools. Lexee environment offers professionals a variety
reports and data about usage.
You
can get more information and watch the video via this link.
Dragon Mobile Assistant:
Dragon Mobile Assistant is also a product of
Nuance Company. Dragon Mobile Assistant offers users speak naturally to access wide
range of content and do the everyday task on the their phone easily. You can
get more information via this link.
You
can download the application and watch my favorite mobile assistant video by
clicking here.
Turkcell Mobil Asistan: Turkcell Mobil Asistan is the only one Turkish
Mobile Assistant in Google Play. Turkcell is one of the biggest GSM companies in Europe.
Via this application you can get customer care service such as your phone bill
details, tariff info. In addition to this you can ask some info about news,
whether, currency, traffic in Istanbul.
To
get more information and download Turkcell Mobil Asistan click here.
I hope above information would be
helpful to understand the basic concepts of mobile assistants. Lets look at some
technical points about the applications. A mobile assistant application should
have the below Technologies,
- Speech to Text (STT) Engine,
- Text to Speech (TTS) Engine,
- Tagging (Intelligence),
- Noise Reduction Engine,
- Voice Biometrics,
- Speech Compression Engine,
- UI for Call Outs.
- STT: Speech2Text engine
should get the voice from a user then convert it to text. The voice could be a voice file or a stream.
- TTS: Text2Speech
engine should convert text to voice. It is important for a user that listen the
response while for example the user drives.
- Tagging: The
text which is created via STT is not always simple, The tagging technology
should tag the text as what is the user wants via that speech. For Example, user asks what should I wear
tomorrow, then the tagging engine can tag the information with weather or
calendar info tag.
- Noise Reduction Engine: User
speech is not always simple, there could be some noise (for example,
air-condition
noise) around. The noise reduction engine should eliminate the white noise from
the voice.
- Voice Biometrics: Mobile
Assistants can give account based information such as credit card monthly
report. Therefore authentication is important, Voice biometrics one of the
authentication methods. Via voice biometrics technology, the mobile assistant
can authenticate you to do system.
- Speech Compression Engine: If your assistants works slow, the users can give up quickly
about the application and choose to search on web via writing the text. The
Internet communication is really important, in addition to this the packet size
for the transaction is also important. Small packets can transfer fast,
and the result gets fast. That is why, A good mobile assistant application
should have a speech compression engine. The client should send the compressed
voice to server fast. The compression is different
than the normal compression, because there is not so much repeating data in voice files. G711 can be chosen for the compression
algorithm, one of the reason for this choice is that the algorithm is not lost
the data.
- UI for Call Outs: After the server sends result you should play an audio, in addition to
this you should show some info on the device screen inside call outs. What I can
advice you, using native components can limited your application, if you prefer
a web based UI inside native application for call outs, it can be more convenient.
Architecture of Mobile Assistants
Mobile
device and main server should have a communication as
streaming, because users doesn't like waiting voice data download and slow communication.
Being fast is really important for this application, because if it is fast,
user feel more nature. User can feel that he is speaking with a real agent or
assistant.
When
users asks a question from client via clicking a button, client starts
streaming the question byte by byte to Main Server. Main server
sends the data to STT Server, STT server finds the text of the speech, The text
sends to the main server then main server send the text to tagging server to
find out what the user wants. Tagging server create a
tag for the request. Such as “weather_info” . Tagging server
sends the tag to the main server, main server sends the tag to information
server, if the tag needs an authentication before the sends information server,
security server checks the authentication. At last, the response comes to the
main server, main server creates the response text,
response graphic and speech text (via in communication TTS Server) and sends the response class to Mobile Device.
Information server can be in communication with 3rd pary servers for some informations that are not stored in Information server. Security server can consists more than
one authentication technology such as Voice Biometrics, IMSI-IP Radius Lookup, Account-Password authentication, etc.
Callout UI
If you try to develop your native components for Call Outs, it would be difficult to handle all the formats in client and scroll all items, etc. What I advice you, you can create a custom web view and add your call outs formatted easily.
| The picture in left shows how your SiriWebView will be shown in screen. The webview can be scrolled by user, in addition to this when a new callout comes, the web view moves automatically. |
In this section I will simply mention how to write your own SiriWebView. Inside the article you will find also a sample project about the webview. Sorry for other platform users, my all examples will be in android platform.
First of all, create a new class and name it SiriWebView. It should be extended from simple android webview. The class should consists constructer and also overided OnDraw function. What is more, we should add two new function to this class one to initialize it, and second one is to add new callout. Code snippet below shows how the add new callout function works.
public void AddNewCallOut(String message, Boolean ismsgResponse) {
elementId = elementId + 1;
StringBuilder messageBuilder = new StringBuilder();
if (!message.contentEquals("")) {
if (!ismsgResponse) {
messageBuilder
.append("<table class='bubble-gray' cellspacing='0' cellpadding='0'><tr><td class='head'></td></tr>");
messageBuilder
.append("<tr><td class='mid'><div class='txt shadow'>"
+ message + "</div></td></tr>");
messageBuilder
.append("<tr><td class='foot'></td></tr></table>");
} else {
messageBuilder
.append("<table class='bubble-blue' cellspacing='0' cellpadding='0'><tr><td class='bhead'></td></tr>");
messageBuilder
.append("<tr><td class='bmid'><div class='txt shadow'>"
+ message + "</div></td></tr>");
messageBuilder
.append("<tr><td class='bfoot'></td></tr></table>");
}
loadUrl("javascript:document.getElementById(\"div" + elementId
+ "\").innerHTML=\"" + messageBuilder.toString() + "\";");
}
StringBuilder jvscr = new StringBuilder();
if (!ismsgResponse) {
if (elementId != 1) {
if (!ismsgResponse) {
jvscr.append("var elem = document.getElementById('div"
+ (elementId - 1)
+ "'); var x = 0; var y = 0; while (elem != null) { x += elem.offsetLeft; y += elem.offsetTop; elem = elem.offsetParent; } ");
jvscr.append("var endj=500; var i=window.scrollY; for(i=window.scrollY;i<y;i++){ var j=0; var a=0; for(j=0;j<endj;j++) {a=a+1; } window.scrollTo(x, i); } ");
loadUrl("javascript:" + jvscr.toString());
}
}
}
}
The function takes two parameters, they are message and isResponse. You can write your message as string and set the value of isResponse parameter to call function when you want to add new callout. IsResponse parameter shows if the message is response of Assistant or not. That parameter changes the color of callout and slides the scroll. In the first lines of function you can see the elementId Parameter. ElementId is important to slide the objects.
After you create your own component you can add it your main_activity.xml as shown below.
<com.example.siriui.SiriWebView
android:id="@+id/webview"
android:layout_width="fill_parent"
android:layout_height="fill_parent"
android:keepScreenOn="true"
android:layout_marginTop="0dp"
android:layout_gravity="fill"
android:layout_marginBottom="0dp"
android:layout_marginLeft="0dp"
android:layout_marginRight="0dp"
android:scrollbars="horizontal"
/>
You can find out a working example of this component in this article.
Audio Compression
Audio compression reduces the size of audio data. The compressed audio data can be transferred more quickly via GSM Network. The compression type can be lossy and lossless.
Lossy: The method can reduces the amount of data during coding process. However, the retained data acceptable for recognition.The advantage of lossy method is that the data can be smaller.
Lossless: Via this method, the audio can be compressed without losing its original quality. It is important if the recognition or recording tools dont have any noise reduction process.
Some of data reduction does not effect directly the quality of speech data. Simply, if the recorded audio data will be used for speech recognition, The data which is not useful for speech recognition can be reduced. Human hearing sensivity is in 20 Hz - 20 KHz audiable frequency. The Outer of the range can be removed.
G.711: You can use G.711 standard for audio compression. The compression method is lossless one. It can compress your data as much as 50 percent. You can download the java source code of G711.java via this link ( https://code.google.com/p/sipdroid/source/browse/trunk/src/org/sipdroid/media/G711.java?r=386 ).
Other methods can be used are, MPEG-1 Layer III (MP3), MPEG-1 Layer II Multichannel, MPEG-1 Layer I, AAC, HE-AAC, MPEG Surround ,MPEG-4 ALS, MPEG-4 SLS, MPEG-4 DST, MPEG-4 HVXC, MPEG-4 CELP, USAC, G.718, G.719, G.722, G.722.1, G.722.2, G.723, G.723.1, G.726, G.728, G.729, G.729.1, Speex, Vorbis, WMA, Codec2 .
Revision History
I will add example code snippets about compression, streaming, playing buffer, Call Out UI, tagging, TTS and STT which can help programmers handle some difficult points.
18/04/13: Callout UI has been added to the article.
30/11/13: Audio Compression has been added to the article.