I have emerged myself recently in Voice driven applications and was asked to knock up a quick prototype of something “that looks and acts like Siri”. That’s a pretty tall order I though but after some research I came up with the following…
The first problem we have to solve is Speech Recognition, i.e. convert the voice data into text. The data would have to be streamed to a server which then performs the actual recognition and sends back a string of what it thinks you said. That’s some complicated stuff right there. Voice recognition is a science in itself and I also did not want to have to deal with the server setup. Luckily for me, it turns out that Google has build all of this stuff into their Chrome browser already courtesy of the HTML 5 Speech Input API. All you have to do is add a special attribute to an <input> and it will allow users to simply
"click on an icon and then speak into your computer’s microphone. The recorded audio is sent to speech servers for transcription, after which the text is typed out for you."
Sounds about right to me, first problem solved!
The second challenge is to extract meaningful information from the text to understand what the users wants you to do. When the user says ”What is the weather forecast for tomorrow”, you have to figure out, from this string that the user … well … wants to see the weather forecast for tomorrow. If this is the only case your application has to handle, it’s pretty easy:
if utterance =~ /.*weather forecast.*/ig
return “I don not know what you mean, try asking again (e.g. what is the weather forecast for tomorrow)”
But also pretty useless.
Clearly you could not write a case statement big enough that could handle all possible scenarios or even a fairly limited scenario, e.g. what would happen if the user asks “Show me the forecast of the weather”, not to mention “Is is going to rain tomorrow?”. You can see that this processing of natural language can get fairly complicated very quickly. As it it turns out, this is another field of science (Natural Language Processing or NLP) that people much smarter than myself have worked on for decades. One example of a website that uses NLP to answer questions is wolframalpha. And guess who uses wolframalpha … that’s right: Siri. So if it is good enough for Apple, it’s certainly good enough for my prototype so I registered for a developers licence with them and that was it (I suggest you do the same if you want to follow this article). Now I just needed to hook up everything, I’m going to create a Rails application to do just that.
It will be a very simple application with 1 page that has 1 form on it. This form in turn will have 1 field on it that can be used by the user to “enter” their question. To support voice entry, I will add the required attribute (“x-webkit-speech”) to this input field. To further emphasis the fact that this is a voice driven application, I am going to style the input field:
Using the following CSS:
Furthermore, that same page will have an area that displays the data: what the user says and what wolframalpha returns as the answer. We call this the stream and represent it as in ordered list, which gives us the following (using HAML):
Incredibly simple! The user presses on the microphone and starts talking. When he stops talking, Google processes the voice data and returns a text representation (actual it returns several, ranked in decreasing order of “correctness”, we just always use the top result). It inserts the text into the Text Field on which the voice was triggered, essentialy Chrome fills in the form for us with the transcribed voice data. This is all handled by chrome, we do not have to do anything for this to work.
When the result comes back from google, chrome also raises a JS event that we can listen for. We will use this to trigger an AJAX call to WolframAlpha, passing in the received text, i.e. we automatically submit the form to process_speech. process_speech is a controller method that handles the call to WolframAlpha (I am using the Faraday gem). When we receive an answer from WolframAlpha, we attach this to the stream (in coffeescript):
And that is it really, some more CSS and more coffeescript to make it look pretty and you are good to go: Siri in the browser in less than 150 lines of code. I haven’t had a chance yet to clean up the code so it’s not public yet on github, but here’s a video showing the end result.