In conversation with a browser

Voice assistants have taken off, but can we build our own with web technologies? I’ve been building bots for other platforms, but I wanted to investigate how well one could work in the browser. Can we talk to a web application and get results?

Let’s dive into the Web Speech API, speech synthesis, and conversation design. We’ll find out whether browsers can be virtual assistants or virtually useless.

Bots have been a hot topic lately, but they’ve been around for ages – Eliza was built back in the 60s, possibly to quietly prove they didn’t work very well. It was mostly just pattern matching.

But technology moved on and SmarterChild gained a surprisingly big following, despite being a ton of preset responses (this one was if statements).

There have been bots on IRC, SMS, then suddenly Slack invented bots! general laughter

Now we have in-house bots, Alexa and Home devices.

So the question is, how do we build our own conversational assistant using the web?

We have the Speech Synthesise API (text to speech) which allows you to have the browser speak with just a couple of lines of code. They do provide different voices to allow a little bit of customisation. This works in everything except IE11.

Speech Recognition API does what it sounds like, however it doesn’t have much support yet. Also Chrome sends all the data to the Google Cloud Speech API, which is likely to bother many people who are concerned about privacy.

Then there’s the MediaRecorder API, which lets you easily record audio or video in the browser; and use it immediately as a webm file.

Demo: http://web-recorder.glitch.me/

So then what? You can send the recording to a speech to text service like Google Cloud Speech, Azure Cognitive Services or IBM Watson.

WebAudio API lets you use the raw audio bytes as they are being recorded, using an audioworklet. Combine with websockets and you can create live transcription…..which kinda works. There are also some polyfills.

It’s not great that all of these services send the data off to a third party service. Privacy is important; and also these services cost money. This is why devices have ‘wake words’ like “Alexa”, or “OK Google”.

So we need to build our own wake word. TensorFlow.js to the rescue! You can set up speech commands using pre-trained modules, which translates to an in-browser wake word. (Demo of waking up a service with the name ‘baxter’).

Thinking about Conversation Design: the one piece that’s really important – speak your bot conversations out loud with someone else. Someone who doesn’t know what the responses should be. They’ll expose the cases you haven’t thought of.

While the technical journey is interesting, what’s more interesting is the potential for the web platform to take over from mystery boxes like Alexa. The web is about experimentation and freedom.

People have been able to build proof of concept of adding sign language detection and speech-to-text reflection for Alexa. Gesture based interaction can be very natural and we do it with open technology.

Phil is continuing with the original idea to build a web assistant. Feel free to join in the project. This is just the start of the journey.

@philnash