Voice assistants have taken off, but can we build our own with web technologies? I’ve been building bots for other platforms, but I wanted to investigate how well one could work in the browser. Can we talk to a web application and get results?
Let’s dive into the Web Speech API, speech synthesis, and conversation design. We’ll find out whether browsers can be virtual assistants or virtually useless.
— Patima™ 💁🏻✨ (@the_patima) June 20, 2019
Bots have been a hot topic lately, but they’ve been around for ages – Eliza was built back in the 60s, possibly to quietly prove they didn’t work very well. It was mostly just pattern matching.
But technology moved on and SmarterChild gained a surprisingly big following, despite being a ton of preset responses (this one was
There have been bots on IRC, SMS, then suddenly Slack invented bots! general laughter
Now we have in-house bots, Alexa and Home devices.
So the question is, how do we build our own conversational assistant using the web?
We have the Speech Synthesise API (text to speech) which allows you to have the browser speak with just a couple of lines of code. They do provide different voices to allow a little bit of customisation. This works in everything except IE11.
Speech Recognition API does what it sounds like, however it doesn’t have much support yet. Also Chrome sends all the data to the Google Cloud Speech API, which is likely to bother many people who are concerned about privacy.
Then there’s the MediaRecorder API, which lets you easily record audio or video in the browser; and use it immediately as a
So then what? You can send the recording to a speech to text service like Google Cloud Speech, Azure Cognitive Services or IBM Watson.
WebAudio API lets you use the raw audio bytes as they are being recorded, using an audioworklet. Combine with websockets and you can create live transcription…..which kinda works. There are also some polyfills.
It’s not great that all of these services send the data off to a third party service. Privacy is important; and also these services cost money. This is why devices have ‘wake words’ like “Alexa”, or “OK Google”.
So we need to build our own wake word. TensorFlow.js to the rescue! You can set up speech commands using pre-trained modules, which translates to an in-browser wake word. (Demo of waking up a service with the name ‘baxter’).
Thinking about Conversation Design: the one piece that’s really important – speak your bot conversations out loud with someone else. Someone who doesn’t know what the responses should be. They’ll expose the cases you haven’t thought of.
While the technical journey is interesting, what’s more interesting is the potential for the web platform to take over from mystery boxes like Alexa. The web is about experimentation and freedom.
People have been able to build proof of concept of adding sign language detection and speech-to-text reflection for Alexa. Gesture based interaction can be very natural and we do it with open technology.
Phil is continuing with the original idea to build a web assistant. Feel free to join in the project. This is just the start of the journey.
(upbeat music) (crowd applauding) – Uh, good afternoon.
Thank you very much for the introduction John. As he said, my name is Phil Nash, I’m a developer evangelist at Twilio, if you don’t know what Twilio is, we are a communications platform for everything from SMS to email to contact centres and everything in between that you can build with. I’m gonna be out there at the stand, I’ve got there. All the rest of the conference I’ve been panicking pre-talk so that’s why it looked empty up until now. So but I’ll be much more calm for the rest of this after I’m done.
But as we said, we’re here to talk about what we can do in conversation with the browser. And bots have been a bit of a hot topic I think for a while now, which is kind of exciting. But also I find kind of amusing, because they’ve been around since forever.
(Phil laughs) But am Eliza has been around forever.
It’s replaying instantly because it knows just out of a random bunch of things that it’s gonna ship something back to that back text box straight away. Meanwhile, technology moved on.
And I think the first real smart bot in the world lived on AOL Instant Messenger, AIM.
It was a bot called Smarter Child.
And you could use it to to ask for the weather or stock information. Or you could try and talk to it about dogs and it might be interested in other things, this is the top image if you search for Smarter Child on Google, and apparently everyone’s very convinced it just wanted to talk about bots the whole time. (Phil and audience laughs) But it didn’t it had these other uses.
It could do smart things.
The only problem with it was that literally every response had to be programmed in by somebody.
And whilst it had quite a lot of responses actually had up to I think 8 million users at one point which is quite impressive.
What effectively was back then a whole bunch of statements it gained a big following.
It lived on I think for a little while and MSN Messenger. And that’s the, I never used AIM.
I tried to sign up for it one time and I couldn’t get my username and then 10 more times, I still couldn’t get a username and I gave up because I assumed they’d all gone at that point. But this is I looked at this username, when Angels Dead Fall that’s (laughs) that’s such a late early 2000 username, it can really be more so if it was like that and on live journal.
(audience laugh) (Phil laughs) Cause some people were on the internet back then sweet. But these bots then continue to evolve in their different channels, IRC has always had bots. And when Twilio lost SMS API people have been building bots on that kind of thing forever.
And then Slack invented bots.
(Phil and audience laughs) Someone’s felt that before and but we you know, we build these in all over the place.
And it’s this increase in technology that make them smarter like no longer do we have to programme in every single answer. And ultimately, it’s kinda led to these devices living in our homes and listening to us the entire time. Actually, my home is surrounded by, I really love my echoes, My house’s surrounded by them, it is entirely possible that if you were near my house, you’d hear me shout, Alexa playing my shower jams. And then 30 minutes later, I feel refreshed. (Phil laughs) But this is this a yeah, increase in technology has brought us to the point where they are understanding our voice.
And then they’re reading and dealing with that kind of thing.
And I brought these ones on plugs today because I want to talk to them, and I want them listening because as a web developer, I want to replace them with the web.
The web is always in a way playing catch up to native environments.
I think that’s a good thing, because native environments are allowed to explore things and then throw them away again and disappoint a bunch of developers that builds on top of them, and the web has to really be sure about stuff. And then when it implements it, never remove it because that will make everybody angry.
We have some web APIs.
This is pretty exciting, It’s a good start. And the Web Speech API, and Jason and Neil talked about this last year and speak about something else tomorrow, I believe, but the Web Speech API was well covered by that, but I want to just go back just to make sure that we know what it can do. So we can go from speech synthesis, just in case the sounds not online.
Going to make sounds now.
So this is in Chrome right now.
– [System sound] Hello, web directions.
(audience laughs) – [Phil] Comic Chrome, It’s okay.
(Phil and audience laughs) – [System sound] S-H-H-H (audience laughs) – [Phil] Disappointing.
(Phil and audience laughs) And then it does this is Firefox’s version of it. This is just the default voice.
– [System sound] Hello web directions.
– So we didn’t have both of those use the same code which is almost simply this.
You create yourself a new speech utterance with the text that you want to say and then use the Speech API to speak that utterance. You can choose the voices that are available, I was surprised to find out that they were different defaults there, because they both use the operating system underneath eventually. Although I think Google does have a couple of extra ones probably from their APIs abroad, because they have some extra languages.
But they mostly use the operating system APIs for this and you can get a list of available voices and choose the voice.
I felt like this one actually just sounded more like a butler which is quite good Actually, I suppose for getting your an assistant.
– [System sound] Hello (Phil laughs) – Moving on of course, we then the other side of these web Speech API from synthesis which sorry, is actually quite well supported which is quite exciting, all those greens look at them, sorry up for many I wasn’t expecting that anyway, I guess obviously not inside explorer but Edge, Firefox, Chrome, Safari, Opera, iOS Safari, Samsung Internet, DOM kind of use that. It’s pretty well supported 88% that says, I’m Can I use from earlier today? The other side of it, of course, speech recognition. Now this is a big test.
Because I’m practise this with a microphone and a crowd of people so, hello, web directions.
Give web directions, is okay, so you know, it’s all right. Let’s try again.
Hello, web directions.
Yeah, there we go, cool.
So that’s cool.
But if we were to try to do that in Firefox, it would fail. It would fail because Firefox currently has the Speech API, the speech recognition API hidden behind a flag. Of course if you turn that flag on, it will also fail. (Phil and audience laughs) Because it doesn’t work.
This is what it looks like though.
And actually, you might notice it’s still vendor prefixed, you remember those. I haven’t got it out of that yet, so it’s vendor prefixed. But then to do a speech recognition on Chrome, you just do a news WebKit speech recognition, add an event listener to it for the results, and start it.
You can also stop it there, it will detect the end as it did just now.
And then that results event comes with a bunch of results. And in this case, we’re picking the first result and so that it comes with a bunch of alternatives, things that thought you might have said, and you can check the confidence on those as well and get them displayed into your text area or whatever you want to do with it, start like sending off your actual requests to do something about what was just said.
The support is horrific right now.
Edge 75 there, which of course is just Chrome. So big jump as well, 18 to 75.
But they’ll get there eventually I’m sure.
It’s I mean, it’s surprising that even when you have this kind of muddy green, which is partial support, just that those versions of Chrome it still equals 67.8% of all web users.
So it’s not out of the question to put speech recognition into your software and expect that actually the majority of people on the web will be able to use it. That’s quite cool.
But speech recognition in Chrome sends all of that speech to the chrome to the Google Cloud Speech API. It does it for free which is nice if you’ve used it from the server and you paid for it, you just put it in a browser. It looks great. No. (Phil laughs) But it does send all that data to the Google Cloud Speech API, which you might be scared about, I don’t know. These things are.
There’s a lot of privacy and a lot of ethical kind of things to think about when we’re building an assistant. Again, it’s why they’re unplugged right now. I don’t want them listening to this.
So what else can we do? How else can we capture this audio and turn it into text so that we can use it? Now I know Jess is going to talk about this in a bit. But the Media Recorder API sprung out to me just purely because it exists, and is way to easily record audio or video, in this case, just audio in the browser. And so I can get my microphone and say hello. It’s good to be here on stage.
(Phil laughs) Put it straight into an audio tag in this case. – [System sound] Say hello, it’s.
– Weird listening to myself speak Austin speaking. (Phil and audience laughs) And that’s cool.
And now we have a WebM file.
(Phil laughs) That’s one, one GF a WebM, cool.
It’s not all bad.
You know, you might want other formats.
And this is where we get to talking about web assembly again.
And I have seen that somebody tried to compile FF MPEG into web assembly, and I did not touch it, just in case, I failed horribly.
So we can do that, and the media API.
Again, you’re gonna see more of this, but it’s very straightforward to use actually, for the most part, you use get easy media to get your stream of data.
Give that to our media recorder session, tell it you’re gonna make audio WebM which as far as I worked out so far as the only format we can actually make with it, and get yourself an array of chunks.
And then when data is available from that recorder, push that data into the RF junks.
At the end of the day, take those chunks and push them into a blob, done. We have our audio WebM file, and we can put it into the audio player like that, like I did or send it to the server.
Media recorder APIs actually pretty well supported. It surprised me to find out that it was first raised in Firefox 29.
Guess the year? No, okay 2014.
Quite a long time ago.
But it’s only coming to Safari NH soon.
Again, is 75 so it’s great.
I built a little example of this on web-recorder.glitch.me so if you wanna go play with that you can go remix, my web recorder on glitch, it also allows you to download your audio file and maybe eventually I’ll get it to translate it to MP3 or something like that.
Then what? We have audio in the browser, what do we do about that? We have a whole bunch of speech to text services available to us as simple as just up here, telling us how there’s a bunch of Microsoft leaving the Azure Cognitive Services API is I believe, as well as our friend, the Google Cloud Speech API and IBM Watson. It’s pretty excited with Watson actually, because actually accepts WebM as the audio file. Clouds, Google’s service will not, and I didn’t look at Azure to know if it supports WebM, no idea.
(Phil laughs) No it’s fine. So yes, you can send your audio, your WebM, straight onto Watson if you want and it’ll give you back some text for it.
Pretty excited about.
But I’ll move on from that.
Web video video is coming up.
Media recorder is coming up very soon.
Instead, I reached to the Web Audio API, because that was an interesting thing on the can I use for the web or the media recorder API. That said, rather than having you can record streams from the user’s browser, rather than having to perform a manual encoding operations on Raw PCM data, etcetera. Not sure od etcetera so, but I know that with the Web Audio API, you can produce raw PCM data. This is the raw audio bytes being recorded. And you can do all sorts of interesting stuff on that, you can do audio visualisations and make a brilliant graphics inside the browser based on what it’s hearing.
You can also start to save those bits of data or indeed send them off to another service. And so if you combine the audio work let, which is a kind of, it’s almost like a Web Workers style thing but for for the Web Audio API, sits on the side receives all the stuff, it doesn’t block anything, turn that into a thing with the web socket. Then we go a bit further.
And there are actually a couple of services that make those web sockets available even to the front end and allow you to do that speech to text within the browser but without having to record it all in one chunk and send it to a server.
So this is the Watson example I think it’s funny to say anyway.
Hello web directions.
No, okay I guess you can have my right because remember that’s what I should have clicked, isn’t it? Hello web directions, What’s going on? Well yes okay L (audience laughs) Hello, it works it out.
(Phil laughs) It’s still going all right, I press the button twice, so (audience laughs) So that was using speech to text inside the browser Firefox which we proved earlier does as I told you earlier, sorry, horribly fails if you’re trying to use the speech detection web APIs.
So we can have it which is great.
And there are a few of these alternatives.
That’s the Watson Developer Cloud one.
And then there were two have been built speech polyfill, and web speech cognitive services, which both, the speech polyfill particularly tries to actually fully poly fill the, I know they both do try to poly fill the actual Speech API to work on any browser. And they both actually use Azure cognitive services in the back.
So that’s quite cool.
We can get that far.
So this is all great, I think it’s great that we can make, we can talk to our browser and have words appear on that screen.
And then we can spend the time working out what those words mean, and getting it to do actions and that’s a home assistant.
But you might have noticed every single one of these services is sending all of your microphone data systems that party. The Web Speech API is sending everything straight through to Google’s Cloud Speech API.
Polyfills and fillings here are sending them to a third party of your choice.
But I have two problems with this, one privacy seems important that we’re not just piping everything off to people at all times. We don’t just have an open microphone, listening to see if we said something of interest. Secondly, I don’t have a kind of money to keep that service running for full time.
Definitely not what more than one.
The Google one doesn’t cost of course.
But if we can do this cross platform which we should, I can’t be just yeah, opening up a third party connection. And this is why our friendly home assistance all come with our own weak words as well, why we expected to say Alexa, before we say anything, Why we expect to say hey Google, or Hey Siri for apples things.
Yeah, so we need to build ourselves some kind of work let. And this brings me to machine learning.
I’m super glad I was straight up after SM because I didn’t have to do any explanation. Of what machine learning is anymore.
You’ve seen the neural networks, you know what to expect. And so this was a problem for me, because I’m not a machine learning expert.
However, I was excited to see that we have TensorFlow JS. And the TensorFlow JS comes with a whole bunch of pre-trained models, which is very exciting because I didn’t have to do any training or find the speech data from thousands of people all over the world saying different words. Because I didn’t have that, I didn’t have the time. So I can show you a demo of this.
TensorFlow JS, it’s just here.
This is a demo of TensorFlow working with speech. This is its pre-trained model.
It gives you a whole bunch of options, kind of directions and numbers.
It’s picking random things up as you can see right now, but if I tell it down, and if I tell it stop and if I tell it seven.
It’s picking those things up, and that’s really cool. Stop that one.
But it also gives you a chance to retrain this model. And this is the really important part, because it has a whole bunch of speech inside already a whole bunch of ideas about what this means. We can say some other words to it over and over again and use the existing model to then retrain just to hear for those words.
And so I have done that.
I had to load that so you can load the model which is quite cool.
And I decided my assistant was gonna be called Baxter. I think you need both an extinct distinguishable word. That’s why Alexa has that x in the middle and Bixby similarly for Samsung internet, but also because I like the dog from anchorman. It’s Baxter, is fine.
So if I say Baxter, it’s not gonna listen to you because I trained this model in my quiet office and not with a microphone. Baxter, there we go, it’s right there.
And so in browser we can have a week word Baxter. Baxter, yeah we go (Phil and audience laughs) I just have to be closer.
(Phil laughs) We can have an in browser work let and if I was to do more training I’d probably do things from further away my microphone than when I was sitting next to it doing this. We can have that kind of work let available for us to, and I’m gonna get this on AI JS.
(Phil laughs) we can have that work let, which would be really cool. Now, you might have noticed that I’ve, those are few disparate demos, and I haven’t quite figured it all together yet. A lot of that problem is because I am very new to TensorFlow JS when I came to this and haven’t managed to export that model out of that page yet, or train a new one elsewhere.
We’ll get to that.
I will get to that because this is an ongoing project. In the meantime (Phil laughs) In the meantime, I wanted to touch on one little bit of conversation design, which I learned recently with a couple of bots with a bot that I made or I was running over last weekend. We, I was outside JS which was fantastic.
So this is the second time I’ve seen this in this talk. (Phil laughs) And we had a T shirt bot going there.
People could order their own T shirts, and pick the size, colour, design and colour of the design and then that will be made for them.
And that bot was cool.
It mostly worked until I talked to some people and they were like, Oh, no, this bit didn’t work. Never talk to your users, and they always have problems. The one piece of conversation design that’s just super important is always speak your bot conversations out loud with someone else.
Go through all those potential flows that it could do and speak it out loud and speak it with somebody that doesn’t suppose to know, the ultimate response to this.
Because if you do that, you’ll find out all the things they could say to you, not just the things you’d like them to say to you. And that is basically the most important thing. And building your own personality into a bot is also important.
But just having those conversations out loud with with other people to see how they would interact with the voice interface. Because these voice interfaces have no discoverability almost.
And neither the text interfaces either we have to discover how to use them, unless you’re still getting that what to do this week with Alexa or email, which has just which jokes he can tell this week and what’s on sports I think.
Discovery of these things is important.
And so being able to discover how that’s gonna come to you is also important. So what do we do with this? I do have a bunch of little demos here that if we did collect them up into a neat package, Try no, don’t have yet.
We could have ourselves a browser based assistant. See, I thought when I embarked upon this experiment trying to experience these web APIs and how we could put them together in order to make this, I thought it was gonna be an interesting technical journey. And I’ve showed some code and I thought that was good. It’s nice to know that sometimes it’s only a few lines to do something or sometimes there’s pre-trained model for something you can do this with.
But ultimately, it actually occurred to me that this is more about the web platform and our ability as developers to explore.
You see the problem with these little boxes is that’s exactly what they are.
This one quite literally as a little black box. I don’t know what’s inside seven directional microphones but apart from that I don’t know what’s inside that can affect that, I can deploy bots to the Alexa platform but then I’m living in there in Alexa’s world, same for Google’s same for whatever.
But that’s not what the web platforms about, the web platform is about experimentation, about freedom to build whatever you want to, and discover how much better that can be.
We can take that experimentation that kind of thing and see what we do with it.
I think this is a wonderful example of that. This is a developer from New York who is not deaf, but thought that these voice interfaces could be better for for people who were deaf. And so they built this gesture recognition in browser, which then use the speech synthesis API to talk to an Alexa and then use the speech recognition API, saw it in Chrome, to listen to Alexa and then put that back on screen to make a visual representation of what this conversational bot assistant is saying. I think that’s absolutely amazing.
That’s not something you could have built into that. That echo device that’s up there, you have to have done something else.
So if we have our own bot, our own assistant to play with, we can experience and experiment with these things. And I don’t know if they’re connected.
But I was fortunate enough to be at Google I/O earlier this year, and I swear one of the biggest gasps when they were in the keynote announcing the Google Home hub, was when they said and you can pick the phone up and tell it to stop playing music without actually speaking to it. And so gesture based experiences with a home assistant, which is something you do to a real person in this case if you pick the phone and display.
This is the kind of thing that we can experiment with, if we have our own bot, and then perhaps influence the future of the bots that are going to invade our house anyway.
I don’t think we can escape from the fact that I have five Alexas around my house, it’s terrifying but they’re there now.
But maybe I can influence that by making experiments and building something I want to see in the world and the way I want bots to work.
So I didn’t intend this necessarily as some kind of announcement, but I am going to continue. I am continuing with this work to build a web assistant. And if you fancy joining me on this travel, is a very empty GitHub repo right now.
And Phil Nash slash web assistant.
It has some media recorded stuff in it right now. And it’s gonna get more we’re gonna, I’m gonna figure out this TensorFlow JS thing.
We’re gonna get there and I’m gonna see a web assistant because it’s entirely possible.
And I’m very excited to see what as a community as a web community, and with the web platform we can build with these abilities.
So this is just the start of that journey.
I hope you’ll come with me on it.
And after that, I just like to leave you with slightly more wise words of Smarter Child. When asked do you sleep? Smarter Child replies no, but I dream.
Do you have a better world? Well, when man and machine can coexist in peace and happiness.
(audience laughs) And with you that’s not a child.
Can we do that? That’s all I’ve got for you.
Thank you so much, my name is Phil Nash (audience applauding) I’ll be out there all day. (audience applauding)