Conversational interfaces in the browser

Léonie Watson at Code '21

Transcript
Slides

Hello.

We, that is to say humans have been trying to talk to things other than ourselves for a remarkably long time.

As far back as the 1700s, we were working on mechanical devices designed to mimic the human anatomy-our lungs, vocal tract, vocal chords and such.

And we kept up with this analog experimentation until about midway through the 20th century.

When the clever people at bell laboratories gave us this.

and with it, of course they gave Stanley Kubrick much to be thankful for.

An interesting thing about conversational interfaces is that they are inextricably linked to text speech recognition, captures what we say and converts it into text for forward processing.

In a reverse of that, when we create synthetic speech, we supply text to a text to speech engine, which then duly converts it into the synthetic speech output that we want.

One of the most basic forms of TTS is known as a "Formant TTS".

It's based on a set of rules that let us manipulate very simple characteristics of human speech, like frequency or pitch or amplitude, volume.

The results are intelligible, but they do sound quite robotic.

For millions of years, humans live just like the animals.

Then something happened that unleashed the power of our imagination.

We learned to talk.

The limitations of formant TTS mean that to all intents and purposes, there is only one voice and we can change it within certain constraints, but really not very much.

For example, the most common way that a female gender voice is created is simply by doubling the pitch of a male sounding.

"For millions of years, humans live just like the animals.

Then something happened that unleash the power of our imagination.

We learned to talk." Then along came concatenative TTS, which was intended to overcome some of the problems with formant TTS.

And to do this, it starts off by many hours of prerecorded human speech.

The recordings are then broken down into tiny little segments, often as small as individual phonemes or even phones, then when synthetic speech is created, it's done by re sequencing those tiny segments until they form the words and sentences that we want.

The results are something of an improvement over formant TTS, if not by much.

"For millions of years, humans live just like the animals.

Then something happened that unleashed the power of our imagination.

We learned to talk." One good thing about concatenative TTS is that because it's based on recordings of real people, it becomes more possible to have voices that have different characters, different genders even some aspect of age and accent.

"For millions of years, humans lived just like the animals.

Then something happened to that unleash the power of our imagination.

We learned to talk." Concatenative TTS is not without its problems, though.

It takes many, many hours of recorded speech to create a concatenative TTS voice.

And it simply isn't possible to record enough hours of original data for the synthetic speech to be able to be as expressive and as nuanced as real human speech is.

The other problem is that it's not terribly performant.

It takes time to search a database for all the little tiny segments of sound that are needed and time to re sequence them.

And that means that although concatenative TTS sounds a bit more human than formant TTS, it definitely isn't as responsive.

I'm going to take a brief moment here to talk about a subject that's important to text to speech and synthetic speech output.

In the two examples of TTS engines I've mentioned so far, I've already talked about gender.

But there's a bit of a problem.

Most, if not all of the text-to-speech engines out there at the moment, assume binary genders, there are male voices and there are female voices and that's pretty much it.

Obviously that's not truly representative of the different types of gender identity that are out there.

So I would like now to introduce you to Q.

"Hi, I'm Q, the world's first genderless voice assistant.

I'm created for a future where we are no longer defined by a gender, but rather how we define ourselves.

My voice was recorded by people who neither identify as male, nor female, and then altered to sound gender neutral, putting my voice between 145 and 175 Hertz.

But for me to become a third option for voice assistants, I need your help.

Share my voice with Apple, Amazon, Google, and Microsoft, and together we can ensure that technology recognizes us all.

Thanks for listening.

Q." Meanwhile, back on text to speech engines, parametric TTS was intended to solve the shortcomings of both formant TTS and concatenative TTS, or to put it another way to utilize the best of both of them.

Like concatenative TTS, parametric TTS is based on a recorded human speech, but instead of breaking it down for resequencing, it's converted into a set of rules or parameters that can be used to model that voice again.

And they're processed by something known as a vocoder.

There is a distinct improvement in the human quality sounding of the speech produced in this way.

"For millions of years, humans live just like the animals.

Then something happened that unleashed the power of our imagination.

We learned to talk".

And with it, we get a much richer ability to express different voices.

Again, genders and ages and accents.

"For millions of years, humans lived just like the animals.

Then something happened that unleashed the power of our imagination.

We learned to talk." But there is still an oddly, flat quality to the speech produced by parametric TTS, that makes it still very clear that you're listening to something synthetic.

If you're building conversational interfaces in the browser, most likely it will be concatenative or parametric voices that you have at your disposal.

But more on that later.

The answer to the still slightly unnatural sound of parametric TTS, comes in the form of neural text-to-speech engines.

These are essentially based on the same model as parametric TTS, but there's one big difference.

They are supplied by truly vast amounts of data.

If you take Google's Wavenet, neural TTS, as an example, it is trained with data taken from the voices of people who've used Google's voice services.

If you opt out, this won't be the case, but if you've ever used one of Google's voice services, like voice for search, for example, there's every possibility that your voice is one of the many that has gone into training its Wavenet neural TTS.

And the difference is remarkable.

The quality of speech is so much more human sounding than any other form of TTS.

"For millions of years, humans lived just like the animals.

Then something happened that unleashed the power of our imagination.

We learned to talk" and even with voices of the same gender and approximately the same age, the ability to create subtle distinctions, like the difference between two English language accents is also really quite astonishing.

" For millions of years, humans lived just like the animals.

Then something happened that unleashed the power of our imagination.

We learn to talk." And if you didn't know better, you'd often be hard pressed to know whether that was a real human or synthetic speech and that's progress I think.

What of conversational interfaces in the browser?

Well, apart from a brief moment of excitement, back in 1997, when Microsoft introduced MSAgent, and momentarily made it possible to embed synthetic speech output in Internet Explorer.

There wasn't much to report until about a decade or so ago.

Google proposed an API that they then implemented and along with other browser engines worked on what is now known as the web speech API.

It was produced as a W3C community report back in 2012, moved into the web incubator community group in 2017.

And there has been periodic activity in moving it forward since then, most recently, last year in 2020.

The Web speech API has two interfaces, speech recognition or capturing speech input, in other words, letting you talk to your web application and speech synthesis or producing speech output, or having your web application say something back to you.

The component parts of a conversation.

We'll start with the speech recognition interface.

This is where the speech we'll start with the speech recognition interface.

We'll begin by creating a speech recognition object.

The line of code on screen shows that we are actually creating a web kit speech recognition object.

And you might well wonder why.

The answer is that the original intent was that each of the browser engines would experiment with their own implementation.

And that the best of the ideas would be drawn from each of the experiments to standardize what will then become the universally adopted feature.

Reality never quite works out the way we think it's going to though, and what has actually emerged is that webkitSpeechRecognition is essentially the only implementation we've got to play with.

On the one hand it's a good implementation, but on the other, of course, it means that we're restricted to using this interface only in browsers that support WebKit.

Let's go on and take a look at what else it contains.

There are 11 events in the webSpeech API speech recognition interface.

And some of the most commonly used include audiostart and end when audio starts and ends soundstart and soundend when the service first becomes aware that sound is being produced and when it ends.

And then simply start and end, when the service begins, capturing that sound for the purposes of the speech recognition and equally when it stops doing that.

There are also events for results, when the results of speech recognition have been captured and error for a number of different ways of error handling, of course.

There are three methods in the webspeech, API speech recognition interface start, abort and stop.

If you're wondering what the difference between the last two is, it's simply that stop means that speech recognition will stop, but data processing will continue and abort is the nuclear option.

It brings everything to a grinding halt.

Permissions or something to give careful thought to when you're using the speech recognition interface.

Necessarily when you're capturing speech, you need to use the microphone.

When you use this browsers will automatically pop up a permission's dialogue, but if someone doesn't accept the use of their microphone or misses it, or inadvertently hits the wrong button, it's a good idea to build in some error handling just as a belt and braces approach.

So again, listening out for a not allowed error event and producing a suitable message to let the user know why speech recognition isn't working, it's just a good touch from a UX point of view.

Transcript is the thing that's created when results are captured.

It's just an array of all the words that were spoken during the speech recognition phase and as an array, it means of course, that we can get to any or all of the content inside it and utilize it for any number of different reasons.

The most likely being to reprint it on screen as this demo shows.

" Open the pod bay doors Hal." So what of the speech synthesis interface?

Well, here things in terms of support, definitely look up.

The basic features of the speech synthesis interface are supported by all of the browser engines.

So there's a really good base for using this particular part of the specification.

And we begin by the most simple of actions.

Creating a speech synthesis object and giving it something to say, it's simply as simple as creating the object and using the text method to supply the text that we want to be spoken.

The example on screen would produce speech, something like this [Synthesized voice saying "This must be Thursday.

I never could get the hang of Thursdays"] With the web speech API, we can also manipulate the quality of the speech to a certain extent.

We can change the pitch.

We can make the voice sound higher or lower than the default.

We can change the rate, the speed that someone, we can also change the rate, the speed at which the voice speaks again, making it faster or slower than the default.

And similarly, we can change the volume.

We can make it louder or softer than the default.

Something that's worth remembering about all of these settings is that they are relative to the user's default.

It isn't possible to suddenly crank up the volume of synthetic speech output so that it would be distressingly loud or uncomfortable for the users.

There are limitations put in place on all of these things from a UX point of view, as much as anything.

And we can hear the differences for all three of these kinds of voice configurations.

In this following example.

[synthetic voice speaks] "One tequila, two tequila, three tequila, floor".

The speech synthesis interface works on the basis of a queue.

In that previous example, what we did was we sent four different speech objects into the queue, the slide on screen now has a simplified representation of that without any of the changes to pitch, rate, and volume.

And as things are sent to the queue, that's the order that they're spoken in [synthetic voice speaks] "one tequila.

Two tequila, three tequila, floor." You might reasonably think that once a speech object has been sent to the queue, that it is immutable, that there's nothing you can do to change its characteristics.

And you'd be wrong about that I'm afraid.

It turns out that in pretty much every browser, if you send an object through the queue and then change its characteristics those changes get acknowledged [synthetic voice speaks] "One tequila, two tequila".

My general recommendation is that once you've sent a speech object to the queue, don't muck around with it.

Yes, you can.

It doesn't mean to say that you should, and generally speaking, you'll keep yourself out of trouble by sticking to that really simple rule.

You can also change the voice with the speech synthesis interface.

On the screen at the moment, we've got an example that shows the voice that will be assigned is Hazel.

It's a UK voice provided by Microsoft.

[female sounding synthesized voice say] "Alice had begun to think that very few things indeed were really impossible." Voice selection is one of the areas where the speech synthesis interface gets a little bit complicated.

There is a method getVoices that will return an array of all of the voices that happen to be available on the platform or user agent.

And that's exactly where the problems start to begin.

On Windows, for example, Firefox will return three or four voice.

Edge a few more, perhaps 10 or 11.

Chrome, a few more again, 14 or 15 or so.

And the only voices that they have in common are the three or four that are default to Windows as a platform, which is great.

You could therefore choose one of the Windows default voices as your safe bet.

Except of course users aren't all using Windows.

They're using Mac OS, iOS, Android, and such.

So choosing a voice that is available on the given platform and the given user agent takes a little bit of logistical thinking and planning in your code.

We've talked a bit about choosing a voice and then even configuring it slightly for pitch and rate and volume, but really what about some proper expression in the voices?

It's really important to good conversational design.

The webspeech API says that it supports SSML speech synthesis, markup language, something that first became a W3C recommendation in 2004, and it was updated in 2010.

SSML is incredibly powerful.

It's really well supported by different home assistant platforms.

And with just a few SSML elements like prosody and attributes like voice and lang we can change Alexa from sounding like this.

[a largely affectless female synthesized voice says] "Hello, my name is Inigo Montoya.

You killed my father.

Prepare to die" to something far more enjoyable like this.

[A Spanish accented synthesized male voice says] "Hello, my name is Inigo Montoya.

You killed my father." Unfortunately, browsers don't support SSML in any meaningful sense.

Happily, the web speech API specification deals with this by saying that if the user agent doesn't support SSML, the elements should just be stripped out from the string that's sent by the text method, and the text itself should be processed as usual.

Sadly with the exception of Edge, every other browser, does this: [synthesized femal voice says] "XML version equals 1.0 speak version equals 1.0 XMLNS equals H..." and before you get too excited about Edge not doing that, all Edge does is strip out the SSML and proceed with the text as though the SSML weren't present.

So SSML doesn't give us a way to choose the expressiveness or design the way the voice sounds for different types of content.

Another possibility might be something called emotion, markup language.

It's a recommendation that was produced by the W3C in 2014.

And it's designed to mark up content to indicate that synthetic speech should render it in a way that is particularly emotive.

For example, something that sounds a little bit surprised or disappointed or happy.

This is an example of the Watson IBM voice that is very emotive indeed.

[female synthesized voice says]"Wow.

I was getting tired of not being able to express myself.

The future looks really exciting." Emotion ML, unfortunately is only supported by text to speech engines.

And as far as I'm aware, none of those that are available on platforms like Windows, Mac OS and such.

But how wonderful would it be as authors if, when we're producing content for consumption on the web, we could mark it up to say if synthetci speech is responsible for the output or the rendering of this content, then make it sound happy or disgusted or surprised or whatever the emotion may be.

But that's one for the future I'm afraid.

A really interesting thing about the lack of ability to design good voice output in browsers is that it's browsers where we find one of the most strong use cases for doing so.

An interesting thing about the lack of ability in browsers to style voice output is that it's in the browsers we find one of the most compelling use cases for doing so.

Lots of browsers now have reader mode.

And as part of that, the option to listen to the content of the page instead of reading.

For example, this is a recent post on my blog.

[synheized female voice says] "I've been thinking about conversational interfaces for some time and about the importance of voice quality as a part of user experience.

I use a screen reader and it sounds the same, whatever I'm doing".

That was Firefox reading that, and like other browsers Firefox gives me as a user, the ability to change the voice that's being used and the volume and the rate it speaks at.

But from an authoring point of view, there's nothing I can do to have any sway over how that content is read aloud.

If I were to ask you to look at this partial screenshot of the same page on my website and ask you what's wrong with it, you mostly would probably say it lacks styling and you're right.

I've disabled style sheets and what's left behind is just the basic structure of the content.

We have ways for styling visual content in the browser.

But what we don't have right now is a way to style sound or speech output from the same browser.

The good news is that the answer exists in the form of the CSS speech module.

Bad news.

It'll come as almost no surprise to you.

Is that nothing supports it yet.

It is a candidate recommendation at the W3C and it has been that way for a year or so now, but it's history goes back quite a lot further than that.

In CSS2 to an aural media type was introduced.

It was then replaced in CSS 2.1 by the speech media type, which still exists to this day, even though it has no support.

At the same time a number of properties were introduced and those properties today exist as the CSS speech module, as we take a look at them, you'll start to get a sense of deja VU.

There is a speak property, which is really closely related to the visual display property.

In fact, the two can be entirely reflective of each other.

The speak property essentially indicates whether the piece of content should be spoken or not.

But, the clever trick is that you can set it to auto and it will mirror the state of the display property.

So if display is set to none, then speak will be set to none and vice versa.

So a nice relationship there, if something isn't intended to be displayed visually may well be intended not to be spoken either.

And then we have properties for manipulating the familiar characteristics that we've already seen in other techniques.

Pitch this time using a word properties rather than numerical properties, but exactly the same idea.

The [unclear] one for rate, changing the rate, the content is spoken at and also volume.

And like previous examples these are also [curtailed].

You can't take any of them to extremes in either direction.

CSS would also give us the ability to introduce pauses into speech.

Pauses are a remarkably important part of human sounding speech.

Some people like Harold Pinter are even famous for their pauses because they were that dramatic and effective.

So having the ability to say to synthetic speech, pause a little while here, make a slightly longer pause for effect here is a really useful capability.

There's also voice family, which works a lot like font family in visual stylesheets.

You can design a number of characteristics.

You can define a number of characteristics or the voice that will be used.

You can select its character by name and also other characteristics like its age or gender.

The voice family property is intended to work a lot like font-family in the sense that you can choose a different number of choices and have each one be applicable to a different platform.

So it overcomes the constraints or at least the logistical complexities of getVoices in the web speech API.

And the nice thing is that once you've done all of this, the language of the content will also be selected.

So for a voice family that calls itself, McFly is a male of a young age.

We can only assume that the content in question would have to be that of American English.

So although none of this is supported at the moment in browsers, I did want to share with you a possibility demo because I think the possibilities of being able to style the speech output of our content in the same way that we style the visual output of it is something that's really missing on the web platform today.

And with just a little bit of imagination and some of those CSS properties, we could change the earlier example of speech output of that blog post to something like this.

[more expressive female synthesized voice says]"I've been thinking about conversational interfaces for some time, and about the importance of voice quality as a part of user experience.

I use a screen reader and it sounds the same, whatever I'm doing, reading an email from a friend, reading news of a global disaster ..." And so, without any dramatic changes with no sudden surprises, we can produce speech output in the browser that is styled.

It may be styled to match our brand.

It may be styled to produce certain emotional reactions or any number of reasons.

Good reasons why we should have the ability to style speech output.

So although CSS speech is not supported in the browser at the moment.

I did want to share with you a possibility demo because I think, well, the possibilities are really just amazing.

With just a few CSS properties and the little bit of imagination we can change the earlier example of speech output reading that blog post is something that sounds like this.

"I've been thinking about conversational interfaces for some time and about the importance of voice quality as a part of user experience, I use a screen reader and it sounds the same, whatever I'm doing, reading an email from a friend reading news of a global design." And how amazing would it be to be able to do that as we start to see more and more conversation happening in the browser, more speech output, the more we're going to want to be able to represent our brand design intentions, the more we're going to be able to want to express certain emotions or put nuances into the speech.

All of the things we currently do with visual design, through spacing, layout, color choice, shading, and all the rest of the creative ideas that we have around the visual design.

So I hope one day, perhaps with your help, if we make enough noise about it, we'll start to see support for the web speech API improve.

We'll start to see support for the CSS speech module and maybe even emotion ML in due course.

Because conversational interfaces are everywhere.

They're on every device and every platform that we use in some fashion or another.

And at the moment, the web platform is kind of missing a trick, I think, but I hope that the possibility will one day become actuality.

Thank you for listening.

And if you are interested in this topic, I thoroughly recommend you take a look at these articles by Brian Kadell.

He's done a great deal of research and investigation into the web speech API, including some of the gnarly bits and how to solve them.

Conversational interfaces in the browser

Web Directions Code @LeonieWatson

Daisy Bell

Photo of the 1961 IBM 704 room-filling mainframe computer

Cantando
IBM 704
Daisy Bell
1961

Conversation is text

stylised image of people swearing in speech bubbles using characters like #, ? and % in place of letters

Formant TTS

“For millions of years humans lived just like the animals. Then something happened that unleashed the power of our imagination; we learned to talk.”

Concatenative TTS

“For millions of years humans lived just like the animals. Then something happened that unleashed the power of our imagination; we learned to talk.”

Meet Q

“Hello, I'm Q, the world's first genderless voice assistant.

I'm created for a future where we're no longer defined by gender, but rather, how we define ourselves.

My voice was recorded by people who neither identify as male nor female and then altered to sound gender neutral, putting my voice somewhere between 145 and 175 Hertz.

But, for me to become the third option for voice assistants, I need your help. Share my voice with Apple, Amazon, Google, and Microsoft, and together we can ensure that technology recognises us all.

Thanks for listening. Q.”

Parametric TTS

“For millions of years humans lived just like the animals. Then something happened that unleashed the power of our imagination; we learned to talk.”

Neural TTS

“For millions of years humans lived just like the animals. Then something happened that unleashed the power of our imagination; we learned to talk.”

Web Speech API

Image of W3C Logo

Draft Community Group Report (2012)
Draft Community Group Report (2020)

Interfaces

SpeechRecognition interface
SpeechSynthesis interface

SpeechRecognition interface

const recog = new webkitSpeechRecognition();

Events

audiostart /audioend
soundstart / soundend
start / end
result
error

Methods

start
stop
abort

Permissions

recog.onerror = (event) => {
  if (event.error == "not-allowed") {        
    output.innerText = "Please grant permission to use your microphone";
  }
}

Transcript

recog.onresult = (event) => {
  let transcript = event.results[0][0].transcript;
  output.innerText = transcript;
}

SpeechRecognition demo

screen shot of the demo page. Main heading reads "Demo: Web Speech API Speech Recognition interface". Underneath this is a button labelled "Recog". In the footer text reads "© Léonie Watson Carpe diem"

SpeechSynthesis interface

var utterance  = new SpeechSynthesisUtterance();
utterance.text = "This must be Thursday. I never could get the hang of Thursdays.";

Pitch

var utterance2 = new SpeechSynthesisUtterance();        
utterance2.text = "2 tequila";
utterance2.pitch = 5;
window.speechSynthesis.speak(utterance2);

Rate

var utterance3 = new SpeechSynthesisUtterance();
utterance3.text = "3 tequila";
utterance3.rate = 2;
window.speechSynthesis.speak(utterance3);

Volume

var utterance4 = new SpeechSynthesisUtterance();
utterance4.text = "Floor!";
utterance4.volume = 1;
window.speechSynthesis.speak(utterance4);

Queue

window.speechSynthesis.speak(utterance1);
window.speechSynthesis.speak(utterance2);

window.speechSynthesis.speak(utterance3);
window.speechSynthesis.speak(utterance4);

Immutability

window.speechSynthesis.speak(utterance2);
utterance2.pitch = 4;

Voice

utterance.voice = "Microsoft Hazel desktop - English (Great Britain)";

Get voices

var voices = speechSynthesis.getVoices();

SSML

image of W3C Logo

SSML 1.0 W3C Recommendation (2004)
SSML 1.1 W3C Recommendation (2010)

Alexa demo

let responsePrompt = `
<voice name='Enrique’>
 <lang xml:lang='es-ES’>
  <p>Hello, <break time='500ms'/> my name is Inigo Montoya.
   You killed my father.
   <break strength='x-strong'/>Prepare to die!
  </p>
 </lang>
</voice>`;

Web Speech demo

utterance.text = `
  <?xml version=‘1.0’?>
  <speak version=‘1.0’     
    xmlns=‘https://www.w3.org/2001/10/synthesis’    
    xml:lang=‘en-US’>
     …
   </speak>`;

Emotion ML

W3C Logo

Emotion ML W3C Recommendation (2014)

Emotion ML demo

<emotionml>
  <emotion>
   <category name="surprised">Wow!</category> 
   <category name="disappointed">I was getting really tired of not being able to express myself.</category> 
   <category name="happiness">The future looks really exciting.</category>
  </emotion>
</emotionml>

Reader mode

screen shot of a styled web page with the main heading "Notes on synthetic speech".

No style

Screenshot of the same page with no style.

CSS Speech

W3C Logo

Working Group Note (2019)
W3C Candidate Recommendation (2020)

Media type

<link rel="stylesheet" media="speech" href="speech.css">

Speak

.content {
  speak: auto;
}

Pitch

.content {
  voice-pitch: x-low;
}

Rate

.content {
  voice-rate: x-fast;
}

Volume

.content {
  voice-volume: loud;
}

Pause

.content {
  pause-after: strong;
}

Voice family

.content {
  voice-family: McFly, young, male;
}

Content language

<p class="content" xml:lang="en-US">
  Wait a minute Doc, uh, are you telling me you built a time machine … out of a DeLorean?
</p>

Screenshot of styled web page from the earlier example of Leonie's blog with a speech demo after she has applied speech output styling

Conversational interfaces in the browser

Web Directions Code @LeonieWatson

Daisy Bell

Conversation is text

Formant TTS

Concatenative TTS

Meet Q

Parametric TTS

Neural TTS

Web Speech API

Interfaces

SpeechRecognition interface

Events

Methods

Permissions

Transcript

SpeechRecognition demo

SpeechSynthesis interface

Pitch

Rate

Volume

Queue

Immutability

Voice

Get voices

SSML

Alexa demo

Web Speech demo

Emotion ML

Emotion ML demo

Reader mode

No style

CSS Speech

Media type

Speak

Pitch

Rate

Volume

Pause

Voice family

Content language

Thank you!

Greetings, Professor Falken

Greetings, Professor Falken:

You don’t say:

Listen up:

You may also be interested in

More presentations from Code '21