Designing for Voice: Alexa, Google Assistant, and beyond

There are few places where design is less evident than when you use a voice user interface, like the Amazon Echo or the Google home. But as anyone who has used Alexa or the Google Assistant knows, it’s painfully obvious when a voice based experience is not designed well. You go nowhere, fast. It’s the equivalent of a 404 page, but somehow more personally frustrating.

As a former voice designer for Alexa, and a current voice designer for the Google Assistant, I would like to talk about the ins and outs of designing for an eyes-free experience. What is voice design? What does it look like? Why is it important? What happens when you do it well? And what happens when it’s not designed at all.

Start here… is a huge collection of links created by Yury Vetrov that includes many of the ones in the document (in a much nicer format), spanning current tools by function, examples of generative design in other disciplines, intros to AI/ML, and ethics.


Also while it’s not specifically about AI – if you are interested in discussions around design ethics, the community in the How Might We Do Good Slack is tackling things like a design ethics framework, collective action, and the toolkit for overcoming the barriers to doing good:

Libratus poker AI and poker AI history (CW: this is a poker forum so proceed with caution)

Poker endgame theory/systems



I also highly recommend watching the AlphaGo documentary on Netflix!


AI design tools and projects


Ethics and AI


Automation and the future of work


Diversity and inclusion (or lack of) in AI


AI progress and current state overviews–the-human-expert.html


Relationship between AI and humans


AI applications in mental health


Overview of design/UX + AI

AI/tech fails


Things that didn’t have a group!


Darla Sharp – Crafting Conversation, design in the age of AI

While all of us having experience designing screens, many of us don’t have experience designing for voice.

Darla currently works at Google (Assistant team) as a Conversation Designer, although the job may also be called Voice User Interface (VUI) Design, or Voice Interaction Designer. In the end it’s just interaction design with a focus on voice.

Google is moving away from mobile-first to AI-first. Google Assistant’s product line is expanding rapidly, including some devices that do actually add a screen (although not as the primary focus).

Design + AI – there is an increase in voice-forward design. The question of course is why? When we all have smartphones why do we need this additional modality?

  1. speed and simplicity
  2. ubiquity

When voice works it really is quicker – there are a surprisingly large number of taps to do simple things. For example you can ask for the latest Gorillaz album in Spotify, much faster than you can open up the app, search for it, find the album and tap to start playing it.

Phones are considered ubiquitous, but as virtual assistants spread to other places they are getting more popular. You shouldn’t be using your phone in the car…. right?! So the ubuiquity is moving to the assistant and not the device.

Design considerations

  1. conversation design, which owes a lot to linguistics
  2. speakers (not the devices)
  3. the tools in the toolkit
  4. expanding ecosystem

Conversation design owes a lot to linguistics; and the way humans process language.

Words (sound into words) → Syntax (words in to phrases) → Semantics (derive meaning) → Pragmatics (interpret meaning in cultural context).

This is really easy in a first language, basically instinctual or obvious. However it is incredibly fragile, if anything breaks the entire interaction falls down. If someone makes a mistake in a second language, it confuses people who are talking or listening to them. Or if someone’s accent makes the sounds hard to understand, the most basic level of comprehension has broken.

How does this break out into conversation design?

Front end:

  • Words: What’s the weather today?
  • Syntax: In Alameda today, it’s 72 degrees and sunny.

Back end (most of the time is spent after this, on logic and UX flows)

  • Semantics
  • Pragmatics

This interaction requires knowledge of the user’s location and preferred units of measurement (degrees F or C?).

Cooperative principle – rules that we innately know, that we use in order to be good conversational analysts.

  • Quality – appropriate for context
  • Quantity – as informative as required (neither too little or too much)
  • Relevance – unambiguous
  • Manner – true

When assistants get something wrong, they will have violated one or more of these principles.

Examples of Google Assistant getting these wrong…

  • Quality – “open uber” → “I can’t open apps” … but they wanted to open an action they know the assistant can do
  • Quantity – (a question about politics/law) → the response had way too much information and wasn’t the right detail
  • Relevance – “what was that last song” → (long plot synopsis of a movie called The Last Song)
  • Manner – “ok google can you tell me directions” → “I can’t find that place” (actually she’s just lying, she can tell you directions)

Cognitive load – this is discussed all the time in voice design. When we listen to people talking, we form a syntax tree that lets us understand the words. We can both listen and process, this is within our capacity of cognitive load.

“I shot an elephant in my pyjamas” can translate into two different language trees. One has you wearing the pjs, the other has the elephant wearing them. We know who is wearing the pjs, but computers have a much harder time.

Example 1:

User: Hey Google, any flights to San Francisco on Thursday
A: Yes, there are four flights. They’re at 1:15, 3:55, 5:05 and 6:35pm. Do you want to hear more about one of these.
A: Yes, there are four flights. Big Blue Airlines 47 leaves New York at blah blah blah….

Speakers… people may be speaking in a very large range of scenarios. They may be hands-busy or eyes-busy, they may be multitasking, they may be in a private or public space. Users are all instant experts – we’ve been talking all our lives! So they have high expectations and low tolerance for error.

The other side of speakers is your assistant, which is representing your brand when it’s talking to the user. It manifests brand attributes, it has a back story and a role. If you don’t define all this, your users will!

Text-to-speech can really change the nature of the communication. Simply removing the exclamation mark from “Let’s go!” completely changes the tone. TTS makes the word “actually” sound incredibly rude and condescending, because all the tone and body language is stripped away. So were it might have said “actually” you need to find another word, to design around this issue in the medium.

We have many tools in the toolkit now – people can speak, type, tap and show things to a device. Most organisations still have siloed teams working on these modalities.

The nature of the speech-only signal is unusual. It’s linear, always moving forward (there’s no nesting or layers the way we work on a screen); and they are ephemeral, constantly fading. They were here and now they’re gone – imagine a screen interface that only shows for five seconds before fading away.

There is complexity in recognition and understanding – what users say and mean. ASR and NLP.

“What’s the weather in Springfield?”
→ which one? there are many across America and even around the world

“Play Yesterday”
→ do you mean the movie or the song..?
→ which version of the song? the original or one of the covers?

Text to speech has the rhythm and melody of speech. It’s not just what you say, it’s how you say it.

As more devices become available, it gets more complex to work out how things work across all of them.

There is a spectrum from voice only, to voice forward, intermodal, visual only.

There is a range of user conditions – static or in motion, public or private space, rich or poor touch interaction. Mobile phones move through these, the context changes to the extremes for motion and privacy.

This is also why porting things doesn’t work. If you port a screen app straight to voice, it just doesn’t work.

The number one thing is to design for empathy. That’s a real challenge for a platform as big as Google, but it’s really important… it’s very hard but we try!

Day Two