Today, UX and UI designers can download asset packs to modularly build interfaces that are almost immediately recognizable, because of visual norms and established conventions. It’s now totally possible for newly minted interaction designers to build a career by copying and pasting pre-built elements.
However, voice experiences lack the established design patterns. The imaginations of designers and developers have led us this far, but we need to start centralizing and standardizing the practice. While volumes of documentation have been written about building for voice by Google and Amazon, strangely neither of those companies have taken any kind of solid stance on what a voice experience should look like.
Voice experience prototyping tools such as Voiceflow, Adobe XD, Dialogflow, Botmock, and more use variations of flow charts to enable people to see what they’re building; however, none of them offer intuitive visualization of invocations, replies, and other conversation elements.
A Rising Tide
Hundreds of thousands of voice apps have been built, but there has yet to be a breakout hit. There are only a handful of specialists thinking about conversation mapping right now, which means that the rate of innovation has been sluggish.
Documenting and sharing methods for visualizing voice interactions and conversation mapping will strengthen the voice design ecosystem, making it easier for more unique perspectives to join the conversation — leading to greater potential for success for all. The more people that jump in and debate emerging conventions, the more that we can iterate and improve. After all, a rising tide lifts all boats.
I’ve been designing for voice for over two years now, which feels like no time at all in comparison to master practitioners such as Cathy Pearl and Margaret Urban who worked on IVR phone systems. However, in this second wave of voice where interaction has gone omnichannel, I am uniquely qualified and have frequently been called an expert. The following emerging conventions in conversation mapping are based on what I’ve seen and the work that my team has done.
We are decidedly still in the Wild West of voice interaction, and frankly it’s unsurprising that visual conventions haven’t meaningfully crystallized in a practice that can’t even widely agree on what it’s called. Voice Design? Voice App Design? Voice User Interface Design? VUI? CUI? Conversational Design? Conversational Interaction Design? Voice UX?
As mentioned above, conversation mapping is what I call the process of visualizing a voice experience or interaction. A conversation map is basically an elevated flow chart, but building a conversation map is a great way to quickly script and edit a voice experience from start to end, as well as share with team members and stakeholders for collaboration and feedback.
A basic conversation map is very simple to create, and can be sketched out with nothing more than a pen and paper. Though they’re quick and simple to create, conversation maps are extremely useful for brainstorming and communicating the flow of a voice interaction without writing a single line of code. Professionally, I have found that using the web app Whimsical has enabled me to quickly draft and update flows in real time, sharing easily with engineers and clients. By having the latest map always available online, it removes the risk of me becoming a blocker if I need to export a locally generated conversation map from Sketch. Whimsical isn’t perfect, but it’s done the job so far.
Conversation Map Components
To begin a voice experience, the user says an utterance containing the wake word to activate the device housing the assistant, and then a connecting word to link it to the invocation and launch a specific action or skill. This is where the conversation map starts.
The starting point of a conversation map is the invocation. This is the phrase a user will say to awaken the device housing the assistant, and prompt it to connect you to the voice experience. I’ll be referencing a fictitious voice-first coffee ordering app called BrewBot for the rest of this article.
As silly as it may seem, an earcon is an icon for your ears: a short tone, sound effect, or a jingle. Earcons are useful for non-verbally indicating experience milestones to the user, such as successfully entering the voice experience, completing a task, or answering a question correctly or incorrectly.
Earcons may be labeled if differentiation is needed, to describe their source, or the file name.
Greeting & Onboarding
How do you want the voice app to greet the user? What is the personality and tone of the voice app? Since we’re making a coffee app, our first use greeting for BrewBot could be a peppy, “Good morning! I’m BrewBot, your digital barista!” As a tip, remember to label conversation blocks as you create them. In the code, these two pieces of copy will be presented as one but calling them out separately in the conversation map helps with scripting, since they serve different functional purposes. Onboarding lets the user know what they can do within the voice app. Try to end Onboarding in a question or clear prompt for the user to respond to.
Onboarding is a great opportunity to let the user know what they can do within the voice app. Again, most people don’t know what they can do with voice experiences because they truly don’t know what is on the table.
Just as when you’re meeting a human being for the first time, you’d give a longer, more descriptive introduction, maybe including your title, role (which is basically your function), or the context that whoever you are meeting can relate to you in.
I might say, “Hi, I’m Hilary Hayes: award-winning conversation designer and researcher. I can teach you about conversation mapping, testing voice experiences, or SSML. What would you like to learn first?” But the next time that you meet me, I might just say “It’s me: Hilary Hayes.” because we’ve already established my context and function. Voice experience greetings and onboarding should be shorter on subsequent interactions, building on previous interactions. Try to end in a question or clear prompt for the user to respond to.
Notice the green arrow that connects the invocation to the 1st Use Greeting block. This means that the user’s utterance was successful and linked them to block within the conversation flow. Since there are so many things that can happen during a conversation, there are many other connection types, too. Some connections, like that green arrow, are directly reactive to a user’s input, but others (like the blue dotted line) indicate the conversation moving between states with no additional action needed by the user. Colour coding helps delineate different kinds of connections from each other, making conversation maps easier to read. That said, the connections are certainly the component that are the least resolved so far.
One of the most concerning issues with voice experiences is the inaccurate mental model that the vast majority of users have about them. Few users refer to their Amazon Echos by their product name, instead calling them “Alexa.” This flawed mental model is only exacerbated by the (as of right now) mandatory use of assistant-level wake words during skills and actions to reopen the mic. This undermines voice app-specific branding, since it confuses the user about who or what they are talking to and where they are.
Consider the metaphor of the device-level assistant as a telephone operator. You pick up the phone and ask the operator to connect you to your friend, they connect you but then remain on the line for the entire conversation because you will need to ask them to relay any messages to the friend you’ve called in the event that your friend fails to ask a question and instead just makes a statement. Telephones in the 1950s would never have become as ubiquitous as they have if they had the same limitations that smart speakers do as we approach 2020.
I strongly recommend using a custom synthetic or pre-recorded human voice to set your voice experience apart from device-level assistants, making it easier for users to understand where they are when interacting via voice. Using persona icons is especially helpful when a voice app contains more than one voice.
Once the functional options are presented via Onboarding, the user will reply with what they want to do. This is called an utterance, and utterances usually contain words or phrases that describe the user’s intentions.
Within an utterance there will be words or phrases that have been mapped to intents and slot values — intents being an action that the user wants to take, and a slot value specifying exactly how they want it done.
To accurately fulfill intents, slots need to be populated with slot values. Slots are variables within voice experiences, and the list of values that are predetermined to be acceptable or not must be compatible with the slot type.
Intents can either be built in or custom. A built-in intent is one that is already handled by the architecture that you’re building your voice application in, such as “Play” or “Pause” in an audio player experience.
Custom intents are created by designers and developers to accomplish actions outside of what is already built within the voice design toolset. Usually, custom intents have names to reflect what they do.
One intent that exists across all voice experiences is the HelpIntent. You can’t publish a voice app without it! At any time, a user must be able to ask for help in a voice experience. A good help message follows similar scripting to onboarding messages by restating what the user can do within the skill or action.
If your skill or action contains different functional sections, such as a first-use setup flow before the main voice app, you should also provide contextual help for that section, since information that the user might need could be very different depending on where they are in the voice experience.
However, if the user doesn’t respond within a few seconds, you’ll want to consider a Reprompt to ensure that you get the slot values needed to fulfill an intent. Reprompts are a great way to provide contextual guidance and light error handling to the user, however the downside being that the voice app will close if the reprompt doesn’t get a response.
Reprompts may also be combined in the map with their corresponding block.
Designers are familiar with designing for errors, and Conversation Repair should be considered as the error-state handling of the voice design world. When the user is no longer in the desired flow, this message helps them get back on track.
Identify that there’s been an error, and then when appropriate, suggest another way for the user to accomplish what they’re trying to do. Such as “Sorry, I’m not sure about that. Do you want to hear your options again?”
Remember that to fulfill intents, slots need to be populated with slot values, but only certain values are predetermined to be acceptable and compatible with the slot type. Conversation repair can also be used to handle system errors by reassuring the user that while there may have been a temporary hiccup, things should be to business as usual shortly.
If you’d like to learn more about Conversation Repair, check out my previous article on it here.
MP3 & Streaming Content Playback
Both audio content longer than an Earcon and streaming audio can be represented by this block.
Visuals & Connected Devices
Voice experiences are good, but they are not universally applicable. Voice has notable shortcomings when it comes to communicating a long list of options or for most browsing experiences. Using nearby screens, especially if the experience is happening on or central to a device that has its own screen, can help you create truly helpful and voice-forward experiences without the pitfalls of being voice-only.
Use as much or as little detail as you need when including representation of other devices in the ecosystem of your experience.
This block indicates that the included copy is handled by system-level dialogue management, such as this zip code reprompt.
Exit Message (StopIntent)
When all is said and done, remember to include an Exit message to wish the user well and even to encourage them to come back soon. This may be another excellent opportunity for an earcon.
This is what the whole example conversation flow map looks like:
This map is barely legible from this perspective but just as you don’t consult a map of the world when trying to find the route to a friend’s house, conversation maps aren’t intended to be read or used at full scale. They are meant to offer insight into the divergent paths that voice interactions can take.
Now you’ve seen my approach to conversation mapping, and I’d love to know how you and your team have been approaching this unique and evolving challenge!
Thank you to Dominic Smith for help with editing this behemoth of an article.