2020 Voice Forecast: Wake Conditions

Conversation designers carefully craft voice interactions to feel as natural as possible, yet we’re still tied to the feasibility restrictions of mandatory device wake words, rendering all interactions fundamentally, well, robotic.

Hilary Hayes
4 min readJan 10, 2020

--

Implicit wake conditions are the key to natural — and even intimate — voice experiences in 2020 and beyond.

Currently, wake words alert a device that it should listen to what the person is about to say next, which may contain an invocation which, in turn, is a trigger phrase that prompts the assistant to connect the user to a specific voice app. If the person speaking asks for a skill or action by name, that is referred to as an explicit invocation, as opposed to an implicit invocation.

The Actions on Google documentation describes implicit invocation as “an interaction flow that occurs when a user makes a request to perform some task without invoking an Action by name. The Google Assistant attempts to match the user’s request to a suitable fulfillment, such as an Action, search result, or mobile app, then presents recommendations to the user… implicit invocation provides a way for users to discover your Action via the Assistant.” The strategy behind implicit invocations puts an emphasis on the discoverability of a potentially helpful voice app that a user may not be familiar with by name, based on their context. However, this isn’t valuable if the user is already very familiar with the voice app, and may use it daily.

Furthermore, zoom out and consider the user’s relationship to the device-level assistant — something that they may speak to several times a day. In the context of the device-level assistant, lowering barriers to invocation is not about discoverability, it’s about building a relationship and deepening intimacy.

Multi-turn exchanges are becoming slightly more prevalent in voice interactions, but since the initiation of a voice request depends on verbalizing the “wake word,” like Hey Alexa, the user will find themselves invoking, and re-invoking, and re-re-invoking the action or skill in order to continue the conversation that the user believes that they’re having. Consider how awkward the same experience model would map to a human-human interaction, which typically consists of several potentially (or seemingly) unrelated topics within one larger conversation. A wide chasm still exists between the user’s mental model and the actual functional reality of voice assistant interactions.

The use of wake words keeps our relationship with voice assistants at arms length, even though they are becoming increasingly omnipresent in our home and work lives.

By using more human conversation cues into device wake conditions, I hope that we can go from people saying “I’m worried about it listening to me all the time” to “when I am speaking to my assistant, I expect to be listened to”.

For years, the Amazon Echos have used its blue ring light to indicate directional listening, the Apple iPhone X with Face ID was released in 2017, and in 2019 the Google Pixel 4 shipped with Motion Sense (Project Soli) directional interaction indication.

Amazon Echo, Apple Face ID, Google Pixel 4 Motion Sense, Google Project Soli

By the end of 2020, I anticipate voice design moving away from explicit wake words (removing the need for saying “Alexa” or “Hey Google”) to implicit wake conditions, such as recognizing when a user is looking at or turns to face a device housing the assistant.

Taking cues from human interaction, there are many ways to implement implicit wake conditions, including:

  • Gaze detection
  • Face recognition
  • Hand tracking
  • Gestures, microgestures
  • Bone induction via wearables
  • Body position detection

One challenge that will continue to plague conversation designers will be appropriately determining when a conversation is over — since conclusion cues may not necessarily be the inverse of wake conditions. For example, someone may initially make eye contact with someone — or a device — to implicitly start a conversation, but that doesn’t mean they won’t turn away over the course of the interaction, while still continuing to want to chat. Google’s Motion Sense shows the most promise to alleviate this pain point since its full feature state allegedly has the capacity to recognize body cues that typically start and end interactions, e.g. when two people make eye contact with each other. How do humans decide when a conversation is over? Even between two people, there is the distinct possibility for awkwardness at the end of conversations.

Implicit wake conditions will provide unique opportunities for developers and designers to imagine voice-only and voice-first experiences that feel unobtrusive, intuitive, and intimate, since they will be based on normative human actions.

Thank you to Dominic Smith for editing.

--

--

Hilary Hayes

Generative & multimodal experience design researcher ✨