Voice Applications for Alexa and Google Assistant you own this product

Dustin A. Coates
Foreword by Max Amordeluso

July 2019
ISBN 9781617295317
264 pages

Included with a Manning Online subscription

printed in black & white

available in Korean, Simplified Chinese

catalog / Software Development

table of content

1 Introduction to voice first

1.1 What is voice first?

1.2 Designing for voice UIs

1.3 Anatomy of a voice command

1.3.1 Waking the voice-first device

1.3.2 Introducing natural language processing

1.3.3 How speech becomes text

1.3.4 Intents are the functions of a skill

1.3.5 Training the NLU with sample utterances

1.3.6 Plucking pertinent information from spoken text

1.4 The fulfillment code that ties it all together

1.5 Telling the device what to say

Summary

2 Building a call-and-response skill on Alexa

2.1 Skill metadata

2.1.1 Interaction model

2.1.2 Invocation name

2.1.3 Intents

2.1.4 Sample utterances

2.1.5 Slots

2.2 The interaction model

2.2.1 Building the intent

2.2.2 Slots

2.3 Fulfillment

2.3.1 Hosted endpoint

2.3.2 AWS Lambda

2.3.3 Coding the fulfillment

Summary

3 Designing a voice user interface

3.1 VUI fundamentals

3.2 The cooperative principle

3.2.1 Quantity

3.2.2 Quality

3.2.3 Relation

3.2.4 Manner

3.3 VUI planning

3.4 Variety

Summary

4 Using entity resolution and built-in intents to extend an Alexa skill

4.1 Alexa Skills Kit CLI

4.1.1 Creating an Alexa skill project

4.2 Entity resolution

4.2.1 Fulfillment

4.2.2 Built-in intents

4.2.3 LaunchRequest

4.3 Invoking the skill locally

4.4 Summary

5 Making a conversational Alexa skill

5.1 Creating a conversation

5.1.1 State management

5.1.2 Per-state handlers

5.1.3 Handling the unhandled

5.2 Maintaining long-term information

How attributes are saved

5.3 Putting it all together

5.3.1 New intents

5.3.2 New utterances

5.3.3 New fulfillment

5.3.4 Correcting a mistake

Summary

6 VUI and conversation best practices

6.1 Conversations and context

6.2 A skill with context

6.2.1 Frame-based interactions

6.2.2 The fulfillment

6.2.3 Decaying context

6.3 Intercepting responses and requests

6.4 Unit testing

Summary

7 Using conversation tools to add meaning and usability

7.1 Discourse markers

7.2 Controlling the application’s speech with SSML

7.2.1 Breaks and pauses

7.2.2 Prosody

7.2.3 amazon:effect

7.2.4 w, say-as

SSML implementation differences

7.2.5 phoneme

7.3 Embedding audio

Summary

8 Directing conversation flow

8.1 Guiding user interaction

8.2 Dialog interface

8.2.1 Creating the skill

8.2.2 Setting up the dialog model

Creating a dialog model in the console or through the CLI

8.2.3 Slot filling

8.2.4 Intent confirmation

8.2.5 Dialog model fulfillment

8.3 Handling errors

Summary

9 Building for Google Assistant

9.1 Setting up the application

9.2 Building the interaction model

9.2.1 Building an intent

9.2.2 Testing with the simulator

Necessary Google account settings for simulator testing

9.2.3 Parameters and entities

9.2.4 Adding entities

9.2.5 Using parameters in intents

9.3 Fulfillment

Developing locally without deploying

9.3.1 The code

9.3.2 Deployment

9.3.3 Changing the invocation name

Summary

10 Going multimodal

10.1 Introducing multimodal

10.2 Multimodal in actions

10.2.1 Simple responses

10.2.2 Rich responses

10.2.3 List responses

10.2.4 Suggestion chips

10.3 Surface capabilities

10.4 Multisurface conversations

Summary

11 Push interactions

11.1 Routine suggestions

11.1.1 Storing user data

11.1.2 Action suggestion for a routine

11.2 Daily updates

Phone-based testing

11.2.1 Developer control of daily updates

11.3 Push notifications

11.4 Implicit invocation

Summary

12 Building for actions on Google with the Actions SDK

12.1 Dialogflow and the Actions SDK

12.2 App planning

12.3 The action package

Query pattern arguments

12.4 The fulfillment

12.4.1 Parsing input with regular expressions

12.4.2 Handling the unexpected

Summary

Appendixes

Appendix A: Adding an AWS IAM profile

Appendix B: Connecting DynamoDB to a Lambda function

Overview

1 Introduction to voice first

Voice-first computing moved from decades of niche promise to mainstream reality with products like Amazon Echo, turning natural conversation into a primary way to use technology. These platforms are defined by voice as the main input and by openness to third-party extensions that bring the web’s breadth to voice through skills and actions. The ecosystem is led by Amazon, Google, and Microsoft, is expanding quickly into homes, and increasingly supports multimodal experiences that pair speech with displays while keeping voice at the center.

Designing for voice requires applying the rules of good conversation: a helpful tone and personality appropriate to context, concise responses, smart follow-up questions, and graceful recovery when users stumble. Unlike rigid, menu-driven IVR systems, voice-first aims for flexible, goal-oriented dialogs that reduce user effort. This shifts responsibility from users to designers and developers—discoverability must be built in, inputs handled naturally despite variation, and outputs pruned to the most relevant answer—forcing those from visual-first backgrounds to rethink interaction models around conversation.

Under the hood, voice experiences follow a clear pipeline: a wake word triggers local listening, speech is streamed to the cloud, converted to text, and interpreted by NLU to match developer-defined intents and extract variable details via slots, guided by sample utterances. Fulfillment code—often running on serverless platforms—executes business logic, calls APIs, and returns a response that the assistant renders using text-to-speech, enhanced with SSML for pacing, emphasis, and other prosody controls. Put together, intents, slots, utterances, handlers, and SSML form the toolkit for building natural, reliable, and scalable voice applications.

Figure 1.1. Web flow compared to voice flow

Figure 1.2. Alexa relies on natural language understanding to answer the user’s question.

Figure 1.3. The overall user goal (the intent), the intent-specific variable information (the slot), and how it’s invoked (the utterance)

Summary

Focus on taking the burden of completing an action from the user in voice-first applications.
Data in a conversation flows back and forth between partners to complete an action.
Building a voice application involves reliably directing this data between systems.
Begin to think of requests in terms of intents, slots, and utterances.

FAQ

What does “voice first” mean?

Voice-first platforms are interacted with primarily through voice and are open to extension by third-party developers. They bring the web to voice by allowing built-in functions plus developer-created apps (skills/actions).

How are voice-first platforms different from old IVR phone trees?

Classical IVR systems are rigid decision trees with a tiny set of choices. Voice-first platforms support natural, open-ended conversations and can invoke web-backed skills/actions, making the options practically limitless and more conversational.

Who are the main voice-first platforms, and why isn’t Apple listed?

The primary platforms discussed are Amazon, Google, and Microsoft because they’re open to third-party developers. Although Siri and HomePod are popular, Apple has not opened them to third-party development.

Is “voice first” the same as “voice only”?

No. Voice first prioritizes voice but can be multimodal. Devices like Echo Show and Google Home with Chromecast combine voice with displays, expanding interaction options while keeping voice as the main modality.

What makes designing for voice different from web or mobile UI?

Voice design relies on natural conversation. You must give the right amount of information, ask clarifying follow-ups, handle unexpected input gracefully, and set an appropriate personality—all without visual scaffolding. Responsibility shifts from users to developers to make options discoverable, inputs natural, and results concise.

What are intents, sample utterances, and slots?

Intents represent what the user wants to do (like functions). Sample utterances are example phrases that train the NLU to map speech to an intent. Slots are variable pieces of information (like arguments) extracted from the utterance (for example, a Room slot in “turn off the lights in the kitchen”).

How does a voice command flow from speech to response?

The device wakes on a local wake word, streams audio to the platform, converts speech to text, uses NLU to infer intent and slots, forwards the request to fulfillment (your code), receives a textual response, converts it to speech, and plays it back to the user.

What is a wake word and how is it handled?

A wake word/phrase (for example, “Alexa,” “Hey Google”) is recognized locally. Devices keep a short audio buffer and continuously listen for the wake word so they can quickly start streaming speech once invoked, helping responsiveness and privacy.

Why is converting speech to text difficult?

Speech varies by accent, voice, and context; background noise and similar-sounding phrases add ambiguity. Systems break audio into phonemes, compare to known patterns, and rely on statistical models and large training sets to infer the most likely words.

What is fulfillment, and how does the device decide what to say back?

Fulfillment is your code that handles requests (often on serverless platforms like AWS Lambda or Google Cloud Functions). A handler mapped to an intent processes slots, performs logic, and returns a response. Speech is typically returned using SSML, which lets you control prosody (rate, volume, pronunciation) before it’s synthesized and spoken to the user.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $31.19

you save $16.80 (35%)

include audio $24.99 $16.24

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $31.19

you save $16.80 (35%)

include audio $24.99 $16.24

eBook

pdf, ePub, online

$47.99 $31.19

you save $16.80 (35%)

include audio $24.99 $16.24

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more