Voice-First Development
Designing, Developing, and Deploying Conversational Interfaces
Ann Thymé-Gobbel, Ph.D. and Charles R. Jankowski Jr., Ph.D.
  • MEAP began November 2018
  • Publication in Spring 2020 (estimated)
  • ISBN 9781617295461
  • 350 pages (estimated)
  • printed in black & white

A very good book on the unique challenges you face when trying to build programs using voice technology.

William Wade
Voice-commanded applications are everywhere, running on smart speakers like the Amazon Echo and Google Home, digital assistants like Apple’s Siri, speech-based automotive chatbots, and even novelties like the Alexa-enabled Big Mouth Billy Bass. In Voice-First Development, authors Ann Thyme-Gobbel and Charles Jankowski draw from more than three decades of experience in voice-related development and research to bring you up to speed with a host of voice-controlled applications. This engaging guide focuses on end-to-end voice app development, concrete best practices, and how to avoid common pitfalls. Including practical instruction, real-world examples, and lots of code samples, this book is perfect for developers ready to create fully-functioning voice solutions that users will love!
Table of Contents detailed table of contents

Part 1: Voice-First Foundations

1 Voice-first development components

1.1 Voice-first, voice-only, and conversational everything

1.2 Introduction to voice technology components

1.2.1 Speech to text

1.2.2 Natural language understanding

1.2.3 Dialog management

1.2.4 Natural language generation

1.2.5 Text to speech

1.3 Meet the phases of voice-first development

1.3.1 Plan

1.3.2 Design

1.3.3 Build

1.3.4 Test

1.3.5 Deploy & Assess

1.3.6 Iterate

1.4 Hope is not a strategy—but to plan & execute is

1.5 What next?

1.6 Summary

2 Keeping voice in mind

2.1 Why voice is different

2.2 Hands-on: A pre-coding thought experiment

2.3 Voice dialog and its participants

2.3.1 Human voice

2.3.2 Computer voice

2.3.3 Human-computer voice dialog

2.4 What next?

2.5 Summary

3 Running a voice-first application – and noticing issues

3.1 Hands-on: Preparing the restaurant finder

3.2 Say hello to the Amazon and Google voice platforms

3.3 Hands-on: Running the Alexa restaurant finder skill

3.3.1 Basic setup

3.3.2 Adding an intent

3.3.3 Adding the endpoint

3.3.4 Connecting the skill and the code

3.3.5 Testing the skill

3.4 Hands-on: Running the Google restaurant finder action

3.4.1 Before you begin

3.4.2 Basic setup

3.4.3 Starting to build the interaction

3.4.4 Doing something

3.4.5 What the user says

3.4.6 What the application says

3.4.7 Connecting Dialogflow to Actions on Google

3.4.8 Testing the app

3.4.9 Saving the voice interaction

3.5 Why we’ll be favoring Google Assistant

3.6 Google’s voice development ecosystem

3.7 The pros and cons of relying on tools

3.8 Hands-on: Making changes

3.8.1 Adding more phrases

3.8.2 Eating something else

3.8.3 Asking for something more specific

3.9 What’s next?

3.10 Summary

Part 2: Planning Voice-First Interactions

4 Defining your vision: Building What, How, and Why for Whom

4.1 Functional requirements: What are you building?

4.1.1 General functionality

4.1.2 Detailed functionality

4.1.3 Supporting functionality

4.2 Non-functional business requirements: Why are you building it?

4.2.1 General purpose and goal

4.2.2 Underlying service and existing automation

4.2.3 Branding and terminology

4.2.4 Data needs

4.2.5 Access and availability

4.3 Non-functional user requirements: Who will use it and what do they want?

4.3.1 Demographics and characteristics

4.3.2 User engagement patterns

4.3.3 User mental models and domain knowledge

4.3.4 User environment and state of mind

4.4 Non-functional system requirements: How will you build it?

4.4.1 Recognizer, parser, and interpreter

4.4.2 External data sources

4.4.3 Data storage and data access

4.4.4 Other system concerns

4.5 Summary

5 From discovery to UX and UI: Tools of the voice design trade

5.1 Where to find early user data on any budget

5.1.1 Online research and crowd sourcing options

5.1.2 Dialog participant observation

5.1.3 Focus groups, interviews, and surveys

5.2 How discovery results feed into VUI design decisions

5.2.1 Dialog manager graph, yes; hierarchical decision tree, no

5.3 Capturing and documenting VUI design

5.3.1 Dialog flows

5.3.2 Sample dialogs

5.3.3 Detailed design specifications

5.3.4 VUI design documentation approaches

5.4 Prototyping and testing your assumptions

5.4.1 Early voice UX and prototyping approaches

5.5 What’s next?

5.6 Summary

Part 3: Building Voice-First Interactions

6 Applying human 'rules of dialog' to reach voice-first dialog resolution

6.1 Dialog acts, games and turns – and Grice

6.2 Question answering

6.3 Action requests

6.4 Task completion requests

6.5 Fully specified requests

6.5.1 Single slot requests

6.5.2 Multi-slot requests

6.6 Determining dialog acts based on feature discovery

6.7 Dialog completion

6.7.1 Responding to ‘goodbye’ and ‘thanks’

6.8 What’s next?

6.9 Summary

7 Resolving incomplete requests through disambiguation

7.1 Incomplete requests

7.1.1 Reaching completeness through dialog management

7.2 Ambiguous requests

7.3 Disambiguation methods

7.3.1 Logic-based assumptions

7.3.2 Yes/No questions

7.3.3 A/B sets

7.3.4 Static lists

7.3.5 Dynamic lists

7.3.6 Open sets

7.3.7 Menus

7.4 Testing on the device to find and solve issues

7.4.1 Two big lessons

7.5 Webhooks 1: Toward code independence

7.5.1 Fulfillment and webhooks

7.5.2 Webhook overview

7.5.3 Webhook in depth

7.5.4 Contexts, context parameters and follow-up intents

7.6 What’s next

7.7 Summary

8 Conveying reassurance with confidence and confirmation

8.1 Conveying reassurance and shared certainty

8.1.1 Setting expectations with your implications

8.2 Webhooks 2

8.2.1 Dialogflow system architecture

8.2.2 The webhook request

8.2.3 The webhook response

8.2.4 Implementing the webhook

8.3 Confirmation methods

8.3.1 Non-verbal confirmation

8.3.2 Generic acknowledgment

8.3.3 Implicit confirmation

8.3.4 Explicit confirmation

8.4 Confirmation placement – confirming slots versus intents

8.5 Disconfirmation: dealing with “no”

8.6 Additional reassurance techniques and pitfalls

8.6.1 System pronunciation

8.6.2 Backchannels

8.6.3 Discourse markers

8.6.4 VUI architecture

8.7 Choosing the right confirmation method

8.8 Summary

9 Helping users succeed through consistency

9.1 Universals

9.1.1 Providing clarification and additional information

9.1.2 Providing a do-over

9.1.3 Providing an exit

9.1.4 Coding universals

9.2 Navigation

9.2.1 Landmarks

9.2.2 Non-verbal audio

9.2.3 Content playback navigation

9.2.4 List navigation

9.3 Consistency

9.3.1 Working with built-in global intents

9.3.2 Consistency across VUIs and frameworks

9.4 What’s next?

9.5 Summary

10 Creating robust coverage for speech-to-text resolution

10.1 Recognition is speech-to-text resolution

10.2 Inside the STT box

10.3 Recognition engines

10.4 Grammar concepts

10.4.1 Coverage

10.4.2 Recognition space

10.4.3 Static or dynamic, large or small

10.4.4 End-pointing

10.4.5 Multiple hypotheses

10.5 Types of grammars

10.5.1 Rule-based grammars

10.5.2 Statistical models

10.5.3 Hot words

10.5.4 Wake words and invocation names

10.6 Working with grammars

10.6.1 Writing regular expressions

10.7 How to succeed with grammars

10.7.1 Bootstrap

10.7.2 Normalize punctuation and spellings

10.7.3 Handle unusual pronunciations

10.7.4 Use dictionaries and domain knowledge

10.7.5 Understand the strengths and limitations of STT

10.8 Limitations on grammar creation and use

10.9 Summary

11 Ensuring shared understanding through parsing and intent resolution

12 Using accuracy strategies to avoid misunderstandings

13 Using error strategies to recover from miscommunication

14 Using world knowledge to improve interpretation and experience

15 Incorporating personalization and customization for broader user appeal

16 Using context and proactive behavior for smarter dialogs

17 Using speaker identity for privacy and security

18 Meeting user expectations through persona and voice

19 Addressing limitations through modality, location, and eco systems

Part 4: Verifying and Deploying Voice-First Interactions

20 Finding and understanding the data that tells you what’s working

21 How users tell you what to improve

22 Voice-first discovery revisited

Appendixes:

Appendix A: Future directions and other in-depth topics

Appendix B: Checklists

Appendix C: Documentation templates and samples

Appendix D: References and sources

Appendix E: Terminology

About the Technology

New platforms and tools make voice apps easier to create than ever before. The unfortunate downside is a flood of sub-par apps that leave users frustrated with easily-avoidable bugs, design flaws, and installation glitches. To build voice apps you need intermediate-level skills in a language like Python or JavaScript along with a solid command of how voice-to-machine interactions work. Being voice-first means leveraging knowledge about other modes, like chat, while incorporating voice-specific knowledge to the process. Like any other application style, voice-centric software requires a proven strategy of planning, designing, building, testing, deploying, and assessing until you get it right.

About the book

Voice-First Development is your personal roadmap to developing successful voice applications. In this insightful guide, you’ll get a solid foundation in modern voice technologies and also get your feet wet writing your first speech interaction apps. As you progress, you’ll devise an effective plan for balancing business and product requirements, technology dependencies, and user needs. Through interesting and practical examples, you’ll be immersed in design-informed development with code and techniques that address various characteristics of voice-first interactions. Finally, you'll ensure your apps succeed by onboarding Ann and Charles’s techniques for testing and debugging. This practical tutorial teaches you the most effective strategy of just-in-time actionable steps and tips for making great voice apps, no matter the scope, topic, or users!

What's inside

  • Planning, building, verifying, and deploying voice apps
  • Applying human rules of dialog
  • Accuracy strategies for avoiding misunderstandings
  • Using world knowledge to improve user experiences
  • Error strategies for recovering from miscommunications
  • Using context for smarter dialogs
  • Pitfalls and how to avoid them
  • Real-world examples and code samples in JavaScript and Python

About the reader

For developers with intermediate JavaScript or Python skills.

About the authors

Ann Thyme-Gobbel and Charles Jankowski have worked in speech recognition and natural language understanding for over 30 years. Ann is currently the Voice UI/UX Design Leader at Sound United. She holds a Ph.D. in cognitive science and linguistics from UC San Diego. Charles is currently Director of NLP Application at CloudMinds Technologies. He holds S.B., S.M., and Ph.D. degrees from M.I.T. Together Ann and Charles created a multi-modal conversational natural language interface to assist acute and chronic care patients.

Manning Early Access Program (MEAP) Read chapters as they are written, get the finished eBook as soon as it’s ready, and receive the pBook long before it's in bookstores.
MEAP combo $49.99 pBook + eBook + liveBook
MEAP eBook $39.99 pdf + ePub + kindle + liveBook
Prices displayed in rupees will be charged in USD when you check out.

placing your order...

Don't refresh or navigate away from the page.

FREE domestic shipping on three or more pBooks