Voice UI Systems
Designing, Developing, and Deploying Conversational Interfaces
Ann Thymé-Gobbel, Ph.D. and Charles R. Jankowski Jr., Ph.D.
  • MEAP began November 2018
  • Publication in Summer 2021 (estimated)
  • ISBN 9781617295461
  • 513 pages (estimated)
  • printed in black & white

A very good book on the unique challenges you face when trying to build programs using voice technology.

William Wade
Voice-commanded applications are everywhere, running on smart speakers like the Amazon Echo and Google Home, digital assistants like Apple’s Siri, speech-based automotive chatbots, and even novelties like the Alexa-enabled Big Mouth Billy Bass. In Voice UI Systems, authors Ann Thyme-Gobbel and Charles Jankowski draw from more than three decades of experience in voice-related development and research to bring you up to speed with a host of voice-controlled applications. This engaging guide focuses on end-to-end voice app development, concrete best practices, and how to avoid common pitfalls. Including practical instruction, real-world examples, and lots of code samples, this book is perfect for developers ready to create fully-functioning voice solutions that users will love!

About the Technology

New platforms and tools make voice apps easier to create than ever before. The unfortunate downside is a flood of sub-par apps that leave users frustrated with easily-avoidable bugs, design flaws, and installation glitches. To build voice apps you need intermediate-level skills in a language like Python or JavaScript along with a solid command of how voice-to-machine interactions work. Being voice-first means leveraging knowledge about other modes, like chat, while incorporating voice-specific knowledge to the process. Like any other application style, voice-centric software requires a proven strategy of planning, designing, building, testing, deploying, and assessing until you get it right.

About the book

Voice UI Systems is your personal roadmap to developing successful voice applications. In this insightful guide, you’ll get a solid foundation in modern voice technologies and also get your feet wet writing your first speech interaction apps. As you progress, you’ll devise an effective plan for balancing business and product requirements, technology dependencies, and user needs. Through interesting and practical examples, you’ll be immersed in design-informed development with code and techniques that address various characteristics of voice-first interactions. Finally, you'll ensure your apps succeed by onboarding Ann and Charles’s techniques for testing and debugging. This practical tutorial teaches you the most effective strategy of just-in-time actionable steps and tips for making great voice apps, no matter the scope, topic, or users!
Table of Contents detailed table of contents

Part 1: Voice-First Foundations

1 Voice-first development components

1.1 Voice-first, voice-only, and conversational everything

1.2 Introduction to voice technology components

1.2.1 Speech to text

1.2.2 Natural language understanding

1.2.3 Dialog management

1.2.4 Natural language generation

1.2.5 Text to speech

1.3 Meet the phases of voice-first development

1.3.1 Plan

1.3.2 Design

1.3.3 Build

1.3.4 Test

1.3.5 Deploy & Assess

1.3.6 Iterate

1.4 Hope is not a strategy—but to plan & execute is

1.5 What next?

1.6 Summary

2 Keeping voice in mind

2.1 Why voice is different

2.2 Hands-on: A pre-coding thought experiment

2.3 Voice dialog and its participants

2.3.1 Human spoken language

2.3.2 Voice system speech and understanding

2.3.3 Human-computer voice dialog

2.4 What’s next?

2.5 Summary

3 Running a voice-first application – and noticing issues

3.1 Hands-on: Preparing the restaurant finder

3.2 Say hello to voice platforms

3.3 Hands-on: A Google restaurant finder action

3.3.1 Basic setup

3.3.2 Specifying a first intent

3.3.3 Doing something

3.3.4 What the user says

3.3.5 What the VUI says

3.3.6 Connecting Dialogflow to Actions on Google

3.3.7 Testing the app

3.3.8 Saving the voice interaction

3.4 Why we’re using Actions on Google and Assistant

3.5 Google’s voice development ecosystem

3.6 The pros and cons of relying on tools

3.7 Hands-on: Making changes

3.7.1 Adding more phrases

3.7.2 Eating something else

3.7.3 Asking for something more specific

3.8 What’s next?

3.9 Summary

Part 2: Planning Voice-First Interactions

4 Defining your vision: Building What, How, and Why for Whom

4.1 Functional requirements: What are you building?

4.1.1 General functionality

4.1.2 Detailed functionality

4.1.3 Supporting functionality

4.2 Non-functional business requirements: Why are you building it?

4.2.1 General purpose and goal

4.2.2 Underlying service and existing automation

4.2.3 Branding and terminology

4.2.4 Data needs

4.2.5 Access and availability

4.3 Non-functional user requirements: Who will use it and what do they want?

4.3.1 Demographics and characteristics

4.3.2 User engagement patterns

4.3.3 User mental models and domain knowledge

4.3.4 User environment and state of mind

4.4 Non-functional system requirements: How will you build it?

4.4.1 Recognizer, parser, and interpreter

4.4.2 External data sources

4.4.3 Data storage and data access

4.4.4 Other system concerns

4.5 Summary

5 From discovery to UX and UI: Tools of the voice design trade

5.1 Where to find early user data on any budget

5.1.1 Online research and crowd sourcing options

5.1.2 Dialog participant observation

5.1.3 Focus groups, interviews, and surveys

5.2 How discovery results feed into VUI design decisions

5.2.1 Dialog manager graph, yes; hierarchical decision tree, no

5.3 Capturing and documenting VUI design

5.3.1 Dialog flows

5.3.2 Sample dialogs

5.3.3 Detailed design specifications

5.3.4 VUI design documentation approaches

5.4 Prototyping and testing your assumptions

5.4.1 Early voice UX and prototyping approaches

5.5 What’s next?

5.6 Summary

Part 3: Building Voice-First Interactions

6 Applying human 'rules of dialog' to reach conversation resolution

6.1 Dialog acts, games and turns – and Grice

6.2 Question answering

6.3 Action requests

6.4 Task completion requests

6.5 Fully specified requests

6.5.1 Single slot requests

6.5.2 Multi-slot requests

6.6 Determining dialog acts based on feature discovery

6.7 Dialog completion

6.7.1 Responding to ‘goodbye’ and ‘thanks’

6.8 What’s next?

6.9 Summary

7 Resolving incomplete requests through disambiguation

7.1 Incomplete requests

7.1.1 Reaching completeness through dialog management

7.2 Ambiguous requests

7.3 Disambiguation methods

7.3.1 Logic-based assumptions

7.3.2 Yes/No questions

7.3.3 A/B sets

7.3.4 Static lists

7.3.5 Dynamic lists

7.3.6 Open sets

7.3.7 Menus

7.4 Testing on the device to find and solve issues

7.4.1 Two big lessons

7.5 Webhooks 1: Toward code independence

7.5.1 Fulfillment and webhooks

7.5.2 Webhook overview

7.5.3 Webhook in depth

7.5.4 Contexts, context parameters and follow-up intents

7.6 What’s next

7.7 Summary

8 Conveying reassurance with confidence and confirmation

8.1 Conveying reassurance and shared certainty

8.1.1 Setting expectations with your implications

8.2 Webhooks 2

8.2.1 Dialogflow system architecture

8.2.2 The webhook request

8.2.3 The webhook response

8.2.4 Implementing the webhook

8.3 Confirmation methods

8.3.1 Non-verbal confirmation

8.3.2 Generic acknowledgment

8.3.3 Implicit confirmation

8.3.4 Explicit confirmation

8.4 Confirmation placement – confirming slots versus intents

8.5 Disconfirmation: dealing with “no”

8.6 Additional reassurance techniques and pitfalls

8.6.1 System pronunciation

8.6.2 Backchannels

8.6.3 Discourse markers

8.6.4 VUI architecture

8.7 Choosing the right reassurance method

8.8 Summary

9 Helping users succeed through consistency

9.1 Universals

9.1.1 Providing clarification and additional information

9.1.2 Providing a do-over

9.1.3 Providing an exit

9.1.4 Coding universals

9.2 Navigation

9.2.1 Landmarks

9.2.2 Non-verbal audio

9.2.3 Content playback navigation

9.2.4 List navigation

9.3 Consistency, variation and randomization

9.3.1 Working with built-in global intents

9.3.2 Consistency and standards across VUIs and frameworks

9.4 What’s next?

9.5 Summary

10 Creating robust coverage for speech-to-text resolution

10.1 Recognition is speech-to-text interpretation

10.2 Inside the STT box

10.3 Recognition engines

10.4 Grammar concepts

10.4.1 Coverage

10.4.2 Recognition space

10.4.3 Static or dynamic, large or small

10.4.4 End-pointing

10.4.5 Multiple hypotheses

10.5 Types of grammars

10.5.1 Rule-based grammars

10.5.2 Statistical models

10.5.3 Hot words

10.5.4 Wake words and invocation names

10.6 Working with grammars

10.6.1 Writing rule-based regular expressions

10.7 How to succeed with grammars

10.7.1 Bootstrap

10.7.2 Normalize punctuation and spellings

10.7.3 Handle unusual pronunciations

10.7.4 Use domain knowledge

10.7.5 Understand the strengths and limitations of STT

10.8 A simple example

10.8.1 Sample phrases in Dialogflow

10.8.2 Regular expressions in the webhook

10.9 Limitations on grammar creation and use

10.10 What’s next

10.11 Summary

11 Reaching understanding through parsing and intent resolution

11.1 From words to meaning

11.1.1 NLP

11.1.2 NLU

11.2 Parsing

11.3 Machine learning and NLU

11.4 Ontologies, knowledge bases and content databases

11.5 Intents

11.5.1 Intent tagging and tagging guides

11.5.2 Middle layers: semantic tags versus system endpoints

11.6 Putting it all together

11.6.1 Matching wide or narrow

11.6.2 Multiple grammars, multiple passes

11.7 A simple example

11.7.1 The Stanford Parser revisited

11.7.2 Determining intent

11.7.3 Machine learning and using knowledge

11.8 What’s next?

11.9 Summary

12 Applying accuracy strategies to avoid misunderstanding

12.1 Accuracy robustness concepts

12.2 Accuracy robustness strategies

12.2.1 Examples

12.2.2 Providing help

12.2.3 Just-in-time information

12.2.4 Hidden options and “none of those”

12.2.5 Recognize-and-reject

12.2.6 One-step correction

12.2.7 Tutorials

12.2.8 Spelling

12.2.9 Narrowing recognition space

12.3 Advanced techniques

12.3.1 Multi-tiered behavior and confidence scores

12.3.2 N-best and skip lists

12.3.3 Probabilities

12.3.4 Contextual latency

12.4 What’s next

12.5 Summary

13 Choosing strategies to recover from miscommunication

13.1 Recovery from what?

13.1.1 Recognition, intent, or fulfillment errors

13.2 Recovery strategies

13.2.1 Meaningful contextual prompts

13.2.2 Escalating prompts

13.2.3 Tapered prompts

13.2.4 Rapid reprompt

13.2.5 Backoff strategies

13.3 When to stop trying

13.3.1 Max error counts

13.3.2 Transfers

13.4 Choosing recovery strategy

13.5 What’s next

13.6 Summary

14 Using context and data to create smarter conversations

14.1 Why there’s no conversation without context

14.2 Reading and writing data

14.2.1 External accounts and services

14.2.2 External data from a system perspective

14.3 Persistence within and across conversations

14.4 Context-aware and context-dependent dialogs

14.4.1 Discourse markers and acknowledgments

14.4.2 Anaphora resolution

14.4.3 Follow-up dialogs and linked requests

14.4.4 Proactive behaviors

14.4.5 Topic, domain and world knowledge

14.4.6 Geo location-based behavior

14.4.7 Proximity and relevance

14.4.8 Number and type of devices

14.4.9 Time and day

14.4.10 User identity, preferences and account types

14.4.11 User utterance wording

14.4.12 System conditions

14.5 Tracking context in modular and multiturn dialogs

14.5.1 Fulfillment

14.6 What’s next?

14.7 Summary

15 Creating secure personalized experiences

15.1 The importance of knowing who’s talking

15.2 Individualized targeted behaviors

15.2.1 Concepts in personalization and customization

15.2.2 Implementing individualized experiences

15.3 Authorized secure access

15.3.1 Approaches to identification and authentication

15.3.2 Implementing secure gated access

15.4 Privacy and security concerns

15.5 System persona

15.5.1 Defining and implementing a system persona

15.5.2 How persona affects dialogs

15.6 System voice audio

15.6.1 TTS or voice talent, generated or recorded

15.6.2 Finding and working with voice talents

15.6.3 One or several voices

15.6.4 Prompt management

15.7 Emotion and style

15.8 Voice for specific user groups

15.9 What’s next?

15.10 Summary

Part 4: Verifying and Deploying Voice-First Interactions

16 Testing and measuring performance in voice systems

16.1 Testing voice system performance

16.1.1 Recognition testing

16.1.2 Dialog traversal: functional end-to-end testing

16.1.3 Wake-word and speech detection testing

16.1.4 Additional system integration testing

16.2 Testing usability and task completion

16.2.1 Voice usability testing concepts

16.2.2 Wizard of Oz studies

16.3 Tracking and measuring performance

16.3.1 Recognition performance metrics

16.3.2 Task completion metrics

16.3.3 User satisfaction metrics

16.4 What’s next?

16.5 Summary

17 Tuning and deploying voice systems

17.1 Tuning: what is it and why do you do it?

17.1.1 Why recognition accuracy isn’t enough

17.1.2 Analyzing causes of poor system performance

17.2 Tuning types and approaches

17.2.1 Log-based versus transcription-based tuning

17.2.2 Coverage tuning

17.2.3 Recognition accuracy tuning

17.2.4 Finding and using recognition accuracy data

17.2.5 Task completion tuning

17.2.6 Dialog tuning

17.2.7 How to prioritize your tuning efforts

17.3 Mapping observations to the right remedy

17.3.1 Reporting and using tuning results

17.4 How to maximize deployment success

17.4.1 Know when to tune

17.4.2 Understand tuning complexities to avoid pitfalls

17.5 What’s next?

17.6 Summary

What's inside

  • Planning, building, verifying, and deploying voice apps
  • Applying human rules of dialog
  • Accuracy strategies for avoiding misunderstandings
  • Using world knowledge to improve user experiences
  • Error strategies for recovering from miscommunications
  • Using context for smarter dialogs
  • Pitfalls and how to avoid them
  • Real-world examples and code samples in JavaScript and Python

About the reader

For developers with intermediate JavaScript or Python skills.

About the authors

Ann Thyme-Gobbel and Charles Jankowski have worked in speech recognition and natural language understanding for over 30 years. Ann is currently the Voice UI/UX Design Leader at Sound United. She holds a Ph.D. in cognitive science and linguistics from UC San Diego. Charles is currently Director of NLP Application at CloudMinds Technologies. He holds S.B., S.M., and Ph.D. degrees from M.I.T. Together Ann and Charles created a multi-modal conversational natural language interface to assist acute and chronic care patients.

placing your order...

Don't refresh or navigate away from the page.
Manning Early Access Program (MEAP) Read chapters as they are written, get the finished eBook as soon as it’s ready, and receive the pBook long before it's in bookstores.
print book $34.99 $49.99 pBook + eBook + liveBook
Additional shipping charges may apply
Voice UI Systems (print book) added to cart
continue shopping
go to cart

eBook $27.99 $39.99 3 formats + liveBook
Voice UI Systems (eBook) added to cart
continue shopping
go to cart

Prices displayed in rupees will be charged in USD when you check out.

FREE domestic shipping on three or more pBooks