table of content

Part 1 Framework for improving conversational AI

1 What makes conversational AI work?

1.1 Introduction to conversational AI

1.1.1 Why use conversational AI?

1.1.2 How does conversational AI work?

1.1.3 How you build conversational AI

1.2 Introduction to generative AI in conversational AI

1.2.1 What is generative AI

1.2.2 Generative AI guardrails

1.2.3 Effectively using generative AI in conversational AI

1.3 Introducing continuous improvement in conversational AI

1.3.1 Why continuously improve

1.3.2 The continuous improvement cycle

1.3.3 Communicating continuous improvement to stakeholders

1.4 Follow along

Summary

2 Building a conversational AI

2.1 Building an FAQ bot

2.1.1 FAQ bot foundations

2.1.2 Static question and answering

2.1.3 Dynamic question and answering

2.2 Routing agents and process-oriented bots

2.2.1 Routing agents

2.2.2 Transitioning from a routing agent to a process-oriented bot

2.3 Responding to the user with generative AI

2.3.1 Integrating with an LLM

2.3.2 Routing requests to an LLM

Summary

3 Planning for improvement

3.1 Knowing when you need to improve

3.2 Your cross-functional team

3.3 Driving to the same goal

3.3.1 Revisit business goals

3.3.2 Effectiveness

3.3.3 Coverage

3.4 Identifying and resolving problems

3.4.1 Finding problems

3.4.2 Group review

3.4.3 Determining acceptance criteria

3.5 Developing and delivering fixes

3.5.1 Sprint planning

3.5.2 Measure again

Summary

Part 2 Pattern: AI doesn’t understand

4 Understanding what your users really want

4.1 Fundamentals of understanding

4.1.1 The impact of weak understanding

4.1.2 What causes weak understanding?

4.1.3 How do we achieve understanding with traditional conversational AI?

4.1.4 How do we achieve understanding with generative AI?

4.2 How is understanding measured?

4.2.1 Measuring understanding for traditional (classification-based) AI

4.2.2 Measuring understanding for generative AI

4.2.3 Measuring understanding with direct user feedback

4.3 Assessing where you are today

4.3.1 Assessing your traditional (classification-based) AI solution

4.3.2 Assessing your generative AI solution

4.4 Obtaining and preparing test data from logs

4.4.1 Obtaining production logs

4.4.2 Guidelines for identifying candidate test utterances

4.4.3 Preparing and scrubbing data for use in iterative improvements

4.4.4 The annotation process

4.5 What does the data tell us?

4.5.1 Interpreting annotated logs for traditional (classification-based) AI

4.5.2 Interpreting annotated logs for generative AI

4.5.3 The case for iterative improvement

Summary

5 Improving weak understanding for traditional AI

5.1 Building your improvement plan

5.1.1 Identify problematic patterns in misunderstood utterances

5.1.2 Incremental improvements

5.1.3 Where to start: Identifying the biggest problems

5.2 Solving “wrong intent matched”

5.2.1 Improve recall for one intent

5.2.2 Improve precision for one intent

5.2.3 Improve the F1 score for one intent

5.2.4 Improve precision and recall for multiple intents

5.3 Solving “no intent matched”

5.3.1 Clustering utterances for new intents

5.3.2 When to stop adding intents

5.4 Supplementing traditional AI with generative content

5.4.1 Combining traditional and generative AI for an intent

5.4.2 Prompting to convey understanding

Summary

6 Enhancing responses with retrieval-augmented generation

6.1 Beyond intents: The role of search in conversational AI

6.1.1 Using search in conversational AI

6.1.2 Benefits of traditional search

6.1.3 Drawbacks of traditional search

6.2 Beyond search: Generating answers with RAG

6.2.1 Using RAG in conversational AI

6.2.2 Benefits of RAG

6.2.3 Combining RAG with other generative AI use cases

6.2.4 Comparing intents, search, and RAG approaches

6.3 How is RAG implemented?

6.3.1 High-level implementation

6.3.2 Preparing your document repository for RAG

6.4 Additional considerations of RAG implementations

6.4.1 Can’t we just use an LLM directly?

6.4.2 Keeping answers current and relevant with RAG

6.4.3 How easy is it to set up the ingestion pipeline?

6.4.4 Handling latency

6.4.5 When to use a fallback mechanism and when to search

6.5 Evaluating and analyzing RAG performance

6.5.1 Indexing metrics

6.5.2 Retrieval metrics

6.5.3 Generation metrics

6.5.4 Comparing efficiency of indexing and embedding solutions for RAG

Summary

7 Augmenting intent data with generative AI

7.1 Getting started

7.1.1 Why do it: Pros and cons

7.1.2 What you need

7.1.3 How to use the augmented data

7.2 Hardening your existing intents

7.2.1 Get creative with synonyms

7.2.2 Generate new grammatical variations

7.2.3 Build strong intents from LLM output

7.2.4 Creating even more examples with templates

7.3 Getting more creative

7.3.1 Brainstorm additional intents

7.3.2 Check for confusion

Summary

Part 3 Pattern: AI is too complex

8 Streamlining complex flows

8.1 The pain of complexity

8.1.1 Complexity’s effect on the end user

8.1.2 Complexity’s effect on business metrics

8.1.3 The incremental cost and benefit of reducing complexity for the user

8.2 Simplifying and streamlining the user journey

8.2.1 Spotting complex dialogue flows

8.2.2 Using what is known about the user

8.2.3 Aligning with the user’s mental model

8.2.4 Allowing flexibility in the expected user responses

8.2.5 Supporting self-service task flows with API/backend processes

Summary

9 Harnessing context for an adaptive virtual assistant experience

9.1 Importance of context in virtual assistant performance

9.1.1 How context influences user interactions

9.1.2 What is contextual information?

9.2 Understanding modality

9.2.1 Comparing modalities

9.2.2 Importance of modality in designing virtual assistant flows

9.2.3 Examples of how modality affects user experience

9.2.4 Voice bot design considerations

9.3 Enhancing context awareness and improving the overall user experience with RAG

9.3.1 Designing adaptive flows with RAG

9.3.2 Strategies for retrieving and generating contextually relevant responses

9.3.3 Maintaining and updating adaptive flows

Summary

10 Reducing complexity with generative AI

10.1 AI-assisted process flows at build time

10.1.1 Generating dialogue flows with generative AI

10.1.2 Improving dialogue flow with generative AI

10.2 AI-assisted process flows at run time

10.2.1 Executing dialogue flows with generative AI

10.2.2 Using LLM for a search process

10.3 AI-assisted flows at test time

10.3.1 Setting up generative AI to be the user

10.3.2 Setting up the conversational test

Summary

Part 4 Pattern: Reduce friction

11 Reducing opt-outs

11.1 What drives opt-out behavior?

11.1.1 Immediate opt-out drivers

11.1.2 Motivations for later opt-outs

11.1.3 Gathering data on opt-out behavior

11.2 Reducing immediate opt-outs

11.2.1 Start with a great experience: Greetings and introductions

11.2.2 Convey capabilities and set expectations

11.2.3 Incentivize self-service

11.2.4 Allow the user to opt in

11.3 Reducing other opt-outs

11.3.1 Try hard to understand

11.3.2 Try hard to be understood

11.3.3 Be flexible and accommodating

11.3.4 Convey progress

11.3.5 Anticipate additional user needs

11.3.6 Don’t be rude

11.4 Opt-out retention

11.4.1 Start right away by collecting opt-out data

11.4.2 Implementing an opt-out retention flow

11.5 Improving dialogue with generative AI

11.5.1 Improving error messages with generative AI

11.5.2 Improving greeting messages with generative AI

11.6 Sometimes it’s okay to escalate

Summary

12 Conversational summarization for smooth handoff

12.1 Intro to summarization

12.1.1 Why summarization is needed

12.1.2 Elements of effective summaries

12.2 Preparing your chatbot for summarization

12.2.1 Using out-of-the-box elements

12.2.2 Instrumenting your chatbot for transcripts

12.2.3 Instrumenting your chatbot (for data points)

12.3 Improving summaries with generative AI

12.3.1 Generating a text summary of a transcript with summarizing prompts

12.3.2 Generating a structured summary of a transcript with extractive prompts

Summary

Overview

5 Improving weak understanding for traditional AI

This chapter presents a practical, iterative approach to strengthening the intent understanding of traditional, classification-based conversational AI. It begins by establishing a clear measurement framework—beyond simple accuracy—using recall, precision, and F1 to diagnose false positives and false negatives at the intent level. With a representative blind test set (or k-fold as an early proxy), teams identify where the model is confused, visualize misclassifications with a confusion matrix, and prioritize fixes by business impact and volume. The guidance emphasizes incremental changes, testing after each update, and aligning training data distribution with real user demand captured in production logs.

Through a stepwise improvement plan, the chapter demonstrates several high-yield tactics. It shows how adding real-user examples can lift recall for underperforming intents (e.g., a login issue intent), how pruning redundant or unrepresentative examples can boost precision (e.g., chitchat about the assistant), and how targeted data augmentation can raise an intent’s F1 when both recall and precision matter. When heavy overlap creates persistent confusion, merging related intents and using entity detection to route to the correct downstream answer can simplify the space and improve accuracy. Iterating across versions in this manner moves a working solution from ~76% to ~92% overall blind-test accuracy, while cautioning against overfitting small datasets and reminding teams to keep performance grounded in fresh, representative logs.

Beyond fixes to existing scope, the chapter outlines how to expand coverage by clustering unmatched utterances into new intents (e.g., canceling licenses/registrations or addressing a data breach) and deciding when to stop adding intents with a long-tail analysis that separates high-value topics from low-frequency outliers. It closes by showing how to enrich traditional intent-driven responses with a light layer of generative text: an LLM can prepend a brief, empathetic, situation-aware greeting and summary before delivering a consistent, static answer. This hybrid pattern improves user experience without sacrificing compliance or answer stability, and pairs well with a continuous, data-driven improvement loop anchored in blind testing and careful prioritization.

In a 2x2 confusion matrix, the possible outcomes are derived by comparing the predicted intent to the actual intent.

The highlighted columns are used for calculating precision and recall for #Request_Agent.

A solid diagonal line shows that each predicted intent (represented by a single letter) matched to the actual intent.

This model had nine correct predictions, but wrongly predicted intent G when the actual intent was E.

A comparison of training examples to the utterances in our representative blind test set shows that there is some disparity in volume for many of the most popular intents (the representative blind utterances) on the left. We also see disparity across several of the least popular intents (those on the right).

Confusion matrix after V4 update. The density in shading represents the volume of questions predicted for a given intent. If a classifier test had a perfect accuracy score, you would see a solid black diagonal line running from the upper left corner to the lower right corner. The shaded squares that stray away from this diagonal line marks the areas of confusion within your model.

Comparison of baseline (V1) confusion matrix to V8 update.

Example of a longtail chart. The terms we use to describe the volume distribution of our available training data are “short head” and “long tail.” These terms describe the visual representation of rendering our data on a bar chart. The heavier-volume intents are on the left (the short head), and as the volume decreases for each intent, the data has the appearance of a long tail falling off to the right.

In a traditional (classification-based) dialogue pattern, an intent is identified, and the dialogue is configured to give a static or minimally-personalized answer.

An output response identifies the correct intent using traditional AI, then prepends generated text to a static output response configured for the intent. The generated greeting and summary convey to the user that the bot understands their goal and the particular details of the user’s situation.

LLMs can be called within traditional dialogue patterns to greet a user and summarize their problem before delivering a pre-defined or static answer.

Summary

A classifier’s performance can be measured in terms of accuracy, recall, precision, and F1-score. These measurements reflect the types of errors a classifier may be committing.
The performance metrics produced by your testing will inform your next steps towards classifier performance improvement. Higher volume intents with low performance are a good place to start.
Iterative test/train cycles will show you the effects of your changes.
A chatbot can use additional strategies, such as disambiguation, clarifying questions, and entity detection to overcome confusion or route answers for merged intents
A chatbot with a strong classifier can deliver more business value by delivering the right answers on the first try and deflecting work that would otherwise be handled by a human agent. You should plan to monitor and re-train your solution throughout the life of the bot.
Generative AI can supplement a traditional AI solution by infusing static chatbot responses with personalization and empathy, which enhances the perception of understanding.

FAQ

How do I establish a reliable baseline for my classifier?

Use a representative blind test set built from production logs so intent volumes mirror real usage. Record per-intent recall, precision, and F1, plus overall accuracy. If logs aren’t available, k-fold cross-validation is acceptable for a pilot, but expect it to be optimistic compared to blind test results.

Which metric should I optimize: recall, precision, or F1?

Choose based on business cost: optimize recall when false negatives are costly, precision when false positives are costly, and F1 when both matter. In practice, teams often track F1 per intent and use business impact to set priorities.

Why isn’t accuracy enough, and why ignore true negatives?

Overall accuracy hides where and how the model fails. With many intents, true negatives dominate counts and add little insight per intent. Recall and precision expose false negatives and false positives, and F1 balances both for a clearer picture.

How do I use a confusion matrix to find problems?

Look for off-diagonal cells—those show which intents are being confused. Dense off-diagonal cells indicate systematic misclassification. Prioritize the highest-volume, lowest-F1 intents and the heaviest confusion pairs for targeted fixes.

How can I raise recall for a weak intent?

Add diverse, real phrasing from logs that reflects what users say (synonyms, colloquialisms, error states). Cover key variants and signals (e.g., “locked out,” “password reset,” “security code”). Make small, incremental changes and retest to confirm recall improves without harming precision.

How do I improve precision when an intent is over-selected?

Prune redundant or overly generic examples, rebalance volumes so the intent isn’t disproportionately represented, and remove examples unsupported by logs. If your platform supports it, add hard negatives. Retest because changes in one intent can impact others.

What if multiple intents overlap heavily?

Consider merging them into a broader intent and use entities (or slots) to route to the correct answer variant. Update dialogues and tests accordingly. This often reduces confusion and boosts precision/recall for the combined topic.

How should I prioritize which intents to fix first?

Start with high-volume intents that have low F1. Also prioritize intents with high business impact even if volume is low (e.g., costly escalations). Use your representative blind test to align priorities with real user demand.

How do I add new intents for “no intent matched,” and when should I stop?

Cluster unmatched utterances from logs by strong signal words, create a small, balanced training set, and put the rest into your blind test. Use a long-tail analysis with the business to set a minimum data threshold for new intents. Handle tail topics with fallbacks, escalation, search, or RAG/LLM responses.

How can LLMs enhance an intent-driven bot without replacing it?

Keep authoritative, static answers, and prepend a generated greeting and empathetic summary via an LLM API call. Constrain the prompt: greet (optionally by name), restate the user’s problem, don’t ask for more info or make promises. This adds warmth and personalization while maintaining controlled content.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$64.99 $48.74

you save $16.25 (25%)

include audio $24.99 $18.74

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$64.99 $48.74

you save $16.25 (25%)

include audio $24.99 $18.74

eBook

pdf, ePub, online

$64.99 $48.74

you save $16.25 (25%)

include audio $24.99 $18.74

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more