table of content

Part 1 Framework for improving conversational AI

1 What makes conversational AI work?

1.1 Introduction to conversational AI

1.1.1 Why use conversational AI?

1.1.2 How does conversational AI work?

1.1.3 How you build conversational AI

1.2 Introduction to generative AI in conversational AI

1.2.1 What is generative AI

1.2.2 Generative AI guardrails

1.2.3 Effectively using generative AI in conversational AI

1.3 Introducing continuous improvement in conversational AI

1.3.1 Why continuously improve

1.3.2 The continuous improvement cycle

1.3.3 Communicating continuous improvement to stakeholders

1.4 Follow along

Summary

2 Building a conversational AI

2.1 Building an FAQ bot

2.1.1 FAQ bot foundations

2.1.2 Static question and answering

2.1.3 Dynamic question and answering

2.2 Routing agents and process-oriented bots

2.2.1 Routing agents

2.2.2 Transitioning from a routing agent to a process-oriented bot

2.3 Responding to the user with generative AI

2.3.1 Integrating with an LLM

2.3.2 Routing requests to an LLM

Summary

3 Planning for improvement

3.1 Knowing when you need to improve

3.2 Your cross-functional team

3.3 Driving to the same goal

3.3.1 Revisit business goals

3.3.2 Effectiveness

3.3.3 Coverage

3.4 Identifying and resolving problems

3.4.1 Finding problems

3.4.2 Group review

3.4.3 Determining acceptance criteria

3.5 Developing and delivering fixes

3.5.1 Sprint planning

3.5.2 Measure again

Summary

Part 2 Pattern: AI doesn’t understand

4 Understanding what your users really want

4.1 Fundamentals of understanding

4.1.1 The impact of weak understanding

4.1.2 What causes weak understanding?

4.1.3 How do we achieve understanding with traditional conversational AI?

4.1.4 How do we achieve understanding with generative AI?

4.2 How is understanding measured?

4.2.1 Measuring understanding for traditional (classification-based) AI

4.2.2 Measuring understanding for generative AI

4.2.3 Measuring understanding with direct user feedback

4.3 Assessing where you are today

4.3.1 Assessing your traditional (classification-based) AI solution

4.3.2 Assessing your generative AI solution

4.4 Obtaining and preparing test data from logs

4.4.1 Obtaining production logs

4.4.2 Guidelines for identifying candidate test utterances

4.4.3 Preparing and scrubbing data for use in iterative improvements

4.4.4 The annotation process

4.5 What does the data tell us?

4.5.1 Interpreting annotated logs for traditional (classification-based) AI

4.5.2 Interpreting annotated logs for generative AI

4.5.3 The case for iterative improvement

Summary

5 Improving weak understanding for traditional AI

5.1 Building your improvement plan

5.1.1 Identify problematic patterns in misunderstood utterances

5.1.2 Incremental improvements

5.1.3 Where to start: Identifying the biggest problems

5.2 Solving “wrong intent matched”

5.2.1 Improve recall for one intent

5.2.2 Improve precision for one intent

5.2.3 Improve the F1 score for one intent

5.2.4 Improve precision and recall for multiple intents

5.3 Solving “no intent matched”

5.3.1 Clustering utterances for new intents

5.3.2 When to stop adding intents

5.4 Supplementing traditional AI with generative content

5.4.1 Combining traditional and generative AI for an intent

5.4.2 Prompting to convey understanding

Summary

6 Enhancing responses with retrieval-augmented generation

6.1 Beyond intents: The role of search in conversational AI

6.1.1 Using search in conversational AI

6.1.2 Benefits of traditional search

6.1.3 Drawbacks of traditional search

6.2 Beyond search: Generating answers with RAG

6.2.1 Using RAG in conversational AI

6.2.2 Benefits of RAG

6.2.3 Combining RAG with other generative AI use cases

6.2.4 Comparing intents, search, and RAG approaches

6.3 How is RAG implemented?

6.3.1 High-level implementation

6.3.2 Preparing your document repository for RAG

6.4 Additional considerations of RAG implementations

6.4.1 Can’t we just use an LLM directly?

6.4.2 Keeping answers current and relevant with RAG

6.4.3 How easy is it to set up the ingestion pipeline?

6.4.4 Handling latency

6.4.5 When to use a fallback mechanism and when to search

6.5 Evaluating and analyzing RAG performance

6.5.1 Indexing metrics

6.5.2 Retrieval metrics

6.5.3 Generation metrics

6.5.4 Comparing efficiency of indexing and embedding solutions for RAG

Summary

7 Augmenting intent data with generative AI

7.1 Getting started

7.1.1 Why do it: Pros and cons

7.1.2 What you need

7.1.3 How to use the augmented data

7.2 Hardening your existing intents

7.2.1 Get creative with synonyms

7.2.2 Generate new grammatical variations

7.2.3 Build strong intents from LLM output

7.2.4 Creating even more examples with templates

7.3 Getting more creative

7.3.1 Brainstorm additional intents

7.3.2 Check for confusion

Summary

Part 3 Pattern: AI is too complex

8 Streamlining complex flows

8.1 The pain of complexity

8.1.1 Complexity’s effect on the end user

8.1.2 Complexity’s effect on business metrics

8.1.3 The incremental cost and benefit of reducing complexity for the user

8.2 Simplifying and streamlining the user journey

8.2.1 Spotting complex dialogue flows

8.2.2 Using what is known about the user

8.2.3 Aligning with the user’s mental model

8.2.4 Allowing flexibility in the expected user responses

8.2.5 Supporting self-service task flows with API/backend processes

Summary

9 Harnessing context for an adaptive virtual assistant experience

9.1 Importance of context in virtual assistant performance

9.1.1 How context influences user interactions

9.1.2 What is contextual information?

9.2 Understanding modality

9.2.1 Comparing modalities

9.2.2 Importance of modality in designing virtual assistant flows

9.2.3 Examples of how modality affects user experience

9.2.4 Voice bot design considerations

9.3 Enhancing context awareness and improving the overall user experience with RAG

9.3.1 Designing adaptive flows with RAG

9.3.2 Strategies for retrieving and generating contextually relevant responses

9.3.3 Maintaining and updating adaptive flows

Summary

10 Reducing complexity with generative AI

10.1 AI-assisted process flows at build time

10.1.1 Generating dialogue flows with generative AI

10.1.2 Improving dialogue flow with generative AI

10.2 AI-assisted process flows at run time

10.2.1 Executing dialogue flows with generative AI

10.2.2 Using LLM for a search process

10.3 AI-assisted flows at test time

10.3.1 Setting up generative AI to be the user

10.3.2 Setting up the conversational test

Summary

Part 4 Pattern: Reduce friction

11 Reducing opt-outs

11.1 What drives opt-out behavior?

11.1.1 Immediate opt-out drivers

11.1.2 Motivations for later opt-outs

11.1.3 Gathering data on opt-out behavior

11.2 Reducing immediate opt-outs

11.2.1 Start with a great experience: Greetings and introductions

11.2.2 Convey capabilities and set expectations

11.2.3 Incentivize self-service

11.2.4 Allow the user to opt in

11.3 Reducing other opt-outs

11.3.1 Try hard to understand

11.3.2 Try hard to be understood

11.3.3 Be flexible and accommodating

11.3.4 Convey progress

11.3.5 Anticipate additional user needs

11.3.6 Don’t be rude

11.4 Opt-out retention

11.4.1 Start right away by collecting opt-out data

11.4.2 Implementing an opt-out retention flow

11.5 Improving dialogue with generative AI

11.5.1 Improving error messages with generative AI

11.5.2 Improving greeting messages with generative AI

11.6 Sometimes it’s okay to escalate

Summary

12 Conversational summarization for smooth handoff

12.1 Intro to summarization

12.1.1 Why summarization is needed

12.1.2 Elements of effective summaries

12.2 Preparing your chatbot for summarization

12.2.1 Using out-of-the-box elements

12.2.2 Instrumenting your chatbot for transcripts

12.2.3 Instrumenting your chatbot (for data points)

12.3 Improving summaries with generative AI

12.3.1 Generating a text summary of a transcript with summarizing prompts

12.3.2 Generating a structured summary of a transcript with extractive prompts

Summary

Overview

3 Planning for improvement

Planning for improvement starts with defining what “success” means for your conversational AI, because expectations differ by bot type: Q&A assistants prioritize fast, accurate answers with minimal back-and-forth, process bots focus on guiding users to complete tasks efficiently, and routing agents aim to hand users off to the right destination seamlessly. As user needs, business goals, and external conditions shift, assistants must evolve in step. The chapter illustrates this with MediWorld’s PharmaBot, which moved from a narrow information bot to a process-oriented assistant for vaccine scheduling—highlighting how organizations must align around continuous, data-driven improvement to meet rising expectations and reduce opt-outs and escalations.

Effective improvement requires a cross-functional team—conversational analysts, developers, data analysts, customer support, product, legal/compliance, IT, and governance—aligned on business outcomes and the metrics that prove them, from revenue impact (conversion, AOV, CLV) to cost reduction (containment, FCR, AHT). The chapter urges teams to go beyond simple KPIs like containment by classifying granular conversation outcomes (e.g., automated resolution, intentional transfer, abandonment, escalations, bot-not-wanted) and linking them to design milestones to pinpoint friction. It also covers improving coverage (training data, intent clarity, retrieval/generative patterns), complementing inferred satisfaction with direct signals (thumbs/NPS), and accounting for security and privacy as bots integrate deeper into transactional flows.

The improvement lifecycle is made operational through structured practices: instrument and analyze outcomes, spot trends, and drill down by last step; blend qualitative feedback with quantitative timing to diagnose issues; triage by frequency, expected return, and implementation effort; and prioritize with cost–benefit and impact–effort framing. Teams then outline high-level fixes, define clear acceptance criteria, and deliver in sprints with visible ownership and status. Crucially, they measure again against baselines to verify that changes move target metrics and ROI, iterating as needed. Governance, documentation, and regular stakeholder communication keep the roadmap aligned with policy, ethics, and evolving business needs.

PharmaBot efficiently detected informational intents from user queries.

The team identified areas for improvement using their diverse skills, setting the stage for an effective improvement plan.

PharmaBot started as a simple Q&A bot. Many question varieties got the same answer.

Q&A got more complex by detecting entities in user utterances, leading to more specific answers within a common intent.

Some question types do not have a single static answer but require a full process flow to satisfy.

PharmaBot's basic analytic dashboard shows a usage summary but cannot give insight into what users like (or don’t like) about the bot.

Basic daily dashboard showing a simple business metric: containment. This metric is tracked daily but still does not give deep insights into bot performance.

The simplest outcome definition. This does not give insight into how to improve the bot.

Breaking down why conversations are not contained gives more insight into bot performance and shows you where your bot needs improvement.

Enhancing the summary dashboard with a success rate. Not all contained calls are successful; not all transferred calls are failures.

Conversation Outcomes and Outcome Details aggregated over a set time period. This provides much greater insights into bot performance than a binary “contained or not contained” model.

High-level design of PharmaBot with milestones for significant parts of the conversation. “Schedule appointmet” and “Help with anything else?” are both marked as successful paths.

When the outcome model and conversation design are overlaid, insights become apparent. Here we zoom in on the "Escalated by User" outcome, broken down by last step before the escalation.

Dashboard which breaks down an outcome by last step taken, observed over time.

Triage of issues describing insights and recommendations related to each issue, which contributes to its priority.

Assessment of the cost of users reaching call center agents after abandoning their chatbot conversations in frustration.

An impact-effort matrix visualizes the relationship between the effort required and potential impact of a proposed change.

A sample prioritized fix list.

A prioritized table, including development sprints. Further columns, including UAT times and expected deployment dates, may be added.

Tracking conversation outcomes against deployment of changes.

Summary

The continuous improvement cycle for conversational systems is an ongoing, iterative process.
All improvements should drive toward the pre-defined business goals and user satisfaction.
Meticulous metric definition, the right choice of monitoring tools, and a commitment to best practices are key.
Use the “right” metrics relevant to your bot rather than those easiest to measure.
Detailed conversation outcomes allow targeting a specific set of conversations for improvement.
Several factors can determine an issue's priority, such as its frequency of occurrence, the expected improvement and complexity of a fix, and the team's capacity.
Regression testing and analyzing improvements are critical to ensure that indeed improvements have occurred.

FAQ

How do I know when my conversational AI needs improvement?

Watch for signals in your KPIs and user behavior: low containment, high fallback intent usage, frequent agent escalations, immediate opt-outs, declining engagement, or unmet business goals. Start improvements as soon as you see recurring issues and set regular review cycles so changes stay aligned with evolving user needs and organizational objectives.

Who should be on the cross-functional improvement team?

- Support/Maintenance: data analyst/engineer, chatbot developer or conversational analyst, QA tester, project manager, and other SMEs (e.g., security).
- Business stakeholders: executive leadership, customer service, chatbot product manager, IT, operations, legal and compliance.
- Governance: corporate ethics/compliance focal and a governing executive team. These groups align technical work with business value, policy, and responsible AI standards.

How do we align on what “success” means for the bot?

Revisit the original business goals and match them to bot type: Q&A bots should be fast and accurate; process-oriented bots should complete tasks efficiently (e.g., bookings); routing agents should direct users to the right destination. Translate goals into measurable metrics like accuracy, user satisfaction, coverage, conversion, AOV/CLV, AHT, FCR, and ROI.

What’s the difference between containment and true success?

Containment means the bot handled the interaction without a human, but it doesn’t guarantee a good outcome. Define outcomes more precisely:
- Success: Automated Resolution; Intentional Transfer (a planned handoff per business rules).
- Failure: Abandoned; Escalated (by user or bot).
- Chatbot Not Wanted: Immediate Disconnect; Immediate Escalation. This clarity shows “good transfers” and avoids counting “bad containments” as wins.

How can we measure effectiveness beyond basic dashboards?

Go beyond counts and confidence scores by building a detailed outcome model and instrumenting milestones in your conversation design. Track outcomes by “last milestone” to see where journeys succeed or fail, break down success/failure reasons, and trend these over time to guide prioritization and stakeholder buy-in.

What causes low coverage and how can we improve it?

Low coverage often stems from inadequate or imbalanced training data and overlapping/confused intents. Improve by analyzing real utterances, refining and rebalancing training data, restructuring intents, adding retrieval-augmented generation (RAG) for long-tail questions, and generating synthetic training/testing data where appropriate.

How do we find and prioritize issues effectively?

Use both quantitative and qualitative methods:
- Quantitative: trend failed and “bot not wanted” outcomes, drill down by last step, time per step, and backend latency; estimate financial impact (e.g., escalations).
- Qualitative: user feedback, call-center insights, transcript reviews.
Triaging considers issue frequency, impact, fix effort, dependencies, and expected return. Use an impact–effort matrix and maintain a prioritized fix list with IDs.

What are good acceptance criteria for a fix?

Write clear, testable statements that define how the bot must behave in specific scenarios, covering current and new behaviors. Validate in development and confirm in production by comparing against baseline metrics. Example: when asked to choose “vaccines or testing,” the bot correctly routes “vaccines,” “testing,” and handles ambiguous “yes” with confirmation.

How should we plan delivery of improvements?

Adopt sprint-based delivery (1–4 weeks). Plan by team capacity and velocity, track progress with a board (issue, status, sprint, UAT, deployment date), and keep stakeholders informed. After deployment, re-measure targeted metrics and outcomes to verify the expected impact and inform the next iteration.

How can we track customer satisfaction reliably?

Combine indirect and direct measures:
- Indirect: infer satisfaction from outcome categories (e.g., automated resolution vs. abandoned/escalated).
- Direct: lightweight CSAT (thumbs up/down or score), brief surveys, NPS, and periodic transcript reviews. Expect lower response rates and bias; use multiple signals for a balanced view.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$64.99 $48.74

you save $16.25 (25%)

include audio $24.99 $18.74

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$64.99 $48.74

you save $16.25 (25%)

include audio $24.99 $18.74

eBook

pdf, ePub, online

$64.99 $48.74

you save $16.25 (25%)

include audio $24.99 $18.74

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more