Overview

5 Improving weak understanding for traditional AI

This chapter presents a practical, iterative approach to strengthening the intent understanding of traditional, classification-based conversational AI. It begins by establishing a clear measurement framework—beyond simple accuracy—using recall, precision, and F1 to diagnose false positives and false negatives at the intent level. With a representative blind test set (or k-fold as an early proxy), teams identify where the model is confused, visualize misclassifications with a confusion matrix, and prioritize fixes by business impact and volume. The guidance emphasizes incremental changes, testing after each update, and aligning training data distribution with real user demand captured in production logs.

Through a stepwise improvement plan, the chapter demonstrates several high-yield tactics. It shows how adding real-user examples can lift recall for underperforming intents (e.g., a login issue intent), how pruning redundant or unrepresentative examples can boost precision (e.g., chitchat about the assistant), and how targeted data augmentation can raise an intent’s F1 when both recall and precision matter. When heavy overlap creates persistent confusion, merging related intents and using entity detection to route to the correct downstream answer can simplify the space and improve accuracy. Iterating across versions in this manner moves a working solution from ~76% to ~92% overall blind-test accuracy, while cautioning against overfitting small datasets and reminding teams to keep performance grounded in fresh, representative logs.

Beyond fixes to existing scope, the chapter outlines how to expand coverage by clustering unmatched utterances into new intents (e.g., canceling licenses/registrations or addressing a data breach) and deciding when to stop adding intents with a long-tail analysis that separates high-value topics from low-frequency outliers. It closes by showing how to enrich traditional intent-driven responses with a light layer of generative text: an LLM can prepend a brief, empathetic, situation-aware greeting and summary before delivering a consistent, static answer. This hybrid pattern improves user experience without sacrificing compliance or answer stability, and pairs well with a continuous, data-driven improvement loop anchored in blind testing and careful prioritization.

In a 2x2 confusion matrix, the possible outcomes are derived by comparing the predicted intent to the actual intent.
The highlighted columns are used for calculating precision and recall for #Request_Agent.
A solid diagonal line shows that each predicted intent (represented by a single letter) matched to the actual intent.
This model had nine correct predictions, but wrongly predicted intent G when the actual intent was E.
A comparison of training examples to the utterances in our representative blind test set shows that there is some disparity in volume for many of the most popular intents (the representative blind utterances) on the left. We also see disparity across several of the least popular intents (those on the right).
Confusion matrix after V4 update. The density in shading represents the volume of questions predicted for a given intent. If a classifier test had a perfect accuracy score, you would see a solid black diagonal line running from the upper left corner to the lower right corner. The shaded squares that stray away from this diagonal line marks the areas of confusion within your model.
Comparison of baseline (V1) confusion matrix to V8 update.
Example of a longtail chart. The terms we use to describe the volume distribution of our available training data are “short head” and “long tail.” These terms describe the visual representation of rendering our data on a bar chart. The heavier-volume intents are on the left (the short head), and as the volume decreases for each intent, the data has the appearance of a long tail falling off to the right.
In a traditional (classification-based) dialogue pattern, an intent is identified, and the dialogue is configured to give a static or minimally-personalized answer.
An output response identifies the correct intent using traditional AI, then prepends generated text to a static output response configured for the intent. The generated greeting and summary convey to the user that the bot understands their goal and the particular details of the user’s situation.
LLMs can be called within traditional dialogue patterns to greet a user and summarize their problem before delivering a pre-defined or static answer.

Summary

  • A classifier’s performance can be measured in terms of accuracy, recall, precision, and F1-score. These measurements reflect the types of errors a classifier may be committing.
  • The performance metrics produced by your testing will inform your next steps towards classifier performance improvement. Higher volume intents with low performance are a good place to start.
  • Iterative test/train cycles will show you the effects of your changes.
  • A chatbot can use additional strategies, such as disambiguation, clarifying questions, and entity detection to overcome confusion or route answers for merged intents
  • A chatbot with a strong classifier can deliver more business value by delivering the right answers on the first try and deflecting work that would otherwise be handled by a human agent. You should plan to monitor and re-train your solution throughout the life of the bot.
  • Generative AI can supplement a traditional AI solution by infusing static chatbot responses with personalization and empathy, which enhances the perception of understanding.

FAQ

How do I establish a reliable baseline for my classifier?Use a representative blind test set built from production logs so intent volumes mirror real usage. Record per-intent recall, precision, and F1, plus overall accuracy. If logs aren’t available, k-fold cross-validation is acceptable for a pilot, but expect it to be optimistic compared to blind test results.
Which metric should I optimize: recall, precision, or F1?Choose based on business cost: optimize recall when false negatives are costly, precision when false positives are costly, and F1 when both matter. In practice, teams often track F1 per intent and use business impact to set priorities.
Why isn’t accuracy enough, and why ignore true negatives?Overall accuracy hides where and how the model fails. With many intents, true negatives dominate counts and add little insight per intent. Recall and precision expose false negatives and false positives, and F1 balances both for a clearer picture.
How do I use a confusion matrix to find problems?Look for off-diagonal cells—those show which intents are being confused. Dense off-diagonal cells indicate systematic misclassification. Prioritize the highest-volume, lowest-F1 intents and the heaviest confusion pairs for targeted fixes.
How can I raise recall for a weak intent?Add diverse, real phrasing from logs that reflects what users say (synonyms, colloquialisms, error states). Cover key variants and signals (e.g., “locked out,” “password reset,” “security code”). Make small, incremental changes and retest to confirm recall improves without harming precision.
How do I improve precision when an intent is over-selected?Prune redundant or overly generic examples, rebalance volumes so the intent isn’t disproportionately represented, and remove examples unsupported by logs. If your platform supports it, add hard negatives. Retest because changes in one intent can impact others.
What if multiple intents overlap heavily?Consider merging them into a broader intent and use entities (or slots) to route to the correct answer variant. Update dialogues and tests accordingly. This often reduces confusion and boosts precision/recall for the combined topic.
How should I prioritize which intents to fix first?Start with high-volume intents that have low F1. Also prioritize intents with high business impact even if volume is low (e.g., costly escalations). Use your representative blind test to align priorities with real user demand.
How do I add new intents for “no intent matched,” and when should I stop?Cluster unmatched utterances from logs by strong signal words, create a small, balanced training set, and put the rest into your blind test. Use a long-tail analysis with the business to set a minimum data threshold for new intents. Handle tail topics with fallbacks, escalation, search, or RAG/LLM responses.
How can LLMs enhance an intent-driven bot without replacing it?Keep authoritative, static answers, and prepend a generated greeting and empathetic summary via an LLM API call. Constrain the prompt: greet (optionally by name), restate the user’s problem, don’t ask for more info or make promises. This adds warmth and personalization while maintaining controlled content.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Effective Conversational AI ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Effective Conversational AI ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Effective Conversational AI ebook for free