Building Reliable AI Systems you own this product

Production-ready methods to reduce hallucination, bias, and more

Rush Shahani

MEAP began October 2024
Last updated October 2025
Publication in Spring 2026 (estimated)

ISBN 9781633436732
325 pages (estimated)

Included with a Manning Online subscription

printed in black & white

available in Korean, Russian

catalog / Data Science / Machine Learning / Large Language Models

table of content

PART 1: UNDERSTANDING AND MITIGATING HALLUCINATIONS IN LLMS

1 Deploying reliable and responsible large language models in the real world

1.1 The Inception of Large Language Models

1.2 The Tangible Impact of LLMs in the Real World

1.3 How do Large Language Models work?

1.3.1 Training Datasets

1.3.2 Neural Network Architectures of LLMs

1.4 Navigating Key Challenges in Real-World LLM Deployment

1.4.1 Curbing Hallucination Risks

1.4.2 Mitigating Problematic Biases

1.4.3 Improving Efficiency of LLMs

1.4.4 Performance Optimization of LLMs

1.5 Summary

1.6 References

2 Understanding and measuring hallucinations in LLMs

2.1 What are Hallucinations Exactly?

2.1.1 Unpacking the Use of "Hallucination" in AI

2.1.2 Types of Hallucinations

2.1.3 Why Do Hallucinations Occur?

2.2 How to Identify and Measure Hallucinations

2.2.1 Four Steps to Identify and Measure Hallucinations

2.2.2 ROUGE Metric: News Summarization Example

2.2.3 QA-Based Metrics

2.2.4 LLM as a Judge: A Holistic Approach to Hallucination Detection

2.2.5 Human Evaluation

2.2.6 Red Teaming: Stress-Testing LLMs for Robustness

2.2.7 Using Arize and the Phoenix Framework for Hallucination Detection

2.3 Mitigating Hallucinations

2.3.1 Selecting the Right Model

2.3.2 Data-Related Methods to Improve Relevance and Reduce Hallucinations

2.3.3 Post-Processing to Lower Hallucinations

2.3.4 Retrieval-Augmented Generation to Improve Grounding

2.3.5 Prompting Techniques to Lower Hallucinations

2.3.6 Mitigating Hallucinations Through Produce Design and User Interaction

2.4 Summary

3 Minimizing hallucinations and enhancing reliability with prompt engineering techniques

3.1 Tailoring LLM Settings for Maximum Reliability

3.1.1 Optimizing Temperature for Predictable Outputs

3.2 Limit Output and Turns for Reduction of Hallucination Risk

3.3 Applying Frequency and Presence Penalties for Balanced Content

3.4 Minimizing Intrinsic Randomness for Stable Performance

3.5 Foundations of Prompt Engineering for Reliable LLMs

3.5.1 Designing Components of a Prompt for Reliability

3.5.2 Crafting Basic Prompts with a Focus on Dependability

3.6 Prompt Engineering Techniques for Preventing Hallucinations

3.6.1 Zero-shot prompting

3.6.2 Few-shot Prompting for Contextual Stability

3.6.3 Chain-of-thought Prompting for Transparent Reasoning

3.6.4 Automatic Chain of Thought (Auto-CoT) for Scalable Reasoning

3.6.5 Self-Consistency for Cross-Verification

3.6.7 Tree-of-Thought (ToT) Prompting for Structured Decision Making

3.7 Project: Creating a Reliable Weather Assistant with OpenAI’s Function Calls

3.7.1 Building a Weather Assistant

3.7.2 Getting Started and Defining the Weather Function

3.7.3 Integrating the Chatbot

3.8 Summary

3.9 References

4 Advancing trust & minimizing hallucinations with retrieval augmented generation

4.1 What is Retrieval Augmented Generation?

4.2 RAG System Architecture

4.3 Reducing Hallucinations with RAG

4.4 Data Preparation and Reliable Indexing for RAG Systems

4.4.1 Structuring Product Catalogs

4.4.2 Creating the searchable vector index

4.4.3 Implementing a FAISS Index

4.5 Building an Effective RAG system

4.5.1 Processing User Queries

4.5.2 Crafting an Accurate Response Using an LLM

4.6 Evaluating and Optimizing RAG Systems

4.6.1 The Role of Evaluator LLMs in Assessing Hallucinations

4.6.2 Crafting Evaluation Datasets

4.6.3 Practical Metrics for RAG Evaluation

4.6.4 Reducing Hallucinations by Filtering Retrievals and Leveraging Metadata

4.6.5 Embedding chunk sizes and model choice

4.7 Advanced Techniques for Improving RAG

4.8 Introduction to LangChain

4.8.1 Building a RAG System for an Ecommerce Chatbot with LangChain

4.8.2 Challenges and Adaptations of the Chatbot

4.9 Summary

4.10 References

PART 2: OPTIMIZING PERFORMANCE AND BUILDING RELIABLE LLM APPLICATIONS

5 Building Reliable AI Agents

5.1 Why Do We Need Agents?

5.1.1 Static Knowledge

5.1.2 Inability to Act

5.1.3 No Workflow Management

5.1.4 How AI Agents Solve These Challenges

5.1.5 Broader Applications of AI Agents

5.2 Core Components of Reliable Agents

5.2.1 Memory: Context that Sticks

5.2.2 Balancing Memory Systems with LangChain

5.2.3 Tool Integration: Connecting Agents to the Real World

5.2.4 Strategies to reduce hallucinations with tools

5.2.5 Decision Making: Making AI Agents Intelligent

5.3 Agentic RAG for Reliable Agents

5.3.1 Project: Building an Agentic RAG System with LangChain

5.3.2 Setting Up Dependencies

5.3.3 Building the Knowledge Base

5.3.4 Creating Interactive Tools

5.3.5 Configuring the Agent’s Behavior

5.3.6 Implementing Transparent Decision Making

5.3.7 Managing Conversation Context

5.3.8 Making it Working in Production

5.4 Advanced Techniques for Preventing Hallucinations and Improving Accuracy in Agents

5.4.1 Cross-Referencing

5.4.2 Guardrails for security and preventing hallucinations

5.4.3 Reinforcement Learning from AI Feedback (RLAIF)

5.4.4 Transparency in Multi-Step Reasoning

5.5 Summary

5.6 References

6 Performance optimization techniques for LLMs and agents

6.1 Essential Architectural Patterns for LLM-Based Systems

6.1.1 Token Streaming: Presenting Answers Incrementally or All at Once

6.1.2 Handling Surges with Batching—System-Level vs. OpenAI’s Batch API

6.1.3 Caching for Efficiency

6.1.4 Multi-Model Fallback: Matching Each Query to the Right Model

6.1.5 Project: Building an E-Commerce LLM Service with Batching, Caching, and Model Fallback

6.1.6 Further Improvements

6.2 Measuring and Evaluating LLM and Agent Performance

6.2.1 Core Metrics for Evaluating LLMs

6.2.2 Evaluating Agents: Beyond Traditional LLM Metrics

6.3 Summary

7 Fine-Tuning LLMs for Improved Performance

7.1 Choosing the right approach: Prompting, RAG, or fine-tuning?

7.1.1 Prompting

7.1.2 Retrieval-augmented generation (RAG)

7.1.3 Fine-tuning

7.1.4 Hybrid approach: Combining RAG and fine-tuning

7.2 Case studies: Finetuning in the real world

7.2.1 Medicine: Med-PaLM 2

7.2.2 Finance: BloombergGPT and FinGPT

7.2.3 Code generation: Codex

7.3 Understanding the finetuning process

7.3.1 The data preparation phase

7.3.2 Fine-tuning closed source models (OpenAI)

7.3.3 Understanding key fine-tuning approaches

7.3.4 Building a customer support assistant with META’s LLaMA open-source model

7.4 Knowledge Distillation for LLMs

7.4.1 How knowledge distillation works in LLMs

7.5 Summary

7.6 References

8 Embeddings, Vector Databases and Retrieval

8.1 What are embeddings (really)?

8.1.1 A mental model: Embeddings as coordinates in meaning space

8.1.2 Why the model matters

8.2 Embedding models in production

8.2.1 Commercial models: Powerful but opaque

8.2.2 Open-source models: Flexible and transparent

8.2.3 Domain-specific models: Purpose-built precision

8.2.4 How to choose the right model

8.2.5 Practical applications across industries

8.2.6 Case study: Learning from Airbnb’s embedding journey

8.3 Beyond simple vectors: Hybrid and multi-stage retrieval

8.3.1 The limitations of pure vector search

8.3.2 Hybrid retrieval: Combining dense and sparse approaches

8.3.3 Multi-stage retrieval: Building precision through layers

8.3.4 Building a hybrid retriever: Hands-on implementation for an employee chatbot

8.4 Essential retrieval optimizations with embeddings and RAG

8.4.1 Metadata filtering: Beyond pure semantic search

8.4.2 Chunking strategies: Finding the right granularity

8.5 Exploring vector storage and databases

8.5.1 HNSW: The algorithm powering modern vector search

8.5.2 FAISS: Industry-standard vector indexing

8.5.3 Vector databases: Complete vector storage solutions

8.5.4 Vector database project: Building a healthcare policy assistant with pinecone

8.6 Practical challenges: Drift and compression

8.6.1 Compression: Getting more from less

8.7 Summary

8.8 References

PART 3: MONITORING, BIAS, AND ETHICAL CONSIDERATIONS

9 Deploying and monitoring large language models for high-quality outcomes

9.1 Introducing LLMOps

9.2 Serving LLMs: Hosted APIs vs. open-source models

9.2.1 Using hosted APIs

9.2.2 The open-source alternative

9.2.3 The hybrid solution: Best of both worlds

9.3 Building LLM-native monitoring systems

9.3.1 What really matters: The four questions

9.3.2 Logging what actually matters

9.3.3 Setting up alerts that actually help

9.3.4 Catching cost explosions before they hurt

9.3.5 Building dashboards that drive action

9.3.6 Output quality monitoring

9.4 User experience and feedback monitoring

9.4.1 Explicit feedback collection

9.4.2 Implicit feedback signals

9.4.3 Building actionable feedback loops

9.5 Ensuring high-quality outputs in production

9.5.1 The three-pillar quality framework

9.5.2 Prompt engineering for consistent quality

9.5.3 Continuous quality monitoring with automated testing

9.6 Observability in practice: Introducing Langfuse with a real-world case study

9.6.1 Case Study: How Huntr uses Langfuse to power the AI Resume Builder

9.7 Summary

10 Bias, privacy and trust in AI systems

10.1 The responsible AI imperative

10.1.1 Regulatory pressure is accelerating

10.1.2 User expectations have shifted

10.1.3 Business risks have multiplied

10.1.4 Real examples of AI bias in production

10.1.5 The four failure modes

10.1.6 The responsible AI defense system

10.2 Data layer: Where bias begins

10.2.1 The fine-tuning bias trap

10.2.2 Detecting bias in chat logs

10.2.3 The name experiment

10.2.4 Three proven bias mitigation strategies

10.3 Model layer: Where bias evolves

10.3.1 Why this matters for open-source

10.3.2 Example: Adding fairness to a LoRA fine-tuning loop

10.3.3 ANTHROPIC’S constitutional AI: LLM-as-judge at training scale

10.4 Safety layer: Your last line of defense

10.4.1 Multi-layered safety architecture

10.4.2 Layer 3: Enhanced safety with commercial APIs

10.5 Privacy layer: Protecting personal data

10.5.1 Why LLM privacy failures are uniquely dangerous

10.5.2 Building Sensitive Data Detection

10.5.3 Understanding HIPAA: Healthcare privacy protection

10.5.4 Understanding GDPR: European data protection

10.6 Real-world project: SafeMedAssist

10.6.1 Why a medical AI assistant?

10.6.2 Professional testing with LangTest

10.6.3 Production Deployment Considerations

10.6.4 The business case for responsible AI

10.7 Summary

11 Model Context Protocol and multi-agent AI systems

11.1 What is MCP?

11.1.1 The hidden complexity: The N×M problem

11.1.2 The solution: Model Context Protocol (MCP)

11.2 Building your first real MCP tool with a CSV-powered product catalog

11.2.1 Loading your product catalog

11.2.2 Setting up the MCP server

11.2.3 Running and testing your server

11.3 Teaching your AI to use MCP tools

11.3.1 How models learn what tools they can use

11.3.2 A complete example: model → MCP → answer

11.3.3 What you didn’t have to write

11.3.4 Tool design becomes interface design

11.4 Multi-agent systems with LangGraph

11.4.1 Why multi-agent architecture matters

11.4.2 Introducing LangGraph: The multi-agent framework

11.4.3 Building intelligent agent coordination

11.4.4 Real-world example: ShopBot multi-agent system

11.4.5 Running your multi-agent system

11.4.6 Adding vision: Image-based product search with VisionAgent

11.4.7 What we built—why it matters

11.5 Testing multi-agent workflows—not just agents

11.5.1 Blended queries: “Can your system walk and chew gum?”

11.5.2 Ambiguous intent: “Does your system make the right call?”

11.5.3 3. Agent failure: “Do you fail gracefully?”

11.5.4 State flow and memory: “Does information flow, or get lost?”

11.5.5 Regression: “Are you moving forward—without breaking what worked?”

11.6 From playground to production: How real AI gets built

11.6.1 From prompts to production workflows

11.6.2 Hallucination prevention: Beyond wishful thinking

11.6.3 Evaluation: Humans, LLMs, and continuous testing

11.6.4 Scaling, finetuning, and living systems

11.6.5 The mindset: Durable, trustworthy AI

11.7 Summary

11.8 References

Appendices

Appendix A: Choosing the right model

Overview

1 Deploying Large Language Models Reliably in the Real World

Large language models have advanced rapidly since the advent of the Transformer, scaling to deliver human-like capabilities in generation, understanding, and reasoning. Yet the chapter emphasizes that flashy demos rarely survive the jump to production: most pilots fail to deliver ROI due to hallucinations, brittle tool use, weak evaluation, and operational gaps. Framing reliability as the decisive differentiator, it introduces a practical, engineering-first approach to building systems that remain accurate, efficient, and ethical long after launch—equipping practitioners to convert lab promise into durable, real-world value.

Across sectors, LLMs are already reshaping work. In law, automated document analysis compresses once‑massive review workloads, but fabricated citations demand rigorous human validation. In customer support, multilingual assistants resolve issues faster and at lower cost, yet require guardrails to prevent incorrect policy guidance. In software development, coding copilots accelerate delivery but can introduce bugs or vulnerabilities, underscoring the need for review. Enterprise applications extend further into agentic AI—models that take actions, use tools, and orchestrate workflows—unlocking productivity gains while raising the stakes for reliability and safety.

The chapter distills four make-or-break challenges and their remedies: hallucination, bias, performance/efficiency, and agentic reliability. It prescribes layered controls such as retrieval-augmented generation, semantic search, chain-of-thought prompting, confidence scoring, and source attribution to curb fabricated answers; proactive bias detection with adversarial tests, fairness metrics, audits, and curated data; and efficiency techniques—distillation, quantization, intelligent caching, hybrid routing—backed by comprehensive technical and quality monitoring. For agents that can act, least‑privilege permissions, approval workflows, and safety interlocks are essential. With LLMs entering regulated, high‑stakes domains, reliability now determines ROI, compliance, and public trust; the book outlines end-to-end workflows—spanning optimization, load balancing, RAG, and robust agents—to deploy responsibly at scale.

Showing the exponential size increase in language models. Most of the newest models such as the newer GPT & Claude models do not reveal their parameter size but they are estimated to be over a trillion parameters.

Performance comparison of GPT models on AIME 2025 competition mathematics problems [10]

Global AI agents market growth forecast by region, 2018–2030, showing rapid acceleration to $50.3B by 2030

Summary

LLMs have immense potential to transform industries. Their applications span content creation, customer service, healthcare, and more.
Core challenges like hallucinations, bias, efficiency and performance must be addressed to successfully use LLMs in production.
Agentic AI systems that take real-world actions introduce new categories of risk requiring sophisticated reliability engineering.
Mitigating bias is crucial to prevent perpetuating harmful assumptions and ensure fair, equitable treatment.
Improving efficiency is vital to making large models economically and environmentally viable at scale.
Curbing hallucination risks is key to keep outputs honest and grounded in facts.
Performance optimization ensures LLMs meet speed responsiveness demands, and quality of real-world applications.
Multi-agent systems require coordination protocols, error handling, and monitoring to prevent cascading failures.
This book covers promising solutions to these challenges that will enable safely harnessing LLMs to create groundbreaking innovations across healthcare, science, education, entertainment, and more while building vital public trust.

FAQ

Why do so many LLM pilots fail to deliver ROI?

MIT reports that 95% of generative AI pilots fail to deliver ROI. Common causes include hallucinations, flaky or inconsistent outputs, brittle tooling, and weak evaluation methods. Because LLMs are probabilistic rather than deterministic, they can vary responses to identical inputs, embed hidden biases, and “fail creatively,” which breaks traditional software assumptions.

What makes modern LLMs so capable?

The 2017 Transformer architecture enabled models to capture long-range context. Combined with massive datasets and compute, models scaled to billions of parameters, unlocking strong abilities in generation, comprehension, reasoning, and translation. Recent “thinking” models show dramatic gains on tasks like AIME mathematics, highlighting rapid advances in reasoning.

How are LLMs reshaping industries today?

Examples include: law (COIN automates large-scale contract review but risks fabricated citations), customer service (Klarna’s assistant handles work of 700 agents and 35 languages but needs strict guardrails), software development (Copilot speeds coding ~55% yet can suggest buggy or insecure code), and enterprise AI (Salesforce Einstein drives 1B+ predictions/day and boosts revenue, but bias in high-stakes domains remains a concern).

What reliability challenges make or break real-world deployments?

Four pillars: hallucination (plausible but false outputs), bias (systematic unfairness from data and correlations), performance and efficiency (cost, latency, and scalability constraints), and agentic reliability (risk from AI systems that take actions, not just generate text). Mastering these is essential for trust, compliance, and ROI.

How can I reduce hallucinations in production systems?

Use defense-in-depth: retrieval-augmented generation (RAG) to ground responses in verified sources; semantic search for relevant, trustworthy context; chain-of-thought prompting to encourage stepwise reasoning; confidence scoring so the model expresses uncertainty; and source attribution so claims can be verified. Treat hallucination like a cybersecurity risk and assume it will occur.

How should teams detect and mitigate bias?

Build bias controls into the development pipeline: adversarial testing to expose discriminatory behavior, clear fairness metrics with regular audits, curated and diverse training data, and automated bias checks in CI/CD. Create feedback loops to catch and correct issues quickly, and track outcomes across demographic groups.

How do we make LLMs efficient and cost-effective at scale?

Apply model distillation (smaller students mimic larger teachers) to retain most performance with far less compute; use quantization to shrink models substantially with minimal quality loss; add intelligent caching and preprocessing; and route requests via hybrid architectures that reserve large models for only the hardest queries.

What should we monitor in production beyond latency and errors?

Track quality and trust metrics: hallucination rates, bias incidents, and user satisfaction, alongside traditional metrics like response time and error rates. This holistic monitoring helps teams detect issues early and maintain reliability and user trust over time.

What is agentic AI, why is it risky, and how do we keep agents safe?

Agentic AI can use tools, coordinate workflows, and take real actions (send emails, update databases, make purchases). Mistakes become costly actions, not just bad text. Mitigate with layered safety controls and strict permission management under least privilege—grant access only to the tools and data required for the task, with guardrails and escalation paths.

What do I need to follow along with the book’s projects?

Install Python 3+ with pip, use a code editor (e.g., VS Code or Cursor), and obtain an OpenAI API key. Most examples call APIs rather than host local models, keeping setup simple and costs low. Optional cloud services and tools are introduced later, with free-tier options where possible.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $23.99

you save $24.00 (50%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $23.99

you save $24.00 (50%)

eBook

pdf, ePub, online

$47.99 $23.99

you save $24.00 (50%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more