AI Agents in Action, Second Edition you own this product

Intelligent workflows with LLMs, MCP, A2A, and more

Micheal Lanham

MEAP began November 2025
Last updated March 2026
Publication in Summer 2026 (estimated)

ISBN 9781633434530
325 pages (estimated)

Included with a Manning Online subscription

printed in black & white

available in Complex Chinese

catalog / Data Science / AI / AI Agents

resources: Source code Book forum Source code on Github

table of content

1 The rise of AI agents

1.1 Defining agents and agentic thinking

1.1.1 Understanding agent/assistant and LLM patterns

1.1.2 Thinking like agents

1.1.3 Agents act with tools

1.2 Introducing the Model Context Protocol (MCP)

1.3 Understanding the functional layers of an agent

1.3.1 The Agent Persona

1.3.2 Agent Actions & Tools

1.3.3 Agent Reasoning & Planning

1.3.4 Agent Knowledge & Memory

1.3.5 Agent Evaluation & Feedback

1.4 Advancing onto multi-agent systems

1.4.1 The agent-flow assembly line

1.4.2 Agent orchestrations (hub-and-spoke)

1.4.3 Agent collaboration (teams of agents)

1.5 Summary

2 Core components: Large Language Models, prompting, and agents

2.1 Understanding Large Language Models

2.1.1 LLMs: Probabilistic Token Machines

2.1.2 What is a token?

2.1.3 Tuning Temperature, Top P, and more

2.2 Controlling LLMs with prompt engineering (Agent Persona)

2.2.1 Applying core prompt techniques

2.2.2 Thinking like an LLM

2.2.3 Avoiding common prompt pitfalls

2.3 Building an agent with OpenAI Agents

2.3.1 Building a minimal agent

2.3.2 Setting the Agent Model and other parameters

2.3.3 Controlling inputs and typed outputs

2.3.4 Tracing agents

2.4 Enhancing agents through tool integration

2.4.1 Providing agents with tools

2.4.2 Tracing agentic tool use

2.5 Exercises

2.6 Summary

3 Actions with Model Context Protocol for AI agents

3.1 Understanding MCP fundamentals for agent development

3.1.1 The standardization problem MCP solves

3.1.2 MCP architecture: Clients, servers, and services

3.1.3 Core components: Tools, resources, and prompts

3.1.4 MCP deployment patterns for agents

3.1.5 MCP powers the functional agent layers

3.2 Getting started with MCP Servers

3.2.1 Coding up an MCP Server for Claude

3.2.2 Using the MCP inspector

3.2.3 Understanding MCP transport types

3.2.4 From desktop to agents: the key differences

3.3 Actioning MCP servers for Agents

3.3.1 Actioning local MCP servers over STDIO with agents

3.3.2 Actioning local MCP servers over SSE with agents

3.3.3 Connecting to the standard MCP servers

3.4 Building MCP servers for agents

3.4.1 Converting tools to an MCP server

3.4.2 Consuming MCP servers locally or remotely

3.5 Exercises

3.6 Summary

4 Architecting and building multi-agent systems

4.1 Architecting multi-agent systems

4.1.1 Decision-making for agent systems

4.1.2 Communicating with shared-memory, message-passing, and MCP

4.1.3 Channeling multi-agent coordination strategies

4.2 Balancing agents with agentic flows

4.2.1 Transforming agents to agent flows

4.2.2 Building an Agent-to-Agent flow

4.2.3 Agency and decision making in agent flows

4.3 Understanding handoffs in aAgent flows

4.3.1 Agent-to-agent flow with handoffs

4.3.2 Visualizing agent flows

4.3.3 Monitoring the handoff

4.4 Validating agent flows with guardrails

4.4.1 Implementing input and output guardrails

4.4.2 Using agents as guardrails

4.4.3 Adding guardrails to pass off agent flows

4.5 Exercises

4.6 Summary

5 Agent Reasoning and Planning

5.1 Understanding LLM Reasoning and Planning

5.1.1 Chain of Thought Reasoning

5.1.2 ReAct Paradigm (Reasoning + Acting + Observing)

5.1.3 Planning with LLMs

5.2 Instructing agents to reason and plan

5.2.1 Applying CoT to an Agent

5.2.2 Implementing ReAct with Agents

5.3 Advanced reasoning with agents

5.3.1 Tree of Thought

5.3.2 Reflexion

5.3.3 Selecting the right pattern for your agents

5.4 Utilizing the Sequential Thinking MCP Server

5.4.1 Unchaining the Sequential Thinking Server

5.4.2 Revisiting time travel problems with Sequential Thinking

5.4.3 Advanced reasoning with sequential thinking

5.5 Exercises

5.6 Summary

6 Working with memory and knowledge RAG for agents

6.1 Understanding retrieval in AI applications

6.1.1 The basics of retrieval augmented generation (RAG)

6.1.2 Delving into semantic search and document indexing

6.1.3 Applying vector similarity search

6.2 Vector databases and similarity search

6.2.1 Demystifying document embeddings

6.2.2 Querying document embeddings from Chroma

6.3 Building practical RAG knowledge agents

6.3.1 Everything begins with search and relevance

6.3.2 Building a vector search RAG agent

6.3.3 Building a hybrid search RAG agent

6.4 Adding memory to agents with MCP

6.4.1 Understanding memory form and agent function

6.4.2 Implementing a graph database for memory using MCP

6.4.3 Creating hybrid memory systems with MCP

6.4.4 Semantic augmented memory and applications to semantic, episodic, and procedural memory

6.4.5 Uncluttering memory with compression and forgetting

6.5 Exercises

6.6 Summary

7 Building robust agents with evaluation and feedback

7.1 Introducing agent evaluation and feedback

7.2 Implementing test-driven agent development

7.2.1 Exploring TDAD in practice

7.2.2 Coding and testing the RAG agent

7.2.3 Refactoring the agent

7.2.4 Extending evaluation with an agent evaluator

7.3 Employing grounding, critic, and evaluation agents

7.3.1 Reviewing the grounding agent

7.3.2 Grounding the RAG agent

7.3.3 Implementing grounding agents as guardrails

7.3.4 Understanding the role of rubrics in evaluation

7.3.5 Building a rubric critic agent

7.4 Phoenix for evaluation and feedback

7.4.1 Connecting to Phoenix

7.4.2 Adding metadata and session tracking

7.4.3 Experimenting with evaluators

7.4.4 Providing feedback with Annotations

7.5 Exercises

7.6 Summary

8 Deploying agents and agentic systems

8.1 Strategies for consuming agents

8.1.1 Embedding real-time voice agents into web applications

8.1.2 Hosting agents through an API

8.1.3 Consuming an agent web service in a web application

8.2 Dockerizing agent systems

8.2.1 Containerizing an agent microservice

8.2.2 Orchestrating agentic systems with Docker Compose

8.2.3 Externalizing local agent microservices

8.3 Considering advanced deployment strategies

8.3.1 Choosing a runtime: edge, API, or event-driven

8.3.2 The three “wires” of communication

8.3.3 Practical multi-agent topologies that adapt well

8.3.4 State, memory, and idempotency

8.3.5 Release engineering for agents (prompts, tools, models)

8.3.6 Observability matters

8.3.7 Reliability patterns: timeouts, fallbacks, and budgets

8.3.8 Cost control and model routing

8.4 Security, safety, and governance in production

8.4.1 A quick threat model for agentic systems

8.4.2 Identity and access—for people, services, and agents

8.4.3 Secrets and configuration management

8.4.4 Tool safety: sandboxing and egress control

8.4.5 Prompt-injection and data-exfiltration defenses

8.4.6 Safety and policy enforcement

8.5 Exercises

8.6 Summary

9 Engaging GPT Assistants

10 Exploring collaborative agent systems

11 Troubleshooting

Overview

7 Building robust agents with evaluation and feedback

This chapter emphasizes that robust, safe, and debuggable agentic systems depend on rigorous evaluation and feedback woven throughout the lifecycle—not just at production hardening time. It maps internal learning loops and external evaluators, highlighting complementary methods such as red-team and benchmark testing, human feedback, and agent-based evaluators (grounding and critic agents). Results are best funneled into an evaluation/feedback store that powers reporting, alerts, and continuous improvement, often via a dedicated feedback agent. The recommended discipline is to adopt test-driven agent development from the outset so quality gates shape prompts, tools, and policies as the system evolves.

Test-driven agent development follows a Think → Test → Prompt → Refactor loop: define goals and benchmarks (including counterexamples), expect early failure, and make minimal, targeted updates for consistency and latency. Because LLMs are variable, each test should be run multiple times and refactoring kept lean; when benchmarks conflict, teams can allow selective failures, split responsibilities across specialized agents, or rebuild prompts around the hardest goals. A practical walk-through shows a RAG agent maturing from string-match checks to an evaluator agent with typed pass/fail plus feedback, then to a grounding agent that blocks ungrounded output and can serve as a guardrail with tripwire-triggered exceptions. The pattern extends to multi-agent flows where critics guide regeneration and prevent drift, ensuring outputs remain correct, contextualized, and policy-aligned.

For complex, mixed-generation tasks, rubric-based evaluation provides structured, explainable judgments beyond simple accuracy, enabling critic agents to score outputs (for example, images) across criteria and thresholds that govern iterative improvement loops. Strong observability underpins this process: comprehensive logging and tracing capture agent, tool, and LLM spans for diagnosis and analysis. Phoenix augments this with OpenTelemetry-based tracing, session and metadata tagging, datasets built from real traces, experiment runners, evaluators, and annotations to systematize triage and iteration. The chapter’s throughline is clear: combine TDAD, grounding and critic/evaluation agents, and trace-driven operations to close the loop on quality—never promoting agents without concrete evaluation and feedback mechanisms in place.

highlights the external agent evaluation and feedback systems for a single user-facing agent. In the figure a generative agent may face evaluation from a critic for general output feedback or a grounding agent for context specific feedback and evalution.

highlights the human process of developing agents using TDAD. Where we first think of goals and benchmarks, then test, develop the prompt, refactor, and then repeat the process.

Architecture of a simple RAG Knowledge Agent. The goal of the agent is to answer questions from a benchmark by searching its knowledge base and returning answers. A simple evaluator LLM evaluates the answer based on the expected and wrong answers from the benchmark test. The accurary of the RAG agent can be determined by the percentage of correct answers answered from the benchmark.

Three forms of agents for evaluating agent output: grunding, critic, and evaluation. The grounding is specific to evaluating RAG agent output. The critic agent is designed to critique generated content, and the evaluation agent is used for more complex output.

RAG knowledge agent coupled with a grounding agent. Both agents are provided with the same context. The RAG agent generates an answer from the context, and the grounding agent verifies that the answer is grounded by reviewing the same context.

Arize Phoenix deployment patterns and use case methods of tracing, tracking and evaluating agents.

Arize Phoenix projects dashboard showing an agent workflow, agent and LLM activity.

Viewing the metadata column and filtering traces by an attribute.

The process to add spans to a dataset for later experimentation and evaluation.

Viewing datasets and then clicking Run Experiment to view the generated code you can copy to a Python file to generate an experiment or evaluation.

demonstrates how to create a new Annotation in Phoenix.

demonstrates how to apply an annotation to an LLM response within an agent workflow, agent, and LLM response spans.

Summary

Robust agents require evaluation and feedback baked in from the start—tie the Learn step of Sense-Plan-Act-Learn to external evaluators, a feedback store, and (optionally) a monitoring/alerting feedback agent.
Use a mix of methods: red-team testing for safety breaks, benchmark testing for goal attainment, human feedback (thumbs/comments) with verification to reduce bias, and automated evaluators (grounding/critic/evaluation agents) where appropriate.
Test-Driven Agent Development (TDAD) flips the process: define benchmarks (incl. wrong examples), expect initial failures, run each test multiple times for consistency, then make the minimum prompt/tool/model changes needed to pass—retest and refactor iteratively.
When benchmarks conflict, decide explicitly: accept partial pass targets, split into specialized agents (plus triage), or refactor the agent’s instructions from first principles.
Prefer tool guidance in the tool’s name/docstring over prompt clutter—rename tools and write clear docstrings; keep prompts lean and role-focused.
Add evaluation agents to score complex, natural-language outputs with typed results (e.g., pass/fail + feedback), avoiding brittle string matching and enabling programmatic gating.
Grounding agents are essential for RAG—verify answers are supported by provided context/citations; on failure, regenerate, block with a static response, or pass grounded feedback forward.
Guardrails operationalize grounding—wrap agents with output guardrails that trip on ungrounded answers and surface feedback/flags for policy (retry, block, or annotate).
Rubrics turn subjective quality into criteria and scales—use them with critic agents for high-variance generation (images, reports, graphs); loop generation ↔ critique until a passing score or attempt limit.
Because LLMs are stochastic, benchmark for stability—not just accuracy—by rerunning tests and watching variance, latency, and token use.
Observability is non-negotiable: OpenAI Traces show every tool call and LLM turn; Arize Phoenix (OpenTelemetry) adds projects, sessions, metrics, costs, experiments, and evaluators.
Instrument Phoenix cleanly—set the collector endpoint, replace default processors, register a tracer, and wrap runs with named traces; attach sessions and metadata for filtering and cohort analysis.
Build datasets from spans, run experiments/evaluators, and use annotations to capture structured human feedback—feed these signals back into prompts, tools, and policies.
Production rule: never promote agents without active evaluation/feedback loops—benchmarks, grounding/guardrails, critics with rubrics, tracing, and monitored KPIs (latency, cost, quality) working together.

FAQ

Why are evaluation and feedback essential for building robust agents?

Evaluation and feedback harden agent systems by catching failures early and continuously improving behavior. Beyond an agent’s internal Learn step in the Sense-Plan-Act-Learn loop, you should add external loops such as benchmark and red-team testing, grounding and critic agents, and human review. Centralizing results in an evaluation/feedback database enables trend analysis, regression detection, and alerting. A dedicated feedback agent can monitor this store to generate reports, dashboards, and notifications.

What is Test-Driven Agent Development (TDAD) and how does it differ from traditional TDD?

TDAD adapts TDD to agents by cycling through Think → Test → Prompt → Refactor, with benchmarks driving minimal prompt/tool changes. Unlike classic TDD, agent tests must run multiple times because LLM outputs are variable; consistency across runs is part of the bar. You iteratively refine the smallest set of instructions and tools needed to pass benchmarks, regularly re-running earlier tests to avoid regressions.

How should I design effective benchmarks for agents?

Start with clear goals and create input–output benchmarks that include both expected answers and representative wrong answers. Use deterministic checks when possible (e.g., structured output, exact matches) and introduce evaluator agents for fuzzy natural-language scoring, classification, or rubric-based judgments. Run each test multiple times to gauge stability, and record pass rates to spot drift and variability.

When should I use grounding, critic, or evaluation agents?

Use a grounding agent for retrieval-augmented generation (RAG) to verify answers are supported by the provided context and block or regenerate ungrounded outputs. Use a critic agent to assess purely generated content against a rubric, providing targeted feedback and optional retry limits. Use an evaluation agent for more general or complex outputs (mixing retrieval and generation), returning typed pass/fail plus feedback to guide improvement.

What are best practices for prompt and tool design under TDAD?

Keep tool-usage directions and output formatting out of the agent’s instruction prompt; instead, rely on tool metadata/descriptions and typed outputs to guide behavior. Make the smallest possible changes to prompts and tools to pass failing benchmarks, then retest all prior benchmarks to prevent regressions. Prefer clarifying tool names/descriptions over adding prompt instructions, and pin formats via output types rather than prose rules.

How do I implement a grounding agent as a guardrail?

Provide the same retrieval context to both the answering agent and the grounding agent, then verify the answer is supported by that context. Use an output guardrail to trip on ungrounded answers, optionally returning a static fallback or triggering controlled regeneration with feedback. You can also set policies like “allow up to three retries, then block” and include grounding feedback in final outputs for transparency.

How should I handle benchmark conflicts and persistent failures?

Decide whether all benchmarks must pass; sometimes a realistic target is a high pass rate (e.g., 90%), with remaining failures serving as negative examples. If failures cluster by type, consider adding a specialized agent or triage pattern—but weigh the added complexity. Otherwise, refactor the agent from first principles with a focus on the most troublesome benchmarks, always making minimal, measurable changes.

How do rubrics improve evaluation of complex outputs?

Rubrics translate goals into explicit criteria, scales, and descriptions, enabling consistent, explainable scoring beyond simple right/wrong checks. They guide critic or evaluation agents to provide granular feedback by dimension (e.g., relevance, style, tone) and compute aggregate or weighted scores. Maintain logs and scores for later analysis, iterate on criteria for your use case, and use rubric-aligned feedback to drive targeted refinements.

What role should human feedback play, and how do I mitigate its pitfalls?

Any user-facing agent should capture simple feedback (e.g., thumbs up/down) and comments to surface edge cases, policy issues, and tonal problems. Store human feedback alongside test and evaluator results for cross-analysis, but remember humans are biased and can be wrong—verify before acting. Use a feedback agent to watch for patterns and trigger alerts when quality or safety issues trend negatively.

How does Phoenix help with tracing, evaluation, and feedback loops?

Phoenix adds deep, trace-level observability across agents, tools, and LLM spans, with latency, token, and cost insights. It integrates via OpenTelemetry, supports sessions and metadata for filtering, and lets you build datasets from real traces to run evaluators and experiments. Annotations enable structured human feedback on spans, which you can later analyze or convert into evaluation datasets—closing the loop from production signals back to prompts, tools, and policies.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $31.19

you save $16.80 (35%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $31.19

you save $16.80 (35%)

eBook

pdf, ePub, online

$47.99 $31.19

you save $16.80 (35%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more