Designing AI Systems you own this product

A guide to production-ready platforms

Suhas Suresha and Dewang Sultania

MEAP began April 2026
Last updated May 2026
Publication in Fall 2026 (estimated)

ISBN 9781633434523
350 pages (estimated)

Included with a Manning Online subscription

printed in black & white

resources: Source code Book forum Source code on Github

table of content

1 Why your AI projects need a platform

1.1 The AI Wild West: why winging it doesn’t scale

1.1.1 The production wake-up call

1.1.2 The universal pattern: prototype vs. production

1.1.3 What if the infrastructure already existed?

1.1.4 AI sprawl

1.2 The AI infrastructure reality: the model is just 2% of the story

1.2.1 Traditional ML Customer Support (2015)

1.2.2 Modern GenAI Customer Support (2026)

1.3 Thinking from first principles: what AI applications actually need

1.4 Your AI platform blueprint: what we’re building

1.5 Platform in action

1.6 Why platform thinking changes everything

1.7 Summary

2 Building with the platform: SDK and API design

2.1 Designing the ideal developer experience

2.1.1 Immediate productivity through effortless setup

2.1.2 The conversation memory challenge

2.1.3 The organizational knowledge problem

2.1.4 The safety imperative

2.1.5 The external system integration challenge

2.1.6 The optimization dilemma

2.1.7 The Multi-Step Coordination Challenge

2.1.8 Synthesis: the developer experience contract

2.2 Code to container: the deployment story

2.2.1 The shared process trap

2.2.2 Choosing the right isolation mechanism

2.2.3 One workflow, one service

2.2.4 Managing execution

2.3 Exposing workflows as APIs

2.3.1 The API sprawl problem

2.3.2 Unified workflow exposure

2.3.3 The communication protocol

2.3.4 Patterns of AI interaction

2.4 The communication architecture

2.4.1 The entry point problem

2.4.2 The internal communication challenge

2.4.3 Tracing the Complete Flow

2.4.4 Why these decisions scale

2.5 Building the SDK

2.5.1 Connection to the API Gateway

2.5.2 Providing access to services

2.5.3 Capturing deployment information

2.5.4 The SDK foundation

2.6 Summary

3 The Model Service: Your platform’s gateway to AI models

3.1 The model service contract

3.1.1 Generating responses

3.1.2 Discovering available models

3.1.3 Managing system prompts

3.1.4 Registering custom models

3.1.5 The gRPC contract

3.1.6 Request and response structures

3.2 Provider abstraction

3.2.1 How providers differ

3.2.2 The unified provider interface

3.2.3 OpenAI message format as platform standard

3.2.4 What adapters do

3.2.5 The adapter implementation pattern

3.3 Streaming responses

3.3.1 The streaming architecture

3.3.2 The ChatChunk message

3.3.3 Streaming and error handling

3.4 Resilience: fallbacks and retries

3.4.1 Retry configuration

3.4.2 Fallback configuration

3.4.3 Configuration examples

3.5 Routing strategies

3.5.1 Routing configuration

3.5.2 Cost-aware routing

3.5.3 Load-based routing

3.5.4 Feature-based routing

3.5.5 Combining patterns

3.6 Rate limiting

3.7 Caching for cost and performance

3.7.1 Two levels of caching

3.7.2 The cache interface

3.7.3 Monitoring cache effectiveness

3.8 Observability: Cost tracking and metrics

3.8.1 What the model service tracks

3.8.2 Feeding the observability service

3.8.3 Enabling informed decisions

3.9 Integrating with the SDK

3.9.1 The ModelClient

3.9.2 Method implementation pattern

3.9.3 Streaming support

3.9.4 The complete picture

3.10 Summary

4 The Session Service: Teaching your AI to remember

4.1 What a session contains

4.2 The session service contract

4.3 Storage abstraction

4.3.1 Choosing the right database

4.3.2 The storage interface

4.4 Implementing the Session Service backend

4.4.1 The Database Schema

4.4.2 The Storage Implementation

4.4.3 The gRPC service implementation

4.5 Integrating with the SDK

4.6 Model-managed memory

4.6.1 Where memory lives

4.6.2 The memory model

4.6.3 Extending the storage interface for memories

4.6.4 SDK Integration

4.6.5 A complete workflow example

4.7 Managing context windows

4.7.1 The token budget problem

4.7.2 Simple truncation

4.7.3 Compressing history with summarization

4.7.4 Hierarchical memory

4.7.5 Retrieval-augmented memory

4.7.6 Putting it together

4.8 Summary

5 The Data Service: teaching AI what your organization knows

5.1 From documents to searchable knowledge

5.2 Indexes: organizing knowledge

5.2.1 Why isolation matters

5.2.2 Index configuration

5.2.3 Index operations

5.3 Ingestion pipeline: from raw files to vectors

5.3.1 The challenge of diverse formats

5.3.2 Pipeline architecture

5.3.3 Format detection and text extraction

5.3.4 Metadata: the filtering foundation

5.3.5 Chunking: breaking text into retrievable pieces

5.3.6 Generating embeddings

5.3.7 Document lifecycle

5.3.8 Document management

5.3.9 Asynchronous ingestion

5.4 Vector storage and search

5.4.1 Vector store interface

5.4.2 Choosing a vector store backend

5.4.3 The pgvector implementation

5.4.4 Search orchestration

5.5 Hybrid search: combining vectors with keywords

5.5.1 Adding keyword search to the platform

5.5.2 PostgreSQL keyword search implementation

5.5.3 Merging results: Reciprocal Rank Fusion

5.5.4 Putting it together

5.6 Service contract and complete retrieval flow

5.7 Summary

6 Tools and Guardrails: Enabling safe, managed AI actions

6.1 Reframing tools as platform-managed capabilities

6.1.1 The tool service contract

6.1.2 Tool definitions beyond function schemas

6.1.3 The SDK interface

6.2 The tool registry: namespacing and discovery

6.2.1 Capability-based discovery

6.2.2 Version control

6.3 Tool execution: adapters, credentials and reliability

6.3.1 The adapter pattern

6.3.2 Credential isolation

6.3.3 The credential store interface

6.3.4 Synchronous and asynchronous execution

6.4 Interoperability: the Model Context Protocol (MCP)

6.4.1 From NxM to N+M, and then to M

6.4.2 The architecture of MCP: hosts, clients and servers

6.4.3 The three MCP primitives

6.4.4 What MCP does not cover

6.4.5 Integrating MCP with the platform

6.5 Tool execution safeguards

6.5.1 Resource limits

6.5.2 Circuit breakers

6.6 Guardrails: not just filters, but execution policies

6.7 Guardrails in practice: input, output and behavioral policies

6.7.1 Input guardrails: protecting the system

6.7.2 Output guardrails: the last line of defense

6.7.3 Behavioral guardrails: controlling actions and logic

6.8 Declarative policy and observability

6.8.1 Policy as configuration

6.8.2 Making safety visible

6.9 Summary

7 Observability and experimentation: seeing and improving what AI does

7.1 Why AI systems need specialized observability

7.1.1 From infrastructure health to output quality

7.1.2 Cross-service correlation

7.1.3 Cost, and the bridge to experimentation

7.2 The observability data model

7.2.1 Deriving the model from questions

7.2.2 How the primitives connect: tracing a single request

7.2.3 Logs and metrics: complementing the request-level model

7.3 The Observability Service contract

7.3.1 Walking through the contract

7.3.2 Ingestion vs query patterns

7.4 Structured logging, metrics, and distributed tracing

7.4.1 Structured logging for AI-specific debugging

7.4.2 Metrics collection

7.4.3 Distributed tracing and the debugging workflow

7.5 How platform services report telemetry

7.5.1 TracedService: automatic instrumentation

7.5.2 Domain-specific telemetry

7.5.3 The observability client: batching and buffering

7.5.4 Adding custom observability beyond the defaults

7.6 Quality scores and cost attribution

7.6.1 Scores: measuring response quality

7.6.2 Cost attribution and budget tracking

7.7 The Experimentation service

7.7.1 Service contract

7.7.2 Target lifecycle and evaluation

7.8 Evaluation: datasets, offline scoring, online scoring, and human review

7.8.1 Datasets: the foundation of evaluation

7.8.2 Offline evaluation: comparing variants before deployment

7.8.3 Online evaluation: scoring production traffic in real-time

7.8.4 Annotation queues: calibrating with human judgment

7.9 A/B testing infrastructure

7.10 Putting it all together

7.11 Summary

8 The Workflow Service: Orchestrating AI processes

9 Building and deploying applications on the platform

Overview

1 Why your AI projects need a platform

AI initiatives often begin with an impressive prototype that quickly falters in production. The chapter uses a customer support chatbot to show how unpredictable latency, lack of scalability, absent cost controls, safety risks, and ungrounded answers turn quick wins into months of firefighting. As multiple teams repeat the same fixes independently, organizations suffer AI sprawl—duplicated components, inconsistent security, fragmented learning, and mounting technical debt—underscoring the need for platform thinking from day one.

The core insight is the infrastructure iceberg: in modern GenAI systems, the model call is only a tiny fraction of the whole, while the surrounding 98% spans knowledge ingestion and retrieval, context assembly and session memory, orchestration and tool use, observability and cost tracking, governance and safety, model routing across providers, and scalable runtime foundations. To graduate from demo to dependable product, applications require seven capabilities: context-aware intelligence, multi-step orchestration, dynamic tool integration, continuous safety and compliance, rigorous observability and experimentation, model abstraction to avoid lock-in, and a resilient, scalable infrastructure layer.

To meet these needs, the chapter proposes a shared, service-oriented platform: Data and Session Services for context, a Workflow Service for orchestration, a Tool Service for integrations, a Guardrails Service for safety and compliance, Observability and Evaluation for visibility and improvement, a Model Service for provider-agnostic access and routing, and an Infrastructure Layer with an API gateway for scale and operability. A single user request flows through these services—retrieving history, validating safety, grounding in organizational knowledge, generating and validating responses, and tracing costs and performance—delivering reliability by default. This blueprint replaces reinvention with reuse, enforces consistent security, enables organization-wide learning, controls cost and risk, and preserves flexibility—setting the stage for the book’s step-by-step build-out of the platform.

The Hidden Infrastructure Iceberg in Traditional ML Systems. While the actual machine learning code (dark center box) represents only 5-10% of system complexity, the surrounding infrastructure components—data collection, feature extraction, monitoring, and serving infrastructure—constitute 90-95% of the engineering effort. This foundational insight from the 2015 NeurIPS paper reveals why ML systems are so complex to put into production.

The GenAI Infrastructure Explosion. The model API call now represents just 2% of total system complexity—even less than in traditional ML. Modern GenAI applications require vastly expanded infrastructure: knowledge management systems, safety and compliance layers, session management for conversations, tool integration platforms, advanced observability, and multi-provider model orchestration. This architectural shift explains why simple prototypes become complex production systems.

AI Platform Architecture: Our platform design directly addresses the core requirements through specialized services: the Data Service and Session Service provide context-aware intelligence; the Workflow Service enables multi-step orchestration; the Tool Service handles dynamic integrations; the Guardrails Service ensures safety and compliance; the Observability and Evaluation Services provide monitoring and experimentation; the Model Service abstracts provider differences; and the Infrastructure Layer with API Gateway provides scalable foundations. Each service operates independently while sharing common patterns, eliminating the need for teams to rebuild foundational components.

The conversational request flow shows how Maria's simple question flows through multiple platform services before returning a response. The Workflow Service orchestrates the entire interaction, coordinating calls to the Session Service to retrieve conversation history, the Guardrails Service to validate both input and output for safety, the Data Service to search organizational knowledge, and the Model Service to generate contextually appropriate responses. The Observability Service tracks costs and performance throughout this coordinated sequence, demonstrating how platform services transform simple interactions into production-ready AI applications.

Summary

AI prototypes consistently fail in production because they lack the infrastructure that real-world usage demands.
The model API call represents only 2% of total system complexity in modern GenAI applications.
The remaining 98% consists of knowledge management systems, session management, safety and compliance layers, tool integration platforms, observability systems, model management, and infrastructure layers.
AI sprawl emerges when teams build disconnected AI solutions independently, each implementing their own versions of session management, cost tracking, safety controls, and observability. This duplication makes each new AI feature exponentially harder to build as teams navigate a maze of incompatible one-off implementations.
Context-aware intelligence requires two complementary capabilities that AI applications must provide. The Data Service handles document ingestion, vector embedding generation, and semantic search. The Session Service manages conversational state by tracking user interactions, preferences, and conversation history across multiple exchanges.
Multi-step orchestration through the Workflow Service coordinates complex AI processes involving sequential and parallel operations where each step depends on previous ones and any step can fail.
Model abstraction through the Model Service provides a unified interface that works seamlessly with any AI provider—GPT-4, Claude, local Llama models, or future providers—without requiring application code changes. The service handles provider-specific API differences, response formatting variations, error handling strategies, intelligent routing based on task requirements, automatic fallback, and cost optimization.
Dynamic tool integration via the Tool Service provides registration and discovery mechanisms for external APIs and services, handling authentication patterns, rate limiting, error recovery, and result caching automatically.
Safety and compliance enforcement through the Guardrails Service ensures every input and output passes through configurable safety filters automatically rather than relying solely on model behavior that users can manipulate, preventing scenarios where AI assistants provide unauthorized advice or expose sensitive data regardless of prompt engineering attempts.
The Observability Service handles operational monitoring by tracking costs per request, measuring performance across services, implementing distributed tracing that follows requests through multiple components, and collecting system metrics that identify bottlenecks before they impact users.
The Evaluation Service focuses on AI-specific assessment through systematic experiments, A/B testing infrastructure, prompt versioning, and quality measurement that enables data-driven optimization rather than guesswork.
Scalable infrastructure foundations provided by the Infrastructure Layer and API Gateway handle configuration management for secrets and environment settings, service discovery so components can find each other, resource allocation that scales individual services based on load, and deployment automation that enables reliable updates without downtime.
Service-oriented architecture enables independent scaling where high-demand features don't impact other applications, clear boundaries where teams can work on different services without coordination, and shared infrastructure patterns that prevent duplication.
Each service operates independently with well-defined API contracts, allowing the Model Service to scale separately from the Session Service, the Data Service to use different storage technology than the Workflow Service, and teams to deploy updates to individual services without affecting the entire platform.

FAQ

Why do AI projects need a platform from day one?

A platform gives teams shared, production-grade capabilities—monitoring, cost controls, safety, scalability, and experimentation—so prototypes don’t turn into months of rebuilding foundational infrastructure for each app.

What goes wrong when a successful prototype is pushed to production?

- Unpredictable latency and timeouts under load - Systems collapse without concurrency, queuing, or rate limits - Costs spike with no tracking or budgets - Safety gaps lead to risky outputs and policy violations - Answers drift without real-time company knowledge - No versioning, metrics, or A/B testing to improve reliably

What are the “critical gaps” between AI prototypes and production systems?

- Predictable performance and scale (sub-second goals, concurrency) - Financial visibility (per-request costs, budgets, alerts) - Monitoring and observability (tracing, errors, SLOs) - Security and safety (key management, guardrails, audit trails) - Response grounding (live knowledge integration) - Experimentation and optimization (A/B tests, evals, versioning)

What is “AI sprawl,” and why is it dangerous?

AI sprawl is the proliferation of disconnected, one-off AI stacks across teams. It creates duplicated effort, inconsistent security, incompatible data and tooling, fragmented learning, vendor lock-in, and mounting technical debt that slows every new initiative.

What does “the model is just 2% of the story” mean?

In real systems, the model call is a tiny fraction of the work. The hard parts are the surrounding services: knowledge ingestion and retrieval, safety and compliance, session memory, tool execution, observability, model routing, and scalable infrastructure.

Which core capabilities do reliable AI applications require?

- Context-aware intelligence (knowledge grounding + session memory) - Multi-step orchestration and robust workflow control - Dynamic tool integrations with auth, quotas, and fallbacks - End-to-end safety and compliance enforcement - Deep observability and systematic evaluation - Model abstraction and routing across providers - Scalable, secure infrastructure and deployment

How does the proposed platform address these needs?

- Data and Session Services: context and memory - Workflow Service: multi-step orchestration and recovery - Tool Service: safe integration with external systems - Guardrails Service: content safety, PII, policy enforcement - Observability and Evaluation Services: monitoring and A/B tests - Model Service: provider abstraction and routing - Infrastructure Layer + API Gateway: security, scaling, and ops

How do existing tools like LangChain, LlamaIndex, Bedrock, or Azure OpenAI fit in?

They’re valuable components but don’t replace platform thinking. Understanding the underlying patterns lets you combine tools wisely, fill gaps, avoid lock-in, and standardize cross-cutting concerns (safety, observability, costs) across all apps.

How do the platform services collaborate during a user request?

A workflow orchestrator fetches conversation context, validates inputs via guardrails, retrieves relevant documents, calls an appropriate model through the Model Service, validates outputs again, persists the exchange to session storage, and emits metrics and costs to observability—all transparently.

What organizational benefits come from platform thinking?

- Faster delivery by reusing shared services - Lower risk via consistent security and safety defaults - Cost control and performance visibility across teams - Easier experimentation and organization-wide learning - Flexibility to adopt new models/providers without rewrites - Sustainable scaling that avoids accumulating technical debt

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $23.99

you save $24.00 (50%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $23.99

you save $24.00 (50%)

eBook

pdf, ePub, online

$47.99 $23.99

you save $24.00 (50%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more