Overview

1 Why your AI projects need a platform

AI prototypes are easy to demo but hard to scale. This chapter argues that the path from “it works on my laptop” to reliable, cost-effective, and safe production systems demands platform thinking from day one. The model call is just a tiny part of the work—the real effort lies in the surrounding infrastructure for performance, governance, integration, and operations. Treating these needs as shared, reusable capabilities—not bespoke code inside each app—turns quick wins into durable systems rather than escalating maintenance burdens.

Through the story of a seemingly simple chatbot, the chapter catalogs the predictable failure modes: unpredictable latency, brittle scalability, opaque costs, unsafe outputs, ungrounded answers, and the inability to experiment systematically. Left unaddressed, these issues compound into AI sprawl—multiple teams reinventing similar components with inconsistent security, logging, and data practices, creating organization-wide technical debt. Building on the “hidden technical debt” insight, the chapter shows how GenAI further widens the infrastructure gap: beyond model calls, teams now need real-time knowledge management and retrieval, session memory, tool orchestration, advanced observability, safety and compliance layers, and multi-model routing—making the model itself only a small fraction of the total system.

To resolve this, the chapter presents a service-oriented platform blueprint that centralizes common capabilities. Data and Session Services provide context-aware intelligence; a Workflow Service enables robust multi-step orchestration; a Tool Service manages secure integrations; a Guardrails Service enforces safety and compliance; Observability and Evaluation Services deliver visibility and systematic improvement; a Model Service abstracts providers to prevent lock-in; and an Infrastructure Layer with an API Gateway supplies scalable, secure foundations. A walkthrough of a retail assistant illustrates how these services coordinate seamlessly to deliver accurate, safe, and cost-efficient responses. The payoff is faster delivery, consistent governance, lower costs, and resilient scaling—and the rest of the book details how to build each component in practice.

The Hidden Infrastructure Iceberg in Traditional ML Systems. While the actual machine learning code (dark center box) represents only 5-10% of system complexity, the surrounding infrastructure components—data collection, feature extraction, monitoring, and serving infrastructure—constitute 90-95% of the engineering effort. This foundational insight from the 2015 NeurIPS paper reveals why ML systems are so complex to put into production.
The GenAI Infrastructure Explosion. The model API call now represents just 2% of total system complexity—even less than in traditional ML. Modern GenAI applications require vastly expanded infrastructure: knowledge management systems, safety and compliance layers, session management for conversations, tool integration platforms, advanced observability, and multi-provider model orchestration. This architectural shift explains why simple prototypes become complex production systems.
AI Platform Architecture: Our platform design directly addresses the core requirements through specialized services: the Data Service and Session Service provide context-aware intelligence; the Workflow Service enables multi-step orchestration; the Tool Service handles dynamic integrations; the Guardrails Service ensures safety and compliance; the Observability and Evaluation Services provide monitoring and experimentation; the Model Service abstracts provider differences; and the Infrastructure Layer with API Gateway provides scalable foundations. Each service operates independently while sharing common patterns, eliminating the need for teams to rebuild foundational components.
The conversational request flow shows how Maria's simple question flows through multiple platform services before returning a response. The Workflow Service orchestrates the entire interaction, coordinating calls to the Session Service to retrieve conversation history, the Guardrails Service to validate both input and output for safety, the Data Service to search organizational knowledge, and the Model Service to generate contextually appropriate responses. The Observability Service tracks costs and performance throughout this coordinated sequence, demonstrating how platform services transform simple interactions into production-ready AI applications.

Summary

  • AI prototypes consistently fail in production because they lack the infrastructure that real-world usage demands.
  • The model API call represents only 2% of total system complexity in modern GenAI applications.
  • The remaining 98% consists of knowledge management systems, session management, safety and compliance layers, tool integration platforms, observability systems, model management, and infrastructure layers.
  • AI sprawl emerges when teams build disconnected AI solutions independently, each implementing their own versions of session management, cost tracking, safety controls, and observability. This duplication makes each new AI feature exponentially harder to build as teams navigate a maze of incompatible one-off implementations.
  • Context-aware intelligence requires two complementary capabilities that AI applications must provide. The Data Service handles document ingestion, vector embedding generation, and semantic search. The Session Service manages conversational state by tracking user interactions, preferences, and conversation history across multiple exchanges.
  • Multi-step orchestration through the Workflow Service coordinates complex AI processes involving sequential and parallel operations where each step depends on previous ones and any step can fail.
  • Model abstraction through the Model Service provides a unified interface that works seamlessly with any AI provider—GPT-4, Claude, local Llama models, or future providers—without requiring application code changes. The service handles provider-specific API differences, response formatting variations, error handling strategies, intelligent routing based on task requirements, automatic fallback, and cost optimization.
  • Dynamic tool integration via the Tool Service provides registration and discovery mechanisms for external APIs and services, handling authentication patterns, rate limiting, error recovery, and result caching automatically.
  • Safety and compliance enforcement through the Guardrails Service ensures every input and output passes through configurable safety filters automatically rather than relying solely on model behavior that users can manipulate, preventing scenarios where AI assistants provide unauthorized advice or expose sensitive data regardless of prompt engineering attempts.
  • The Observability Service handles operational monitoring by tracking costs per request, measuring performance across services, implementing distributed tracing that follows requests through multiple components, and collecting system metrics that identify bottlenecks before they impact users.
  • The Evaluation Service focuses on AI-specific assessment through systematic experiments, A/B testing infrastructure, prompt versioning, and quality measurement that enables data-driven optimization rather than guesswork.
  • Scalable infrastructure foundations provided by the Infrastructure Layer and API Gateway handle configuration management for secrets and environment settings, service discovery so components can find each other, resource allocation that scales individual services based on load, and deployment automation that enables reliable updates without downtime.
  • Service-oriented architecture enables independent scaling where high-demand features don't impact other applications, clear boundaries where teams can work on different services without coordination, and shared infrastructure patterns that prevent duplication.
  • Each service operates independently with well-defined API contracts, allowing the Model Service to scale separately from the Session Service, the Data Service to use different storage technology than the Workflow Service, and teams to deploy updates to individual services without affecting the entire platform.

FAQ

Why do AI prototypes that look great in demos often fail in production?Because prototypes ignore the hard parts that surface under real load: unpredictable latency, lack of concurrency and rate limits, missing monitoring, no cost controls, and weak safety. What feels like a simple API call becomes a distributed system that needs observability, budget guardrails, compliance, data grounding, and systematic iteration—none of which are present in quick demos.
What are the biggest gaps between an AI prototype and a production system?Production systems must deliver predictable response times, scale to concurrent users, track and control costs per request, enforce security and safety, integrate with live knowledge, and support controlled experimentation. These requirements demand shared infrastructure—monitoring, queueing, rate limiting, audit logs, retrieval, and A/B testing—that prototypes typically skip.
What is the “infrastructure iceberg,” and why is the model only about 2% of the work?The model call is a tiny black box within a much larger system. The bulk of engineering lies in knowledge ingestion and search, session memory, safety and compliance, tool integration, observability, model routing, and scalable deployment. In modern GenAI, this surrounding infrastructure accounts for roughly 98% of complexity.
What is AI sprawl, and how does it create technical debt?AI sprawl is the proliferation of disconnected AI solutions across teams—each reinventing session stores, token trackers, logging, and safety in different ways. It fragments knowledge, introduces security inconsistencies, complicates audits, and slows delivery as every new feature navigates a maze of one-off decisions and hidden assumptions.
Which core capabilities do reliable AI applications need?They need context-aware intelligence (live knowledge and session memory), multi-step orchestration, dynamic tool integration, end-to-end safety and compliance, deep observability and evaluation, model abstraction to avoid lock-in, and a scalable infrastructure foundation for configuration, deployment, and autoscaling.
How does a platform help with performance, scaling, and cost control?A platform provides request queueing, rate limiting, and autoscaling for consistent latency; real-time monitoring to detect bottlenecks; and per-request cost attribution, spending limits, and alerts to keep budgets in check. These shared services prevent outages during traffic spikes and stop surprise invoices.
How do Data and Session Services enable context-aware intelligence?The Data Service ingests documents, creates embeddings, and runs semantic search to ground responses in current organizational knowledge. The Session Service maintains multi-turn conversation history and user state, so answers reflect prior context and preferences. Together, they produce accurate, personalized, and up-to-date responses.
How does the platform enforce safety and compliance beyond model defaults?A Guardrails Service validates every input and output with configurable policies: content moderation, PII detection, policy enforcement, bias monitoring, and audit logging. Safety becomes an always-on platform capability—preventing risky behavior (like unauthorized financial advice) regardless of prompt tricks.
How do observability and evaluation services support continuous improvement?Observability provides distributed tracing, latency and error metrics, token and cost tracking, and alerts—giving teams real-time visibility into behavior and spend. Evaluation adds A/B testing, prompt and model versioning, and quality measurement so changes are data-driven rather than guesswork.
How does model abstraction reduce vendor lock-in and optimize outcomes?A Model Service exposes a consistent interface across providers and local models, handles provider quirks, supports streaming and fallbacks, and routes requests based on task, cost, or latency. Teams can adopt new models or switch for savings and performance without rewriting application logic.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Designing AI Systems ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Designing AI Systems ebook for free