13 Production LLM system design
Moving LLM applications from prototype to production requires new engineering disciplines that account for probabilistic outputs, evolving prompts, and heightened safety needs. The chapter frames production LLM design around three imperatives: treat prompts and policies as first-class, versioned code; evaluate quality semantically rather than by exact matches; and implement multi-layered safety and governance so reliability, compliance, and costs remain predictable under real workloads. The goal is to operationalize LLM systems with the same rigor as any mission-critical platform—measuring, controlling, and iterating on behavior to deliver consistent, safe value at scale.
Practically, this means managing prompts as critical infrastructure with centralized versioning, labels, rollbacks, configuration coupling, A/B testing, and runtime resolution (e.g., via LangFuse) so teams can update behavior without redeployments and tie outcomes to specific prompt versions. Testing shifts to semantic evaluation using tools like DeepEval and G-Eval (LLM-as-judge) across multiple dimensions—correctness, relevance, completeness, and security boundary adherence—supported by helpers that integrate async systems with synchronous test runners. Security is addressed through adversarial testing (e.g., PromptFoo) and layered guardrails: input sanitization and PII detection, in-process controls, and output validation with configurable failure strategies, augmented by model-level safety settings and hallucination checks—altogether forming a governance loop that detects, prevents, and continuously regresses against vulnerabilities.
Cost discipline is woven into system design: understand token-based economics, match models to tasks, and use tiered routing so premium models serve only truly complex queries. Reduce token spend with prompt optimization and cache strategically—reuse stable context and answers via context and semantic caching—while handling freshness and threshold trade-offs. Finally, monitor costs and quality together (e.g., with LangFuse tracing and estimates) to spot expensive outliers, set budgets and alerts, and verify that spend maps to measurable impact. The chapter closes by positioning LLMOps as an evolution of MLOps: the engineering fundamentals remain, but success now depends on rigorous prompt management, probabilistic evaluation, layered safety, and continuous, cost-aware operations.
prompt injection attempt trying to make DakkaBot act like a pirate and bypass its documentation assistant role.
Another prompt injection example attempting to make DakkaBot compare DataKrypt with competitors instead of staying within its boundaries.
PromptFoo test generation report showing 48 adversarial prompts created across vulnerability categories and attack strategies.
PromptFoo web dashboard displaying test results with vulnerability assessments and pass/fail status for different attack types.
Detailed PromptFoo evaluation results showing specific test cases where the bot failed.
Multi-layered safety guardrails architecture diagram showing input guards, LLM processing, and output guards workflow.
Guardrails AI Hub website interface for API key generation and validator marketplace access.
Guardrails AI validators consist of different categories like content safety, PII protection, and business logic validators.
PII detection example showing email address being automatically redacted to <EMAIL_ADDRESS> placeholder.
Guardrails validation failure message showing competitor mention detection and redirection advice.
Summary
- Non-deterministic outputs demand a paradigm shift in evaluation. Traditional assertion-based testing fails when the same input produces multiple valid responses. Success depends on building evaluation frameworks that assess semantic quality, factual accuracy, and safety rather than exact matches.
- Prompt engineering emerges as a critical discipline that bridges natural language and software engineering. Treating prompts as code—with version control, testing frameworks, and systematic optimization—separates successful LLM applications from fragile prototypes.
- Production LLM systems require multi-layered safety approaches that traditional ML doesn't face. Input sanitization, output validation, and continuous monitoring for harmful content become essential operational concerns, not just model performance metrics.
- Cost optimization requires understanding LLM economics where you pay per token, not per server. Implement tiered model selection routing simple queries to cheaper models, semantic caching for frequently-asked questions, and prompt optimization to reduce token consumption.
- Adversarial testing becomes mandatory for production systems. Use tools like PromptFoo to systematically probe for vulnerabilities including prompt injection, jailbreaking, and scope violations before deployment.
- Production deployment requires evolved infrastructure strategies including auto-scaling that handles variable token loads, comprehensive monitoring covering quality metrics and cost tracking, and automated alerting for degradation patterns.
Machine Learning Platform Engineering ebook for free