Overview

13 Production LLM system design

Moving LLM applications from prototype to production requires new engineering disciplines that account for probabilistic outputs, evolving prompts, and heightened safety needs. The chapter frames production LLM design around three imperatives: treat prompts and policies as first-class, versioned code; evaluate quality semantically rather than by exact matches; and implement multi-layered safety and governance so reliability, compliance, and costs remain predictable under real workloads. The goal is to operationalize LLM systems with the same rigor as any mission-critical platform—measuring, controlling, and iterating on behavior to deliver consistent, safe value at scale.

Practically, this means managing prompts as critical infrastructure with centralized versioning, labels, rollbacks, configuration coupling, A/B testing, and runtime resolution (e.g., via LangFuse) so teams can update behavior without redeployments and tie outcomes to specific prompt versions. Testing shifts to semantic evaluation using tools like DeepEval and G-Eval (LLM-as-judge) across multiple dimensions—correctness, relevance, completeness, and security boundary adherence—supported by helpers that integrate async systems with synchronous test runners. Security is addressed through adversarial testing (e.g., PromptFoo) and layered guardrails: input sanitization and PII detection, in-process controls, and output validation with configurable failure strategies, augmented by model-level safety settings and hallucination checks—altogether forming a governance loop that detects, prevents, and continuously regresses against vulnerabilities.

Cost discipline is woven into system design: understand token-based economics, match models to tasks, and use tiered routing so premium models serve only truly complex queries. Reduce token spend with prompt optimization and cache strategically—reuse stable context and answers via context and semantic caching—while handling freshness and threshold trade-offs. Finally, monitor costs and quality together (e.g., with LangFuse tracing and estimates) to spot expensive outliers, set budgets and alerts, and verify that spend maps to measurable impact. The chapter closes by positioning LLMOps as an evolution of MLOps: the engineering fundamentals remain, but success now depends on rigorous prompt management, probabilistic evaluation, layered safety, and continuous, cost-aware operations.

prompt injection attempt trying to make DakkaBot act like a pirate and bypass its documentation assistant role.
Another prompt injection example attempting to make DakkaBot compare DataKrypt with competitors instead of staying within its boundaries.
PromptFoo test generation report showing 48 adversarial prompts created across vulnerability categories and attack strategies.
PromptFoo web dashboard displaying test results with vulnerability assessments and pass/fail status for different attack types.
Detailed PromptFoo evaluation results showing specific test cases where the bot failed.
Multi-layered safety guardrails architecture diagram showing input guards, LLM processing, and output guards workflow.
Guardrails AI Hub website interface for API key generation and validator marketplace access.
Guardrails AI validators consist of different categories like content safety, PII protection, and business logic validators.
PII detection example showing email address being automatically redacted to <EMAIL_ADDRESS> placeholder.
Guardrails validation failure message showing competitor mention detection and redirection advice.

Summary

  • Non-deterministic outputs demand a paradigm shift in evaluation. Traditional assertion-based testing fails when the same input produces multiple valid responses. Success depends on building evaluation frameworks that assess semantic quality, factual accuracy, and safety rather than exact matches.
  • Prompt engineering emerges as a critical discipline that bridges natural language and software engineering. Treating prompts as code—with version control, testing frameworks, and systematic optimization—separates successful LLM applications from fragile prototypes.
  • Production LLM systems require multi-layered safety approaches that traditional ML doesn't face. Input sanitization, output validation, and continuous monitoring for harmful content become essential operational concerns, not just model performance metrics.
  • Cost optimization requires understanding LLM economics where you pay per token, not per server. Implement tiered model selection routing simple queries to cheaper models, semantic caching for frequently-asked questions, and prompt optimization to reduce token consumption.
  • Adversarial testing becomes mandatory for production systems. Use tools like PromptFoo to systematically probe for vulnerabilities including prompt injection, jailbreaking, and scope violations before deployment.
  • Production deployment requires evolved infrastructure strategies including auto-scaling that handles variable token loads, comprehensive monitoring covering quality metrics and cost tracking, and automated alerting for degradation patterns.

FAQ

Why should prompts be treated as critical infrastructure in production LLM systems?Because core business logic increasingly lives in prompts. Treating them like code enables version control, rollbacks, A/B testing, and audit trails. Centralized management (for example with LangFuse) decouples prompts from deployments, lets non-engineers iterate safely, and ensures each prompt carries its own model/config so behavior stays consistent across environments.
How do I migrate from hardcoded prompts to managed prompts with LangFuse?Create prompts in LangFuse with a unique name, type (chat/text), templated content, and labels (production/staging). Store model settings (temperature, max tokens) in the prompt’s config. In your app, fetch the active prompt by name+label at runtime, apply its config to the LLM, compile template variables (e.g., {{context}}, {{query}}), and record prompt name/version in trace metadata. Use boundary prompts as graceful fallbacks on errors.
What makes a production-ready system prompt?Define five elements explicitly: persona/role, constraints and safety rules, output format/style, task context and objectives (for example, “use only retrieved documentation”), and operational knowledge (user prefs, environment, session state). Include uncertainty handling (“say I don’t know when unsure”), source citation rules, and confidence indicators. This prevents hallucinations, scope creep, and inconsistent formatting.
How do I test non-deterministic LLM outputs effectively?Use semantic evaluation instead of string matching. Tools like DeepEval and G‑Eval (LLM-as-judge) score responses on criteria such as correctness, relevance, completeness, tone, and safety. Define clear, domain-specific criteria with thresholds, run multiple metrics per test, and maintain a golden dataset for regressions. Larger evaluator models improve reliability at higher cost/latency.
How can I implement safety guardrails for production?Adopt a multi-layer approach: input guards (PII detection, competitor mentions, injection screening), in-processing controls (model safety filters such as Gemini SafetySettings), and output validation (policy compliance, leakage checks). With Guardrails AI, pick appropriate on-fail actions: EXCEPTION/REFRAIN for hard stops, FIX/FILTER for sanitization, or REASK to regenerate. Always provide user-friendly fallbacks when blocking.
What is adversarial testing and how do I set it up for a RAG app?Adversarial testing probes prompt injection, jailbreaks, excessive agency, hallucinations, and harmful content before production. With PromptFoo, configure a provider that calls your app, enable red-team plugins and strategies (basic/jailbreak/composite), generate tests, run them, and review the dashboard. Convert failures into regression tests and update system prompts/validators to close gaps.
How should I handle out-of-scope questions and runtime errors?Use a boundary prompt that acknowledges limits, redirects users to the right channels, and preserves a professional UX. In code, catch exceptions and compile the boundary prompt with contextual variables (for example, topic and contact channel). This avoids cryptic errors and maintains trust even when retrieval or generation fails.
What are practical cost-optimization strategies for production LLMs?Combine: tiered model routing (cheap model for routing/simple queries, balanced for standard, premium for complex), caching (context reuse and semantic caching with a vector store like RedisVL), prompt token reduction (concise, non-redundant instructions), and observability (track tokens, cost per query, and expensive traces). Use budgets/alerts and LangFuse cost analytics to guide changes.
When should I prefer managed endpoints over self-hosting open-source models?Below roughly 100k monthly queries, pay-per-token managed endpoints (open or proprietary) are usually cheaper and simpler. Reserved compute (self-hosting) becomes attractive at high scale (often 500k+ monthly queries) where GPU costs amortize. Consider performance SLAs, privacy/compliance, and ops overhead. Many platforms now offer managed open-source endpoints that remove infra burden while preserving cost control.
What observability and governance signals matter most in production?Trace prompt name/version, model and parameters, token usage, latency, cost per request, safety guard outcomes, evaluation scores (correctness, relevance, completeness), routing decisions/accuracy, cache hit rates, and fallback usage. Use labels for environment governance (production/staging), maintain audit trails for prompt edits, and enforce policy via validators with clear on-fail strategies.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Machine Learning Platform Engineering ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Machine Learning Platform Engineering ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Machine Learning Platform Engineering ebook for free