Machine Learning Platform Engineering you own this product

Build an internal developer platform for ML and AI systems

Benjamin Tan Wei Hao, Shanoop Padmanabhan, and Varun Mallya

February 2026
ISBN 9781633437333
504 pages

Included with a Manning Online subscription

printed in black & white

available in Korean, Russian

catalog / Data Science / Machine Learning

resources: Source code Datasets Book forum Source code on GitHub Register your pBook for a free eBook

table of content

Part 1 Laying the MLOps foundation

1 Getting started with MLOps and ML engineering

1.1 The ML life cycle

1.1.1 Experimentation phase

1.1.2 Development/staging/production phase

1.2 Skills needed for MLOps

1.2.1 Required skills for ML engineers

1.2.2 Prerequisites

1.3 Building an ML platform

1.3.1 Build vs. buy

1.3.2 Looking ahead: From MLOps to LLMOps

1.3.3 Tools used in this book

1.4 Building ML systems

1.4.1 Introducing the ML projects

1.4.2 ML projects

2 What is MLOps?

2.1 The iterative MLOps life cycle

2.1.1 Data collection

2.1.2 Exploratory Data Analysis

2.1.3 Modeling and training

2.1.4 Model evaluation

2.1.5 Deployment

2.1.6 Monitoring

2.1.7 Maintenance, updates, and review

2.2 Why is robust MLOps important ?

2.3 Role of MLOps in a mature organization

2.4 DevOps vs. MLOps

2.5 Levels of MLOps maturity

2.5.1 Level 0: Basic

2.5.2 Level 1: Intermediate

2.5.3 Level 2: Advanced

3 Building applications on Kubernetes

3.1 Containers and tooling

3.2 Docker

3.2.1 Write the application code

3.2.2 Write a Dockerfile

3.2.3 Building and pushing a Docker image

3.3 Kubernetes

3.3.1 Kubernetes architecture overview

3.3.2 Kubectl

3.3.3 Kubernetes objects

3.3.4 Networking and services

3.3.5 Other objects

3.3.6 Helm charts

3.3.7 Conclusion

3.4 Continuous integration and deployment

3.4.1 GitLab CI

3.4.2 Argo CD

3.5 Prometheus and Grafana

Part 2 Building core ML platform capabilities

4 Designing reliable ML systems

4.1 MLflow for experiment tracking

4.1.1 Data exploration

4.1.2 MLflow tracking

4.1.3 MLflow model registry

4.2 Feast as a feature store

4.2.1 Registering features

4.2.2 Retrieving features

4.2.3 Feature server

4.2.4 Using the Feast UI

5 Orchestrating ML pipelines

5.1 Kubeflow Pipelines: Task orchestrator

5.1.1 Kubeflow components

5.1.2 Income classifier pipeline

6 Productionizing ML models

6.1 BentoML as a deployment platform

6.1.1 Building a Bento

6.1.2 Building and pushing the Bento

6.1.3 Deploying a Bento

6.2 Evidently for data drift monitoring

6.2.1 Data drift detection report and dashboard

6.2.2 Data drift detection Kubeflow pipeline component

6.2.3 Data drift detection for a model deployed as an API

Part 3 Applying MLOps in practice

7 Data analysis and preparation

7.1 Data analysis

7.1.1 Launching a notebook server in Kubeflow

7.1.2 Workspace and data volumes

7.1.3 Configurations and affinity/tolerations

7.1.4 Customizing the menu

7.1.5 Creating a custom Kubeflow notebook image

7.2 Data passing

7.2.1 Scenario 1: Passing simple values to downstream components

7.2.2 Scenario 2: Passing paths for larger data

7.2.3 Overview of KFP v2 artifact types

7.3 Data preparation in action

7.3.1 Data preparation: Object detection

7.3.2 Data preparation: Movie recommender

8 Model training and validation: Part 1

8.1 Training an object detection model

8.1.1 Training YOLO on a custom dataset

8.1.2 Training the model

8.1.3 Container components for system dependencies

8.1.4 Creating the validation component

8.1.5 Creating the pipeline

8.1.6 Executing the pipeline

8.1.7 Validating model artifacts

9 Model training and validation: Part 2

9.1 Storing data with PersistentVolumeClaim

9.1.1 Refactoring the pipeline with a PVC

9.1.2 Efficient dataset management

9.1.3 Creating a VolumeOp

9.1.4 Download Op using PVC

9.1.5 Splitting the dataset directly

9.1.6 Simplifying model training

9.1.7 Simplifying model validation

9.2 Tracking training with TensorBoard

9.2.1 Launching a new TensorBoard

9.2.2 Exploring YOLOv8’s default graphs

9.3 Movie recommender project

9.3.1 Reading data from MinIO and quality assurance

9.3.2 Model training component

9.3.3 Metrics for evaluation

9.3.4 Experiment tracking with MLflow

9.3.5 Model registry with MLflow

9.3.6 Creating a pipeline from components

9.3.7 Local inference in a notebook

10 Model inference and serving

10.1 Model deployment is hard

10.2 BentoML: Simplifying model deployment

10.3 A whirlwind tour of BentoML

10.3.1 BentoML Service and Runners

10.4 Executing a BentoML Service locally

10.4.1 Loading a model with BentoML Runner

10.5 Building Bentos: Packaging your service for deployment

10.5.1 Bento tags: Versioning and managing your Bentos

10.6 BentoML and MLflow inference

10.7 Using only MLflow to create an inference service

10.8 KServe: An alternative to BentoML

11 Monitoring and explainability

11.1 Monitoring

11.1.1 Basic monitoring

11.1.2 Custom metrics

11.1.3 Logging

11.1.4 Alerting

11.2 Data drift detection

11.2.1 Object detection

11.2.2 Movie recommender

11.3 Explainability

11.3.1 Object detection

11.3.2 Movie recommendation

Part 4 Extending MLOps for large language models

12 Designing LLM-powered systems

12.1 LLMOps: New challenges, familiar principles

12.1.1 What makes LLM applications different

12.1.2 Extending our ML platform for LLMs

12.1.3 Essential tools for LLM applications

12.2 Building DataKrypt’s DakkaBot: A simple RAG architecture

12.2.1 What you’ll build

12.2.2 Beyond single API calls: Designing for composability

12.2.3 Google’s Gemini LLM and embeddings

12.2.4 The retrieval component

12.2.5 The augmentation component

12.2.6 The generation component

12.3 Giving DakkaBot a UI

12.4 Observability for LLM applications

12.4.1 Set up Langfuse via Docker

12.4.2 Integrating Langfuse with DakkaBot

12.4.3 Enhanced observability in DakkaBotCore

12.4.4 Beyond traditional metrics

13 Production LLM system design

13.1 Prompt engineering: Code for the generative AI era

13.1.1 Treating prompts as critical infrastructure

13.1.2 Langfuse prompt management for DakkaBot

13.1.3 Langfuse prompt management for production

13.2 Testing LLM applications

13.2.1 Evaluation framework for LLM responses

13.2.2 Safety and adversarial testing

13.3 Governance and safety in production

13.3.1 Implementing safety guardrails

13.4 Cost optimization strategies

13.4.1 Understanding LLM economics

13.4.2 Model selection strategy

13.4.3 Caching strategies

13.4.4 Prompt optimization for efficiency

13.4.5 Production cost monitoring

13.4.6 From traditional ML to LLMOps

Appendices

Appendix A: Installation and setup

A.1 Local installation of command-line tools (Mac and Linux)

A.1.1 The yq YAML processor

A.1.2 Kustomize

A.1.3 Kubectl

A.1.4 K8s distribution

A.1.5 K3s installation

A.1.6 MicroK8s installation

A.1.7 Argo CD

A.1.8 Kubeflow

A.1.9 Cloud provider K8s setup

A.1.10 MLflow setup

A.2 Deploy MLflow

A.2.1 Redis online store setup

A.2.2 BentoML and Yatai setup

A.2.3 Evidently UI setup

Appendix B: Basics of YAML

B.1 Basic YAML files

B.1.1 Comments

B.1.2 Scalar values

B.1.3 Lists

B.1.4 Nested structures (maps)

B.1.5 Quoted strings

B.1.6 Multiline strings

B.1.7 Data types in YAML

B.2 Aliases and anchors

B.2.1 References (merging and reusing data)

B.2.2 Complex data types

B.2.3 Custom data types

B.2.4 Block style vs. flow style

B.2.5 Key sorting and case sensitivity

B.2.6 Best practices

Overview

13 Production LLM system design

Moving LLM applications from prototype to production requires new engineering disciplines that account for probabilistic outputs, evolving prompts, and heightened safety needs. The chapter frames production LLM design around three imperatives: treat prompts and policies as first-class, versioned code; evaluate quality semantically rather than by exact matches; and implement multi-layered safety and governance so reliability, compliance, and costs remain predictable under real workloads. The goal is to operationalize LLM systems with the same rigor as any mission-critical platform—measuring, controlling, and iterating on behavior to deliver consistent, safe value at scale.

Practically, this means managing prompts as critical infrastructure with centralized versioning, labels, rollbacks, configuration coupling, A/B testing, and runtime resolution (e.g., via LangFuse) so teams can update behavior without redeployments and tie outcomes to specific prompt versions. Testing shifts to semantic evaluation using tools like DeepEval and G-Eval (LLM-as-judge) across multiple dimensions—correctness, relevance, completeness, and security boundary adherence—supported by helpers that integrate async systems with synchronous test runners. Security is addressed through adversarial testing (e.g., PromptFoo) and layered guardrails: input sanitization and PII detection, in-process controls, and output validation with configurable failure strategies, augmented by model-level safety settings and hallucination checks—altogether forming a governance loop that detects, prevents, and continuously regresses against vulnerabilities.

Cost discipline is woven into system design: understand token-based economics, match models to tasks, and use tiered routing so premium models serve only truly complex queries. Reduce token spend with prompt optimization and cache strategically—reuse stable context and answers via context and semantic caching—while handling freshness and threshold trade-offs. Finally, monitor costs and quality together (e.g., with LangFuse tracing and estimates) to spot expensive outliers, set budgets and alerts, and verify that spend maps to measurable impact. The chapter closes by positioning LLMOps as an evolution of MLOps: the engineering fundamentals remain, but success now depends on rigorous prompt management, probabilistic evaluation, layered safety, and continuous, cost-aware operations.

prompt injection attempt trying to make DakkaBot act like a pirate and bypass its documentation assistant role.

Another prompt injection example attempting to make DakkaBot compare DataKrypt with competitors instead of staying within its boundaries.

PromptFoo test generation report showing 48 adversarial prompts created across vulnerability categories and attack strategies.

PromptFoo web dashboard displaying test results with vulnerability assessments and pass/fail status for different attack types.

Detailed PromptFoo evaluation results showing specific test cases where the bot failed.

Multi-layered safety guardrails architecture diagram showing input guards, LLM processing, and output guards workflow.

Guardrails AI Hub website interface for API key generation and validator marketplace access.

Guardrails AI validators consist of different categories like content safety, PII protection, and business logic validators.

PII detection example showing email address being automatically redacted to <EMAIL_ADDRESS> placeholder.

Guardrails validation failure message showing competitor mention detection and redirection advice.

Summary

Non-deterministic outputs demand a paradigm shift in evaluation. Traditional assertion-based testing fails when the same input produces multiple valid responses. Success depends on building evaluation frameworks that assess semantic quality, factual accuracy, and safety rather than exact matches.
Prompt engineering emerges as a critical discipline that bridges natural language and software engineering. Treating prompts as code—with version control, testing frameworks, and systematic optimization—separates successful LLM applications from fragile prototypes.
Production LLM systems require multi-layered safety approaches that traditional ML doesn't face. Input sanitization, output validation, and continuous monitoring for harmful content become essential operational concerns, not just model performance metrics.
Cost optimization requires understanding LLM economics where you pay per token, not per server. Implement tiered model selection routing simple queries to cheaper models, semantic caching for frequently-asked questions, and prompt optimization to reduce token consumption.
Adversarial testing becomes mandatory for production systems. Use tools like PromptFoo to systematically probe for vulnerabilities including prompt injection, jailbreaking, and scope violations before deployment.
Production deployment requires evolved infrastructure strategies including auto-scaling that handles variable token loads, comprehensive monitoring covering quality metrics and cost tracking, and automated alerting for degradation patterns.

FAQ

Why should prompts be treated as critical infrastructure in production LLM systems?

Because core business logic increasingly lives in prompts. Treating them like code enables version control, rollbacks, A/B testing, and audit trails. Centralized management (for example with LangFuse) decouples prompts from deployments, lets non-engineers iterate safely, and ensures each prompt carries its own model/config so behavior stays consistent across environments.

How do I migrate from hardcoded prompts to managed prompts with LangFuse?

Create prompts in LangFuse with a unique name, type (chat/text), templated content, and labels (production/staging). Store model settings (temperature, max tokens) in the prompt’s config. In your app, fetch the active prompt by name+label at runtime, apply its config to the LLM, compile template variables (e.g., {{context}}, {{query}}), and record prompt name/version in trace metadata. Use boundary prompts as graceful fallbacks on errors.

What makes a production-ready system prompt?

Define five elements explicitly: persona/role, constraints and safety rules, output format/style, task context and objectives (for example, “use only retrieved documentation”), and operational knowledge (user prefs, environment, session state). Include uncertainty handling (“say I don’t know when unsure”), source citation rules, and confidence indicators. This prevents hallucinations, scope creep, and inconsistent formatting.

How do I test non-deterministic LLM outputs effectively?

Use semantic evaluation instead of string matching. Tools like DeepEval and G‑Eval (LLM-as-judge) score responses on criteria such as correctness, relevance, completeness, tone, and safety. Define clear, domain-specific criteria with thresholds, run multiple metrics per test, and maintain a golden dataset for regressions. Larger evaluator models improve reliability at higher cost/latency.

How can I implement safety guardrails for production?

Adopt a multi-layer approach: input guards (PII detection, competitor mentions, injection screening), in-processing controls (model safety filters such as Gemini SafetySettings), and output validation (policy compliance, leakage checks). With Guardrails AI, pick appropriate on-fail actions: EXCEPTION/REFRAIN for hard stops, FIX/FILTER for sanitization, or REASK to regenerate. Always provide user-friendly fallbacks when blocking.

What is adversarial testing and how do I set it up for a RAG app?

Adversarial testing probes prompt injection, jailbreaks, excessive agency, hallucinations, and harmful content before production. With PromptFoo, configure a provider that calls your app, enable red-team plugins and strategies (basic/jailbreak/composite), generate tests, run them, and review the dashboard. Convert failures into regression tests and update system prompts/validators to close gaps.

How should I handle out-of-scope questions and runtime errors?

Use a boundary prompt that acknowledges limits, redirects users to the right channels, and preserves a professional UX. In code, catch exceptions and compile the boundary prompt with contextual variables (for example, topic and contact channel). This avoids cryptic errors and maintains trust even when retrieval or generation fails.

What are practical cost-optimization strategies for production LLMs?

Combine: tiered model routing (cheap model for routing/simple queries, balanced for standard, premium for complex), caching (context reuse and semantic caching with a vector store like RedisVL), prompt token reduction (concise, non-redundant instructions), and observability (track tokens, cost per query, and expensive traces). Use budgets/alerts and LangFuse cost analytics to guide changes.

When should I prefer managed endpoints over self-hosting open-source models?

Below roughly 100k monthly queries, pay-per-token managed endpoints (open or proprietary) are usually cheaper and simpler. Reserved compute (self-hosting) becomes attractive at high scale (often 500k+ monthly queries) where GPU costs amortize. Consider performance SLAs, privacy/compliance, and ops overhead. Many platforms now offer managed open-source endpoints that remove infra burden while preserving cost control.

What observability and governance signals matter most in production?

Trace prompt name/version, model and parameters, token usage, latency, cost per request, safety guard outcomes, evaluation scores (correctness, relevance, completeness), routing decisions/accuracy, cache hit rates, and fallback usage. Use labels for environment governance (production/staging), maintain audit trails for prompt edits, and enforce policy via validators with clear on-fail strategies.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $12.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $12.49

eBook

pdf, ePub, online

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $12.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more