Machine Learning Platform Engineering you own this product

Build an internal developer platform for ML and AI systems

Benjamin Tan Wei Hao, Shanoop Padmanabhan, and Varun Mallya

February 2026
ISBN 9781633437333
504 pages

Included with a Manning Online subscription

printed in black & white

available in Korean, Russian

catalog / Data Science / Machine Learning

resources: Source code Datasets Book forum Source code on GitHub Register your pBook for a free eBook

table of content

Part 1 Laying the MLOps foundation

1 Getting started with MLOps and ML engineering

1.1 The ML life cycle

1.1.1 Experimentation phase

1.1.2 Development/staging/production phase

1.2 Skills needed for MLOps

1.2.1 Required skills for ML engineers

1.2.2 Prerequisites

1.3 Building an ML platform

1.3.1 Build vs. buy

1.3.2 Looking ahead: From MLOps to LLMOps

1.3.3 Tools used in this book

1.4 Building ML systems

1.4.1 Introducing the ML projects

1.4.2 ML projects

2 What is MLOps?

2.1 The iterative MLOps life cycle

2.1.1 Data collection

2.1.2 Exploratory Data Analysis

2.1.3 Modeling and training

2.1.4 Model evaluation

2.1.5 Deployment

2.1.6 Monitoring

2.1.7 Maintenance, updates, and review

2.2 Why is robust MLOps important ?

2.3 Role of MLOps in a mature organization

2.4 DevOps vs. MLOps

2.5 Levels of MLOps maturity

2.5.1 Level 0: Basic

2.5.2 Level 1: Intermediate

2.5.3 Level 2: Advanced

3 Building applications on Kubernetes

3.1 Containers and tooling

3.2 Docker

3.2.1 Write the application code

3.2.2 Write a Dockerfile

3.2.3 Building and pushing a Docker image

3.3 Kubernetes

3.3.1 Kubernetes architecture overview

3.3.2 Kubectl

3.3.3 Kubernetes objects

3.3.4 Networking and services

3.3.5 Other objects

3.3.6 Helm charts

3.3.7 Conclusion

3.4 Continuous integration and deployment

3.4.1 GitLab CI

3.4.2 Argo CD

3.5 Prometheus and Grafana

Part 2 Building core ML platform capabilities

4 Designing reliable ML systems

4.1 MLflow for experiment tracking

4.1.1 Data exploration

4.1.2 MLflow tracking

4.1.3 MLflow model registry

4.2 Feast as a feature store

4.2.1 Registering features

4.2.2 Retrieving features

4.2.3 Feature server

4.2.4 Using the Feast UI

5 Orchestrating ML pipelines

5.1 Kubeflow Pipelines: Task orchestrator

5.1.1 Kubeflow components

5.1.2 Income classifier pipeline

6 Productionizing ML models

6.1 BentoML as a deployment platform

6.1.1 Building a Bento

6.1.2 Building and pushing the Bento

6.1.3 Deploying a Bento

6.2 Evidently for data drift monitoring

6.2.1 Data drift detection report and dashboard

6.2.2 Data drift detection Kubeflow pipeline component

6.2.3 Data drift detection for a model deployed as an API

Part 3 Applying MLOps in practice

7 Data analysis and preparation

7.1 Data analysis

7.1.1 Launching a notebook server in Kubeflow

7.1.2 Workspace and data volumes

7.1.3 Configurations and affinity/tolerations

7.1.4 Customizing the menu

7.1.5 Creating a custom Kubeflow notebook image

7.2 Data passing

7.2.1 Scenario 1: Passing simple values to downstream components

7.2.2 Scenario 2: Passing paths for larger data

7.2.3 Overview of KFP v2 artifact types

7.3 Data preparation in action

7.3.1 Data preparation: Object detection

7.3.2 Data preparation: Movie recommender

8 Model training and validation: Part 1

8.1 Training an object detection model

8.1.1 Training YOLO on a custom dataset

8.1.2 Training the model

8.1.3 Container components for system dependencies

8.1.4 Creating the validation component

8.1.5 Creating the pipeline

8.1.6 Executing the pipeline

8.1.7 Validating model artifacts

9 Model training and validation: Part 2

9.1 Storing data with PersistentVolumeClaim

9.1.1 Refactoring the pipeline with a PVC

9.1.2 Efficient dataset management

9.1.3 Creating a VolumeOp

9.1.4 Download Op using PVC

9.1.5 Splitting the dataset directly

9.1.6 Simplifying model training

9.1.7 Simplifying model validation

9.2 Tracking training with TensorBoard

9.2.1 Launching a new TensorBoard

9.2.2 Exploring YOLOv8’s default graphs

9.3 Movie recommender project

9.3.1 Reading data from MinIO and quality assurance

9.3.2 Model training component

9.3.3 Metrics for evaluation

9.3.4 Experiment tracking with MLflow

9.3.5 Model registry with MLflow

9.3.6 Creating a pipeline from components

9.3.7 Local inference in a notebook

10 Model inference and serving

10.1 Model deployment is hard

10.2 BentoML: Simplifying model deployment

10.3 A whirlwind tour of BentoML

10.3.1 BentoML Service and Runners

10.4 Executing a BentoML Service locally

10.4.1 Loading a model with BentoML Runner

10.5 Building Bentos: Packaging your service for deployment

10.5.1 Bento tags: Versioning and managing your Bentos

10.6 BentoML and MLflow inference

10.7 Using only MLflow to create an inference service

10.8 KServe: An alternative to BentoML

11 Monitoring and explainability

11.1 Monitoring

11.1.1 Basic monitoring

11.1.2 Custom metrics

11.1.3 Logging

11.1.4 Alerting

11.2 Data drift detection

11.2.1 Object detection

11.2.2 Movie recommender

11.3 Explainability

11.3.1 Object detection

11.3.2 Movie recommendation

Part 4 Extending MLOps for large language models

12 Designing LLM-powered systems

12.1 LLMOps: New challenges, familiar principles

12.1.1 What makes LLM applications different

12.1.2 Extending our ML platform for LLMs

12.1.3 Essential tools for LLM applications

12.2 Building DataKrypt’s DakkaBot: A simple RAG architecture

12.2.1 What you’ll build

12.2.2 Beyond single API calls: Designing for composability

12.2.3 Google’s Gemini LLM and embeddings

12.2.4 The retrieval component

12.2.5 The augmentation component

12.2.6 The generation component

12.3 Giving DakkaBot a UI

12.4 Observability for LLM applications

12.4.1 Set up Langfuse via Docker

12.4.2 Integrating Langfuse with DakkaBot

12.4.3 Enhanced observability in DakkaBotCore

12.4.4 Beyond traditional metrics

13 Production LLM system design

13.1 Prompt engineering: Code for the generative AI era

13.1.1 Treating prompts as critical infrastructure

13.1.2 Langfuse prompt management for DakkaBot

13.1.3 Langfuse prompt management for production

13.2 Testing LLM applications

13.2.1 Evaluation framework for LLM responses

13.2.2 Safety and adversarial testing

13.3 Governance and safety in production

13.3.1 Implementing safety guardrails

13.4 Cost optimization strategies

13.4.1 Understanding LLM economics

13.4.2 Model selection strategy

13.4.3 Caching strategies

13.4.4 Prompt optimization for efficiency

13.4.5 Production cost monitoring

13.4.6 From traditional ML to LLMOps

Appendices

Appendix A: Installation and setup

A.1 Local installation of command-line tools (Mac and Linux)

A.1.1 The yq YAML processor

A.1.2 Kustomize

A.1.3 Kubectl

A.1.4 K8s distribution

A.1.5 K3s installation

A.1.6 MicroK8s installation

A.1.7 Argo CD

A.1.8 Kubeflow

A.1.9 Cloud provider K8s setup

A.1.10 MLflow setup

A.2 Deploy MLflow

A.2.1 Redis online store setup

A.2.2 BentoML and Yatai setup

A.2.3 Evidently UI setup

Appendix B: Basics of YAML

B.1 Basic YAML files

B.1.1 Comments

B.1.2 Scalar values

B.1.3 Lists

B.1.4 Nested structures (maps)

B.1.5 Quoted strings

B.1.6 Multiline strings

B.1.7 Data types in YAML

B.2 Aliases and anchors

B.2.1 References (merging and reusing data)

B.2.2 Complex data types

B.2.3 Custom data types

B.2.4 Block style vs. flow style

B.2.5 Key sorting and case sensitivity

B.2.6 Best practices

Overview

1 Getting started with MLOps and ML engineering

This chapter sets the stage for building production-grade machine learning systems by bridging the gap between model development and real-world operations. It introduces MLOps as the discipline that makes ML reliable, scalable, and auditable in practice, emphasizing that most failures stem from system and process issues rather than model complexity. With a hands-on, iterative approach, the book aims to turn readers—whether data scientists, software engineers, or aspiring ML engineers—into confident practitioners who can take projects from conception to deployment and continuous improvement.

Central to the chapter is the end-to-end ML life cycle. It begins with problem formulation and data work (collection, labeling, and versioning), then proceeds through training, evaluation, and business-oriented validation—steps that are inherently iterative and best orchestrated via automated pipelines. The chapter then distinguishes the dev/staging/production phase, where full automation, CI-triggered workflows, and deployment practices (APIs, containers, cloud, versioned rollouts) take over. Robust monitoring spans both system and ML-specific signals—throughput, errors, drift, and business KPIs—feeding into policies for model retraining and safe promotion. Reproducibility, reliability, and automation are highlighted as non-negotiables throughout.

The chapter also maps the skills and platform foundations needed for MLOps: solid software engineering, practical ML know-how, data engineering awareness, and strong automation practices, with Kubernetes as a useful backbone. It introduces an incremental ML platform built around Kubeflow (notebooks and pipelines) and complemented by components such as a feature store, a model registry, container registries, and CI/CD—while noting that tool choices are contextual and “build vs buy” is a pragmatic decision. Finally, it previews three projects that anchor these concepts: an OCR system, a tabular movie recommender (showcasing feature stores, testing, and drift detection), and a RAG-based documentation assistant that extends the same MLOps foundation to LLMOps with vector search, guardrails, and cost-aware operations.

The experimentation phase of the ML life cycle

The dev/staging/production phase of the ML life cycle

MLOps is a mix of different skill sets

The mental map of an ML setup, detailing the project flow from planning to deployment and the tools typically involved in the process

Traditional MLOps (right) extended with LLMOps components (left) for production LLM systems. Chapters 12-13 explore these extensions in detail.

An automated pipeline being executed in Kubeflow.

Feature Stores take in transformed data (features) as input, and have facilities to store, catalog, and serve features.

The model registry captures metadata, parameters, artifacts, and the ML model and in turn exposes a model endpoint.

Model deployment consists of the container registry, CI/CD, and automation working in concert to deploy ML services.

Summary

The Machine Learning (ML) life cycle provides a framework for confidently taking ML projects from idea to production. While iterative in nature, understanding each phase helps you navigate the complexities of ML development.
Building reliable ML systems requires a combination of skills spanning software engineering, MLOps, and data science. Rather than trying to master everything at once, focus on understanding how these skills work together to create robust ML systems.
A well-designed ML Platform forms the foundation for confidently developing and deploying ML services. We'll use tools like Kubeflow Pipelines for automation, MLFlow for model management, and Feast for feature management - learning how to integrate them effectively for production use.
We'll apply these concepts by building two different types of ML systems: an OCR system and a Movie recommender. Through these projects, you'll gain hands-on experience with both image and tabular data, building confidence in handling diverse ML challenges.
Traditional MLOps principles extend naturally to Large Language Models through LLMOps - adding components for document processing, retrieval systems, and specialized monitoring. Understanding this evolution prepares you for the modern ML landscape.
The first step is to identify the problem the ML model is going to solve, followed by collecting and preparing the data to train and evaluate the model. Data versioning enables reproducibility, and model training is automated using a pipeline.
The ML life cycle serves as our guide throughout the book, helping us understand not just how to build models, but how to create reliable, production-ready ML systems that deliver real business value.

FAQ

What is the ML life cycle and why is it so iterative?

The ML life cycle spans problem formulation, data collection and preparation, data versioning, model training, evaluation, validation, deployment, monitoring, and retraining. Unlike traditional software, ML progresses through repeated experiments and feedback loops—poor evaluation may send you back to adjust data prep or model choices. Even after deployment, monitoring often triggers new training runs, making iteration the norm, not the exception.

How do the experimentation and dev/staging/production phases differ?

Experimentation emphasizes rapid iteration and learning: assembling datasets, training variants, and evaluating results via an orchestrated (semi-automated) pipeline. The dev/staging/production phase hardens this work with full automation, CI/CD triggers, deployment as a service (often a REST API), versioning, performance and drift monitoring, and clear rollback paths—while also accounting for ethics, security, scalability, and reliability.

How do I decide whether to use ML or a simpler heuristic?

Start by asking if ML is necessary at all. Simple heuristics can be accurate and cheaper to build/operate. Choose ML when data is complex or high-dimensional and patterns are non-linear or too intricate for rules. If you opt for ML, define a crisp problem statement, success criteria, and the business value (e.g., OCR extracting ID numbers; fraud detection minimizing false positives while catching bad transactions).

What are best practices for data collection, labeling, and splitting?

Identify reliable data sources early; use synthetic data if real data is scarce or sensitive. Invest in accurate labeling (e.g., bounding boxes and text for OCR), leveraging domain experts where needed. Organize labeled data into training, validation, and test sets to reduce leakage and fairly assess generalization. Plan for scale and quality checks because data work is often the bottleneck.

Why is data versioning critical and how does it differ from code versioning?

Both data and code change model behavior. Data versioning ensures reproducibility—so a model trained today can be recreated tomorrow with the same inputs and parameters. Unlike code (well-served by Git), data comes in diverse formats and sizes, making versioning harder and tooling less standardized. Still, tracking dataset snapshots, schemas, and lineage is essential to trust results and support audits.

What is pipeline orchestration and why automate early?

Orchestration connects life-cycle steps—prep, training, evaluation, validation, deployment—into a reproducible pipeline. Automating early reduces manual error, makes experiments repeatable, speeds iteration, and lays the groundwork for CI/CD in production. Tools like Kubeflow Pipelines represent each step as a component with defined inputs/outputs, enabling reliable, scalable workflows.

What does production-grade model deployment involve?

Commonly, you expose inference via a REST API, containerize with Docker, and deploy to a platform like Kubernetes. Production-readiness requires load testing, autoscaling for spiky traffic, observability, strict versioning of model and service artifacts, and safe rollout/rollback strategies. CI/CD should build, push, and deploy images automatically on code or model changes.

What should I monitor once a model is in production?

Track system performance (e.g., latency, throughput, error rates) and ML-specific signals: input data drift, model drift, and critical business KPIs (e.g., churn, approvals). Monitoring closes the loop—degradation in these metrics can trigger alerts and automated retraining pipelines to restore performance and value.

When should I retrain a model and how do I automate it?

Retraining cadence depends on the use case. You can schedule it (e.g., monthly) or trigger it based on thresholds—such as drift scores, KPI drops, or label feedback. Automate end to end: artifact tracking for reproducibility, pipeline runs for training/validation, and CI/CD for safe, versioned deployments—so new models can promote from staging to production with minimal friction.

What core skills and platform components do I need, and should I build or buy?

Key skills: strong software engineering (debugging, performance), working knowledge of ML frameworks, data engineering for data quality and pipelines, and automation/DevOps (CI/CD, containers, Kubernetes). Core platform components include pipeline orchestration (e.g., Kubeflow Pipelines), a Feature Store (to share features and prevent training-serving skew), a Model Registry (to track runs and promote versions), deployment via containers/Kubernetes, and monitoring. There’s no one-size-fits-all platform—evaluate “build vs buy” based on constraints, but assembling one yourself at least once deepens understanding and lets you tailor to your needs; the same foundations extend naturally to LLMOps with additions like vector databases, guardrails, and cost controls.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $12.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $12.49

eBook

pdf, ePub, online

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $12.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more