Machine Learning Platform Engineering you own this product

Build an internal developer platform for ML and AI systems

Benjamin Tan Wei Hao, Shanoop Padmanabhan, and Varun Mallya

February 2026
ISBN 9781633437333
504 pages

Included with a Manning Online subscription

printed in black & white

available in Korean, Russian

catalog / Data Science / Machine Learning

resources: Source code Datasets Book forum Source code on GitHub Register your pBook for a free eBook

table of content

Part 1 Laying the MLOps foundation

1 Getting started with MLOps and ML engineering

1.1 The ML life cycle

1.1.1 Experimentation phase

1.1.2 Development/staging/production phase

1.2 Skills needed for MLOps

1.2.1 Required skills for ML engineers

1.2.2 Prerequisites

1.3 Building an ML platform

1.3.1 Build vs. buy

1.3.2 Looking ahead: From MLOps to LLMOps

1.3.3 Tools used in this book

1.4 Building ML systems

1.4.1 Introducing the ML projects

1.4.2 ML projects

2 What is MLOps?

2.1 The iterative MLOps life cycle

2.1.1 Data collection

2.1.2 Exploratory Data Analysis

2.1.3 Modeling and training

2.1.4 Model evaluation

2.1.5 Deployment

2.1.6 Monitoring

2.1.7 Maintenance, updates, and review

2.2 Why is robust MLOps important ?

2.3 Role of MLOps in a mature organization

2.4 DevOps vs. MLOps

2.5 Levels of MLOps maturity

2.5.1 Level 0: Basic

2.5.2 Level 1: Intermediate

2.5.3 Level 2: Advanced

3 Building applications on Kubernetes

3.1 Containers and tooling

3.2 Docker

3.2.1 Write the application code

3.2.2 Write a Dockerfile

3.2.3 Building and pushing a Docker image

3.3 Kubernetes

3.3.1 Kubernetes architecture overview

3.3.2 Kubectl

3.3.3 Kubernetes objects

3.3.4 Networking and services

3.3.5 Other objects

3.3.6 Helm charts

3.3.7 Conclusion

3.4 Continuous integration and deployment

3.4.1 GitLab CI

3.4.2 Argo CD

3.5 Prometheus and Grafana

Part 2 Building core ML platform capabilities

4 Designing reliable ML systems

4.1 MLflow for experiment tracking

4.1.1 Data exploration

4.1.2 MLflow tracking

4.1.3 MLflow model registry

4.2 Feast as a feature store

4.2.1 Registering features

4.2.2 Retrieving features

4.2.3 Feature server

4.2.4 Using the Feast UI

5 Orchestrating ML pipelines

5.1 Kubeflow Pipelines: Task orchestrator

5.1.1 Kubeflow components

5.1.2 Income classifier pipeline

6 Productionizing ML models

6.1 BentoML as a deployment platform

6.1.1 Building a Bento

6.1.2 Building and pushing the Bento

6.1.3 Deploying a Bento

6.2 Evidently for data drift monitoring

6.2.1 Data drift detection report and dashboard

6.2.2 Data drift detection Kubeflow pipeline component

6.2.3 Data drift detection for a model deployed as an API

Part 3 Applying MLOps in practice

7 Data analysis and preparation

7.1 Data analysis

7.1.1 Launching a notebook server in Kubeflow

7.1.2 Workspace and data volumes

7.1.3 Configurations and affinity/tolerations

7.1.4 Customizing the menu

7.1.5 Creating a custom Kubeflow notebook image

7.2 Data passing

7.2.1 Scenario 1: Passing simple values to downstream components

7.2.2 Scenario 2: Passing paths for larger data

7.2.3 Overview of KFP v2 artifact types

7.3 Data preparation in action

7.3.1 Data preparation: Object detection

7.3.2 Data preparation: Movie recommender

8 Model training and validation: Part 1

8.1 Training an object detection model

8.1.1 Training YOLO on a custom dataset

8.1.2 Training the model

8.1.3 Container components for system dependencies

8.1.4 Creating the validation component

8.1.5 Creating the pipeline

8.1.6 Executing the pipeline

8.1.7 Validating model artifacts

9 Model training and validation: Part 2

9.1 Storing data with PersistentVolumeClaim

9.1.1 Refactoring the pipeline with a PVC

9.1.2 Efficient dataset management

9.1.3 Creating a VolumeOp

9.1.4 Download Op using PVC

9.1.5 Splitting the dataset directly

9.1.6 Simplifying model training

9.1.7 Simplifying model validation

9.2 Tracking training with TensorBoard

9.2.1 Launching a new TensorBoard

9.2.2 Exploring YOLOv8’s default graphs

9.3 Movie recommender project

9.3.1 Reading data from MinIO and quality assurance

9.3.2 Model training component

9.3.3 Metrics for evaluation

9.3.4 Experiment tracking with MLflow

9.3.5 Model registry with MLflow

9.3.6 Creating a pipeline from components

9.3.7 Local inference in a notebook

10 Model inference and serving

10.1 Model deployment is hard

10.2 BentoML: Simplifying model deployment

10.3 A whirlwind tour of BentoML

10.3.1 BentoML Service and Runners

10.4 Executing a BentoML Service locally

10.4.1 Loading a model with BentoML Runner

10.5 Building Bentos: Packaging your service for deployment

10.5.1 Bento tags: Versioning and managing your Bentos

10.6 BentoML and MLflow inference

10.7 Using only MLflow to create an inference service

10.8 KServe: An alternative to BentoML

11 Monitoring and explainability

11.1 Monitoring

11.1.1 Basic monitoring

11.1.2 Custom metrics

11.1.3 Logging

11.1.4 Alerting

11.2 Data drift detection

11.2.1 Object detection

11.2.2 Movie recommender

11.3 Explainability

11.3.1 Object detection

11.3.2 Movie recommendation

Part 4 Extending MLOps for large language models

12 Designing LLM-powered systems

12.1 LLMOps: New challenges, familiar principles

12.1.1 What makes LLM applications different

12.1.2 Extending our ML platform for LLMs

12.1.3 Essential tools for LLM applications

12.2 Building DataKrypt’s DakkaBot: A simple RAG architecture

12.2.1 What you’ll build

12.2.2 Beyond single API calls: Designing for composability

12.2.3 Google’s Gemini LLM and embeddings

12.2.4 The retrieval component

12.2.5 The augmentation component

12.2.6 The generation component

12.3 Giving DakkaBot a UI

12.4 Observability for LLM applications

12.4.1 Set up Langfuse via Docker

12.4.2 Integrating Langfuse with DakkaBot

12.4.3 Enhanced observability in DakkaBotCore

12.4.4 Beyond traditional metrics

13 Production LLM system design

13.1 Prompt engineering: Code for the generative AI era

13.1.1 Treating prompts as critical infrastructure

13.1.2 Langfuse prompt management for DakkaBot

13.1.3 Langfuse prompt management for production

13.2 Testing LLM applications

13.2.1 Evaluation framework for LLM responses

13.2.2 Safety and adversarial testing

13.3 Governance and safety in production

13.3.1 Implementing safety guardrails

13.4 Cost optimization strategies

13.4.1 Understanding LLM economics

13.4.2 Model selection strategy

13.4.3 Caching strategies

13.4.4 Prompt optimization for efficiency

13.4.5 Production cost monitoring

13.4.6 From traditional ML to LLMOps

Appendices

Appendix A: Installation and setup

A.1 Local installation of command-line tools (Mac and Linux)

A.1.1 The yq YAML processor

A.1.2 Kustomize

A.1.3 Kubectl

A.1.4 K8s distribution

A.1.5 K3s installation

A.1.6 MicroK8s installation

A.1.7 Argo CD

A.1.8 Kubeflow

A.1.9 Cloud provider K8s setup

A.1.10 MLflow setup

A.2 Deploy MLflow

A.2.1 Redis online store setup

A.2.2 BentoML and Yatai setup

A.2.3 Evidently UI setup

Appendix B: Basics of YAML

B.1 Basic YAML files

B.1.1 Comments

B.1.2 Scalar values

B.1.3 Lists

B.1.4 Nested structures (maps)

B.1.5 Quoted strings

B.1.6 Multiline strings

B.1.7 Data types in YAML

B.2 Aliases and anchors

B.2.1 References (merging and reusing data)

B.2.2 Complex data types

B.2.3 Custom data types

B.2.4 Block style vs. flow style

B.2.5 Key sorting and case sensitivity

B.2.6 Best practices

Overview

11 Monitoring and explainability

Putting models into production is only the beginning; keeping them reliable demands comprehensive monitoring, alerting, and interpretability. This chapter outlines an end-to-end approach to observability for ML services, combining basic operational monitoring with ML-specific checks like data drift, and pairing them with explainability techniques to understand model behavior. Using an object detection service and a movie recommender as running examples, it frames monitoring in two parts—service health and data drift—and shows how explainability builds trust, aids debugging, and supports regulatory and stakeholder needs.

For operational monitoring, the chapter demonstrates how BentoML’s built-in metrics can be scraped by Prometheus and visualized in Grafana, providing insights into uptime, latency, throughput, resource usage, and error rates against SLAs. It shows how to add custom business-aware metrics (for example, a prediction confidence histogram and request counters) and why logs complement metrics for root-cause analysis. Logs are centralized with Loki to correlate events across services, unifying metrics and logs in one place. Alerting is then layered on using Prometheus alert rules (including up and absent checks) and Alertmanager to route notifications by severity to channels like email or PagerDuty, enabling timely, actionable incident response.

On the ML side, the chapter focuses on detecting and responding to data drift and making model decisions transparent. For computer vision, it uses Deepchecks to compare distributions of image properties (such as brightness and contrast) between training and production data, and recommends storing inference images and predictions to compute drift periodically. For recommendations, it tracks shifts in user and item latent factors and detects feature-level drift over time. Explainability techniques close the loop: EigenCAM heatmaps verify that the object detector attends to the right regions, while an explainable matrix factorization approach surfaces neighborhood-based rationales for recommendations. Together, these practices establish a robust feedback system—monitor, detect, explain, and act—to maintain performance, reliability, and trust in production ML systems.

The mental map where we are now focusing on model monitoring(8)

Searching for BentoML in Prometheus service discovery

Verifying if BentoML metrics are being scraped by Prometheus

Importing BentoML dashboard in Grafana

BentoML basic monitoring dashboard.

Custom metrics can be seen at /metrics endpoint

Visualizing the confidence score custom metric in Grafana as a Gauge

Visualizing the ranked movie counter score custom metric in Grafana as a line chart

Using Loki as log aggregation system in Grafana

Alerts generated by the Prometheus service are sent to Alertmanager, which routes them to various channels such as Slack, Email, or PagerDuty

When the alerts are green in color it means they have not been triggered yet

When the alerts are yellow it means the rule that triggers the alert is active and in a pending state

When the alerts are red it means the alert has been triggered

An alert email that states the alert label and pre-defined description

Multiple alerts which have been triggered and routed to Gmail by Alertmanager

Before and after adjusting the brightness of the image. We have reduced the brightness of the original training image.

Difference in data distribution of brightness between train and test dataset.We can see the distribution of the test dataset has more variance than the train dataset.

No difference in data distribution for aspect ratio and area

Differences in data distribution of item latent factors between training and test datasets

EigenCam heatmap visualizing the region of the image that contributes most to model’s decision making

Summary

Monitoring ML applications is crucial for maintaining service reliability and performance. Basic monitoring involves tracking resource utilization and request metrics, which can be visualized using pre-built dashboards like those provided by BentoML.
Custom metrics allow for tracking application-specific details, such as confidence scores in object detection or ranked movie counts in recommender systems. These custom metrics can be integrated into monitoring dashboards for better insights.
Logging provides valuable context and detailed information for debugging and troubleshooting. Centralizing logs using tools like Loki enhances observability and facilitates efficient log analysis.
Alerting is essential for proactive incident management. Setting up alert rules based on monitored metrics and logs, and using Alertmanager for routing notifications, ensures timely responses to critical issues.
Data drift monitoring is important for maintaining model accuracy. Deepchecks provides tools to detect drift in various data types, including images and embeddings. Regularly monitoring for drift helps prevent model performance degradation.
Model explainability is crucial for building trust and understanding AI decisions. Techniques like EigenCAM for object detection and model-based approaches for recommender systems provide insights into how models make predictions. Explainability enhances transparency and accountability in ML systems.

FAQ

What are the two main components of ML model monitoring covered in this chapter?

Basic monitoring and data drift monitoring. Basic monitoring tracks operational health and SLAs (uptime, latency, error rates, CPU/memory). Data drift monitoring checks whether incoming data distributions and their relationship to targets still match training-time distributions, helping detect when models may degrade and need retraining or feature updates.

How do I enable and verify Prometheus scraping for BentoML deployments?

BentoML services expose metrics at /metrics. Ensure Prometheus is scraping them via a PodMonitor for Yatai Bento deployments. Steps: 1) Port-forward Prometheus: kubectl port-forward svc/prometheus-server -n prometheus 9090:80. 2) In the Prometheus UI, check Status → Service Discovery and confirm a PodMonitor for yatai/bento-deployment. 3) If missing, apply: kubectl apply -f https://raw.githubusercontent.com/bentoml/yatai/main/scripts/monitoring/bentodeployment-podmonitor.yaml. 4) Verify metrics on the Graph tab by querying for bento.

How can I get a ready-made Grafana dashboard for basic monitoring of BentoML services?

Use the prebuilt BentoML dashboard. Steps: 1) Download dashboard JSON: curl -L https://raw.githubusercontent.com/bentoml/yatai/main/scripts/monitoring/bentodeployment-dashboard.json -o /tmp/bentodeployment-dashboard.json. 2) Port-forward Grafana: kubectl port-forward svc/grafana -n grafana 8001:80. 3) In Grafana (http://localhost:8001), go to Dashboards → New → Import, paste the JSON, and load. The dashboard shows RPS, success rate per endpoint, in-flight requests, CPU/memory, etc.

How do I add a custom Prometheus metric (e.g., confidence score histogram) to a BentoML service and visualize p90?

1) Install prometheus-client (pip install prometheus-client). 2) Define a Histogram in code (e.g., confidence_score with buckets 0.1–1.0) and call observe() with the prediction confidence in your inference path. 3) Serve and hit the endpoint; confirm the new metric appears at /metrics. 4) In Grafana, visualize the 90th percentile with PromQL: histogram_quantile(0.9, sum(rate(confidence_score_bucket[5m])) by (le)).

Which operational metrics should I track for API-based ML services?

Two categories: 1) Resource utilization: CPU, memory, container/pod scaling limits to avoid saturation. 2) Request tracking: response latency, throughput (RPS), success/error rates (non-200 status codes), in-progress requests. These support SLA adherence and early anomaly detection.

Why do I need logs in addition to metrics, and how can I centralize them with Loki?

Metrics show “what” and “how much,” while logs give the “why” via error messages and context. For BentoML, configure the standard Python logger for the bentoml logger with a formatter and DEBUG level to capture helpful events and errors. To centralize, install Loki (helm install loki grafana/loki --namespace loki --create-namespace), add Loki as a Grafana data source, and query logs alongside metrics for unified observability.

How do I set up uptime alerts with Prometheus and route them with Alertmanager (e.g., to Gmail)?

1) Write alert rules using up (equals 0) and absent() for missing targets, with for: 5m and severity labels. 2) Add rules under serverFiles in the Prometheus Helm values and upgrade Helm. 3) Alert states: Inactive → Pending (condition met, waiting for “for” duration) → Firing. 4) Configure Alertmanager email routing with Gmail SMTP; create an app password (2FA required) and set smtp_* in alertmanager.config, specify a receiver with email_configs, and optional routing by severity. Upgrade Helm; verify alerts and emails are received.

How can I detect data drift for image-based models using Deepchecks?

Use Deepchecks Vision checks. Workflow: 1) Assemble a training image set and create a modified “production-like” set (e.g., change brightness with PIL ImageEnhance). 2) Wrap both as Deepchecks VisionData (e.g., via a custom torchvision dataset). 3) Run ImagePropertyDrift().run(train, test) to compute drift scores for properties like brightness, contrast, aspect ratio. 4) Review the HTML report and scores; significant shifts (e.g., brightness) indicate drift requiring data/augmentation or retraining actions.

How do I detect drift in a recommender system’s behavior over time?

Track latent factor distributions. Steps: 1) Create a drifted ratings dataset (e.g., increase ratings for selected items). 2) Retrain the model on both baseline and drifted data. 3) Extract user/item embeddings from each model. 4) Convert embeddings to DataFrames and wrap in Deepchecks tabular Dataset. 5) Use FeatureDrift to compare factor distributions per item/user; significant drift flags changes in item popularity or user preferences.

What explainability methods are demonstrated, and how do I use them?

Two approaches: 1) Post hoc for CV with EigenCAM on YOLOv8 to generate heatmaps highlighting regions driving detections. Identify target layers, initialize EigenCAM (task='od'), run on images, and visualize overlays to ensure the model focuses on intended regions (e.g., the ID card). 2) Model-based for recommendations with Explainable Matrix Factorization (EMF). Train EMF, generate recommendations, then compute explanations from neighborhoods of similar users/items; outputs like {rating: count_of_similar_users} help justify each recommendation to stakeholders.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $12.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $12.49

eBook

pdf, ePub, online

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $12.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more