Overview

1 Getting started with MLOps and ML engineering

This chapter introduces the practice of taking machine learning beyond notebooks into reliable, scalable production systems. It frames MLOps as the discipline that closes the gap between experimentation and real-world operation, emphasizing that most failures stem not from model quality but from system reliability, data complexities, and process gaps. The authors adopt a hands-on path that builds confidence through practical patterns, automation, and reproducibility, guiding readers from core concepts to a working platform and concrete projects. The goal is to help practitioners of varied backgrounds—data scientists, software engineers, and ML engineers—navigate the full journey with clarity and momentum.

The ML life cycle is presented as iterative and orchestration-driven: start by validating that ML is the right tool, then progress through problem formulation, data collection and preparation, data versioning, training, evaluation, and business-facing validation. These steps are assembled into automated pipelines for consistency and speed. The transition to dev/staging/production introduces end-to-end automation with CI triggers, robust deployment practices (containerization, scaling, rollback strategies), and comprehensive monitoring across system performance, data/model drift, and business impact. Retraining is treated as a recurring, automated process, activated by schedules or performance thresholds to maintain model quality over time.

To execute this effectively, the chapter outlines the essential skill set: strong software engineering fundamentals, practical ML proficiency with common frameworks, data engineering literacy, and a bias toward automation—plus just enough Kubernetes to be productive. Readers are guided to incrementally build a modern ML platform centered on Kubeflow and Kubeflow Pipelines, then extend it with capabilities such as a feature store (to reduce training-serving skew), a model registry (for lineage and promotion), and CI/CD-driven model deployment. Tool choices are pragmatic and adaptable (build vs buy is context-dependent), and the lessons generalize to LLMOps. The chapter culminates with a roadmap of three projects—an OCR system, a tabular movie recommender, and a RAG-based documentation assistant—that anchor the concepts in realistic architectures and workflows.

The experimentation phase of the ML life cycle
The dev/staging/production phase of the ML life cycle
MLOps is a mix of different skill sets
The mental map of an ML setup, detailing the project flow from planning to deployment and the tools typically involved in the process
Traditional MLOps (right) extended with LLMOps components (left) for production LLM systems. Chapters 12-13 explore these extensions in detail.
An automated pipeline being executed in Kubeflow.
Feature Stores take in transformed data (features) as input, and have facilities to store, catalog, and serve features.
The model registry captures metadata, parameters, artifacts, and the ML model and in turn exposes a model endpoint.
Model deployment consists of the container registry, CI/CD, and automation working in concert to deploy ML services.

Summary

  • The Machine Learning (ML) life cycle provides a framework for confidently taking ML projects from idea to production. While iterative in nature, understanding each phase helps you navigate the complexities of ML development.
  • Building reliable ML systems requires a combination of skills spanning software engineering, MLOps, and data science. Rather than trying to master everything at once, focus on understanding how these skills work together to create robust ML systems.
  • A well-designed ML Platform forms the foundation for confidently developing and deploying ML services. We'll use tools like Kubeflow Pipelines for automation, MLFlow for model management, and Feast for feature management - learning how to integrate them effectively for production use.
  • We'll apply these concepts by building two different types of ML systems: an OCR system and a Movie recommender. Through these projects, you'll gain hands-on experience with both image and tabular data, building confidence in handling diverse ML challenges.
  • Traditional MLOps principles extend naturally to Large Language Models through LLMOps - adding components for document processing, retrieval systems, and specialized monitoring. Understanding this evolution prepares you for the modern ML landscape.
  • The first step is to identify the problem the ML model is going to solve, followed by collecting and preparing the data to train and evaluate the model. Data versioning enables reproducibility, and model training is automated using a pipeline.
  • The ML life cycle serves as our guide throughout the book, helping us understand not just how to build models, but how to create reliable, production-ready ML systems that deliver real business value.

FAQ

What does the ML life cycle look like in practice?The chapter frames the life cycle as a repeatable set of stages: problem formulation; data collection and preparation (including labeling and splitting into train/validation/test); data versioning; model training; model evaluation; and model validation. In the dev/staging/production phase, the pipeline is fully automated, adding model deployment, monitoring, and retraining. The process is inherently iterative with frequent loops back to earlier steps when issues are found.
How does the experimentation phase differ from the dev/staging/production phase?Experimentation focuses on rapid, iterative learning with partially automated, orchestrated pipelines. Moving to dev/staging/production shifts the emphasis to reliability, security, scalability, and full automation. CI or programmatic triggers run the entire pipeline end to end, culminating in deployment and ongoing monitoring. Roles, resources, and governance typically become more formal at this stage.
How do I decide whether to use ML or simple heuristics?Start by asking if ML is needed at all. Prefer simple heuristics when they solve the problem adequately. Choose ML when the data is complex or high-dimensional, relationships are non-linear, or heuristics break down. Make the decision collaboratively (product, business, and engineering), define the problem precisely, and agree on success criteria.
What are best practices for data collection, labeling, and splitting?Identify reliable data sources early and consider synthetic data if real data is scarce. Label carefully with clear guidelines and domain expertise (e.g., bounding boxes and text transcription for OCR; expert review for fraud). Organize labeled data into training, validation, and test sets to support unbiased evaluation and iteration.
Why is data versioning critical and how is it different from code versioning?Both code and data changes can alter model behavior. Versioning data alongside code ensures reproducibility—being able to recreate results from the same inputs. Unlike code, data varies in form and scale and lacks a universal “Git,” so teams must capture dataset snapshots, schemas, labels, and lineage tied to model runs, parameters, and artifacts.
What’s the difference between model evaluation and model validation?Evaluation measures model performance on held-out data using metrics such as precision, recall, or AUC to estimate generalization. Validation confirms the model behaves as expected for the business, often involving stakeholders outside the modeling team to perform sanity checks and ensure it meets real-world requirements.
What does a safe, production-ready deployment involve?Expose the model via a service (often a REST API), then containerize (e.g., Docker) and deploy to a scalable platform (e.g., Kubernetes on a cloud provider). Perform load testing, enable autoscaling for spiky traffic, version every deployment, and prepare rollback strategies. Involving data scientists often improves reliability and iteration speed.
What should be monitored in production, and when should I retrain?Track system health (e.g., requests per second, latency, error rates), ML-specific signals (data drift, model drift), and business KPIs (e.g., churn, conversion). Retrain on a schedule or when thresholds are breached (e.g., metric degradation). Automate as much as possible so new models can be trained, validated, and redeployed with minimal manual steps.
What is a Feature Store and how does it help?A Feature Store centralizes curated features with three core parts: storage, a registry/catalog, and a serving layer (online and offline). It enables feature sharing and reuse across teams, supports both training and inference, and reduces training–serving skew by ensuring consistent transformations in both contexts.
Should I build or buy an ML platform, and why does this book use Kubeflow?There’s no one-size-fits-all. Managed platforms (e.g., SageMaker, Vertex AI) can be great, but building at least once teaches how components fit together and where to customize. The book incrementally builds on Kubeflow (Jupyter, Pipelines) and augments it with tools like Feature Stores and Model Registries—an approach that builds deep understanding and adapts to evolving needs.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Machine Learning Platform Engineering ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Machine Learning Platform Engineering ebook for free