Overview

11 Monitoring and explainability

Putting models into production is only the beginning; keeping them reliable demands comprehensive monitoring, alerting, and interpretability. This chapter outlines an end-to-end approach to observability for ML services, combining basic operational monitoring with ML-specific checks like data drift, and pairing them with explainability techniques to understand model behavior. Using an object detection service and a movie recommender as running examples, it frames monitoring in two parts—service health and data drift—and shows how explainability builds trust, aids debugging, and supports regulatory and stakeholder needs.

For operational monitoring, the chapter demonstrates how BentoML’s built-in metrics can be scraped by Prometheus and visualized in Grafana, providing insights into uptime, latency, throughput, resource usage, and error rates against SLAs. It shows how to add custom business-aware metrics (for example, a prediction confidence histogram and request counters) and why logs complement metrics for root-cause analysis. Logs are centralized with Loki to correlate events across services, unifying metrics and logs in one place. Alerting is then layered on using Prometheus alert rules (including up and absent checks) and Alertmanager to route notifications by severity to channels like email or PagerDuty, enabling timely, actionable incident response.

On the ML side, the chapter focuses on detecting and responding to data drift and making model decisions transparent. For computer vision, it uses Deepchecks to compare distributions of image properties (such as brightness and contrast) between training and production data, and recommends storing inference images and predictions to compute drift periodically. For recommendations, it tracks shifts in user and item latent factors and detects feature-level drift over time. Explainability techniques close the loop: EigenCAM heatmaps verify that the object detector attends to the right regions, while an explainable matrix factorization approach surfaces neighborhood-based rationales for recommendations. Together, these practices establish a robust feedback system—monitor, detect, explain, and act—to maintain performance, reliability, and trust in production ML systems.

The mental map where we are now focusing on model monitoring(8)
Searching for BentoML in Prometheus service discovery
Verifying if BentoML metrics are being scraped by Prometheus
Importing BentoML dashboard in Grafana
BentoML basic monitoring dashboard.
Custom metrics can be seen at /metrics endpoint
Visualizing the confidence score custom metric in Grafana as a Gauge
Visualizing the ranked movie counter score custom metric in Grafana as a line chart
Using Loki as log aggregation system in Grafana
Alerts generated by the Prometheus service are sent to Alertmanager, which routes them to various channels such as Slack, Email, or PagerDuty
When the alerts are green in color it means they have not been triggered yet
When the alerts are yellow it means the rule that triggers the alert is active and in a pending state
When the alerts are red it means the alert has been triggered
An alert email that states the alert label and pre-defined description
Multiple alerts which have been triggered and routed to Gmail by Alertmanager
Before and after adjusting the brightness of the image. We have reduced the brightness of the original training image.
Difference in data distribution of brightness between train and test dataset.We can see the distribution of the test dataset has more variance than the train dataset.
No difference in data distribution for aspect ratio and area
Differences in data distribution of item latent factors between training and test datasets
EigenCam heatmap visualizing the region of the image that contributes most to model’s decision making

Summary

  • Monitoring ML applications is crucial for maintaining service reliability and performance. Basic monitoring involves tracking resource utilization and request metrics, which can be visualized using pre-built dashboards like those provided by BentoML.
  • Custom metrics allow for tracking application-specific details, such as confidence scores in object detection or ranked movie counts in recommender systems. These custom metrics can be integrated into monitoring dashboards for better insights.
  • Logging provides valuable context and detailed information for debugging and troubleshooting. Centralizing logs using tools like Loki enhances observability and facilitates efficient log analysis.
  • Alerting is essential for proactive incident management. Setting up alert rules based on monitored metrics and logs, and using Alertmanager for routing notifications, ensures timely responses to critical issues.
  • Data drift monitoring is important for maintaining model accuracy. Deepchecks provides tools to detect drift in various data types, including images and embeddings. Regularly monitoring for drift helps prevent model performance degradation.
  • Model explainability is crucial for building trust and understanding AI decisions. Techniques like EigenCAM for object detection and model-based approaches for recommender systems provide insights into how models make predictions. Explainability enhances transparency and accountability in ML systems.

FAQ

What are the two main components of ML model monitoring covered in this chapter?Basic monitoring and data drift monitoring. Basic monitoring tracks operational health and SLAs (uptime, latency, error rates, CPU/memory). Data drift monitoring checks whether incoming data distributions and their relationship to targets still match training-time distributions, helping detect when models may degrade and need retraining or feature updates.
How do I enable and verify Prometheus scraping for BentoML deployments?BentoML services expose metrics at /metrics. Ensure Prometheus is scraping them via a PodMonitor for Yatai Bento deployments. Steps: 1) Port-forward Prometheus: kubectl port-forward svc/prometheus-server -n prometheus 9090:80. 2) In the Prometheus UI, check Status → Service Discovery and confirm a PodMonitor for yatai/bento-deployment. 3) If missing, apply: kubectl apply -f https://raw.githubusercontent.com/bentoml/yatai/main/scripts/monitoring/bentodeployment-podmonitor.yaml. 4) Verify metrics on the Graph tab by querying for bento.
How can I get a ready-made Grafana dashboard for basic monitoring of BentoML services?Use the prebuilt BentoML dashboard. Steps: 1) Download dashboard JSON: curl -L https://raw.githubusercontent.com/bentoml/yatai/main/scripts/monitoring/bentodeployment-dashboard.json -o /tmp/bentodeployment-dashboard.json. 2) Port-forward Grafana: kubectl port-forward svc/grafana -n grafana 8001:80. 3) In Grafana (http://localhost:8001), go to Dashboards → New → Import, paste the JSON, and load. The dashboard shows RPS, success rate per endpoint, in-flight requests, CPU/memory, etc.
How do I add a custom Prometheus metric (e.g., confidence score histogram) to a BentoML service and visualize p90?1) Install prometheus-client (pip install prometheus-client). 2) Define a Histogram in code (e.g., confidence_score with buckets 0.1–1.0) and call observe() with the prediction confidence in your inference path. 3) Serve and hit the endpoint; confirm the new metric appears at /metrics. 4) In Grafana, visualize the 90th percentile with PromQL: histogram_quantile(0.9, sum(rate(confidence_score_bucket[5m])) by (le)).
Which operational metrics should I track for API-based ML services?Two categories: 1) Resource utilization: CPU, memory, container/pod scaling limits to avoid saturation. 2) Request tracking: response latency, throughput (RPS), success/error rates (non-200 status codes), in-progress requests. These support SLA adherence and early anomaly detection.
Why do I need logs in addition to metrics, and how can I centralize them with Loki?Metrics show “what” and “how much,” while logs give the “why” via error messages and context. For BentoML, configure the standard Python logger for the bentoml logger with a formatter and DEBUG level to capture helpful events and errors. To centralize, install Loki (helm install loki grafana/loki --namespace loki --create-namespace), add Loki as a Grafana data source, and query logs alongside metrics for unified observability.
How do I set up uptime alerts with Prometheus and route them with Alertmanager (e.g., to Gmail)?1) Write alert rules using up (equals 0) and absent() for missing targets, with for: 5m and severity labels. 2) Add rules under serverFiles in the Prometheus Helm values and upgrade Helm. 3) Alert states: Inactive → Pending (condition met, waiting for “for” duration) → Firing. 4) Configure Alertmanager email routing with Gmail SMTP; create an app password (2FA required) and set smtp_* in alertmanager.config, specify a receiver with email_configs, and optional routing by severity. Upgrade Helm; verify alerts and emails are received.
How can I detect data drift for image-based models using Deepchecks?Use Deepchecks Vision checks. Workflow: 1) Assemble a training image set and create a modified “production-like” set (e.g., change brightness with PIL ImageEnhance). 2) Wrap both as Deepchecks VisionData (e.g., via a custom torchvision dataset). 3) Run ImagePropertyDrift().run(train, test) to compute drift scores for properties like brightness, contrast, aspect ratio. 4) Review the HTML report and scores; significant shifts (e.g., brightness) indicate drift requiring data/augmentation or retraining actions.
How do I detect drift in a recommender system’s behavior over time?Track latent factor distributions. Steps: 1) Create a drifted ratings dataset (e.g., increase ratings for selected items). 2) Retrain the model on both baseline and drifted data. 3) Extract user/item embeddings from each model. 4) Convert embeddings to DataFrames and wrap in Deepchecks tabular Dataset. 5) Use FeatureDrift to compare factor distributions per item/user; significant drift flags changes in item popularity or user preferences.
What explainability methods are demonstrated, and how do I use them?Two approaches: 1) Post hoc for CV with EigenCAM on YOLOv8 to generate heatmaps highlighting regions driving detections. Identify target layers, initialize EigenCAM (task='od'), run on images, and visualize overlays to ensure the model focuses on intended regions (e.g., the ID card). 2) Model-based for recommendations with Explainable Matrix Factorization (EMF). Train EMF, generate recommendations, then compute explanations from neighborhoods of similar users/items; outputs like {rating: count_of_similar_users} help justify each recommendation to stakeholders.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Machine Learning Platform Engineering ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Machine Learning Platform Engineering ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Machine Learning Platform Engineering ebook for free