Practical FinOps

Managing cloud cost, visibility, and accountability

Mohamed Labouardy

MEAP began June 2025
Last updated October 2025
Publication in April 2026 (estimated)

ISBN 9781633435964
375 pages (estimated)

Included with a Manning Online subscription

printed in black & white

catalog / Software Development / Cloud

resources: Source code Book forum Source code on Github

table of content

1 Introduction to FinOps

1.1 Cloud Cost Struggles: Real Stories

1.1.1 A tech start-up backfires on backers

1.1.2 Expanding platform, expanding cloud costs

1.1.3 Outdated financial practices lead to financial headaches

1.2 The Cloud Revolution

1.2.1 Amazon S3: Unlimited Storage

1.2.2 Amazon EC2: On-Demand Computing

1.3 The Hidden Costs of Cloud

1.4 What’s FinOps?

1.5 FinOps vs. Cloud Cost Management

1.6 Embracing the FinOps Journey

1.6.1 The Cultural Shift Toward FinOps

1.6.2 Overcoming Common Challenges

1.6.3 When Not to Adopt FinOps

1.7 Summary

2 Exploring the FinOps Framework

2.1 Core Principles of FinOps

2.1.1 Collaboration Across Teams

2.1.2 Transparency

2.1.3 Optimization

2.1.4 Accountability

2.1.5 Agility

2.2 Key Personas in FinOps

2.3 The Lifecycle Phases of FinOps

2.3.1 Inform Phase

2.3.2 Optimize Phase

2.3.3 Operate Phase

2.4 Assessing Maturity in FinOps Practices

2.4.1 Pre-Crawl Stage: Awareness and Discovery

2.4.2 Crawl Stage: Early Adoption

2.4.3 Walk Stage: Intermediate Maturity

2.4.4 Run Stage: Full Maturity

2.5 Domains & Capabilities in Depth

2.5.1 Understand Cloud Usage and Cost

2.5.2 Quantify Business Value

2.5.3 Optimize Cloud Usage and Cost

2.5.4 Manage the FinOps Practice

2.6 Summary

3 Building a Cloud Asset Inventory

3.1 Cloud Resources Management

3.1.1 Building an Inventory Using a Bash Script

3.2 Managing Your AWS Inventory

3.2.1 AWS Config

3.2.2 AWS Resources Explorer

3.2.3 AWS Resource Groups

3.3 Building a Multi-Cloud Asset Inventory

3.3.1 CloudQuery

3.4 Summary

4 Tags Management

4.1 Why Tagging?

4.1.1 Tagging in Practice: An Enterprise Example

4.1.2 How Tagging Works Across Cloud Providers

4.1.3 Cloud Native Tools for Tagging Cloud Resources

4.1.4 Third-Party and Open-Source Tagging Tools

4.2 Tagging Best Practices

4.2.1 Develop a Tagging Strategy

4.2.2 Use a Standardized Tagging Convention

4.2.3 Implement Tag Enforcement and Auditing

4.3 Maintaining Effective Tag Hygiene

4.3.1 Regular Tag Audits

4.3.2 Automate Tagging Processes

4.3.3 Remove Outdated Resources

4.3.4 Keep Tagging Schemes Flexible

4.3.5 Use Tag-Based Access Control

4.3.6 Implement a Tag Validation Service

4.3.7 Implement Automated Tag Correction

4.3.8 Implement Tag Version Control

4.4 Summary

5 Mastering Cost Allocation Techniques

5.1 Exploring Cost Allocation Models

5.1.1 Direct Allocation

5.1.2 Shared Cost Allocation

5.1.3 Fixed Allocation

5.2 Leveraging AWS Cost Explorer for Cost Allocation

5.3 Advanced Cost Allocation with AWS Cost and Usage Report

5.3.1 Setting up AWS Cost and Usage Report

5.3.2 Understanding CUR Data Structure

5.3.3 Querying Cost and Usage Reports using Amazon Athena

5.3.4 Building FinOps Dashboards with AWS QuickSight

5.3.5 Exploring Cloud Intelligence Dashboards

5.4 FOCUS

5.5 Summary

6 Forecasting and Budgeting

6.1 Mastering Forecasting

6.1.1 Using AWS Cost Explorer for Forecasting

6.1.2 Using AWS CUR and SageMaker for Forecasting

6.2 Setting up and Managing Budgets

6.2.1 Create AWS Budgets

6.2.2 Estimating Cost for New Projects

6.3 Cost Anomaly Detection

6.4 Summary

7 Compute Optimization Strategies

7.1 AWS Compute Services and Pricing Models

7.1.1 AWS Compute Services

7.1.2 Pricing Models

7.2 Assessing Compute Usage

7.2.1 Understanding Current Usage and Costs

7.2.2 Identifying Wasted Compute Resources

7.3 Implementing Automation for Cost Optimization

7.3.1 Scheduling On/Off Times for Instances

7.3.2 Implementing Spot Instances for CI/CD Pipelines

7.3.3 Lambda Memory Tuning

7.3.4 Rightsizing & Autoscaling Instances

7.3.5 Optimizing Kubernetes Compute Costs

7.3.6 The KPI and Modernization Dashboard

7.3.7 Managing Multi-Cloud Compute Resources

7.4 Summary

8 Storage Optimization Strategies

8.1 Understanding Storage Costs

8.1.1 Overview of Storage Options

8.1.2 Storage Pricing Models

8.1.3 Web Application Storage Cost Breakdown

8.2 Assessing Storage Usage and Cost

8.2.1 Understanding Storage Cost Breakdown

8.2.2 CUDOS Dashboards for Storage

8.2.3 Storage Usage Monitoring

8.3 Best Practices for Storage Optimization

8.3.1 Regular Audits and Cleanup

8.3.2 Automating Lifecycle Management

8.3.3 Enforcing Tagging

8.3.4 Rightsizing Storage Volumes

8.3.5 Setting Budgets for Storage Costs

8.4 Summary

9 Network Optimization Strategies

9.1 Understanding Network Charges

9.1.1 Key Types of AWS Network Costs

9.2 Monitoring and Reporting AWS Network Costs

9.2.1 Getting Visibility with AWS CUR + Athena

9.2.2 Using CloudWatch to Track Network Metrics

9.2.3 Using CUDOS to Visualize Network Charges

9.3 Best Practices for Network Optimization

9.3.1 Use VPC Endpoints Instead of NAT Gateways

9.3.2 Compress Data Before Transfer

9.3.3 Enable Caching with CloudFront

9.3.4 Avoid Cross-AZ Traffic for Internal Services

9.3.5 Use VPC Flow Logs to Trace and Attribute Internal Traffic

9.3.6 Optimize Load Balancer Design

9.3.7 Tune DNS TTLs to Reduce Lookup Overhead

9.3.8 Use Direct Connect for High-Volume Hybrid Traffic

9.3.9 Use CloudWatch to Alert on Traffic Anomalies

9.4 Optimize Network Costs in Multi-Cloud Architectures

9.4.1 Route Through Peering or Direct Connections

9.4.2 Place Dependent Workloads in the Same Region

9.4.3 Compress and Batch Transfers

9.4.4 Use Object Storage as a Transfer Bridge

9.4.5 Track Cross-Cloud Traffic in CUR or Third-Party Tools

9.4.6 Forecast and Model Network Costs

9.5 Summary

10 Cloud Governance

10.1 What Is Cloud Governance?

10.1.1 Governance Components and Their FinOps Impact

10.1.2 Cloud Governance in Practice

10.2 Cloud Center of Excellence (CCoE)

10.2.1 Structure and Roles

10.2.2 KPIs and Measuring Success

10.2.3 Scaling Governance with Enablement

10.3 Accountability and Ownership in Cloud Governance

10.3.1 Role-Based Ownership

10.3.2 Showback and Chargeback Models

10.3.3 Key Accountability Metrics

10.3.4 Governance Roles and RACI

10.4 Bootstrapping Governance in a FinOps-Enabled Organization

10.5 Summary

11 AI-Powered FinOps

11.1 Leveraging LLMs in FinOps Workflows

11.1.1 What is a Large Language Model?

11.1.2 From Natural Language to AWS Config Queries

11.1.3 Using LLMs to Analyze AWS CUR Data

11.2 Building a CUR Chatbot with LLMs

11.3 Agentic FinOps

11.3.1 MCP Server for Cost Explorer

11.3.2 MCP Server For Cost Estimation

11.4 Risks, Limits, and Human-in-the-Loop FinOps

11.4.1 Hallucinations and Fabricated Data

11.4.2 Context Awareness and Operational Boundaries

11.4.3 Token Limits and Dataset Size

11.4.4 Data Privacy and Control

11.5 Summary

Overview

7 Compute Optimization Strategies

This chapter advances the FinOps Optimize phase by focusing on how to align compute supply with real demand to cut waste while protecting performance. Because compute typically dominates cloud spend, the guidance concentrates on practical, high‑impact levers in AWS (applicable across clouds): choosing the right service and pricing constructs, continuously assessing actual usage, and automating optimization so savings persist without manual effort.

It first grounds readers in the major AWS compute options—virtual machines (EC2), containers (ECS/EKS/Fargate), serverless (Lambda), and edge/hybrid (Outposts)—then maps workloads to pricing models such as On‑Demand, Spot, Reserved Instances, Savings Plans, and serverless pay‑for‑use. To expose savings, it emphasizes a disciplined assessment loop: maintain an inventory, use tags, surface idle/orphaned assets, and analyze spend and utilization with Cost Explorer and CUR/Athena (including views for Graviton adoption and Spot savings). Visualization with Cloud Intelligence Dashboards (e.g., CUDOS, CID, KPI) helps non‑technical stakeholders act. Waste‑finding tools—Trusted Advisor (low‑utilization EC2, idle LBs/EIPs, RI/SP guidance, Lambda tuning), Compute Optimizer (rightsizing), and CloudWatch (plus simple scripts)—provide concrete, prioritized recommendations.

The chapter then turns to automation and advanced optimization: schedule non‑prod resources off hours; shift fault‑tolerant CI/CD to Spot (self‑hosted GitHub Actions on EKS with Karpenter/ARC or Jenkins with EC2 Fleet); tune Lambda memory with power‑tuning workflows; and implement rightsizing and autoscaling with ASGs. For Kubernetes, it prescribes cluster and pod autoscaling, Karpenter for flexible provisioning, mixed On‑Demand/Spot node groups with taints/tolerations and termination handling, minimizing cross‑AZ/egress and log verbosity, right‑sizing requests/limits, adopting Graviton, enforcing quotas, cleaning idle objects, and pausing dev clusters. Finally, it outlines multi‑cloud practices—standardized tagging, centralized cost monitoring, IaC for consistency, cost‑aware workload placement, and provider‑native autoscaling—and closes with KPI tracking to institutionalize continuous, measurable compute cost optimization.

AWS compute services are split to servers, containers, serverless and hybrid options

An example of running EC2 instances behind a dynamic autoscaling group

Running Docker based containers on a Kubernetes cluster powered by EKS

Resizing uploaded images with a Serverless workflow based on SQS, Lambda and S3

Running EC2 instances within an on-premise infrastructure

Available pricing models on AWS

Using Resources Explorer to list active EC2 instances

Tracking EC2 cost by instance type

AWS Lambda cost and usage breakdown

EC2 compute unit cost and normalized hours by purchase option widget

Breakdown of Spot Savings by platform, instance type, region and AZ

EC2 instances usage time percentage and cost

Most expensive Fargate clusters

Most expensive Lambda functions

Trusted Advisor works by scanning AWS accounts looking for security, cost and compliance recommendations

Potential monthly savings identified by Trusted Advisor

Shows a list of active EC2 instances with low CPU usage, along with estimated monthly savings

Trusted Advisor highlights unattached Elastic IP addresses that can be removed to reduce recurring charges

Estimated money saved with one year Reserved Instances term

Highlights Lambda functions with over-provisioned memory, along with estimated cost savings

How to filter cost recommendations using tags, providing team-specific insights

Compute Optimizer main dashboard with overview of monthly savings

EC2 instance details page with recommended options

Recommendations for current Lambda functions

CloudWatch CPU utilization metrics displayed for EC2 instances

Analyzing CPU utilization for a selected EC2 instances

CloudWatch dashboard tracking CPU and network metrics for EC2 instances

Billing summary tab

Compute summary tab

Savings plans usage vs unused cost for each compute resource

Potential Graviton monthly savings

Cost savings opportunities identified by Trusted Advisor

EC2 instances with low CPU and network utilization

Potential compute savings grouped by compute service

Daily cost anomalies with their impact vs expected spend

Workflow of the EC2 instance scheduler based on AWS Lambda and Cron Jobs.

EC2InstanceScheduler function will be triggered by an EventBridge rule

Running self-hosted GitHub runners on EKS

Building a Jenkins clusters with workers based on Spot Instances

Invoking the step function

Step function execution steps

Optimal memory configuration and corresponding average cost per execution

Insufficient resources triggering an autoscaling event via Cluster Autoscaler

Visual representation of a deployment scaling from 2 pods to 10 pods during traffic spikes using HPA, with metrics driving the scaling decisions

Setting KPI goals to track your cost saving opportunities

Summary

Explored AWS compute options, including EC2, ECS, EKS, Lambda, and Outposts, categorizing them into servers, containers, serverless, and edge computing, and discussed their practical use cases.
Compared On-Demand, Spot Instances, Reserved Instances, Savings Plans, and Serverless pricing models, with examples of when to use each based on workload characteristics and cost considerations.
Used AWS Cost Explorer, CUR, and Amazon Athena queries to analyze compute spending, and explored the CUDOS dashboard for visibility into compute unit cost and utilization trends.
Leveraged AWS Trusted Advisor and Compute Optimizer to detect underutilized instances and inefficiencies, analyzed CloudWatch metrics to track low CPU utilization and network activity, and implemented an automated detection system for idle EC2 instances using a Go-based script.
Scheduled EC2 instance shutdown/startup using AWS EventBridge and Lambda, integrated Spot Instances into CI/CD pipelines for GitHub Actions and Jenkins workers, and utilized KPI dashboards to track cost-saving initiatives and measure optimization impacts.
Implemented Cluster Autoscaler and Karpenter for automatic cluster scaling, configured Horizontal Pod Autoscaler (HPA) for dynamic pod scaling, leveraged Spot Instances in EKS for batch jobs, and reduced inter-AZ and inter-region data transfer costs by configuring AWS endpoints and optimizing pod placement.
Identified and cleaned up unused Persistent Volume Claims (PVCs), ConfigMaps, and Secrets, scheduled non-essential EKS clusters to shut down during off-peak hours, and implemented KEDA for event-driven autoscaling based on external AWS metrics like SQS queue depth.
Standardized tagging across AWS, GCP, and Azure for better cost tracking, centralized cost visibility with tools like CloudQuery and AWS Cost Explorer, used Terraform and OpenTofu for multi-cloud infrastructure consistency, and leveraged GCP Preemptible VMs, Azure Spot Virtual Machines, and AWS Lambda based on workload needs.
Used Kubecost, AWS CUR, and Athena queries to track per-namespace and per-service compute costs, and leveraged AWS QuickSight dashboards and CUDOS widgets to visualize cost breakdowns by service and instance type.

FAQ

What AWS compute options does the chapter cover, and when should I use each?

• Amazon EC2 (servers): Full OS/control for web apps, databases, and legacy/enterprise software; scale via Auto Scaling. • Containers: Amazon ECS (simple AWS-native orchestration) and Amazon EKS (managed Kubernetes for portability/hybrid). • Serverless: AWS Lambda for event-driven, variable-demand workloads; pay only for execution. • Edge/Hybrid: AWS Outposts for low-latency, data residency, and on-prem integration with AWS APIs.

How do AWS compute pricing models compare, and which workloads fit each?

• On-Demand: No commitment; great for short-term, tests, and sudden spikes. • Spot Instances: Up to ~90% discount with interruption risk; ideal for batch, analytics, ML training, CI/CD. • Reserved Instances (RIs): Commit 1–3 years to specific instance families; steady-state, predictable workloads. • Savings Plans: Commit to hourly spend with flexibility across instance types/regions; good for evolving fleets. • Serverless: Pay per request and GB-second; perfect for intermittent or event-driven usage.

How do I baseline current compute usage and costs and quickly uncover waste?

Start with an inventory (across accounts/regions) to find untagged/orphaned assets. Use Cost Explorer (Group By service, instance type, tags) and CUR + Athena for deep dives (e.g., cost by instance type, purchase option mix, RI coverage, Lambda cost breakdown, Spot vs On-Demand savings, Graviton usage). Common waste: stopped EC2 with attached EBS, idle load balancers, unassociated Elastic IPs, and low-utilization EC2 (CPU ≤10% and network ≤5 MB over multiple days). Validate with CloudWatch (CPU/Network; add memory via CloudWatch Agent). CUDOS dashboards surface EC2 unit costs, Spot savings, Fargate top spenders, and top Lambda costs.

When should I use AWS Trusted Advisor vs. AWS Compute Optimizer?

Trusted Advisor (paid via Business/Enterprise Support) gives best-practice checks across cost, security, reliability, and performance, including idle/low-util EC2, idle ELB/RDS, EIPs, RI purchase recommendations, and Lambda over-provisioning. Compute Optimizer (no extra service fee beyond CloudWatch) analyzes 14 days of metrics and recommends rightsizing for EC2, EBS, Lambda memory tuning, and flags idle resources. Both support filtering by tags for team/cost-center targeting; TA is broader, CO is deeper on resource sizing.

How can I automate on/off schedules for EC2 to save in dev/test?

Use EventBridge cron rules to trigger a Lambda function that starts/stops instances by tag (for example, Environment=Development) at business hours. This typically saves 60–70% for non-24/7 environments and enforces consistent, low-effort hygiene aligned with FinOps governance.

What is a safe way to use Spot Instances for CI/CD pipelines?

Use self-hosted runners on Spot-backed capacity: • GitHub Actions on EKS with Karpenter (for fast, cost-aware node provisioning) and Actions Runner Controller (ARC) for autoscaled runners. • Jenkins with the EC2 Fleet plugin targeting a Spot ASG. Build/test jobs are interruption-tolerant; design retries and checkpointing. Expect large savings (often 70–90%) while maintaining throughput.

What are practical steps to rightsize and autoscale EC2?

• Rightsize with Compute Optimizer or CloudWatch (CPU/memory/network trends) and switch to smaller or newer-generation/Graviton types where viable. • Apply changes via console/CLI/IaC and validate via blue/green or ALB-weighted shifts. • Autoscale with ASGs: define scale-out/in thresholds (e.g., CPU >70% add capacity; <30% remove), tie to ALB target groups, and consider Predictive Scaling for known patterns. Combine rightsizing for baseline with autoscaling for peaks.

How can I reduce Kubernetes (EKS) compute costs without hurting reliability?

• Infrastructure scaling: Cluster Autoscaler or Karpenter to add/remove nodes dynamically. • Workload scaling: HPA (replicas) and VPA (requests/limits) for pods. • Spot nodes: Mix On-Demand (critical) and Spot (fault-tolerant) with taints/tolerations and the node termination handler to drain before reclaim. • Right-size pods/limits (use tools like Goldilocks, kube-resource-report) and enforce namespace quotas. • Cut data transfer/logging costs: prefer in-AZ traffic, VPC endpoints/PrivateLink, right-size log verbosity. • Consider Graviton and mixed-arch clusters (arm64/amd64) for better price-performance.

How do I measure progress and communicate optimization outcomes?

Use Cloud Intelligence Dashboards (CUDOS/CID/KPI): • EC2 unit costs and normalized hours by purchase option. • Spot savings details by platform/instance/region/AZ. • Savings Plans/RI trackers (usage, unused cost, expirations). • Top Fargate/Lambda spenders. • KPI dashboard to set goals (e.g., on-demand EC2 ≤10% spend) and track green/red status over time. Share dashboards widely to drive accountability across engineering and finance.

What strategies help optimize compute in multi-cloud environments?

• Standardize tagging (Environment, Project, Owner, CostCenter) across providers for clean allocation. • Centralize cost and usage (export provider CURs to a common data lake and query with Athena/BigQuery). • Place workloads where they’re most cost-effective: e.g., preemptible/Spot for fault-tolerant batch, serverless for event-driven. • Use Terraform/OpenTofu to enforce consistent sizing, autoscaling, and pricing choices. • Leverage each cloud’s autoscaler (ASG, VMSS, GCE Autoscaler) tuned to regional demand and data-transfer realities.

with subscription

$24.99

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more