1 Facing the Efficiency Wall

The chapter explains why efficiency has become a first-class concern in modern ML inference. As model sizes and context lengths have grown, the limiting factors have shifted from arithmetic throughput to memory movement and power. Traditional remedies—better hardware, architectural tweaks, or batching—no longer keep latency low or utilization high. The core message is that bytes, not FLOPs, dominate the cost of serving large transformers, and that quantization—reducing weights and activations from floating point to lower-bit integers—directly addresses this reality without changing model semantics. The book targets practitioners deploying models in production and sets out to provide the judgment needed to navigate the new trade-offs.

Through energy and bandwidth reasoning and a running 7B-parameter example, the chapter makes the “efficiency wall” concrete: fetching data from memory costs orders of magnitude more energy than arithmetic, and each generated token pulls gigabytes through the hierarchy, especially as KV caches scale with context length. Even at modest serving rates, steady-state power and cost accumulate rapidly, while GPU utilization appears low and latency plateaus—symptoms of a memory-bound system. The key mental anchor is that bit precision is primarily an energy decision: every extra bit inflates memory footprint, traffic, and energy per token, long before compute becomes the limit.

Given this diagnosis, the chapter positions quantization as the practical lever because it linearly shrinks the dominant term—bytes moved per token—while composing with existing optimizations. Alternatives like sparse/conditional computation, attention kernel improvements, smarter batching, or distillation help in specific ways but do not systematically reduce memory traffic across the whole model. Quantization often preserves quality at 8- and 4-bit because neural networks are redundant and tolerant to bounded, structured noise; the real work lies in choosing what to quantize (weights, activations, KV cache), when (PTQ, QAT, low-bit adaptation), and to which bit-widths. The chapter closes by framing the floating-point-to-integer transition as a shift from adaptive expressiveness to fixed, efficient grids—ultimately seeking sufficient precision at minimum energy.

Fetching data from HBM costs roughly 1,700× more energy than an INT8 multiply-add. This gap explains why inference systems are bottlenecked by memory, not compute.

Memory traffic per token as context length grows. Model weights (filled) stay constant, but KV cache (hatched) scales linearly with context. At 128K tokens, total memory traffic reaches 78 GB per token—5.5× more than at 512 tokens.

Floating point concentrates precision near zero (top), leaving large values sparsely represented. Integers use uniform spacing across your chosen range (bottom). Quantization is the act of deciding where to place that fixed grid.

Summary

Modern LLM inference is constrained by memory bandwidth and power consumption, not raw compute—a 7B parameter model moves roughly 15 GB through memory for every token generated, consuming nearly 2 joules of energy per token.
Quantization reduces precision (typically from 16-bit to 8-bit or 4-bit integers), directly cutting memory footprint, bandwidth, and energy consumption in proportion to the bit reduction.
Neural networks tolerate quantization because they encode directions and correlations rather than exact values—their inherent redundancy absorbs the bounded approximation error that lower precision introduces.
The core trade-off in quantization is range versus resolution: integers force you to choose a fixed grid where every number must fit, unlike floating point which auto-scales at the cost of hardware complexity.
Quantization decisions span what to quantize (weights, activations, KV cache), when to quantize (post-training or during training), how aggressively to quantize (8-bit, 4-bit, mixed), and where the model runs (GPU, CPU, edge).

FAQ

What is the “efficiency wall” in modern LLM inference?

Large models and long contexts have shifted inference bottlenecks from arithmetic to memory movement and power. GPUs sit underutilized, latency plateaus, and costs rise because moving bytes (weights, KV cache) dominates over math.

Why is data movement more expensive than compute?

On modern hardware, a single off‑chip DRAM/HBM access costs orders of magnitude more energy than a multiply‑add. Think hundreds to a thousand picojoules to fetch a few bytes versus well under a picojoule for an INT8 MAC. This structural gap makes memory traffic the binding constraint.

How does quantization directly reduce inference cost?

Quantization stores weights/activations in fewer bits (e.g., 16 → 8 → 4). Cutting precision linearly cuts bytes moved per token—shrinking memory footprint, bandwidth demand, and energy—without changing model architecture or computation graphs.

Will quantization hurt model quality?

At moderate bit-widths, degradation is typically small because neural networks are redundant and tolerant to bounded noise. Empirically, INT8 is often near FP16, and well-done INT4 (e.g., GPTQ on Llama 2 7B) shows minimal perplexity impact; below 4 bits, method choice matters more.

Where does per‑token memory traffic come from in a 7B transformer?

Main contributors are model weights, the KV cache, and negligible activations. For a 7B model at FP16 and ~2k context: ~14 GB (weights) + ~1.0 GB (KV reads/writes) ≈ ~15 GB moved per generated token.

What does that memory traffic mean for power and cost?

Using ~500 pJ per 4 bytes from HBM, ~15 GB per token is about 1.9 joules. At 1,000 tokens/s per replica, that’s ~1.9 kW steady draw—scaling linearly with replicas and becoming an infrastructure‑level energy concern.

How do longer context lengths change the picture?

Weights are fixed, but the KV cache scales linearly with context and quickly dominates traffic. As context grows (e.g., from 512 to 128k tokens), total bytes moved per token can multiply several‑fold, driving up latency and energy.

Why don’t faster GPUs (more FLOPs) fix latency or utilization?

Hardware advances favor compute density; energy per byte moved improves slowly. You can add FLOPs, but if memory bandwidth and power remain the limit, utilization stays low and latency flattens. The cost tracks bytes, not flops.

How does integer math differ from floating point, and why does it help?

Integers live on a fixed grid with uniform spacing and simpler hardware paths—no exponent alignment or normalization—yielding lower energy per op and higher throughput. Floating point maximizes expressiveness; integers maximize efficiency.

What other optimizations exist, and why is quantization central?

MoE/conditional compute and sparsity can help but require architectural changes. FlashAttention, fused kernels, and better scheduling optimize how data moves, not how much. Batching/speculative decoding amortize costs but don’t reduce per‑token bytes. Distillation creates a different model. Only quantization systematically reduces bits moved per token across the whole model.

Fetching data from HBM costs roughly 1,700× more energy than an INT8 multiply-add. This gap explains why inference systems are bottlenecked by memory, not compute.

Memory traffic per token as context length grows. Model weights (filled) stay constant, but KV cache (hatched) scales linearly with context. At 128K tokens, total memory traffic reaches 78 GB per token—5.5× more than at 512 tokens.

Floating point concentrates precision near zero (top), leaving large values sparsely represented. Integers use uniform spacing across your chosen range (bottom). Quantization is the act of deciding where to place that fixed grid.

pro $24.99 per month

lite $19.99 per month

team

pro

team

pro

team