Latency you own this product

Reduce delay in software systems

Pekka Enberg

October 2025
ISBN 9781633438088
264 pages

Included with a Manning Online subscription

printed in black & white

available in Korean, Simplified Chinese

catalog / Software Development

table of content

Part 1 Basics

1 Introduction

1.1 What is latency?

1.2 How is latency measured?

1.3 Why does latency matter?

1.3.1 User experience

1.3.2 Real-time systems

1.3.3 Efficiency

1.4 What latency is not

1.5 Latency vs. bandwidth

1.6 Latency vs. energy

2 Modeling and measuring latency

2.1 Laws of latency

2.1.1 Little’s law

2.1.2 Amdahl’s law

2.2 Latency distribution

2.3 Common sources of latency

2.3.1 Physics

2.3.2 CPU and hardware

2.3.3 Virtualization

2.3.4 Operating system, drivers, and firmware

2.3.5 Managed runtime

2.3.6 Application

2.4 Compounding latency

2.5 Measuring latency

2.6 Putting it together: Measuring network latency

2.6.1 Plotting with histograms

2.6.2 Plotting with eCDF

Part 2 Data

3 Colocation

3.1 Why colocate?

3.2 Internode latency

3.2.1 Geographical and last-mile latency

3.2.2 Edge computing and CDNs

3.3 Intranode latency

3.3.1 Network stack

3.3.2 TCP/IP protocol

3.3.3 Kernel-bypass networking

3.4 Multicore architecture

3.5 Putting it together: REST API with embedded database

4 Replication

4.1 Why replicate data?

4.2 Availability and scalability

4.3 Consistency model

4.3.1 Strong consistency

4.3.2 Eventual consistency

4.3.3 Other consistency models

4.4 Replication strategies

4.4.1 Single-leader replication

4.4.2 Multi-leader replication

4.4.3 Leaderless replication

4.4.4 Read-your-writes property

4.4.5 Local-first approach

4.5 Asynchronous vs. synchronous replication

4.6 State machine replication

4.7 Case study: Viewstamped Replication

4.8 Putting it together: Replicating a key–value store

5 Partitioning

5.1 Why partition data?

5.2 Physical partitioning strategies

5.2.1 Horizontal partitioning

5.2.2 Vertical partitioning

5.2.3 Hybrid partitioning

5.3 Logical partitioning strategies

5.3.1 Functional partitioning

5.3.2 Geographical partitioning

5.3.3 User-based partitioning

5.3.4 Time-based partitioning

5.3.5 Overpartitioning

5.4 Request routing

5.4.1 Direct routing

5.4.2 Proxy routing

5.4.3 Forward routing

5.5 Partition imbalance

5.5.1 Hot partitions

5.5.2 Skewed workloads

5.6 Putting it together: Horizontal partitioning with SQLite

6 Caching

6.1 Why cache data?

6.2 Caching overview

6.3 Caching strategies

6.3.1 Cache-aside caching

6.3.2 Read-through caching

6.3.3 Write-through caching

6.3.4 Write-behind caching

6.3.5 Client-side caching

6.3.6 Distributed caching

6.4 Cache coherency

6.5 Cache hit ratio

6.6 Cache replacement

6.6.1 Least recently used (LRU)

6.6.2 Least frequently used (LFU)

6.6.3 First-in, first-out (FIFO) and SIEVE

6.7 Time-to-live (TTL)

6.8 Materialized views

6.9 Memoization

6.10 Putting it together: In-application caching with Moka

Part 3 Compute

7 Eliminating work

7.1 Ways of eliminating work

7.2 Algorithmic complexity

7.3 Serializing and deserializing

7.4 Memory management

7.4.1 Dynamic memory allocation

7.4.2 Garbage collection

7.4.3 Virtual and physical memory

7.4.4 Demand paging

7.4.5 Memory topology

7.5 Operating system overhead

7.5.1 Scheduling delay and context switching

7.5.2 Background tasks and interrupts

7.5.3 Network stack

7.6 Precomputation

7.7 Putting it together: Benchmarking with Criterion

8 Wait-free synchronization

8.1 Mutual exclusion

8.1.1 Mutexes

8.1.2 Read–write locks

8.1.3 Spinlocks

8.2 Problems with mutual exclusion

8.2.1 Inefficiency

8.2.2 Priority inversion

8.2.3 Convoying

8.2.4 Deadlocks

8.3 Atomics

8.3.1 Atomic operations

8.3.2 Anatomy of a spinlock

8.4 Memory barriers

8.4.1 Types of memory barriers

8.4.2 Compiler barriers

8.4.3 Memory reordering example

8.5 Wait-free synchronization

8.5.1 Progress conditions

8.5.2 Consensus number

8.5.3 Wait-free queues

8.5.4 Wait-free stacks

8.5.5 Wait-free linked lists

8.6 Putting it together: Building a single-producer, single-consumer queue

9 Exploiting concurrency

9.1 Concurrency and parallelism

9.2 Concurrency models

9.2.1 Threads

9.2.2 Fibers

9.2.3 Coroutines

9.2.4 Event-driven concurrency

9.2.5 Futures and promises

9.2.6 Actor model

9.3 Parallel processing

9.3.1 Data parallelism

9.3.2 Task parallelism

9.4 Transactions

9.4.1 Serializability

9.4.2 Snapshot isolation

9.4.3 Data anomalies and weaker isolation

9.5 Concurrency control

9.5.1 Two-phase locking

9.5.2 Multiversion concurrency control

9.6 Putting it together: Sequential vs. concurrent execution

Part 4 Hiding latency

10 Asynchronous processing

10.1 Fundamentals

10.1.1 Asynchronous vs. synchronous processing

10.1.2 The event loop

10.1.3 Challenges

10.2 Asynchronous I/O

10.2.1 I/O multiplexing

10.2.2 Request batching

10.2.3 Request hedging

10.2.4 Buffered I/O

10.2.5 Memory mapping

10.3 Deferring work

10.3.1 Task scheduling

10.3.2 Priority queues

10.3.3 Work stealing

10.4 Resource management

10.4.1 Thread pools

10.4.2 Memory pools

10.4.3 Connection pools

10.5 Managing concurrency with backpressure

10.5.1 Controlling the producer

10.5.2 Buffering

10.5.3 Dropping and rate limiting

10.6 Error handling

10.6.1 Partial errors

10.6.2 Recovery

10.6.3 Timeouts and cancellation

10.7 Observability

10.7.1 Tracing

10.7.2 Metrics

11 Predictive techniques

11.1 Introduction to predictive techniques

11.2 Prefetching

11.2.1 Pattern-based prefetching

11.2.2 Semantic prefetching

11.3 Optimistic updates

11.3.1 Optimistic view

11.3.2 Synchronizing optimistic updates

11.3.3 Consistency guarantees

11.3.4 Error handling and rollbacks

11.4 Speculative execution

11.4.1 Incremental computation

11.4.2 Parallel speculation

11.4.3 Value prediction

11.5 Predictive resource allocation

11.5.1 Overprovisioning

11.5.2 Prewarming

Appendix

Appendix A: Further reading

Overview

1 Introduction

This chapter lays the foundation for building low‑latency applications by defining latency as the time between a cause and its observed effect and explaining why that framing matters across the stack. It motivates a systematic, practice‑oriented approach—combining concrete techniques, tools, and mental models—so developers can diagnose and reduce delays rather than rely on scattered folklore. The discussion also situates latency within physical limits (such as the speed of light), clarifies its relationship to other performance metrics, and sets expectations for how the rest of the book balances theory with practical guidance.

Latency is measured in time units and appears at every layer, from CPU caches and memory to disks, networks, operating systems, and application code. The chapter uses everyday and software examples—like light switches, HTTP requests, and Linux packet processing—to show how end‑to‑end delay compounds across components, varies between runs, and directly shapes user‑perceived responsiveness. It also contrasts common terminology (latency, response time, service time, wait time) and stresses that getting intuition for microsecond and nanosecond scales is key when chasing the last mile of performance.

The importance of low latency spans three themes: improving user experience (with clear business impact), meeting real‑time constraints (hard vs. soft deadlines), and boosting efficiency now that “free” hardware speedups have plateaued. The chapter distinguishes latency from bandwidth and throughput, highlights when designs trade latency for throughput (for example, via pipelining and concurrency), and notes the principle that bandwidth is easier to add than lower latency. Finally, it introduces the latency–energy tension—techniques like busy polling can cut delay yet raise power use—and shows how workload patterns determine whether you can optimize for both.

60 ms Length of a nanosecond. Source: https://americanhistory.si.edu/collections/search/object/nmah_692464

Processing without pipelining. We first perform step W (washing) fully and only then perform step D (drying). As the time to complete W is 30 minutes and the time to complete D is 60 minutes, each step takes 90 minutes in total. Therefore, we say that the latency to wash and dry clothes is 90 minutes and the throughput is 1/90 loads of laundry washed per minute.

Processing with pipelining. We perform step W (washing) in full, but as soon as it completes, we start another step W. In parallel, we perform step D (drying) for the previous step W. If we ignore the initial step where there is no completed step W, the time to complete a load of laundry is 120 minutes because W and D run in parallel, but we’re bottlenecked by D, making latency worse than without pipelining. However, due to pipelining, we have now increased throughput to 1/60 loads of laundry per minute, which means that we can complete four loads of laundry in the same time as non-pipelined does three.

Summary

Latency is the time delay between a cause and its observed effect.
Latency is measured in units of time.
You need to understand the latency constants of your system when designing for low latency.
Latency matters because people expect a real-time experience.
When optimizing for latency, there are sometimes throughput and energy efficiency trade-offs.

FAQ

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $12.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $12.49

eBook

pdf, ePub, online

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $12.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more