table of content

Part 1 Basics

1 Introduction

1.1 What is latency?

1.2 How is latency measured?

1.3 Why does latency matter?

1.3.1 User experience

1.3.2 Real-time systems

1.3.3 Efficiency

1.4 What latency is not

1.5 Latency vs. bandwidth

1.6 Latency vs. energy

2 Modeling and measuring latency

2.1 Laws of latency

2.1.1 Little’s law

2.1.2 Amdahl’s law

2.2 Latency distribution

2.3 Common sources of latency

2.3.1 Physics

2.3.2 CPU and hardware

2.3.3 Virtualization

2.3.4 Operating system, drivers, and firmware

2.3.5 Managed runtime

2.3.6 Application

2.4 Compounding latency

2.5 Measuring latency

2.6 Putting it together: Measuring network latency

2.6.1 Plotting with histograms

2.6.2 Plotting with eCDF

Part 2 Data

3 Colocation

3.1 Why colocate?

3.2 Internode latency

3.2.1 Geographical and last-mile latency

3.2.2 Edge computing and CDNs

3.3 Intranode latency

3.3.1 Network stack

3.3.2 TCP/IP protocol

3.3.3 Kernel-bypass networking

3.4 Multicore architecture

3.5 Putting it together: REST API with embedded database

4 Replication

4.1 Why replicate data?

4.2 Availability and scalability

4.3 Consistency model

4.3.1 Strong consistency

4.3.2 Eventual consistency

4.3.3 Other consistency models

4.4 Replication strategies

4.4.1 Single-leader replication

4.4.2 Multi-leader replication

4.4.3 Leaderless replication

4.4.4 Read-your-writes property

4.4.5 Local-first approach

4.5 Asynchronous vs. synchronous replication

4.6 State machine replication

4.7 Case study: Viewstamped Replication

4.8 Putting it together: Replicating a key–value store

5 Partitioning

5.1 Why partition data?

5.2 Physical partitioning strategies

5.2.1 Horizontal partitioning

5.2.2 Vertical partitioning

5.2.3 Hybrid partitioning

5.3 Logical partitioning strategies

5.3.1 Functional partitioning

5.3.2 Geographical partitioning

5.3.3 User-based partitioning

5.3.4 Time-based partitioning

5.3.5 Overpartitioning

5.4 Request routing

5.4.1 Direct routing

5.4.2 Proxy routing

5.4.3 Forward routing

5.5 Partition imbalance

5.5.1 Hot partitions

5.5.2 Skewed workloads

5.6 Putting it together: Horizontal partitioning with SQLite

6 Caching

6.1 Why cache data?

6.2 Caching overview

6.3 Caching strategies

6.3.1 Cache-aside caching

6.3.2 Read-through caching

6.3.3 Write-through caching

6.3.4 Write-behind caching

6.3.5 Client-side caching

6.3.6 Distributed caching

6.4 Cache coherency

6.5 Cache hit ratio

6.6 Cache replacement

6.6.1 Least recently used (LRU)

6.6.2 Least frequently used (LFU)

6.6.3 First-in, first-out (FIFO) and SIEVE

6.7 Time-to-live (TTL)

6.8 Materialized views

6.9 Memoization

6.10 Putting it together: In-application caching with Moka

Part 3 Compute

7 Eliminating work

7.1 Ways of eliminating work

7.2 Algorithmic complexity

7.3 Serializing and deserializing

7.4 Memory management

7.4.1 Dynamic memory allocation

7.4.2 Garbage collection

7.4.3 Virtual and physical memory

7.4.4 Demand paging

7.4.5 Memory topology

7.5 Operating system overhead

7.5.1 Scheduling delay and context switching

7.5.2 Background tasks and interrupts

7.5.3 Network stack

7.6 Precomputation

7.7 Putting it together: Benchmarking with Criterion

8 Wait-free synchronization

8.1 Mutual exclusion

8.1.1 Mutexes

8.1.2 Read–write locks

8.1.3 Spinlocks

8.2 Problems with mutual exclusion

8.2.1 Inefficiency

8.2.2 Priority inversion

8.2.3 Convoying

8.2.4 Deadlocks

8.3 Atomics

8.3.1 Atomic operations

8.3.2 Anatomy of a spinlock

8.4 Memory barriers

8.4.1 Types of memory barriers

8.4.2 Compiler barriers

8.4.3 Memory reordering example

8.5 Wait-free synchronization

8.5.1 Progress conditions

8.5.2 Consensus number

8.5.3 Wait-free queues

8.5.4 Wait-free stacks

8.5.5 Wait-free linked lists

8.6 Putting it together: Building a single-producer, single-consumer queue

9 Exploiting concurrency

9.1 Concurrency and parallelism

9.2 Concurrency models

9.2.1 Threads

9.2.2 Fibers

9.2.3 Coroutines

9.2.4 Event-driven concurrency

9.2.5 Futures and promises

9.2.6 Actor model

9.3 Parallel processing

9.3.1 Data parallelism

9.3.2 Task parallelism

9.4 Transactions

9.4.1 Serializability

9.4.2 Snapshot isolation

9.4.3 Data anomalies and weaker isolation

9.5 Concurrency control

9.5.1 Two-phase locking

9.5.2 Multiversion concurrency control

9.6 Putting it together: Sequential vs. concurrent execution

Part 4 Hiding latency

10 Asynchronous processing

10.1 Fundamentals

10.1.1 Asynchronous vs. synchronous processing

10.1.2 The event loop

10.1.3 Challenges

10.2 Asynchronous I/O

10.2.1 I/O multiplexing

10.2.2 Request batching

10.2.3 Request hedging

10.2.4 Buffered I/O

10.2.5 Memory mapping

10.3 Deferring work

10.3.1 Task scheduling

10.3.2 Priority queues

10.3.3 Work stealing

10.4 Resource management

10.4.1 Thread pools

10.4.2 Memory pools

10.4.3 Connection pools

10.5 Managing concurrency with backpressure

10.5.1 Controlling the producer

10.5.2 Buffering

10.5.3 Dropping and rate limiting

10.6 Error handling

10.6.1 Partial errors

10.6.2 Recovery

10.6.3 Timeouts and cancellation

10.7 Observability

10.7.1 Tracing

10.7.2 Metrics

11 Predictive techniques

11.1 Introduction to predictive techniques

11.2 Prefetching

11.2.1 Pattern-based prefetching

11.2.2 Semantic prefetching

11.3 Optimistic updates

11.3.1 Optimistic view

11.3.2 Synchronizing optimistic updates

11.3.3 Consistency guarantees

11.3.4 Error handling and rollbacks

11.4 Speculative execution

11.4.1 Incremental computation

11.4.2 Parallel speculation

11.4.3 Value prediction

11.5 Predictive resource allocation

11.5.1 Overprovisioning

11.5.2 Prewarming

Appendix

Appendix A: Further reading

Overview

10 Asynchronous processing

This chapter explains how asynchronous processing hides latency when further reductions aren’t feasible. It contrasts synchronous, blocking workflows with async designs that initiate work without waiting for results, improving perceived responsiveness by overlapping I/O and computation. The chapter frames async as complementary to prior optimization techniques: instead of only shrinking absolute latency, it keeps systems responsive while slow operations proceed, and it introduces the core building blocks you need to make that practical at scale.

The event loop is presented as the heart of async systems: a dispatcher that polls OS multiplexing interfaces (such as epoll, kqueue, io_uring, or IOCP), processes readiness events, and runs scheduled tasks without blocking. Around this core, the chapter surveys async I/O techniques to hide or amortize latency—multiplexing many connections per thread, batching requests, hedging duplicates against tail spikes (with idempotency caveats), buffered I/O, and memory mapping. It then covers deferring non-critical work via scheduling, priority queues (with anti-starvation), and work stealing for load balance. Resource management is treated as essential to low latency: thread pools versus thread-per-core runtimes, careful memory buffer pooling, and connection pooling combined with asynchronous database queries to increase parallelism while controlling overhead.

The chapter also addresses the hard parts: complexity, race conditions, and resource blow-ups if concurrency is left unchecked. It advocates backpressure to regulate producers—TCP window-based throttling, bounded buffering, and last-resort dropping or rate limiting—to keep service latency predictable. Robust error handling is required for partial failures, retries with exponential backoff, and safe idempotent operations, plus timeouts and cancellation with thorough cleanup. Finally, it emphasizes observability for async systems: distributed tracing with propagated context, and metrics that capture concurrency, queue depths, error categories, retry behavior, and resource utilization, along with latency decomposition across wait, queue, and processing stages. The takeaway is that async processing can dramatically improve perceived latency, but only with disciplined flow control, resilient error handling, and strong visibility.

Synchronous vs asynchronous processing. With synchronous processing (at the top of the diagram), a task runs to completion before we start a new task. Therefore, the total time to run tasks A, B, and C is the sum of each task’s time. In contrast, with asynchronous processing (at the bottom of the diagram), all the tasks are started at the same time, resulting in the total time of the slowest task. Asynchronous processing allows you to reduce latency if you can execute the tasks in parallel.

Event loop breaks down work into individual tasks that execute when an event happens. In this example, event loop processes three different tasks, accept connection, process request, and send response, as part of processing a request arriving from the network. Each task runs when an event, such as socket becoming readable, happens.

Request hedging is a latency hiding technique where the client sends two or more copies of the same request. The client uses the response that arrives first and ignores responses that arrive later for whatever reason. For example, perhaps the network path for some of the requests and responses is slower than for others or the messages get queued somewhere along the path.

Backpressure controls the flow of work from producer to consumer to avoid overwhelming the consumer. Clients push work to a server, which buffers them. The service itself pulls work from the buffer. The service also signals to the clients of buffer capacity limits so that the clients know when to slow down to avoid overwhelming the system.

Summary

In synchronous processing, tasks run one after another, waiting for a task to complete before starting another one. In contrast, asynchronous processing is primarily about structuring your application in a way where tasks can start independently, addressing the issue of some tasks taking a long time to complete.
Event loop is a fundamental concept in asynchronous processing where we have a dispatcher at the core of the system, polling for event such as data arriving from the network, and reacting to them.
Although asynchronous processing can improve performance and reduce latency, it has some downsides too with resource management and error handling often being more complex.
I/O multiplexing is a fundamental OS primitive enabling the event loop approach. It allows the event loop to efficiently monitor thousands of event sources, allowing the application to react to events as they happen.
Asynchronous processing enables various efficient latecy hiding techniques such as request hedging, deferred work, and more.
Managing concurrency with backpressure is critical in asynchronous systems to avoid overwhelming the system.
Asynchronous processing requires special attention to error handling. For example, handling partial failures and recovering from them can be tricky. Timeouts and cancellations are also essential to dealing with asynchronous task errors.

FAQ

What is asynchronous processing and how does it reduce perceived latency compared to synchronous processing?

Asynchronous processing lets tasks progress independently without blocking for results. Instead of completing A, then B, then C in sequence (sum of all latencies), an async system starts them and completes when the slowest finishes (max latency), reducing perceived wait time. This hides unavoidable I/O delays by overlapping work and reacting to events when they’re ready, keeping the system responsive.

When does asynchronous processing help, and when can it make things slower?

Async helps when tasks are independent and can overlap, such as multiple network/database calls. It reduces idle time waiting on I/O. It can hurt when tasks are inherently sequential or tightly dependent; the coordination overhead (scheduling, state machines, callbacks/futures) may add latency without any parallelism to exploit.

What is an event loop and how does it enable efficient async I/O?

An event loop is a dispatcher that multiplexes many I/O sources (sockets, files, timers) on one thread. It repeatedly: 1) polls for events, 2) processes events, 3) runs scheduled tasks, then repeats. Using OS interfaces (e.g., io_uring, epoll, kqueue, IOCP), it registers interest and wakes only when sources are ready, avoiding blocking and enabling a single thread to manage thousands of connections efficiently.

How does an event-driven server differ from a synchronous server?

Key differences: 1) Non-blocking operations: rather than calling blocking recv()/send(), it registers interest in readiness and only acts when readable/writable. 2) Resource efficiency: one event loop thread can handle many connections by polling multiple sources at once. 3) Structure: work is split into small tasks (accept, read/process, write) triggered by readiness events instead of a linear request lifecycle per thread.

What are the main challenges of building asynchronous systems?

Common pitfalls include: 1) Complexity: managing dependencies, race conditions, and non-linear control flow. 2) Resource management: many concurrent tasks can exhaust memory/handles; throttling is needed. 3) Debuggability: non-deterministic ordering and limited stack traces complicate diagnosis. 4) Error handling: coordinating partial failures and deciding how one task’s error affects others.

What is I/O multiplexing and why is it central to async servers?

I/O multiplexing (epoll, kqueue, io_uring, IOCP) lets one thread monitor many connections and act only when specific events occur (readable, writable, timers). It removes the need for a thread-per-connection, cutting context switches and improving scalability for high-connection workloads like web servers and real-time systems.

How do request batching and buffered I/O hide latency?

- Request batching: send multiple operations in a single network round-trip to amortize RTT, useful when responses can be processed asynchronously or are not latency-critical. Tune batch size or send responses individually to avoid response latency spikes. - Buffered I/O: accumulate reads/writes in memory and issue larger system calls to reduce syscall overhead. It can be paired with readahead to fetch data before it’s needed.

What is request hedging and what trade-offs does it involve?

Hedging sends duplicate requests and uses the first response, masking latency variance from networks or third-party services. Trade-offs: higher load on the service (which can worsen latency), the need for idempotent operations, and more complex error policies (e.g., accept the first response vs. wait for specific success conditions). Use selectively where variability is high and capacity permits.

How do deferring work, task scheduling, and priority queues improve user-perceived latency?

- Deferring work: run only user-visible, time-critical tasks immediately; postpone non-critical tasks (e.g., analytics, reconciliation) to reduce foreground latency. - Scheduling: balance immediate vs deferred execution to preserve resources. - Priority queues: assign importance levels, prevent starvation via aging, and adjust priorities dynamically (e.g., off-peak boosts for batch tasks) to keep critical paths fast.

How do backpressure, buffering, and rate limiting keep async systems stable?

- Backpressure: consumers signal producers to slow down (e.g., honoring TCP receive window or delaying reads until capacity exists). - Buffering: use bounded queues to absorb bursts; size carefully (large buffers tolerate spikes but add latency; small buffers reduce latency but may drop work). Distinguish this from buffered I/O, which reduces syscall/I/O overhead. - Dropping/rate limiting: as a last resort, reject or limit requests per client/time window to prevent overload and protect tail latency.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$54.99 $27.49

you save $27.50 (50%)

include audio $24.99 $12.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$54.99 $27.49

you save $27.50 (50%)

include audio $24.99 $12.49

eBook

pdf, ePub, online

$54.99 $27.49

you save $27.50 (50%)

include audio $24.99 $12.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more