table of content

1 Thinking in distributed systems: Models, mindsets, and mechanics

1.1 Software engineering and mental models

1.1.1 Mental models: The foundation of reasoning

1.1.2 Correct mental models

1.1.3 Complete mental models

1.2 Mental model of software systems

1.3 Different types of models

1.3.1 Different models describing the same aspects

1.3.2 Different models describing different aspects of a system

1.4 Thinking about distributed systems

1.4.1 Correctness

1.4.2 Scalability and reliability

1.4.3 Responsiveness

1.5 Two big ideas

1.5.1 Systems of systems

1.5.2 Global view vs. local view

1.6 Distributed Systems Incorporated

1.7 Navigating complexity

1.7.1 Simple yet complex

1.7.2 Emergent behavior

1.7.3 Changing perspective

1.7.4 Think globally; act locally

1.8 Thinking above the code

2 System models, order, and time

2.1 System models

2.1.1 Theory and practice

2.1.2 Synchronous distributed systems

2.1.3 Asynchronous distributed systems

2.1.4 Partially synchronous systems

2.1.5 Component and network behavior

2.1.6 Realistic system models

2.2 Order and time

2.2.1 The happened-before relationship

2.2.2 Time and clocks

2.2.3 Physical time and physical clocks

2.2.4 Logical time and logical clocks

2.2.5 Physical clocks vs. logical clocks

3 Failure tolerance

3.1 In theory

3.2 Types of failure tolerance

3.2.1 Masking failure tolerance

3.2.2 Nonmasking failure tolerance

3.2.3 Fail-safe failure tolerance

3.2.4 None of the above

3.3 In practice

3.3.1 System model

3.3.2 Failure handling

3.3.3 Failure classification

3.3.4 Failure detection

3.3.5 Failure mitigation

3.3.6 Putting everything together

4 Message delivery and processing

4.1 Exchanging messages

4.2 The uncertainty principle of message delivery and processing

4.2.1 Before sending the request

4.2.2 After sending the request and before receiving a response

4.2.3 After receiving a response

4.3 Silence and chatter

4.4 Exactly-once processing semantics

4.5 Idempotence

4.6 Case study: Charging a credit card

5 Transactions

5.1 Abstractions

5.2 The magic of transactions

5.2.1 Concurrency

5.2.2 Failure

5.3 The model of transactions

5.3.1 Correctness

5.3.2 Serializability

5.3.3 Completeness

5.3.4 Application-level abort

5.3.5 Platform-level abort

6 Distributed transactions

6.1 Atomic commitment: From a single RM to multiple RMs

6.1.1 Transaction on a single RM

6.1.2 Transaction on multiple RMs

6.1.3 Blocking and nonblocking

6.2 The essence of distributed transactions

6.3 Two-Phase Commit protocol

6.3.1 In the absence of failure

6.3.2 In the presence of failure

6.3.3 Improvement

7 Partitioning

7.1 Encyclopedias and volumes

7.2 Thinking in partitions

7.3 The mechanics of partitioning and balancing

7.4 (Re)partitioning

7.4.1 Types of partitioning

7.4.2 Data item to partition assignment strategies

7.5 Common item-based assignment strategies

7.5.1 Range partitioning

7.5.2 Hash partitioning

7.6 Repartitioning

7.6.1 Range partitioning

7.6.2 Hash partitioning

7.7 Consistent hashing

7.8 (Re)balancing and overpartitioning

8 Replication

8.1 Redundancy

8.2 Thinking about replication and consistency

8.3 Replication

8.4 The mechanics of replication

8.4.1 System model

8.4.2 Replication lag

8.4.3 Synchronous vs. asynchronous replication

8.4.4 State-based vs. log-based replication

8.4.5 Single-leader, multileader, and leaderless systems

9 Consistency

9.1 Consistency models

9.1.1 Common consistency models

9.1.2 Virtues and limitations

9.2 Linearizability

9.2.1 Queue and stack

9.2.2 Formal definition of linearizability

9.3 Eventual consistency

9.3.1 The shopping cart

9.3.2 Variants of eventual consistency

9.3.3 Implementation

9.4 Consistency, availability, and partition tolerance

9.4.1 History

9.4.2 Conjecture vs. theorem

9.4.3 CAP theorem

10 Distributed consensus

10.1 The challenge of reaching agreement

10.2 System model

10.3 State machine replication

10.4 The origin—and irony—of consensus

10.5 Implementing consensus

10.5.1 Leader-based consensus

10.5.2 Quorum-based consensus

10.5.3 Combining leader and quorum

10.6 Raft

10.6.1 The log

10.6.2 Terms

10.6.3 Leader Election protocol

10.6.4 Log Replication protocol

10.6.5 State machine safety

10.7 Raft puzzles

10.7.1 Puzzle 1

10.7.2 Puzzle 2

10.7.3 Puzzle 3

11 Durable executions

11.1 The pitfalls of partial executions

11.2 System model

11.2.1 Process definition

11.2.2 Process execution

11.3 The concept of failure-transparent recovery

11.4 Strategies of failure-transparent recovery

11.4.1 Restart

11.4.2 Resume

11.5 Implementation of failure-transparent recovery

11.5.1 Application-level implementation: Sagas

11.5.2 Platform-level implementation: Durable execution

12 Cloud and services

12.1 From proactive to reactive

12.2 Cloud computing

12.3 Cloud-native computing

12.4 Serverless computing

12.4.1 Traditional

12.4.2 Serverless

12.4.3 Cold path vs. hot path

12.5 Service

12.5.1 Global view vs. local view

12.5.2 Example recommendation service

12.6 Final thoughts

Overview

2 System models and order and time

Distributed system design starts with an explicit system model: a set of assumptions about components, the network, and timing. The chapter contrasts theoretical and practical models, then frames synchrony as an assumption about time: synchronous systems have precise or bounded timing guarantees, while asynchronous systems range from having no notion of time (no timeouts) to a weak one (local, unsynchronized clocks that permit timeouts). Reality sits in between as partial synchrony: systems behave synchronously most of the time but occasionally act asynchronously. Within this model, components may crash and stop, pause and resume (omission), or crash and recover with possible state loss; networks may reorder, drop, or duplicate messages. A pragmatic baseline emerges: partially synchronous systems, components subject to crash-stop/omission/crash-recovery, and unreliable networks, with Byzantine behavior typically out of scope.

Because collaboration depends on noncommutative actions, correctness hinges on establishing a consistent order of events. The chapter illustrates how differing message arrival orders create race conditions—situations with multiple possible executions, some correct and some not—and shows how introducing a coordinator that assigns progressively increasing tags (sequence numbers) enables receivers to detect gaps and restore a single coherent order. This framing elevates ordering from an implementation detail to a first-class concern that underpins safety and liveness in protocols.

To reason about order, the chapter builds on Lamport’s happened-before relation, a partial order capturing intra-component sequencing, message-send/receive causality, and transitivity. It uses this lens to distinguish concurrency (no causal ordering) from parallelism (temporal overlap) and then ties ordering to clocks through the requirement of clock consistency. Physical clocks timestamp real time but suffer skew and drift, prompting synchronization and the careful use of time-of-day versus monotonic clocks. Logical clocks, such as Lamport and vector clocks, assign timestamps that respect causality across components; even when not used explicitly, many systems embed logical time (for example, per-partition offsets or per-key sequence numbers). In practice, distributed systems combine physical time for measuring durations and triggering timeouts with logical time to enforce consistent cross-component ordering.

System models

From synchronous to asynchronous system models

Component Failures

Crash-Stop failure

Omission failure

Crash recovery failure

Byzantine failure

Message reordering

Message duplication

Message loss

Proposers and acceptors

Proposers, acceptors, and a coordinator

Happened-before, intra component

Happened-before, inter component

Happened-before, transitively

Clock skew

Clock drift

Lamport clocks

Summary

System models encode assumptions about components, network, and timing behavior; different system models affect algorithm correctness.
Synchronous systems have strict timing guarantees, while asynchronous systems operate with no timing guarantees. Partially synchronous systems combine the properties of both synchronous and asynchronous systems, operating synchronously most of the time but tolerating asynchronous behavior occasionally.
Component failures include Crash-Stop, Omission, Crash-Recovery, and Byzantine; network failures include message reordering, duplication, and loss.
Order of events is crucial for correctness; logical clocks, such as Lamport clocks, are used to establish event order and capture causality.

FAQ

What is a system model, and why is it crucial in distributed systems?

It is the set of assumptions about components, the network, and timing (e.g., failures, message behavior, clock properties). Correctness depends on these assumptions—an algorithm proven correct under one model may be incorrect under another.

How do synchronous, asynchronous, and partially synchronous models differ?

- Synchronous: strong notion of physical time; clocks are perfect or have known bounds; processing and communication have bounded delays.
- Asynchronous (no notion of time): no clocks; arbitrary delays; timeouts are impossible.
- Asynchronous (weak notion of time): unsynchronized local clocks; arbitrary delays; timeouts are allowed.
- Partially synchronous: systems behave synchronously most of the time and asynchronously sometimes; common in practice.

Do timing assumptions imply anything about failures or message loss?

No. Timing (synchronous/asynchronous) is independent of failure assumptions. Component failures and network faults are separate dimensions of the system model.

What component failure models are commonly considered?

- Crash-stop: the component halts forever (ceases to exist).
- Omission: the component pauses for an arbitrary time, then resumes with state intact (takes a break).
- Crash-recovery: the component pauses and then resumes but may lose volatile state (memory loss).
- Byzantine: arbitrary/malicious behavior (anything may happen). Many practical systems ignore Byzantine faults.

What network faults should I plan for, and what is an unreliable network?

- Message reordering: delivery order may differ from send order.
- Message duplication: receivers may see duplicates.
- Message loss: messages may never arrive.
“Unreliable network” means any of the above can occur. Typically, the network is assumed not to lie: delivered messages were actually sent by some component.

What is the happened-before relationship, and why is it a partial order?

It orders events by causality: (1) within a component, earlier events precede later ones; (2) a send event precedes its corresponding receive event; and (3) it is transitive. Some event pairs are unrelated (concurrent), so the order is partial, not total.

How does this chapter define a race condition, and how can coordination resolve it?

- Definition: a system has a race condition if multiple possible executions exist where some are correct and some are incorrect.
- Fix: introduce coordination (e.g., a coordinator assigns increasing tags/sequence numbers) so recipients can reorder or delay processing to enforce a consistent global order.

What’s the difference between concurrency and parallelism here?

- Concurrency: neither operation’s end happens before the other’s begin; defined by logical time/causality (happened-before).
- Parallelism: operations overlap in real time; determined by physical clocks and physical time.

What is clock consistency, and what challenges do physical clocks pose?

Clock consistency requires: if event A happened before event B, then timestamp(A) < timestamp(B). Physical clocks suffer from skew (offset) and drift (rate differences). Mitigations include clock sync (e.g., NTP). Use time-of-day clocks for wall time (may jump backward) and monotonic clocks for durations (never go backward, but only comparable within one machine).

What are logical clocks, and where do we see them in practice?

- Lamport clocks: per-component counters incremented on events; on receive, set to max(local, received)+1 to preserve causality order.
- Vector clocks: extend Lamport clocks to detect concurrency.
- Practical analogs: Kafka partition offsets (ordered within a partition) and etcd per-key sequence/revision numbers—both act like logical timestamps that are not comparable across partitions/keys.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$54.99 $41.24

you save $13.75 (25%)

include audio $24.99 $18.74

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$54.99 $41.24

you save $13.75 (25%)

include audio $24.99 $18.74

eBook

pdf, ePub, online

$54.99 $41.24

you save $13.75 (25%)

include audio $24.99 $18.74

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more