Think Distributed Systems you own this product

Dominik Tornow

August 2025
ISBN 9781633436176
192 pages

Included with a Manning Online subscription

printed in black & white

available in Russian, Simplified Chinese

catalog / Software Development / Software Engineering / Distributed Systems

table of content

1 Thinking in distributed systems: Models, mindsets, and mechanics

1.1 Software engineering and mental models

1.1.1 Mental models: The foundation of reasoning

1.1.2 Correct mental models

1.1.3 Complete mental models

1.2 Mental model of software systems

1.3 Different types of models

1.3.1 Different models describing the same aspects

1.3.2 Different models describing different aspects of a system

1.4 Thinking about distributed systems

1.4.1 Correctness

1.4.2 Scalability and reliability

1.4.3 Responsiveness

1.5 Two big ideas

1.5.1 Systems of systems

1.5.2 Global view vs. local view

1.6 Distributed Systems Incorporated

1.7 Navigating complexity

1.7.1 Simple yet complex

1.7.2 Emergent behavior

1.7.3 Changing perspective

1.7.4 Think globally; act locally

1.8 Thinking above the code

2 System models, order, and time

2.1 System models

2.1.1 Theory and practice

2.1.2 Synchronous distributed systems

2.1.3 Asynchronous distributed systems

2.1.4 Partially synchronous systems

2.1.5 Component and network behavior

2.1.6 Realistic system models

2.2 Order and time

2.2.1 The happened-before relationship

2.2.2 Time and clocks

2.2.3 Physical time and physical clocks

2.2.4 Logical time and logical clocks

2.2.5 Physical clocks vs. logical clocks

3 Failure tolerance

3.1 In theory

3.2 Types of failure tolerance

3.2.1 Masking failure tolerance

3.2.2 Nonmasking failure tolerance

3.2.3 Fail-safe failure tolerance

3.2.4 None of the above

3.3 In practice

3.3.1 System model

3.3.2 Failure handling

3.3.3 Failure classification

3.3.4 Failure detection

3.3.5 Failure mitigation

3.3.6 Putting everything together

4 Message delivery and processing

4.1 Exchanging messages

4.2 The uncertainty principle of message delivery and processing

4.2.1 Before sending the request

4.2.2 After sending the request and before receiving a response

4.2.3 After receiving a response

4.3 Silence and chatter

4.4 Exactly-once processing semantics

4.5 Idempotence

4.6 Case study: Charging a credit card

5 Transactions

5.1 Abstractions

5.2 The magic of transactions

5.2.1 Concurrency

5.2.2 Failure

5.3 The model of transactions

5.3.1 Correctness

5.3.2 Serializability

5.3.3 Completeness

5.3.4 Application-level abort

5.3.5 Platform-level abort

6 Distributed transactions

6.1 Atomic commitment: From a single RM to multiple RMs

6.1.1 Transaction on a single RM

6.1.2 Transaction on multiple RMs

6.1.3 Blocking and nonblocking

6.2 The essence of distributed transactions

6.3 Two-Phase Commit protocol

6.3.1 In the absence of failure

6.3.2 In the presence of failure

6.3.3 Improvement

7 Partitioning

7.1 Encyclopedias and volumes

7.2 Thinking in partitions

7.3 The mechanics of partitioning and balancing

7.4 (Re)partitioning

7.4.1 Types of partitioning

7.4.2 Data item to partition assignment strategies

7.5 Common item-based assignment strategies

7.5.1 Range partitioning

7.5.2 Hash partitioning

7.6 Repartitioning

7.6.1 Range partitioning

7.6.2 Hash partitioning

7.7 Consistent hashing

7.8 (Re)balancing and overpartitioning

8 Replication

8.1 Redundancy

8.2 Thinking about replication and consistency

8.3 Replication

8.4 The mechanics of replication

8.4.1 System model

8.4.2 Replication lag

8.4.3 Synchronous vs. asynchronous replication

8.4.4 State-based vs. log-based replication

8.4.5 Single-leader, multileader, and leaderless systems

9 Consistency

9.1 Consistency models

9.1.1 Common consistency models

9.1.2 Virtues and limitations

9.2 Linearizability

9.2.1 Queue and stack

9.2.2 Formal definition of linearizability

9.3 Eventual consistency

9.3.1 The shopping cart

9.3.2 Variants of eventual consistency

9.3.3 Implementation

9.4 Consistency, availability, and partition tolerance

9.4.1 History

9.4.2 Conjecture vs. theorem

9.4.3 CAP theorem

10 Distributed consensus

10.1 The challenge of reaching agreement

10.2 System model

10.3 State machine replication

10.4 The origin—and irony—of consensus

10.5 Implementing consensus

10.5.1 Leader-based consensus

10.5.2 Quorum-based consensus

10.5.3 Combining leader and quorum

10.6 Raft

10.6.1 The log

10.6.2 Terms

10.6.3 Leader Election protocol

10.6.4 Log Replication protocol

10.6.5 State machine safety

10.7 Raft puzzles

10.7.1 Puzzle 1

10.7.2 Puzzle 2

10.7.3 Puzzle 3

11 Durable executions

11.1 The pitfalls of partial executions

11.2 System model

11.2.1 Process definition

11.2.2 Process execution

11.3 The concept of failure-transparent recovery

11.4 Strategies of failure-transparent recovery

11.4.1 Restart

11.4.2 Resume

11.5 Implementation of failure-transparent recovery

11.5.1 Application-level implementation: Sagas

11.5.2 Platform-level implementation: Durable execution

12 Cloud and services

12.1 From proactive to reactive

12.2 Cloud computing

12.3 Cloud-native computing

12.4 Serverless computing

12.4.1 Traditional

12.4.2 Serverless

12.4.3 Cold path vs. hot path

12.5 Service

12.5.1 Global view vs. local view

12.5.2 Example recommendation service

12.6 Final thoughts

Overview

3 Failure tolerance

This chapter presents a clear, practical way to think about failure in distributed systems with the goal of ensuring failure tolerance—keeping system behavior well-defined even when things go wrong. It unifies terminology by simply using “failure,” and frames the discussion in two complementary parts: theory (what failures are and how they can be reasoned about) and practice (how to detect and mitigate them in real systems). A central theme is that end-to-end correctness demands complete processes: in the presence of failures, outcomes should be observably equivalent either to full success or to no effect at all.

The theoretical core defines failure as an unwanted but possible transition that moves a system from a legal (good) state to an illegal (bad) state, with intolerable states excluded from tolerance by definition. Systems evolve via normal and failure transitions, and recovery is the sequence that returns from illegal to legal states. Correctness is captured by safety (nothing bad happens) and liveness (something good eventually happens), leading to a taxonomy of failure tolerance: masking (both safety and liveness), non-masking (liveness only), fail-safe (safety only), and none. Practical limits (e.g., trade-offs captured by impossibility results) mean full masking is often infeasible. The chapter also links failure detection to maintaining safety (halt dangerous actions) and mitigation to restoring liveness (resume progress). It broadens “failure detectors” beyond crash suspicion to any predicate that witnesses an illegal state, while noting that timeout-based detectors cannot be both complete and accurate in asynchronous, unreliable networks.

The practical part adopts a service-orchestration model where a consumer executes a multi-step process against providers, aiming to avoid partial application. Mitigation follows two axes. Spatially, per the end-to-end argument, handle failures at the lowest layer able to do so correctly and completely: application-level failures (e.g., business rule violations) are addressed with backward recovery and compensation; platform-level failures (e.g., connectivity) are handled with forward recovery such as retries. Temporally, failures are classified as transient (quick auto-repair), intermittent (elevated likelihood but auto-repair), or permanent (require manual repair), guiding tactics like immediate retry, backoff-based retries, or suspending until fixed and then resuming. An ideal strategy first attempts platform mitigation (immediate retry, then backoff, then suspend-and-resume), escalates to application-level compensation if needed, and, if compensation itself fails, escalates to human operators—always preserving safety, restoring liveness when possible, and ensuring effects like “charge exactly once” in workflows such as e-commerce checkout.

System states and state transitions

An illustration of the states and state transitions defined in Listing 3.1

Service orchestration

A process as a sequence of steps

E-commerce process

Failure handling

Failure classification

Thinking in layers

Spatial classification

Application-level versus platform-level failure

Temporal dimension

Transient failure

Intermittent failure

Permanent failure

Failure mitigation

Outline of failure-handling strategy in an orchestration scenario

Summary

Failure tolerance is the goal of failure handling.
Failure handling involves two key steps: failure detection and failure mitigation.
Failures can be classified across two dimensions: spatial and temporal.
Spatially, failures are classified as application-level or platform-level.
Temporally, failures are classified as transient, intermittent, or permanent.
Different failure tolerance strategies, such as masking, non-masking, and fail-safe, address the safety and liveness of the system.
Failure detection and mitigation strategies vary based on the classification of the failure and the desired class of failure tolerance.

FAQ

What is “failure” and what is “failure tolerance” in this chapter?

Failure is an unwanted but possible state transition that moves a system from a good (legal) state to a bad (illegal) state. Failure tolerance is the system’s ability to keep behaving in a well-defined way even when it is in a bad state.

How are system states and transitions modeled?

The model uses three states—legal (good), illegal (bad), and intolerable (“everything is lost”). Transitions are of two kinds: normal transitions (intended behavior) and failure transitions (unwanted). Moving from an illegal state back to a legal state is failure recovery.

What are safety and liveness, and why do they matter for failure tolerance?

Safety means nothing bad ever happens; liveness means something good eventually happens. In the absence of failures, both should hold. Under failure, different approaches trade off safety and liveness to deliver different types of failure tolerance.

What are masking, non-masking, and fail-safe failure tolerance?

- Masking: guarantees both safety and liveness under failure (failure is transparent). Often costly or impossible (e.g., CAP trade-offs).
- Non-masking: guarantees liveness but not safety (system keeps making progress, may make mistakes temporarily; e.g., a queue delivers out of order during failure).
- Fail-safe: guarantees safety but not liveness (system avoids mistakes by halting progress; e.g., a queue stops delivering to preserve order).

How does the end-to-end argument guide where to handle failures?

Handle failures in the lowest layer (from the top down) that can correctly and completely detect and mitigate them. If the lowest adequate layer is the application, it’s an application-level failure (e.g., InsufficientFunds). If the platform can fully handle it, it’s platform-level (e.g., CouldNotConnect via retries).

What’s the difference between transient, intermittent, and permanent failures?

- Transient: “come and go,” auto-repair quickly; a second failure is no more likely than usual (retry soon).
- Intermittent: “linger,” auto-repair after some delay; a second failure is more likely in the short term (retry with backoff).
- Permanent: persist until fixed; a second failure is certain without manual intervention (suspend, repair, then resume).

How should failure handling be structured in practice?

Two steps: failure detection (recognize the failure) and failure mitigation (recover). In service orchestration, aim for complete execution of a process; on failure, end in either a forward-recovered outcome (equivalent to success) or a backward-recovered outcome (equivalent to no-op).

What is a failure detector, and why can’t it be both complete and accurate in asynchronous systems?

A failure detector is a “witness” (predicate) that a failure occurred (often via timeouts over heartbeats or requests—pull or push). In partially synchronous/asynchronous systems with unreliable networks, messages can be delayed or lost, so a detector cannot be both complete (no misses) and accurate (no false suspicions) at the same time.

When should I use backward vs. forward recovery?

- Backward recovery (compensation): move the system back to an initial legal state; typical at the application layer (e.g., refund a charge if checkout fails). Doesn’t require fixing underlying causes.
- Forward recovery (retries): move toward the intended final legal state; typical at the platform layer (e.g., immediate retry, backoff, suspend-and-resume). Requires addressing the underlying cause.

What if mitigation fails repeatedly or compensation itself fails?

Escalate. If platform-level mitigation exhausts options, surface the failure to the application. If application-level compensation fails, escalate to human operators for manual resolution. Some situations may be intolerable and outside the system’s scope to handle.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $35.99

you save $12.00 (25%)

include audio $24.99 $18.74

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $35.99

you save $12.00 (25%)

include audio $24.99 $18.74

eBook

pdf, ePub, online

$47.99 $35.99

you save $12.00 (25%)

include audio $24.99 $18.74

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more