table of content

1 Thinking in distributed systems: Models, mindsets, and mechanics

1.1 Software engineering and mental models

1.1.1 Mental models: The foundation of reasoning

1.1.2 Correct mental models

1.1.3 Complete mental models

1.2 Mental model of software systems

1.3 Different types of models

1.3.1 Different models describing the same aspects

1.3.2 Different models describing different aspects of a system

1.4 Thinking about distributed systems

1.4.1 Correctness

1.4.2 Scalability and reliability

1.4.3 Responsiveness

1.5 Two big ideas

1.5.1 Systems of systems

1.5.2 Global view vs. local view

1.6 Distributed Systems Incorporated

1.7 Navigating complexity

1.7.1 Simple yet complex

1.7.2 Emergent behavior

1.7.3 Changing perspective

1.7.4 Think globally; act locally

1.8 Thinking above the code

2 System models, order, and time

2.1 System models

2.1.1 Theory and practice

2.1.2 Synchronous distributed systems

2.1.3 Asynchronous distributed systems

2.1.4 Partially synchronous systems

2.1.5 Component and network behavior

2.1.6 Realistic system models

2.2 Order and time

2.2.1 The happened-before relationship

2.2.2 Time and clocks

2.2.3 Physical time and physical clocks

2.2.4 Logical time and logical clocks

2.2.5 Physical clocks vs. logical clocks

3 Failure tolerance

3.1 In theory

3.2 Types of failure tolerance

3.2.1 Masking failure tolerance

3.2.2 Nonmasking failure tolerance

3.2.3 Fail-safe failure tolerance

3.2.4 None of the above

3.3 In practice

3.3.1 System model

3.3.2 Failure handling

3.3.3 Failure classification

3.3.4 Failure detection

3.3.5 Failure mitigation

3.3.6 Putting everything together

4 Message delivery and processing

4.1 Exchanging messages

4.2 The uncertainty principle of message delivery and processing

4.2.1 Before sending the request

4.2.2 After sending the request and before receiving a response

4.2.3 After receiving a response

4.3 Silence and chatter

4.4 Exactly-once processing semantics

4.5 Idempotence

4.6 Case study: Charging a credit card

5 Transactions

5.1 Abstractions

5.2 The magic of transactions

5.2.1 Concurrency

5.2.2 Failure

5.3 The model of transactions

5.3.1 Correctness

5.3.2 Serializability

5.3.3 Completeness

5.3.4 Application-level abort

5.3.5 Platform-level abort

6 Distributed transactions

6.1 Atomic commitment: From a single RM to multiple RMs

6.1.1 Transaction on a single RM

6.1.2 Transaction on multiple RMs

6.1.3 Blocking and nonblocking

6.2 The essence of distributed transactions

6.3 Two-Phase Commit protocol

6.3.1 In the absence of failure

6.3.2 In the presence of failure

6.3.3 Improvement

7 Partitioning

7.1 Encyclopedias and volumes

7.2 Thinking in partitions

7.3 The mechanics of partitioning and balancing

7.4 (Re)partitioning

7.4.1 Types of partitioning

7.4.2 Data item to partition assignment strategies

7.5 Common item-based assignment strategies

7.5.1 Range partitioning

7.5.2 Hash partitioning

7.6 Repartitioning

7.6.1 Range partitioning

7.6.2 Hash partitioning

7.7 Consistent hashing

7.8 (Re)balancing and overpartitioning

8 Replication

8.1 Redundancy

8.2 Thinking about replication and consistency

8.3 Replication

8.4 The mechanics of replication

8.4.1 System model

8.4.2 Replication lag

8.4.3 Synchronous vs. asynchronous replication

8.4.4 State-based vs. log-based replication

8.4.5 Single-leader, multileader, and leaderless systems

9 Consistency

9.1 Consistency models

9.1.1 Common consistency models

9.1.2 Virtues and limitations

9.2 Linearizability

9.2.1 Queue and stack

9.2.2 Formal definition of linearizability

9.3 Eventual consistency

9.3.1 The shopping cart

9.3.2 Variants of eventual consistency

9.3.3 Implementation

9.4 Consistency, availability, and partition tolerance

9.4.1 History

9.4.2 Conjecture vs. theorem

9.4.3 CAP theorem

10 Distributed consensus

10.1 The challenge of reaching agreement

10.2 System model

10.3 State machine replication

10.4 The origin—and irony—of consensus

10.5 Implementing consensus

10.5.1 Leader-based consensus

10.5.2 Quorum-based consensus

10.5.3 Combining leader and quorum

10.6 Raft

10.6.1 The log

10.6.2 Terms

10.6.3 Leader Election protocol

10.6.4 Log Replication protocol

10.6.5 State machine safety

10.7 Raft puzzles

10.7.1 Puzzle 1

10.7.2 Puzzle 2

10.7.3 Puzzle 3

11 Durable executions

11.1 The pitfalls of partial executions

11.2 System model

11.2.1 Process definition

11.2.2 Process execution

11.3 The concept of failure-transparent recovery

11.4 Strategies of failure-transparent recovery

11.4.1 Restart

11.4.2 Resume

11.5 Implementation of failure-transparent recovery

11.5.1 Application-level implementation: Sagas

11.5.2 Platform-level implementation: Durable execution

12 Cloud and services

12.1 From proactive to reactive

12.2 Cloud computing

12.3 Cloud-native computing

12.4 Serverless computing

12.4.1 Traditional

12.4.2 Serverless

12.4.3 Cold path vs. hot path

12.5 Service

12.5.1 Global view vs. local view

12.5.2 Example recommendation service

12.6 Final thoughts

Overview

12 Cloud and services

This chapter traces the shift from static, manually provisioned infrastructure to dynamic, elastic cloud environments and argues for precise, shared definitions of cloud, cloud-native, serverless, and services. It presents elasticity—acquiring and releasing resources on demand—as the foundation for scalability and reliability, contrasting earlier proactive, predictive operations with reactive, self-regulating systems. The aim is to provide concise mental models that align with industry practice and improve communication.

Cloud computing is defined by a clear separation between resource consumers and providers and by on-demand acquisition and release of resources. A cloud-native application is scalable and reliable by construction, leveraging platform primitives such as supervisors, autoscalers, and load balancers; lifting and shifting alone does not make an application cloud-native, and specific technologies like containers are not defining characteristics. Serverless computing acquires resources reactively after an event arrives, requiring the system to infer resources per request, and leads to cold and hot paths depending on whether resource acquisition lies on the critical path of request processing.

Reframing microservices as services, the chapter defines a service as a contract between a component and the rest of the system, emphasizing a logical view over the physical set of implementing components. A recommendation service example shows how a stable contract can be realized by a dynamic, redundant assemblage of processes, data stores, load balancers, and autoscalers that scale and fail independently while preserving the same external behavior. The chapter closes by stressing that shared, accurate mental models reduce misalignment and improve engineering and collaboration, encouraging readers to keep probing concepts until they are truly understood.

Elasticity in terms of scalability and reliability

Non-cloud application versus cloud application

Non-cloud-native application versus cloud-native application

Minimal model of computation to reason about serverless computing

Order of events in traditional computing; acquiring resources happens proactively.

Order of events in serverless computing; acquiring resources happens reactively.

Cold path versus hot path

Global point of view: C1 and C2

From C1’s point of view, there is only C1 and the rest of the system.

Initial model of the recommendation service

Redundant implementation of the recommendation service

Primary and backup

Loadbalancer and Autoscaler

Redundant Loadbalancers and Autoscalers

Final model of the Recommendation Service

Thinking in distributed systems aims to maximize the intersection of our mental models with reality and each other.

Summary

Elasticity refers to a system's ability to ensure scalability and reliability by dynamically acquiring or releasing resources to match demand.
Cloud computing divides the world into resource consumers and providers, allowing consumers to acquire and release virtually unlimited resources on demand through public or private cloud platforms.
Cloud computing fundamentally transformed computing, replacing static, long-lived system topologies with dynamic, on-demand system topologies.
A cloud application is any application hosted on a cloud platform, while a cloud-native application is a cloud application that is elastic by construction.
Traditional systems acquire resources that are necessary to process events proactively, before receiving the event.
Serverless systems acquire resources that are necessary to process events reactively, after receiving the event.
Services are contracts between components and the system, focusing on logical interactions that remain constant instead of physical interactions that may change over time.

FAQ

What shift does the chapter describe from proactive to reactive system design?

Systems moved from static topologies with a few long‑lived, well‑known resources to dynamic topologies composed of many short‑lived resources (including short‑lived network addresses). Instead of predicting load and failures proactively, engineers increasingly observe and react so that systems can self‑regulate.

What is elasticity and why does it matter?

Elasticity is a system’s ability to achieve scalability and reliability by acquiring and releasing resources on demand to meet demand. It’s the bedrock of modern cloud environments and enables reactive, self‑adjusting systems.

How does the chapter define cloud computing?

Cloud computing separates resource consumers from resource providers (the cloud platform). Consumers can acquire and release virtually unlimited resources (compute, storage, etc.) on demand. This division and on‑demand resource control are the core primitives of cloud computing.

What is the difference between public and private cloud platforms? Can you give examples?

A public cloud offers resources outside the provider’s own organization (e.g., AWS, Google Cloud, Microsoft Azure). A private cloud offers resources only within an organization (e.g., company‑run OpenStack, Cloud Foundry, or Kubernetes instances).

What makes an application cloud‑native?

A cloud‑native application is a cloud‑hosted application that is scalable and reliable by construction, explicitly leveraging the cloud’s ability to acquire and release resources on demand. These qualities are assured by design and implementation, not merely requested as requirements.

Do you need containers and microservices to be cloud‑native?

No. While cloud‑native applications often use containers, microservices, and declarative orchestration, these are not defining requirements. Cloud‑native is defined by being scalable and reliable by construction; technology and architectural choices may vary.

Which platform building blocks support “by construction” scalability and reliability?

Supervisors (replace failed resources), scalers (add or remove resources to match demand), and load balancers (distribute requests). In Kubernetes, these map to Deployments (supervisor), HorizontalPodAutoscaler (scaler), and Service (load balancer).

What is serverless computing in this chapter’s model?

In serverless environments, the system may acquire the necessary resources after an event enters the system (reactively), rather than a priori. The system must determine the required resources from the event itself, subject to the constraint that an event cannot be processed before needed resources are acquired.

What are the cold path and hot path in serverless?

The cold path occurs when receiving and processing an event are interleaved by on‑demand resource acquisition (incurring a “cold start”). The hot path occurs when resources are already available, so receiving and processing are not interleaved by acquisition.

How does the chapter define a service, and what does the Recommendation Service example illustrate?

From a local view, a service is a contract between a component and the rest of the system—not the set of components that implement it. The Recommendation Service example shows the implementation can be a dynamic, redundant collection of components (primary/backup groups, smart load balancers, autoscalers, and shared infrastructure), while the service’s contract (return recommendations) remains the constant.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$54.99 $27.49

you save $27.50 (50%)

include audio $24.99 $12.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$54.99 $27.49

you save $27.50 (50%)

include audio $24.99 $12.49

eBook

pdf, ePub, online

$54.99 $27.49

you save $27.50 (50%)

include audio $24.99 $12.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more