Chaos Engineering
Site reliability through controlled disruption
Mikolaj Pawlikowski
  • MEAP began April 2020
  • Publication in Spring 2021 (estimated)
  • ISBN 9781617297755
  • 365 pages (estimated)
  • printed in black & white

This book is like the Swiss Army knife of chaos engineering. If you need to “break” your app, this book will have a tool that will help you with that.

Alessandro Campeis
Auto engineers test the safety of a car by intentionally crashing it and carefully observing the results. Chaos engineering applies the same principles to software systems. In Chaos Engineering: Site reliability through controlled disruption, you’ll learn to run your applications and infrastructure through a series of tests that simulate real-life failures. You’ll maximize the benefits of chaos engineering by learning to think like a chaos engineer, and how to design the proper experiments to ensure the reliability of your software. With examples that cover a whole spectrum of software, you’ll be ready to run an intensive testing regime on anything from a simple WordPress site to a massive distributed system running on Kubernetes.

About the Technology

Rather than just looking for code bugs and errors, chaos engineering sees how your software responds to calamity, including partial infrastructure outages, hardware failure, and other major pitfalls that can befall a production system. By observing a system in distress or under attack, chaos engineering ensures the reliability and resiliency of your software—especially for hard-to-test distributed systems with lots of moving parts and little scope for downtime.

About the book

In Chaos Engineering: Site reliability through controlled disruption you’ll learn to design and execute controlled failure experiments that reveal the hidden problems in your software. Using a toolbox of open source tools, you’ll inject system-shaking failures at every level—from your Docker containers, to your Kubernetes deployment, to the UI. You’ll learn Linux monitoring for observing system metrics and evaluating your results, and even how to apply Chaos Engineering to make your human teams more reliable and resilient to handling failures. Best of all, all tools and examples come with a downloadable Linux VM image, letting you easily experiment without risk to your own systems.
Table of Contents detailed table of contents

1 Into the world of chaos engineering

1.1 What is chaos engineering?

1.2 Motivations for chaos engineering

1.2.1 Risk, cost and service-level indicators, objectives, and agreements (SL{I,O,A})

1.2.2 Testing a system as a whole

1.2.3 Emergent properties

1.3 Four steps to chaos engineering

1.3.1 Observability

1.3.2 Steady state

1.3.3 Hypothesis for our experiment

1.3.4 Run the experiment and prove (or refute) your hypothesis

1.4 What chaos engineering is not

1.5 A taste of chaos engineering

1.5.1 FizzBuzz as a service

1.5.2 A long, dark night

1.5.3 Post-mortem

1.5.4 Chaos engineering in a nutshell

1.6 Summary

PART 1 Chaos Engineering Fundamentals

2 First cup of chaos & blast radius

2.1 Setup - working with the code in this book

2.2 Scenario

2.3 Linux forensics 101

2.3.1 Exit codes

2.3.2 Killing processes

2.3.3 Out Of Memory (OOM) Killer

2.4 The first chaos experiment

2.4.1 Visibility

2.4.2 Steady state

2.4.3 Hypothesis

2.4.4 Run the experiment

2.5 Blast radius

2.6 Digging deeper

2.6.1 Saving the world

2.7 Summary

3 Observability

3.1 The app is slow

3.2 The USE method

3.3 Resources

3.3.1 System overview

3.4 Block IO

3.5 Networking

3.6 RAM

3.7 CPU

3.8 OS

3.9 Application

3.10 Automation - using time series

3.11 Further reading

3.12 Summary

4 Database trouble & testing in production

4.1 We’re doing WordPress

4.2.1 Experiment 1: slow disks

4.2.2 Experiment 2: slow connection

4.3 Testing in production

4.4 Summary

PART 2 Chaos Engineering in Action

5 Poking Docker

5.1 My (dockerized) app is slow!

5.1.1 Architecture

5.2 The brief history of Docker

5.2.1 Emulation, simulation, and virtualization

5.2.2 Virtual machines and containers

5.2.3 Linux containers and Docker

5.3 Peeking under the Docker’s hood

5.3.1 Uprooting processes with chroot

5.3.2 Implementing a simple container(-ish) part 1 - using chroot

5.3.3 Experiment 1: can one container prevent another one from writing to disk?

5.3.4 Isolating processes with Linux namespaces

5.3.5 Docker and namespaces

5.3.6 Experiment 2: killing processes in a different pid namespace

5.3.7 Implementing a simple container(-ish) part 2 - namespaces

5.3.8 Limiting resource use of a process with cgroups

5.3.9 Experiment 3: Using all the CPU I can find!

5.3.10 Experiment 4: Using too much RAM

5.3.11 Implementing a simple container(-ish) part 3 - cgroups

5.3.12 Docker and networking

5.3.13 Capabilities and seccomp

5.3.14 Docker demystified

5.4 Fixing my (dockerized) app being slow

5.4.1 Booting up Meower

5.4.2 Why is the app slow?

5.4.3 Experiment 5: Network slowness for containers with Pumba

5.5 Other parts of the puzzle

5.5.1 Docker daemon restarts

5.5.2 Storage for image layers

5.5.3 Advanced networking

5.5.4 Security

5.6 Summary

6 Who you gonna call? Syscall-busters!

6.1 Scenario - congratulations on your promotion!

6.1.1 System X: if everyone is using it, but no one maintains it, is it abandonware?

6.2 A brief refresher on syscalls

6.2.1 Finding out about syscalls

6.2.2 Standard C library and glibc

6.3 How to observe a process’ syscalls?

6.3.1 strace and sleep

6.3.2 strace and System X

6.3.3 strace’s problem - overhead

6.3.4 BPF

6.3.5 Other options

6.4 Blocking syscalls for fun and profit part 1 - strace

6.4.1 Experiment 1: breaking the close syscall

6.4.2 Experiment 1: steady state

6.4.3 Experiment 1: implementation

6.4.4 Experiment 1: analysis

6.4.5 Experiment 2: breaking the write syscall

6.4.6 Experiment 2: steady state

6.4.7 Experiment 2: implementation

6.5 Blocking syscalls for fun and profit part 2 - seccomp

6.5.1 Seccomp the easy way - Docker

6.5.2 Seccomp the hard way - libseccomp

6.6 Summary

7 Injecting failure into the JVM

7.1 Scenario

7.1.1 FizzBuzzEnterpriseEdition

7.2 Chaos engineering and Java

7.2.1 Experiment 1 - idea

7.2.2 Experiment 1 - plan

7.2.3 Brief introduction to JVM bytecode

7.2.4 Experiment 1 - implementation

7.3 Existing tools

7.3.1 Byteman

7.3.2 Byte-monkey

7.3.3 Chaos Monkey for Spring Boot

7.4 Further reading

7.5 Summary

8 Application-level fault-injection

8.1 Scenario

8.1.1 Implementation details - before chaos

8.2 Experiment 1 - Redis latency

8.2.1 Experiment 1 plan

8.2.2 Experiment 1 steady state

8.2.3 Experiment 1 implementation

8.2.4 Experiment 1 execution

8.2.5 Experiment 1 discussion

8.3 Experiment 2 - failing requests

8.3.1 Experiment 2 plan

8.3.2 Experiment 2 implementation

8.3.3 Experiment 2 execution

8.4 Application versus infrastructure

8.5 Summary

9 There’s a monkey in my browser!

9.1 Scenario

9.1.1 Pgweb

9.1.2 Pgweb - implementation details

9.2 Experiment 1: adding latency

9.2.1 Experiment 1 - plan

9.2.2 Experiment 1 - steady state

9.2.3 Experiment 1 - implementation

9.2.4 Experiment 1 - run!

9.3 Experiment 2: adding failure

9.3.1 Experiment 2 - implementation

9.3.2 Experiment 2 - run

9.4 Other good-to-know topics

9.4.1 Fetch API

9.4.2 Throttling

9.5 Summary

PART 3 Chaos Engineering beyond machines

10 Chaos in Kubernetes

10.1 Porting things onto Kubernetes

10.1.1 High-profile project documentation

10.2 What’s Kubernetes (in 7 minutes)?

10.2.1 The very brief history of Kubernetes

10.2.2 What can Kubernetes do for you?

10.3 Setting up a Kubernetes cluster

10.3.1 Starting a cluster

10.4 Testing out software running on Kubernetes

10.4.1 Running the ICANT Project

10.4.2 Experiment 1: kill 50% of pods

10.4.3 Party trick: killing pods in style

10.4.4 Experiment 2: network slowness

10.5 Summary

11 Automating Kubernetes experiments

11.1 Automating chaos with PowerfulSeal

11.1.1 What’s PowerfulSeal?

11.1.2 PowerfulSeal - installation

11.1.3 Experiment 1b: kill 50% of pods

11.1.4 Experiment 2b: network slowness

11.2 Ongoing testing & Service Level Objectives (SLOs)

11.2.1 Experiment 3: verify pods are ready within (n) seconds of being created

11.3 Cloud layer

11.3.1 Cloud provider APIs, availability zones

11.3.2 Experiment 4: taking VMs down

11.4 Summary

12 Under the hood of Kubernetes

12.1 Anatomy of a Kubernetes cluster and how to break it

12.1.1 Control plane

12.1.2 Kubelet and pause container

12.1.3 Kubernetes, Docker, and container runtimes

12.1.4 Kubernetes networking

12.2 Summary of key components

12.3 Summary

13 Chaos engineering (for) people

13.1 Chaos engineering mindset

13.1.1 Failure is not a maybe: it will happen

13.1.2 Failing early vs failing late

13.2 Getting the buy-in

13.2.1 Management

13.2.2 Team members

13.2.3 Game days

13.3 Teams as distributed systems

13.3.1 Finding knowledge single points of failure: “Staycation”

13.3.2 Misinformation and trust within the team: “Liar, liar”

13.3.3 Bottlenecks in the team: “life in the slow lane”

13.3.4 Testing your processes: “inside job”

13.4 Summary

Appendixes

Appendix A: Installing chaos engineering tools

A.1 Prerequisites

A.2 Installing the Linux tools

A.2.1 Pumba

A.2.2 Python 3.7 with DTrace option

A.2.3 Pgweb

A.2.4 Pip dependencies

A.2.5 Example data to look at for pgweb

A.3 Wordpress configuration

A.4 Checking out the source code for this book

A.5 Installing Minikube (Kubernetes)

Appendix B: Answers to exercises

B.1 Chapter 2

B.2 Chapter 3

B.3 Chapter 4

B.4 Chapter 5

B.5 Chapter 6

B.6 Chapter 7

B.7 Chapter 8

B.8 Chapter 9

B.9 Chapter 10

B.10 Chapter 11

B.11 Chapter 12

What's inside

  • Design, run and analyze Chaos Engineering experiments
  • See how applications react to a database connections latency
  • Experiment with Docker container isolation
  • Test software running on Kubernetes and the platform itself
  • Inject failure into software running in the HVM

About the reader

For developers with basic knowledge of scripting and Linux.

About the author

Mikolaj Pawlikowski has been practicing chaos engineering for four years, beginning with a large distributed Kubernetes-based microservices platform at Bloomberg. He is the creator of the Kubernetes Chaos Engineering tool PowerfulSeal, and the networking visibility tool Goldpinger. He is an active member of the Chaos Engineering community and speaks at numerous conferences.

placing your order...

Don't refresh or navigate away from the page.
Manning Early Access Program (MEAP) Read chapters as they are written, get the finished eBook as soon as it’s ready, and receive the pBook long before it's in bookstores.
print book $29.99 $49.99 pBook + eBook + liveBook
Additional shipping charges may apply
Chaos Engineering (print book) added to cart
continue shopping
go to cart

eBook $31.99 $39.99 3 formats + liveBook
Chaos Engineering (eBook) added to cart
continue shopping
go to cart

Prices displayed in rupees will be charged in USD when you check out.
customers also reading

This book 1-hop 2-hops 3-hops

FREE domestic shipping on three or more pBooks