Chaos Engineering
Crash test your applications
Mikolaj Pawlikowski
  • MEAP began April 2020
  • Publication in Early 2021 (estimated)
  • ISBN 9781617297755
  • 325 pages (estimated)
  • printed in black & white

This book is like the Swiss Army knife of chaos engineering. If you need to “break” your app, this book will have a tool that will help you with that.

Alessandro Campeis
Auto engineers test the safety of a car by intentionally crashing it and carefully observing the results. Chaos engineering applies the same principles to software systems. In Chaos Engineering: Crash test your applications, you’ll learn to run your applications and infrastructure through a series of tests that simulate real-life failures. You’ll maximize the benefits of chaos engineering by learning to think like a chaos engineer, and how to design the proper experiments to ensure the reliability of your software. With examples that cover a whole spectrum of software, you’ll be ready to run an intensive testing regime on anything from a simple WordPress site to a massive distributed system running on Kubernetes.

About the Technology

Rather than just looking for code bugs and errors, chaos engineering sees how your software responds to calamity, including partial infrastructure outages, hardware failure, and other major pitfalls that can befall a production system. By observing a system in distress or under attack, chaos engineering ensures the reliability and resiliency of your software—especially for hard-to-test distributed systems with lots of moving parts and little scope for downtime.

About the book

In Chaos Engineering: Crash test your applications you’ll learn to design and execute controlled failure experiments that reveal the hidden problems in your software. Using a toolbox of open source tools, you’ll inject system-shaking failures at every level—from your Docker containers, to your Kubernetes deployment, to the UI. You’ll learn Linux monitoring for observing system metrics and evaluating your results, and even how to apply Chaos Engineering to make your human teams more reliable and resilient to handling failures. Best of all, all tools and examples come with a downloadable Linux VM image, letting you easily experiment without risk to your own systems.
Table of Contents detailed table of contents

PART 1 Chaos Engineering Fundamentals

1 Into the world of chaos engineering

1.1 What is chaos engineering?

1.2 Motivations for chaos engineering

1.2.1 Risk, cost and service-level indicators, objectives, and agreements (SL{I,O,A})

1.2.2 Testing a system as a whole

1.2.3 Emergent properties

1.3 Four steps to chaos engineering

1.3.1 Observability

1.3.2 Steady state

1.3.3 Hypothesis for our experiment

1.3.4 Run the experiment and prove (or refute) your hypothesis

1.4 What chaos engineering is not

1.5 A taste of chaos engineering

1.5.1 FizzBuzz as a service

1.5.2 A long, dark night

1.5.3 Post-mortem

1.5.4 Chaos engineering in a nutshell

1.6 Summary

2 First cup of chaos & blast radius

2.1 Setup - working with the code in this book

2.2 Scenario

2.3 Linux forensics 101

2.3.1 Exit codes

2.3.2 Killing processes

2.3.3 Out Of Memory (OOM) Killer

2.4 The first chaos experiment

2.4.1 Visibility

2.4.2 Steady state

2.4.3 Hypothesis

2.4.4 Run the experiment

2.5 Blast radius

2.6 Digging deeper

2.6.1 Saving the world

2.7 Summary

3 Observability

3.1 The app is slow

3.2 The USE method

3.3 Resources

3.3.1 System overview uptime

3.3.2 Block IO df iostat biotop

3.3.3 Networking sar tcptop

3.3.4 RAM free top vmstat oomkill

3.3.5 CPU top mpstat ­-P ALL 1 My dog ate my CPU - how do I fix it?

3.3.6 OS opensnoop execsnoop

3.3.7 other tools

3.4 Application

3.4.1 cProfile

3.4.2 BCC and Python

3.5 Automation - using time series

3.5.1 Prometheus & Grafana

3.6 Further reading

3.7 Summary

4 Database trouble & testing in production

4.1 We’re doing WordPress

4.2.1 Experiment 1: slow disks

4.2.2 Experiment 2: slow connection

4.3 Testing in production

4.4 Summary

PART 2 Chaos Engineering in Action

5 Poking Docker

5.1 My (dockerized) app is slow!

5.1.1 Architecture

5.2 The brief history of Docker

5.2.1 Emulation, simulation, and virtualization

5.2.2 Virtual machines and containers

5.2.3 Linux containers and Docker

5.3 Peeking under the Docker’s hood

5.3.1 Uprooting processes with chroot

5.3.2 Implementing a simple container(-ish) part 1 - using chroot

5.3.3 Experiment 1: can one container prevent another one from writing to disk?

5.3.4 Isolating processes with Linux namespaces

5.3.5 Docker and namespaces

5.3.6 Experiment 2: killing processes in a different pid namespace

5.3.7 Implementing a simple container(-ish) part 2 - namespaces

5.3.8 Limiting resource use of a process with cgroups

5.3.9 Experiment 3: Using all the CPU I can find!

5.3.10 Experiment 4: Using too much RAM

5.3.11 Implementing a simple container(-ish) part 3 - cgroups

5.3.12 Docker and networking

5.3.13 Capabilities and seccomp Capabilities seccomp

5.3.14 Docker demystified

5.4 Fixing my (dockerized) app being slow

5.4.1 Booting up Meower

5.4.2 Why is the app slow?

5.4.3 Experiment 5: Network slowness for containers with Pumba Pumba - Docker chaos engineering tool Chaos experiment implementation

5.5 Other parts of the puzzle

5.5.1 Docker daemon restarts

5.5.2 Storage for image layers

5.5.3 Advanced networking

5.5.4 Security

5.6 Summary

6 Who you gonna call? Syscall-busters!

6.1 Scenario - congratulations on your promotion!

6.1.1 System X: if everyone is using it, but no one maintains it, is it abandonware?

6.2 A brief refresher on syscalls

6.2.1 Finding out about syscalls

6.2.2 Standard C library and glibc

6.3 How to observe a process’ syscalls?

6.3.1 strace and sleep

6.3.2 strace and System X

6.3.3 strace’s problem - overhead

6.3.4 BPF BPF and BCC

6.3.5 Other options SystemTap ftrace

6.4 Blocking syscalls for fun and profit part 1 - strace

6.4.1 Experiment 1: breaking the close syscall

6.4.2 Experiment 1: steady state

6.4.3 Experiment 1: implementation

6.4.4 Experiment 1: analysis

6.4.5 Experiment 2: breaking the write syscall

6.4.6 Experiment 2: steady state

6.4.7 Experiment 2: implementation

6.5 Blocking syscalls for fun and profit part 2 - seccomp

6.5.1 Seccomp the easy way - Docker

6.5.2 Seccomp the hard way - libseccomp

6.6 Summary

7 Injecting failure into the JVM

8 Application-level fault injection

9 There’s a monkey in my browser!

10 Chaos in Kubernetes

PART 3 Chaos Engineering beyond machines

11 Chaos engineering (for) people


Appendix A: Appendix A. Installing chaos engineering tools

What's inside

  • Design, run and analyze Chaos Engineering experiments
  • See how applications react to a database connections latency
  • Experiment with Docker container isolation
  • Test software running on Kubernetes and the platform itself
  • Inject failure into software running in the HVM

About the reader

For developers with basic knowledge of scripting and Linux.

About the author

Mikolaj Pawlikowski has been practicing chaos engineering for four years, beginning with a large distributed Kubernetes-based microservices platform at Bloomberg. He is the creator of the Kubernetes Chaos Engineering tool PowerfulSeal, and the networking visibility tool Goldpinger. He is an active member of the Chaos Engineering community and speaks at numerous conferences.

placing your order...

Don't refresh or navigate away from the page.
Manning Early Access Program (MEAP) Read chapters as they are written, get the finished eBook as soon as it’s ready, and receive the pBook long before it's in bookstores.
print book $29.99 $49.99 pBook + eBook + liveBook
Additional shipping charges may apply
Chaos Engineering (print book) added to cart
continue shopping
go to cart

eBook $24.99 $39.99 3 formats + liveBook
Chaos Engineering (eBook) added to cart
continue shopping
go to cart

Prices displayed in rupees will be charged in USD when you check out.

FREE domestic shipping on three or more pBooks