Apache Kafka in Action you own this product

From basics to production

Anatoly Zelenin, Alexander Kropp
Foreword by Adam Bellemare

April 2025
ISBN 9781633437593
368 pages

Included with a Manning Online subscription

printed in black & white

available in Russian

catalog / Data Science / Big Data / Stream Processing

table of content

Part 1 Getting started

1 Introduction to Apache Kafka

1.1 What is Apache Kafka, and how does it solve our problems?

1.2 Kafka in enterprise ecosystems

1.3 Architectural overview of Kafka

1.4 Running and using Kafka

1.5 Our learning path

Summry

2 First steps with Kafka

2.1 Introducing our use case

2.2 Producing messages

2.3 Consuming messages

2.4 Consuming and producing messages in parallel

2.5 Graphical user interfaces for Kafka

Summry

Part 2 Concepts

3 Exploring Kafka topics and messages

3.1 Topics

3.1.1 Viewing topics

3.1.2 Create, customize, and delete topics

3.2 Messages

3.2.1 Message types

3.2.2 Data formats

3.2.3 Message structure

Summry

4 Kafka as a distributed log

4.1 Logs

4.1.1 What exactly is a log?

4.1.2 Basic properties of a log

4.1.3 Kafka as a log

4.2 Kafka as a distributed system

4.2.1 Partitioning and keys

4.2.2 Consumer groups

4.2.3 Replication

4.3 Components of Kafka

4.3.1 Coordination cluster

4.3.2 Broker

4.3.3 Clients

4.4 Kafka in corporate use

Summry

5 Reliability

5.1 Acknowledgments

5.1.1 ACK strategies in Kafka

5.1.2 ACKs and ISRs

5.1.3 Message delivery guarantees in Kafka

5.2 Transactions

5.2.1 Transactions in databases

5.2.2 Transactions in Kafka

5.2.3 Transactions and consumers

5.3 Replication and the leader-follower principle

Summry

6 Performance

6.1 Configuring topics for performance

6.1.1 Scaling and load balancing

6.1.2 Determining how many partitions are needed

6.1.3 Changing the number of partitions

6.2 Producer performance

6.2.1 Producer configuration

6.2.2 Producer performance test

6.3 Broker configuration and optimization

6.3.1 Optimizing brokers

6.3.2 Determining broker count and sizing

6.4 Consumer performance

6.4.1 Consumer configuration

6.4.2 Consumer performance test

Summry

Part 3 Kafka deep dive

7 Cluster management

7.1 Apache Kafka Raft cluster management

7.2 ZooKeeper Cluster Management

7.3 Migrating from ZooKeeper to KRaft

7.4 Connection to Kafka

Summry

8 Producing and persisting messages

8.1 Producer

8.1.1 Producing messages

8.1.2 Production process for messages

8.1.3 Producer and ACKs

8.2 Broker

8.2.1 Receiving and persisting messages

8.2.2 Brokers and ACKs

8.3 Data and file structures

8.3.1 Metadata, checkpoints, and topics

8.3.2 Partitions directory

8.3.3 Log data and indices

8.3.4 Segments

8.3.5 Deleted topics

8.4 Replication

8.4.1 In-sync replicas

8.4.2 High Watermark

8.4.3 Effects of delays during replication

Summry

9 Consuming messages

9.1 Fetching messages

9.1.1 Fetch requests

9.1.2 Fetch from the closest replica

9.2 Broker handling of consumer fetch requests

9.3 Offsets and Consumer

9.3.1 Offset management

9.3.2 Understanding offsets in Kafka

9.4 Understanding and managing Kafka consumer groups

9.4.1 Consumer group management

9.4.2 Distribution of partitions to consumers

9.4.3 Static memberships

Summry

10 Cleaning up messages

10.1 Why clean up messages?

10.2 Kafka’s cleanup methods

10.3 Log retention

10.3.1 When is a log cleaned up via retention?

10.3.2 Offset retention

10.4 Log compaction

10.4.1 When is a log cleaned up via compaction?

10.4.2 How the log cleaner works

10.4.3 Tombstones

Summry

Part 4 Kafka in enterprise use

11 Integrating external systems with Kafka Connect

11.1 What is Kafka Connect?

11.2 Kafka Connect cluster: Distributed Mode

11.2.1 Configuring a Kafka Connect cluster

11.2.2 Creating a connector

11.2.3 Testing the connector

11.3 Scalability and fault tolerance of Kafka Connect

11.4 Worker configuration

11.5 The Kafka Connect REST API

11.5.1 Status of a Kafka Connect cluster

11.5.2 Creating, modifying, and deleting connectors

11.6 Connector configuration

11.6.1 General connector configuration

11.6.2 Error handling in Kafka Connect

11.7 Single message transformations

11.8 Kafka Connect example: JDBC Source Connector

11.8.1 Preparing the JDBC Source Connector

11.8.2 Configuring the JDBC Source Connector

11.8.3 Testing the JDBC Source Connector

11.9 Kafka Connect example: Change data capture connector

11.9.1 Preparing the Debezium connector for PostgreSQL

11.9.2 Configuring the Debezium connector for PostgreSQL

11.9.3 Testing the Debezium connector for PostgreSQL

Summry

12 Stream processing

12.1 Stream processing overview

12.1.1 Stream-processing libraries

12.1.2 Processing data

12.2 Stream processors

12.2.1 Processor types

12.2.2 Processor topologies

12.3 Stream processing using SQL

12.4 Stream states

12.4.1 Streams and tables

12.4.2 Aggregations

12.4.3 Streaming joins

12.4.4 Use case: Notifications

12.5 Streaming and time

12.5.1 Time is relative

12.5.2 Time windows

12.5.3 Use case: Fraud detection

12.6 Scaling Kafka Streams

Summry

13 Governance

13.1 Schema management

13.1.1 Why do we need schemas?

13.1.2 Compatibility levels

13.1.3 Schema registries

13.1.4 Avro

13.2 Security

13.2.1 Transport encryption

13.2.2 Authentication

13.2.3 Authorization

13.2.4 Encryption at rest

13.2.5 End-to-end encryption

13.2.6 ZooKeeper security

13.2.7 Securing an unsecured Kafka cluster

13.3 Quotas in Kafka: Protecting the cluster from overload

Summry

14 Kafka reference architecture

14.1 Useful components and tools

14.1.1 kcat

14.1.2 Graphical user interfaces

14.1.3 Managing Kafka resources

14.1.4 Cruise Control for Apache Kafka

14.2 Deployment environments

14.2.1 Kafka on a company’s own hardware

14.2.2 Kafka in virtualized environments

14.2.3 Kafka in Kubernetes: Strimzi

14.2.4 Running Kafka in the public cloud

14.3 Hardware requirements

14.3.1 Brokers

14.3.2 Coordination cluster

Summry

15 Kafka monitoring and alerting

15.1 Infrastructure metrics

15.2 Broker metrics

15.2.1 Kafka server metrics

15.2.2 Kafka log metrics

15.2.3 Kafka network metrics

15.2.4 Kafka controller metrics

15.3 Client metrics

15.3.1 General client metrics

15.3.2 Producer metrics

15.3.3 Consumer metrics

15.3.4 Kafka Connect and Kafka Streams metrics

15.4 Alerting

15.4.1 From metrics to alerts

15.4.2 From alerts to problem solving

15.5 Kafka deployment environments and their monitoring challenges

15.5.1 Kafka on a company’s own hardware

15.5.2 Kafka on virtual machines

15.5.3 Kafka in the public cloud

15.5.4 Kafka in Kubernetes

15.5.5 Kafka as a managed services

15.5.6 Security considerations across environments

Summry

16 Disaster management

16.1 What could possibly go wrong?

16.1.1 Network failures

16.1.2 Compute failures

16.1.3 Storage failures

16.1.4 Data center failures

16.2 Backing up Kafka

16.3 Mirroring Kafka clusters with MirrorMaker

16.3.1 Active-passive cluster

16.3.2 Active-active cluster

16.3.3 Hub-and-spoke topology

Summry

17 Comparison with other technologies

17.1 Data on the outside vs. data on the inside

17.2 Classic messaging systems vs. Kafka

17.2.1 Kafka is agnostic

17.2.2 Operational complexity in classic messaging systems

17.2.3 Governance of classic messaging systems

17.3 REST vs. Kafka

17.3.1 Challenges of synchronous communication

17.3.2 Alternative communication strategies

17.4 Relational databases vs. Kafka

17.4.1 Strengths and weaknesses of relational databases

17.4.2 Complementary roles of Kafka and relational databases in modern data architectures

17.5 Kafka is the core of a streaming platform

Summry

18 Kafka’s role in modern enterprise architectures

18.1 Kafka as the core of a data mesh

18.1.1 The challenges of traditional data management

18.1.2 Principles of a data mesh

18.1.3 Data mesh vs. traditional approaches

18.1.4 Kafka’s role and responsibilities in implementing a data mesh

18.2 Liberating data from core systems with Kafka

18.3 Kafka for big data

18.4 Kafka for the Industrial Internet of Things

18.4.1 Use cases for Kafka in the IIoT

18.4.2 Data storage and retention challenges

18.4.3 Data integration and access management

18.4.4 When to use multiple Kafka clusters

18.5 What Kafka is not

18.5.1 Kafka isn’t a relational database

18.5.2 Kafka isn’t a synchronous communication interface

18.5.3 Kafka isn’t a file exchange platform

18.5.4 Kafka for small applications is questionable

18.5.5 Kafka isn’t a substitute for good architecture

Summry

Appendix

Appendix A: Setting up a Kafka test environment

A.1 Operating systems

A.2 Downloading Kafka

A.3 Configuring Kafka

A.4 Preparing the data directories

A.5 Starting Kafka

A.6 Stopping Kafka

Appendix B: Monitoring setup

B.1 Prometheus

B.2 Prometheus Exporter

B.3 Prometheus Alertmanager

B.4 Grafana

Overview

6 Performance

This chapter frames Kafka performance as a balance of throughput, latency, and resource efficiency, then explains how Kafka’s design choices serve those goals. Kafka relies on a simple, append-only log and predictable sequential I/O, leans on the operating system’s page cache, and, when possible, benefits from zero-copy transfers. Brokers focus on moving bytes while clients do most of the heavy lifting, and the platform avoids a one-size-fits-all approach by letting you tune guarantees and performance per topic, producer, consumer, and broker. The overarching message is to optimize deliberately: understand the workload, measure, and then adjust configurations in ways that respect both reliability and cost.

Performance starts with topics and partitions: partitioning enables parallelism and load balancing across brokers and within consumer groups, yet ordering is guaranteed only per partition and is strongest when enable.idempotence=true. Choosing partition counts is a trade-off—too few limits parallelism, too many strains CPU, memory, file handles, and operational complexity (with KRaft easing previous ZooKeeper limits). A practical approach is to size for consumer bottlenecks and begin with a sensible default (often around a dozen), increasing only when needed and mindful of costs. Because partitions can be increased but not decreased, and adding them changes key-to-partition mapping, preserving order often requires creating a new topic and migrating data and consumers; proposals like shared groups (queues) may relax the partitions-to-consumers constraint at the expense of strict ordering.

On the producer side, batching and compression are the key levers: larger batch.size and a modest linger.ms can dramatically lift throughput and even reduce latency, while compression.type (commonly zstd or lz4) cuts network and storage use as entire batches remain compressed end-to-end; ACKs trade reliability for speed. Producer tests with kafka-producer-perf-test.sh help quantify gains. Broker tuning emphasizes system-level settings (file descriptors, virtual memory, swappiness), right-sizing network and I/O threads, and capacity planning for storage, replication (often factor 3), and network bandwidth. Consumers are fetch-based and self-paced; fetch.min.bytes and fetch.max.wait.ms shape their throughput/latency profile, and scaling typically comes from consumer groups, which also manage offsets. Consumer performance can exceed producer rates in isolation, but real bottlenecks are often downstream processing—so always validate with production-like tests using kafka-consumer-perf-test.sh and end-to-end workloads.

We can configure performance in Kafka in many places: in the topic configuration, the brokers and all clients.

In Kafka, we partition topics for better load balancing and parallelization. We distribute the partitions to different brokers for better load balancing. Producers decide on which partition to produce the message to based on the key or round-robin. On the consumer side, we use consumer groups to process the partitions in parallel. Different consumer groups are isolated from each other.

After changing the number of partitions, the partition number for a particular message key changes usually. In this example, messages with the key circle are now produced to partition 2 instead of partition 1 and messages with the key square are now produced to partition 1 instead of partition 0.

The message 7 with the key triangle was previously stored at offset 3 of partition 0. After creating a new topic with three partitions and migrating the data, this message is now stored at offset 1 of partition 1.

The producer batches messages for each topic and partition. A message is sent as soon as either the batch is full or the linger time is up.

When we turn on compression, the producer compresses entire batches. Not only do we send these compressed batches over the network, but they are also stored on the brokers in compressed form. Only the consumer has to finally unpack the batch to extract the individual messages.

Summary

High throughput does not imply low latency but both can be equally important
Partitioning allows distributing the load and therefore increase performance
Partitioning strategy involves identifying performance bottlenecks in consumers or Kafka and adjusting partitions accordingly.
Consider balancing partition counts to manage client RAM usage and operational complexity.
Start with a default of 12 partitions, scaling up as needed for high throughput, while considering operational and cost implications.
The number of partitions can never be decreased.
Increasing the number of partitions can lead to consuming messages in the wrong order.
A Consumer group distributes load across its members.
Batching can increase the bandwidth but also the latency.
Batching can be configured with batch.size and linger.ms.
Producers can compress batches to reduce the required bandwidth but this might increase latency.
The usage of acks=all reduces the performance of producers by a bit, the same goes for idempotence.
Brokers will not decompress batches, this is the task of the consumer.
In most cases, Brokers do not require any further finetuning.
Brokers open file descriptors for every partition.
Kafka heavily depends on the operating system, necessitating specific OS-related optimizations to maximize its performance.
Consumer performance depends mostly on the number of consumers in a consumer group but can be also tuned by setting fetch.max.wait.ms and fetch.min.bytes.

FAQ

What does “performance” mean in Kafka: bandwidth, latency, or something else?

Performance spans three dimensions: bandwidth (bytes per second), latency (end-to-end responsiveness), and resource friendliness (CPU, RAM, disk, and cost/energy efficiency). While bandwidth is important, users typically care more about latency; optimizing should balance all three.

How do partitions enable scaling, and how many should I create?

- Partitions allow parallelism: producers distribute load, brokers spread partitions, and consumer groups process partitions in parallel while preserving order per partition.
- Start by locating the bottleneck, often the consumer. Size partitions to support the required number of parallel consumers. Example: if a consumer needs ~100 ms per message (≈10 msg/s) and you need 100 msg/s peak, use at least 10 consumers and thus 10 partitions.
- Prefer easily divisible counts and a small buffer above the current consumer count for fault tolerance. A practical default is 12 partitions; double if you need more parallelism (e.g., 12 → 24 → 48).

What are the risks of having too many partitions?

- Higher client resource usage (RAM, file handles) and broker load (CPU, memory, descriptors).
- Longer recovery and operational complexity, especially in older ZooKeeper-based clusters (leader movements during outages could take time).
- General guidance: up to ~4,000 partitions per broker and ~200,000 per ZooKeeper-based cluster; KRaft can handle more. Also consider cloud pricing models that charge per partition.

Can I change the number of partitions later?

- You can only increase partitions, not decrease for an existing topic.
- Increasing partitions can disrupt key-based ordering because partition selection is key-hash modulo partition count; after a change, keys may map to different partitions. Expect temporary imbalance and ordering discontinuity until old data ages out per retention.
- If strict ordering must be preserved, create a new topic with the desired partition count and migrate: switch producers, let consumers drain, or copy with Kafka Streams when retaining data indefinitely. Plan offset translation and duplicate handling (e.g., include an event ID for idempotent consumption).

How is message ordering guaranteed, and what role do consumer groups play?

- Ordering is guaranteed only within a single partition. Across partitions there is no order guarantee.
- Multiple producers writing to the same topic do not guarantee inter-producer ordering; to preserve order, messages with the same key must go to the same partition, and the producer should have enable.idempotence=true (with acks=all/-1).
- Consumer groups scale processing by assigning entire partitions to consumers, preserving partition-level order.

Which producer settings most affect throughput and latency?

- batch.size (bytes) and linger.ms (ms) control batching. Larger batches and a small wait often improve throughput and may even help latency if batches were too small. Common guidance: batch.size up to 1 MB and linger.ms around 10 ms; adjust based on traffic and monitor results.
- compression.type: none, gzip, snappy, lz4, zstd. zstd or lz4 typically balance ratio and CPU well. Kafka compresses whole batches end-to-end (producer → broker disk → replicas) and only decompresses at the consumer, reducing network and storage load. Mixed compression across producers is allowed but standardizing per topic is preferable.
- acks: 0/1 vs all (-1) trades reliability for marginal throughput; usually set by reliability needs rather than performance alone.

How do I benchmark producer performance effectively?

- Use kafka-producer-perf-test.sh to explore settings (message size, batch.size, linger.ms, compression, etc.). Be cautious: it can generate many GB of data.
- Run multiple repetitions and interpret averages/percentiles; initial runs may be slower due to JVM warmup and buffer allocations.
- Treat it as a microbenchmark. For end-to-end tests with real workloads, complement it with tools like Apache JMeter and production-like data.

How can I tune consumer performance, and what should I avoid?

- Key settings: fetch.min.bytes (how much data to accumulate before responding) and fetch.max.wait.ms (max wait before responding). Raising them increases throughput at the cost of latency.
- Never set fetch.min.bytes or fetch.max.wait.ms to 0; this can overload brokers with fetch storms (potentially tens of thousands of requests/sec).
- Use kafka-consumer-perf-test.sh with a config file to measure impact. Remember consumers are fetch-based; to scale processing and manage offsets, use consumer groups.

Which broker and OS-level optimizations matter most?

- Kafka relies on the OS page cache: it acks after writing to memory, relying on replication for durability; it avoids forced fsync on every write for performance.
- Zero-copy networking can move data efficiently from socket to disk/page cache, but it doesn’t apply when TLS is enabled.
- System tuning: raise file descriptor limits, tune virtual memory and swappiness, and adjust networking. For heavy TLS, consider increasing num.network.threads (e.g., from 3 to 6). Align num.io.threads with the number of disks and consider raising queued.max.requests for large client counts. Defaults are good, but monitor and tune in production.

How do I choose the number of brokers and size the cluster?

- Start with three brokers for small deployments to support a replication factor of 3 and tolerate maintenance plus an additional failure.
- Size for storage (retention × daily volume × replication), throughput, network capacity, and machine specs. Example: 1 TB/day retained for 7 days = 7 TB raw; with RF=3 that’s 21 TB of replicated storage. On 2 TB nodes, you’d need roughly 11 brokers just for capacity, plus headroom.
- Balance many small vs few large brokers: more brokers reduce blast radius but increase management overhead. In environments like Kubernetes you may not control OS-level tunables; monitor to stay ahead of limits. Follow the rule: first understand, then measure, then optimize.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $35.99

you save $12.00 (25%)

include audio $24.99 $18.74

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $35.99

you save $12.00 (25%)

include audio $24.99 $18.74

eBook

pdf, ePub, online

$47.99 $35.99

you save $12.00 (25%)

include audio $24.99 $18.74

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more