Overview

4 Kafka as a distributed log

This chapter introduces Kafka through the lens of logs: ordered, append-only sequences of events that answer the question “what happened?” It explains core log properties—temporal ordering, append-at-end writes, immutability—and how offsets make large logs navigable while enabling consumers to track progress. Kafka elevates the log to a first-class storage and transport abstraction, using topics as logs and offsets to coordinate reading, but it cautions against treating Kafka as a query system or key-value store. Instead, Kafka acts as a central data backbone, where event streams are shared reliably and different systems materialize the forms they need (databases for queries, caches for fast lookups, search engines for discoverability).

To scale and remain resilient, Kafka is presented as a distributed log. Topics are partitioned so processing can be parallelized, with ordering guaranteed per partition and preserved for records sharing the same key. Without keys, producers use round-robin partitioning (now optimized via batching). Consumer groups enable horizontal consumption by assigning each partition to exactly one consumer instance within a group while storing per-group offsets for continuity. Reliability is delivered through replication: each partition has a leader and followers (replicas), with in-sync replicas (ISR) ready to take over on failure. Replication is log-based and efficient, and leaders are distributed across brokers to balance load and maintain throughput.

The chapter also outlines Kafka’s building blocks and their roles at scale. A coordination cluster manages cluster metadata, broker membership, partition leadership, and failover; Kafka now recommends KRaft for this role, replacing the operationally heavier ZooKeeper in most cases. Brokers store and serve data, while clients—producers, consumers, Kafka Streams, and Kafka Connect—write, read, process, and integrate data with external systems. In corporate environments, Kafka becomes a data hub: Connect links databases and other systems, Streams enables real-time processing, schema registries standardize data formats across teams, MirrorMaker 2 supports multi-datacenter mirroring, and robust operations (monitoring, automation, governance) turn Kafka into a dependable streaming platform for near–real-time, data-driven decisions.

A log is a sequential list where we add elements at the end and read them from a specific position (offset). For example, we read from offset 0, then choose to read from offset 4, and so on.
A log is a perfect data structure to exchange data between systems. Typically, we do not work directly with the data in the log, but store it in a data format that is best suited for our particular use case. For example, we can use relational databases to perform complex queries over our data. If we want to access prepared data quickly, we can use an in-memory key-value store like Redis, for example. If we want to provide a search function over the data in the log, we can use a search engine for that.
Scaling vertically means adding more resources to a single instance. Scaling horizontally means adding more instances to a system.
Log A holds all the data for coffee pads and log B holds all data for cola.
Every odd message was produced to partition 0 and every even message was produced to partition 1.
Messages with the same key (here the form) were produced to the same partition
If we have only one consumer that needs to read data from all partitions, it may not be able to keep up and we may not be able to process the data in a timely manner.
Consumer groups allow us to split the processing of multiple partitions between different instances of the same service. Often, not only one consumer group consumes the data from a topic, but several. Consumer groups are isolated from each other and do not influence each other.
Consumer and producer communicate exclusively with the leader (with rare exceptions). Followers are only there to continuously replicate new messages from the leader. If the leader fails, one of the followers takes over.
A typical Kafka environment consists of the Kafka cluster itself and the clients that write and read data to Kafka. Before the KRaft-based coordination cluster, a Zookeeper-ensemble was used as a coordination cluster. Without Zookeeper, brokers can take over the task of the coordination cluster themselves or outsource it to a standalone coordination cluster.
Kafka uses either a KRaft-based or a Zookeeper-based coordination cluster. Both should consist of an odd number (usually 3 or 5) of nodes and are using a protocol to find a consensus.
Kafka alone is usually not enough. The Kafka ecosystem offers numerous components to integrate Kafka into our enterprise landscape and thus build a streaming platform.

Summary

  • A log is a sequential list where we add elements at the end and read them from a specific position.
  • Kafka is a distributed log, the data of a topic is distributed to several partitions on several brokers.
  • Offsets are used to define the position of a message inside a partition.
  • Kafka is used to exchange data between systems, it does not replace databases, key-value stores, or search engines.
  • Partitions are used to scale topics horizontally and enable parallelization of processing.
  • Producers use partitioners to decide into which partition to produce to.
  • Messages with the same keys end up in the same partition.
  • Consumer groups are used to scale consumers and allow them to share the workload, one partition is always consumed by one consumer inside a group.
  • Replication is used to ensure reliability by duplicating partitions across multiple brokers within a Kafka cluster.
  • There is always one leader replica per partition which is responsible for the coordination of the partition.
  • Kafka consists of a coordination cluster, brokers, and clients.
  • The coordination cluster is responsible for orchestrating the Kafka cluster, in other words for managing brokers.
  • Brokers form the actual Kafka cluster, they are responsible for receiving, storing, and making messages available for retrieval.
  • Clients are responsible for producing or consuming messages, they connect to brokers.
  • There are various frameworks and tools to easily integrate Kafka into an existing corporate infrastructure.

FAQ

How does Kafka model data as a log?Kafka stores records in append-only, ordered logs. New messages are written at the end, and consumers read from a specific position (offset) forward. This simple structure enables high throughput, durability, and straightforward replication.
What are offsets and how do consumers track their progress?An offset is the position of a record within a partition, assigned by the broker when the record is written. Consumers remember the next offset to read; Kafka can persist these positions in the __consumer_offsets topic so consumers can resume from where they left off after restarts.
Why is immutability important in Kafka logs?Messages, once written, are not changed in place. This immutability simplifies replication, preserves ordering, and allows consistent replay to reconstruct state at any point in time. Retention policies govern how long immutable data is kept.
Should I use Kafka like a database or key-value store?No. Kafka addresses records by offset, not by key lookup or ad hoc queries. Use Kafka as a central data hub and materialize data into systems optimized for your access patterns (for example, relational databases for analytics, Redis for fast key lookups, Elasticsearch for search).
What are partitions and how do keys affect message ordering?Topics are split into partitions for parallelism and scalability. Kafka guarantees ordering only within a single partition. If you need ordering for related messages, produce them with the same key so the partitioner routes them to the same partition.
What happens if I change the number of partitions or mix client libraries?Partition selection typically uses hash(key) % numPartitions; changing the partition count can reshuffle keys and break ordering guarantees across re-partitioned data. Also, Java and librdkafka use different default partitioners—set librdkafka producers to murmur2_random to align with Java clients and ensure consistent routing.
How do consumer groups provide horizontal scalability?Consumers that share the same group.id form a consumer group. Within a group, each partition is consumed by at most one consumer instance, allowing parallel processing while preserving per-partition order. Groups are isolated from each other, and offsets are tracked per group.
How does replication work and what do Leader, Follower, and ISR mean?Each partition has one Leader that handles reads and writes, and Followers that replicate data from the Leader. In-Sync Replicas (ISR) are replicas caught up with the Leader. If the Leader fails, an ISR (or eligible replica) is elected Leader and clients automatically switch over.
What is the coordination cluster (KRaft vs. ZooKeeper) and why use an odd number of nodes?The coordination cluster manages metadata, broker membership, controller elections, and partition leadership. Modern Kafka uses KRaft (Kafka Raft) instead of ZooKeeper, reducing operational complexity and improving performance. Use an odd number of nodes (commonly 3 or 5) to maintain quorum and tolerate failures.
How is Kafka used in enterprises and which ecosystem tools matter?Kafka typically serves as a central data hub. Kafka Connect integrates external systems; Kafka Streams (and alternatives like Flink) process streams; a schema registry manages data formats; MirrorMaker 2 supports multi-datacenter mirroring. Production setups also require monitoring, automation, and compliance controls.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Apache Kafka in Action ebook for free
choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Apache Kafka in Action ebook for free
choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Apache Kafka in Action ebook for free