Grokking Streaming Systems
Josh Fischer and Ning Wang
  • MEAP began June 2020
  • Publication in Early 2021 (estimated)
  • ISBN 9781617297304
  • 375 pages (estimated)
  • printed in black & white

Provides a lot of useful information that I'd wish to have had when I first started working with Event Based systems.

Sebastián Palma
Every action by a user or a system process generates valuable data for your application or organization. Streaming systems capture and process these events, turning disconnected bits into coherent, useful information sources. Grokking Streaming Systems is a simple guide to the complex concepts you need to start building your own streaming systems. In this friendly, framework-agnostic tutorial, you’ll learn how to handle real-time events and how to design and implement a system that’s a perfect fit for your needs. Each new idea is carefully explained with diagrams, clear examples, and fun dialogue between perplexed personalities!

About the Technology

Many modern organizations rely on real-time event data to ensure top performance. In its raw state, most event data is unfiltered and hard to analyze. Streaming systems address this problem by processing event data so it can be used to create alerts, analysis dashboards, and automated responses, and to trigger other actions within a system. From live financial information to monitoring for signs of a DDoS attack and blocking rogue agents, streaming systems provide a big boost to the health, security, and flexibility of your applications.

About the book

Grokking Streaming Systems helps you unravel what streaming systems are, how they work, and whether they’re right for your business. Written to be tool-agnostic, you’ll be able to apply what you learn no matter which framework you choose. You’ll start with the key concepts and then work your way through increasingly complex examples, including tracking a real-time count of IoT sensor events and detecting fraudulent credit card transactions in real-time. You’ll even be able to easily experiment with your own streaming system, by downloading the custom-built and super-simplified streaming framework designed for this book. By the time you’re done, you’ll be able to easily assess the capabilities of streaming frameworks, and solve common challenges that arise when building streaming systems.
Table of Contents detailed table of contents

Part 1: Getting started with Streaming

1 Why Stream?

1.1 What is stream processing?

1.1.1 Examples of events

1.1.2 Streaming system and examples

1.1.3 Streaming system and “real-time”

1.2 How a streaming system works

1.2.1 Comparison of four typical computer systems

1.2.2 Applications

1.2.3 Inside an application

1.2.4 Backend Services

1.2.5 Inside a backend service

1.2.6 Batch processing systems

1.2.7 Inside a batch processing system

1.2.8 Stream processing systems

1.2.9 Inside a stream processing system

1.2.10 The advantages of multi-stage architecture

1.2.11 The multi-stage architecture in batch and stream processing systems

1.2.12 Compare the systems

1.3 A model stream processing system

1.3.1 Building systems with stream processing engine and API

1.4 Summary

1.4.1 Exercises

2 Streaming systems in action

2.1 Streaming Systems Move Faster

2.1.1 Tim wants a state-of-the-art toll booth system

2.2 It started as HTTP requests and it failed

2.3 Tracy ponders about streaming systems

2.4 Comparing services and streaming

2.5 How a streaming system could fit

2.6 Queues: A foundational concept

2.7 Data transfer via queues

2.8 Our streaming framework (the start of it)

2.9 The Streamwork overview

2.10 Zooming in on the stream work engine

2.11 Core Streaming Concepts

2.12 More details of the concepts

2.13 The streaming job execution flow

2.14 Your first streaming job

2.14.1 Your First Streaming Job: Create your event class

2.14.2 Your First Streaming Job: The data source

2.14.3 Your First Streaming Job: The data source (cont.)

2.14.4 Your First Streaming Job: The operator (cont.)

2.14.5 Your First Streaming Job: Assembling the job

2.15 Executing the job

2.16 Inspecting the job execution

2.17 Look inside the engine

2.17.1 Look inside the engine: Source executors

2.17.2 Look inside the engine: Operator executors

2.17.3 Look inside the engine: Job starter

2.17.4 Keep events moving

2.17.5 The life of a data element

2.18 Reviewing Streaming Concepts

2.19 Summary

2.19.1 Exercises

3 Parallelization & data grouping

3.1 The sensor is emitting more events

3.2 Even in streaming real-time is hard

3.2.1 Increasing lanes caused the job to fall behind

3.3 New concepts

3.3.1 Parallelization is an important concept

3.3.2 Why it’s important

3.3.3 New Concepts: Data parallelism

3.3.4 New Concepts: Data Execution Independence

3.3.5 ew Concepts: Task parallelism

3.3.6 Data parallelism vs task parallelism

3.4 Parallelism and concurrency

3.4.1 Is there a difference?

3.5 Parallelizing the job

3.6 Parallelizing components

3.7 Parallelizing sources

3.8 Viewing Job Output

3.9 Parallelizing operators

3.9.1 Running the new job

3.10 Events and instances

3.11 Event ordering

3.12 Event Grouping

3.12.1 Shuffle Grouping

3.12.2 Shuffle Grouping: Under the hood

3.12.3 Fields Grouping

3.12.4 Fields Grouping: Under the hood

3.13 Event grouping execution

3.13.1 Look inside the engine: Event dispatcher

3.14 Applying fields grouping in your job

3.14.1 Fields grouping output

3.14.2 Comparing grouping behaviors

3.15 Summary

3.15.1 Exercises

4 Stream Graph

4.1 A credit card fraud detection system

4.1.1 More about the credit card fraud detection system

4.1.2 The fraud detection business

4.1.3 Streaming isn’t always a straight-line

4.1.4 Zoom into the system

4.2 The fraud detection job in detail

4.3 New concepts

4.3.1 Upstream and downstream components

4.3.2 Stream fan-out and fan-in

4.3.3 Graph, Directed Graph, and DAG (Directed Acyclic Graph)

4.3.4 DAG in stream processing systems

4.3.5 All new concepts in one page

4.4 Stream fan-out to the analyzers

4.4.1 Look inside the engine

4.4.2 There is a problem: efficiency

4.5 Stream fan-out with different streams

4.5.1 Look inside the engine again

4.5.2 The communication between the components via channels

4.5.3 Multiple channels

4.6 Stream fan-in to the score aggregator

4.6.1 Stream fan-in in the engine

4.6.2 A brief introduction of another stream fan-in: join

4.7 Look at the whole system

4.8 Graph and streaming jobs

4.8.1 The example systems

4.9 Summary

4.9.1 Exercises

5 Delivery Semantics

5.1 The latency requirement of the fraud detection system

5.2 Revisit the fraud detection job

5.3 About accuracy

5.4 Partial result

5.5 A new streaming job to monitor system usage

5.6 The new system usage job

5.7 The requirements of the new system usage job

5.8 New concepts: (number of) times delivered and times processed

5.9 New concept: delivery semantics

5.10 Choosing the right semantics

5.11 At-most-once

5.12 The fraud detection job

5.12.1 The goods
5.12.2 The bads
5.12.3 The hope

5.13 At-least-once

5.14 At least once with acknowledging

5.15 Track events

5.16 Handle event processing failures

5.17 Track early out events

5.18 Acknowledging code in components

5.19 New concept: checkpointing

5.20 New concept: state

5.21 Checkpointing in the system usage job for at-least-once semantic

5.22 Checkpointing and state manipulation functions

5.23 State handling code in the transaction source component

5.24 Exactly-once or effectively-once?

5.25 Bonus concept: idempotent operation

5.26 Exactly-once, finally

5.27 State handling code in the system usage analyzer component

5.28 Comparing the delivery semantics again

5.29 Exercises

6 Streaming systems review & a glimpse ahead

Part 2: Stepping up

7 Windowed computations

8 The join operation

9 Backpressure

10 Stateful computation

11 Advanced Streaming system wrap up

Part 3: going above and Beyond

12 Programming interfaces

13 Integrating with external data systems


Appendix A: Troubleshooting code examples

Appendix B: Definitions

Appendix C: What’s next

What's inside

  • Implement and troubleshoot streaming systems
  • Design streaming systems for complex functionalities
  • Assess parallelization requirements
  • Spot networking bottlenecks and resolve back pressures
  • Group data for high-performance systems
  • Handle delayed events in real-time systems

About the reader

For readers interested in data processing. Examples in Java.

About the authors

Josh Fischer and Ning Wang are both Apache Committers, and part of the project management committee for the Apache Heron distributed stream processing engine. Josh is a software engineer at Scotcro and has worked with moving large datasets in real time for organizations such as 1904labs and Bayer. Ning is a software engineer at Amplitude building real-time data pipelines. He was a key contributor of Apache Heron in Twitter’s Real-time Compute team.

placing your order...

Don't refresh or navigate away from the page.
Manning Early Access Program (MEAP) Read chapters as they are written, get the finished eBook as soon as it’s ready, and receive the pBook long before it's in bookstores.
print book $35.99 $59.99 pBook + eBook + liveBook
Additional shipping charges may apply
Grokking Streaming Systems (print book) added to cart
continue shopping
go to cart

eBook $33.59 $47.99 3 formats + liveBook
Grokking Streaming Systems (eBook) added to cart
continue shopping
go to cart

Prices displayed in rupees will be charged in USD when you check out.

FREE domestic shipping on three or more pBooks