Storm Applied
Strategies for real-time event processing
Sean T. Allen, Matthew Jankowski, and Peter Pathirana
Foreword by Andrew Montalenti
  • March 2015
  • ISBN 9781617291890
  • 280 pages
  • printed in black & white

Will no doubt become the definitive practitioner’s guide for Storm users.

From the Foreword by Andrew Montalenti,

Storm Applied is a practical guide to using Apache Storm for the real-world tasks associated with processing and analyzing real-time data streams. This immediately useful book starts by building a solid foundation of Storm essentials so that you learn how to think about designing Storm solutions the right way from day one. But it quickly dives into real-world case studies that will bring the novice up to speed with productionizing Storm.

Table of Contents show full

foreword

preface

acknowledgments

about this book

about the cover illustration

1. Introducing Storm

1.1. What is big data?

1.1.1. The four Vs of big data

1.1.2. Big data tools

1.2. How Storm fits into the big data picture

1.2.1. Storm vs. the usual suspects

1.3. Why you’d want to use Storm

1.4. Summary

2. Core Storm concepts

2.1. Problem definition: GitHub commit count dashboard

2.1.1. Data: starting and ending points

2.1.2. Breaking down the problem

2.2. Basic Storm concepts

2.2.1. Topology

2.2.2. Tuple

2.2.3. Stream

2.2.4. Spout

2.2.5. Bolt

2.2.6. Stream grouping

2.3. Implementing a GitHub commit count dashboard in Storm

2.3.1. Setting up a Storm project

2.3.2. Implementing the spout

2.3.3. Implementing the bolts

2.3.4. Wiring everything together to form the topology

2.4. Summary

3. Topology design

3.1. Approaching topology design

3.2. Problem definition: a social heat map

3.2.1. Formation of a conceptual solution

3.3. Precepts for mapping the solution to Storm

3.3.1. Consider the requirements imposed by the data stream

3.3.2. Represent data points as tuples

3.3.3. Steps for determining the topology composition

3.4. Initial implementation of the design

3.4.1. Spout: read data from a source

3.4.2. Bolt: connect to an external service

3.4.3. Bolt: collect data in-memory

3.4.4. Bolt: persisting to a data store

3.4.5. Defining stream groupings between the components

3.4.6. Building a topology for running in local cluster mode

3.5. Scaling the topology

3.5.1. Understanding parallelism in Storm

3.5.2. Adjusting the topology to address bottlenecks inherent within design

3.5.3. Adjusting the topology to address bottlenecks inherent within a data stream

3.6. Topology design paradigms

3.6.1. Design by breakdown into functional components

3.6.2. Design by breakdown into components at points of repartition

3.6.3. Simplest functional components vs. lowest number of repartitions

3.7. Summary

4. Creating robust topologies

4.1. Requirements for reliability

4.1.1. Pieces of the puzzle for supporting reliability

4.2. Problem definition: a credit card authorization system

4.2.1. A conceptual solution with retry characteristics

4.2.2. Defining the data points

4.2.3. Mapping the solution to Storm with retry characteristics

4.3. Basic implementation of the bolts

4.3.1. The AuthorizeCreditCard implementation

4.3.2. The ProcessedOrderNotification implementation

4.4. Guaranteed message processing

4.4.1. Tuple states: fully processed vs. failed

4.4.2. Anchoring, acking, and failing tuples in our bolts

4.4.3. A spout’s role in guaranteed message processing

4.5. Replay semantics

4.5.1. Degrees of reliability in Storm

4.5.2. Examining exactly once processing in a Storm topology

4.5.3. Examining the reliability guarantees in our topology

4.6. Summary

5. Moving from local to remote topologies

5.1. The Storm cluster

5.1.1. The anatomy of a worker node

5.1.2. Presenting a worker node within the context of the credit card authorization topology

5.2. Fail-fast philosophy for fault tolerance within a Storm cluster

5.3. Installing a Storm cluster

5.3.1. Setting up a Zookeeper cluster

5.3.2. Installing the required Storm dependencies to master and worker nodes

5.3.3. Installing Storm to master and worker nodes

5.3.4. Configuring the master and worker nodes via storm.yaml

5.3.5. Launching Nimbus and Supervisors under supervision

5.4. Getting your topology to run on a Storm cluster

5.4.1. Revisiting how to put together the topology components

5.4.2. Running topologies in local mode

5.4.3. Running topologies on a remote Storm cluster

5.4.4. Deploying a topology to a remote Storm cluster

5.5. The Storm UI and its role in the Storm cluster

5.5.1. Storm UI: the Storm cluster summary

5.5.2. Storm UI: individual Topology summary

5.5.3. Storm UI: individual spout/bolt summary

5.6. Summary

6. Tuning in Storm

6.1. Problem definition: Daily Deals! reborn

6.1.1. Formation of a conceptual solution

6.1.2. Mapping the solution to Storm concepts

6.2. Initial implementation

6.2.1. Spout: read from a data source

6.2.3. Bolt: look up details for each sale

6.3. Tuning: I wanna go fast

6.3.1. The Storm UI: your go-to tool for tuning

6.3.2. Establishing a baseline set of performance numbers

6.3.3. Identifying bottlenecks

6.3.4. Spouts: controlling the rate data flows into a topology

6.4. Latency: when external systems take their time

6.4.1. Simulating latency in your topology

6.4.2. Extrinsic and intrinsic reasons for latency

6.5. Storm’s metrics-collecting API

6.5.1. Using Storm’s built-in CountMetric

6.5.2. Setting up a metrics consumer

6.5.3. Creating a custom SuccessRateMetric

6.5.4. Creating a custom MultiSuccessRateMetric

6.6. Summary

7. Resource contention

7.1. Changing the number of worker processes running on a worker node

7.1.1. Problem

7.1.2. Solution

7.1.3. Discussion

7.2. Changing the amount of memory allocated to worker processes (JVMs)

7.2.1. Problem

7.2.2. Solution

7.2.3. Discussion

7.3. Figuring out which worker nodes/processes a topology is executing on

7.3.1. Problem

7.3.2. Solution

7.3.3. Discussion

7.4. Contention for worker processes in a Storm cluster

7.4.1. Problem

7.4.2. Solution

7.4.3. Discussion

7.5. Memory contention within a worker process (JVM)

7.5.1. Problem

7.5.2. Solution

7.5.3. Discussion

7.6. Memory contention on a worker node

7.6.1. Problem

7.6.2. Solution

7.6.3. Discussion

7.7. Worker node CPU contention

7.7.1. Problem

7.7.2. Solution

7.7.3. Discussion

7.8. Worker node I/O contention

7.8.1. Network/socket I/O contention

7.8.2. Disk I/O contention

7.9. Summary

8. Storm internals

8.1. The commit count topology revisited

8.1.1. Reviewing the topology design

8.1.2. Thinking of the topology as running on a remote Storm cluster

8.1.3. How data flows between the spout and bolts in the cluster

8.2. Diving into the details of an executor

8.2.1. Executor details for the commit feed listener spout

8.2.2. Transferring tuples between two executors on the same JVM

8.2.3. Executor details for the email extractor bolt

8.2.4. Transferring tuples between two executors on different JVMs

8.2.5. Executor details for the email counter bolt

8.3. Routing and tasks

8.4. Knowing when Storm’s internal queues overflow

8.4.1. The various types of internal queues and how they might overflow

8.4.2. Using Storm’s debug logs to diagnose buffer overflowing

8.5. Addressing internal Storm buffers overflowing

8.5.1. Adjust the production-to-consumption ratio

8.5.2. Increase the size of the buffer for all topologies

8.5.3. Increase the size of the buffer for a given topology

8.5.4. Max spout pending

8.6. Tweaking buffer sizes for performance gain

8.7. Summary

9. Trident

9.1. What is Trident?

9.1.1. The different types of Trident operations

9.1.2. Trident streams as a series of batches

9.2. Kafka and its role with Trident

9.2.1. Breaking down Kafka’s design

9.2.2. Kafka’s alignment with Trident

9.3. Problem definition: Internet radio

9.3.1. Defining the data points

9.3.2. Breaking down the problem into a series of steps

9.4. Implementing the internet radio design as a Trident topology

9.4.1. Implementing the spout with a Trident Kafka spout

9.4.2. Deserializing the play log and creating separate streams for each of the fields

9.4.3. Calculating and persisting the counts for artist, title, and tag

9.5. Accessing the persisted counts through DRPC

9.5.1. Creating a DRPC stream

9.5.2. Applying a DRPC state query to a stream

9.5.3. Making DRPC calls with a DRPC client

9.6. Mapping Trident operations to Storm primitives

9.7. Scaling a Trident topology

9.7.1. Partitions for parallelism

9.7.2. Partitions in Trident streams

9.8. Summary

afterword

index

About the Technology

It's hard to make sense out of data when it's coming at you fast. Like Hadoop, Storm processes large amounts of data but it does it reliably and in real time, guaranteeing that every message will be processed. Storm allows you to scale with your data as it grows, making it an excellent platform to solve your big data problems.

About the book

Storm Applied is an example-driven guide to processing and analyzing real-time data streams. This immediately useful book starts by teaching you how to design Storm solutions the right way. Then, it quickly dives into real-world case studies that show you how to scale a high-throughput stream processor, ensure smooth operation within a production cluster, and more. Along the way, you'll learn to use Trident for stateful stream processing, along with other tools from the Storm ecosystem.

What's inside

  • Mapping real problems to Storm components
  • Performance tuning and scaling
  • Practical troubleshooting and debugging
  • Exactly-once processing with Trident

About the reader

This book moves through the basics quickly. While prior experience with Storm is not assumed, some experience with big data and real-time systems is helpful.

About the authors

Sean Allen, Matthew Jankowski, and Peter Pathirana lead the development team for a high-volume, search-intensive commercial web application at TheLadders.


combo $49.99 pBook + eBook
eBook $39.99 pdf + ePub + kindle

FREE domestic shipping on three or more pBooks

The book’s practical approach to Storm will save you a lot of hassle and a lot of time.

Tanguy Leroux, Elasticsearch

Great introduction to distributed computing with lots of real-world examples.

Shay Elkin, Tangent Logic

Go beyond the MapReduce way of thinking to solve big data problems.

Muthusamy Manigandan, OzoneMedia