Streaming Data
Understanding the real-time pipeline
Andrew G. Psaltis
  • MEAP began July 2014
  • Publication in December 2016 (estimated)
  • ISBN 9781617292286
  • 300 pages (estimated)
  • printed in black & white

Streaming Data introduces the concepts and requirements of streaming and real-time data systems. Through this book you will develop a foundation to understand the challenges and solutions of building in-the-moment data systems before committing to specific technologies. Using copious diagrams, this book systematically builds up the blueprint for an in-the-moment system concept by concept. Although code may occasionally appear in examples, this book focuses on the big ideas of streaming and real time data systems rather than the implementation details.

Many of the technologies discussed in the book—Spark, Storm, Kafka, Impala, RabbitMQ, etc.—are covered individually in other books. As you read, you'll get a clear picture of how these technologies work individually and together, gain insight on how to choose the correct technologies, and discover how to fuse them together to architect a robust system.

Table of Contents detailed table of contents


1. Introducing Streaming Data

1.1. What is a real-time system

1.2. Differences of real-time and streaming systems

1.3. The architectural blueprint

1.4. Security for streaming systems

1.5. How do we scale?

1.6. Summary

2. Getting data from clients: data ingestion

2.1. Common interaction patterns

2.1.1. Request/Response

2.1.2. Publish/Subscribe

2.1.3. Request/Acknowledge

2.1.4. One-Way

2.1.5. Stream

2.2. Scaling the interaction patterns

2.2.1. Request/Response Optional

2.2.2. Scaling the Stream Pattern

2.3. Fault-Tolerance

2.3.1. Receiver-Based Message Logging (RBML)

2.3.2. Sender-Based Message Logging (SBML)

2.3.3. Hybrid Message Logging (HML)

2.4. A dose of reality

2.5. Summary

3. Transporting the data from collection tier: decoupling the data pipeline

3.1. Do we really need a message queuing tier?

3.2. Core concepts

3.2.1. The Producer, The Broker, and the Consumer

3.2.2. Isolating Producers from Consumers

3.2.3. Durable Messaging

3.3. Message Delivery Semantics

3.4. Security

3.5. Fault tolerance

3.6. Applying the core concepts to business problems

3.7. Summary

4. Analyzing streaming data

4.1. Understanding in-flight data analysis

4.2. Distributed Stream Processing Architecture

4.3. Key Features of Stream-Processing Frameworks

4.4. Summary

5. Algorithms for data analysis

5.1. Accepting constraints and relaxing

5.2. Thinking about time

5.2.1. Sliding Window

5.2.2. Tumbling Window

5.3. Summarization techniques

5.3.1. Random Sampling

5.3.2. Counting Distinct Elements

5.3.3. Frequency

5.3.4. Membership

5.4. Summary

6. Storing the analyzed or collected data

6.1. When you need long—term storage

6.2. Keeping it In—Memory

6.2.1. Embedded In—Memory / Flash Optimized

6.2.2. Caching system

6.2.3. In Memory Database (IMDB) and In Memory Data Grid (IMDG)

6.3. Use case exercises

6.3.1. In—Session Personalization

6.3.2. Next Generation Energy Company

6.4. Summary

7. Making the data available

7.1. Communications Patterns

7.1.1. Data Sync

7.1.2. Remote Method Invocation (rmi) / Remote Method Call (rpc)

7.1.3. Simple Messaging

7.1.4. Publish—Subscribe

7.2. Protocols to use to send data to the client

7.2.1. Webhooks

7.2.2. Http Long Polling

7.2.3. Server—Sent Events

7.2.4. WebSockets

7.3. Filtering the stream

7.3.1. Where to filter

7.3.2. Static vs. Dynamic Filtering

7.4. Use Case: Building a Streaming API

7.5. Summary

8. Consumer device capabilities, limitations accessing the data

8.1. The core concepts

8.1.1. Reading fast enough

8.1.2. Maintaining state

8.1.3. Mitigating data loss

8.1.4. Exactly Once Processing

8.2. Introducing the Web Client

8.2.1. Integrating with the Streaming API Service

8.3. The move towards a query language

8.4. Summary


9. Building an In The Moment recommendation engine

10. Building an IoT — a tweeting San Francisco parking garage

About the Technology

There's a big difference between sipping a glass of water and drinking directly from the hydrant. In the same way, applications built to deal with streaming data present fundamentally different challenges than those that work with stored data. For example, live location data paired with a social media profile might allow a vendor to recommend a product or service to a user at just the right instant, and the split-nanosecond reaction of a pacemaker or anti-lock brakes can save lives. Emerging techniques and technologies that enable you to take immediate action on streaming data make it possible to design and build in-the-moment decision systems, dynamic reporting dashboards, live recommendation systems, and other real-time applications.

What's inside

  • Understand and architect a complete system for collecting and analyzing data in real time
  • Harness the "internet of things" by handling live data from billions of devices
  • Use the specific functions of each tier of an in-the-moment system to solve real business problems
  • Combine emerging technologies like Spark, Storm, Kafka, RabbitMQ, and WebSockets
  • Integrating and extending the Lambda architecture into a complete system

About the reader

No experience with streaming or real-time data systems required. Perfect for developers or architects, this book is also written to be accessible to technical managers and business decision makers.

About the author

Andrew Psaltis is a software engineer and architect focused full time on building massively scalable real-time analytics systems using Spark, Kafka, Storm, Hadoop, and WebSockets.

Manning Early Access Program (MEAP) Read chapters as they are written, get the finished eBook as soon as it’s ready, and receive the pBook long before it's in bookstores.
  • MEAP combo $49.99 pBook + eBook
  • MEAP eBook $39.99 pdf + ePub + kindle

FREE domestic shipping on three or more pBooks