Big Data
Principles and best practices of scalable realtime data systems
Nathan Marz and James Warren
  • April 2015
  • ISBN 9781617290343
  • 328 pages
  • printed in black & white

Transcends individual tools or platforms. Required reading for anyone working with big data systems.

Jonathan Esterhazy, Groupon

Big Data teaches you to build big data systems using an architecture that takes advantage of clustered hardware along with new tools designed specifically to capture and analyze web-scale data. It describes a scalable, easy-to-understand approach to big data systems that can be built and run by a small team. Following a realistic example, this book guides readers through the theory of big data systems, how to implement them in practice, and how to deploy and operate them once they're built.

Table of Contents show full



about this Book

1. A new paradigm for Big Data

1.1. How this book is structured

1.2. Scaling with a traditional database

1.2.1. Scaling with a queue

1.2.2. Scaling by sharding the database

1.2.3. Fault-tolerance issues begin

1.2.4. Corruption issues

1.2.5. What went wrong?

1.2.6. How will Big Data techniques help?

1.3. NoSQL is not a panacea

1.4. First principles

1.5. Desired properties of a Big Data system

1.5.1. Robustness and fault tolerance

1.5.2. Low latency reads and updates

1.5.3. Scalability

1.5.4. Generalization

1.5.5. Extensibility

1.5.6. Ad hoc queries

1.5.7. Minimal maintenance

1.5.8. Debuggability

1.6. The problems with fully incremental architectures

1.6.1. Operational complexity

1.6.2. Extreme complexity of achieving eventual consistency

1.6.3. Lack of human-fault tolerance

1.6.4. Fully incremental solution vs. Lambda Architecture solution

1.7. Lambda Architecture

1.7.1. Batch layer

1.7.2. Serving layer

1.7.3. Batch and serving layers satisfy almost all properties

1.7.4. Speed layer

1.8.1. CPUs aren’t getting faster

1.8.2. Elastic clouds

1.8.3. Vibrant open source ecosystem for Big Data

1.9. Example application:

1.10. Summary

Part 1 Batch layer

2. Data model for Big Data

2.1. The properties of data

2.1.1. Data is raw

2.1.2. Data is immutable

2.1.3. Data is eternally true

2.2. The fact-based model for representing data

2.2.1. Example facts and their properties

2.2.2. Benefits of the fact-based model

2.3. Graph schemas

2.3.1. Elements of a graph schema

2.3.2. The need for an enforceable schema

2.4. A complete data model for

2.5. Summary

3. Data model for Big Data: Illustration

3.1. Why a serialization framework?

3.2. Apache Thrift

3.2.1. Nodes

3.2.2. Edges

3.2.3. Properties

3.2.4. Tying everything together into data objects

3.2.5. Evolving your schema

3.3. Limitations of serialization frameworks

3.4. Summary

4. Data storage on the batch layer

4.1. Storage requirements for the master dataset

4.2. Choosing a storage solution for the batch layer

4.2.1. Using a key/value store for the master dataset

4.2.2. Distributed filesystems

4.3. How distributed filesystems work

4.4. Storing a master dataset with a distributed filesystem

4.5. Vertical partitioning

4.6. Low-level nature of distributed filesystems

4.7. Storing the master dataset on a distributed filesystem

4.8. Summary

5. Data storage on the batch layer: Illustration

5.1. Using the Hadoop Distributed File System

5.1.1. The small-files problem

5.1.2. Towards a higher-level abstraction

5.2. Data storage in the batch layer with Pail

5.2.1. Basic Pail operations

5.2.2. Serializing objects into pails

5.2.3. Batch operations using Pail

5.2.4. Vertical partitioning with Pail

5.2.5. Pail file formats and compression

5.2.6. Summarizing the benefits of Pail

5.3. Storing the master dataset for

5.3.1. A structured pail for Thrift objects

5.3.2. A basic pail for

5.3.3. A split pail to vertically partition the dataset

5.4. Summary

6. Batch layer

6.1. Motivating examples

6.1.1. Number of pageviews over time

6.1.2. Gender inference

6.1.3. Influence score

6.2. Computing on the batch layer

6.3. Recomputation algorithms vs. incremental algorithms

6.3.1. Performance

6.3.2. Human-fault tolerance

6.3.3. Generality of the algorithms

6.3.4. Choosing a style of algorithm

6.4. Scalability in the batch layer

6.5. MapReduce: a paradigm for Big Data computing

6.5.1. Scalability

6.5.2. Fault-tolerance

6.5.3. Generality of MapReduce

6.6. Low-level nature of MapReduce

6.6.1. Multistep computations are unnatural

6.6.2. Joins are very complicated to implement manually

6.6.3. Logical and physical execution tightly coupled

6.7. Pipe diagrams: a higher-level way of thinking about batch computation

6.7.1. Concepts of pipe diagrams

6.7.2. Executing pipe diagrams via MapReduce

6.7.3. Combiner aggregators

6.7.4. Pipe diagram examples

6.8. Summary

7. Batch layer: Illustration

7.1. An illustrative example

7.2. Common pitfalls of data-processing tools

7.2.1. Custom languages

7.2.2. Poorly composable abstractions

7.3. An introduction to JCascalog

7.3.1. The JCascalog data model

7.3.2. The structure of a JCascalog query

7.3.3. Querying multiple datasets

7.3.4. Grouping and aggregators

7.3.5. Stepping though an example query

7.3.6. Custom predicate operations

7.4. Composition

7.4.1. Combining subqueries

7.4.2. Dynamically created subqueries

7.4.3. Predicate macros

7.4.4. Dynamically created predicate macros

7.5. Summary

8. An example batch layer: Architecture and algorithms

8.1. Design of the batch layer

8.1.1. Supported queries

8.1.2. Batch views

8.2. Workflow overview

8.3. Ingesting new data

8.4. URL normalization

8.5. User-identifier normalization

8.6. Deduplicate pageviews

8.7. Computing batch views

8.7.1. Pageviews over time

8.7.2. Unique visitors over time

8.7.3. Bounce-rate analysis

8.8. Summary

9. An example batch layer: Implementation

9.1. Starting point

9.2. Preparing the workflow

9.3. Ingesting new data

9.4. URL normalization

9.5. User-identifier normalization

9.6. Deduplicate pageviews

9.7. Computing batch views

9.7.1. Pageviews over time

9.7.2. Uniques over time

9.7.3. Bounce-rate analysis

9.8. Summary

Part 2 Serving layer

10. Serving layer

10.1. Performance metrics for the serving layer

10.2. The serving layer solution to the normalization/denormalization problem

10.3. Requirements for a serving layer database

10.4. Designing a serving layer for

10.4.1. Pageviews over time

10.4.2. Uniques over time

10.4.3. Bounce-rate analysis

10.5. Contrasting with a fully incremental solution

10.5.1. Fully incremental solution to uniques over time

10.5.2. Comparing to the Lambda Architecture solution

10.6. Summary

11. Serving layer: Illustration

11.1. Basics of ElephantDB

11.1.1. View creation in ElephantDB

11.1.2. View serving in ElephantDB

11.1.3. Using ElephantDB

11.2. Building the serving layer for

11.2.1. Pageviews over time

11.2.2. Uniques over time

11.2.3. Bounce-rate analysis

11.3. Summary

Part 3 Speed layer

12. Realtime views

12.1. Computing realtime views

12.2. Storing realtime views

12.2.1. Eventual accuracy

12.2.2. Amount of state stored in the speed layer

12.3. Challenges of incremental computation

12.3.1. Validity of the CAP theorem

12.3.2. The complex interaction between the CAP theorem and incremental algorithms

12.4. Asynchronous versus synchronous updates

12.5. Expiring realtime views

12.6. Summary

13. Realtime views: Illustration

13.1. Cassandra’s data model

13.2. Using Cassandra

13.2.1. Advanced Cassandra

13.3. Summary

14. Queuing and stream processing

14.1. Queuing

14.1.1. Single-consumer queue servers

14.1.2. Multi-consumer queues

14.2. Stream processing

14.2.1. Queues and workers

14.2.2. Queues-and-workers pitfalls

14.3. Higher-level, one-at-a-time stream processing

14.3.1. Storm model

14.3.2. Guaranteeing message processing

14.4. speed layer

14.4.1. Topology structure

14.5. Summary

15. Queuing and stream processing: Illustration

15.1. Defining topologies with Apache Storm

15.2. Apache Storm clusters and deployment

15.3. Guaranteeing message processing

15.4. Implementing the uniques-over-time speed layer

15.5. Summary

16. Micro-batch stream processing

16.1. Achieving exactly-once semantics

16.1.1. Strongly ordered processing

16.1.2. Micro-batch stream processing

16.1.3. Micro-batch processing topologies

16.2. Core concepts of micro-batch stream processing

16.3. Extending pipe diagrams for micro-batch processing

16.4. Finishing the speed layer for

16.4.1. Pageviews over time

16.4.2. Bounce-rate analysis

16.5. Another look at the bounce-rate-analysis example

16.6. Summary

17. Micro-batch stream processing: Illustration

17.1. Using Trident

17.2. Finishing the speed layer

17.2.1. Pageviews over time

17.2.2. Bounce-rate analysis

17.3. Fully fault-tolerant, in-memory, micro-batch processing

17.4. Summary

18. Lambda Architecture in depth

18.1. Defining data systems

18.2. Batch and serving layers

18.2.1. Incremental batch processing

18.2.2. Measuring and optimizing batch layer resource usage

18.3. Speed layer

18.4. Query layer

18.5. Summary


About the book

Web-scale applications like social networks, real-time analytics, or e-commerce sites deal with a lot of data, whose volume and velocity exceed the limits of traditional database systems. These applications require architectures built around clusters of machines to store and process data of any size, or speed. Fortunately, scale and simplicity are not mutually exclusive.

Big Data teaches you to build big data systems using an architecture designed specifically to capture and analyze web-scale data. This book presents the Lambda Architecture, a scalable, easy-to-understand approach that can be built and run by a small team. You'll explore the theory of big data systems and how to implement them in practice. In addition to discovering a general framework for processing big data, you'll learn specific technologies like Hadoop, Storm, and NoSQL databases.

What's inside

  • Introduction to big data systems
  • Real-time processing of web-scale data
  • Tools like Hadoop, Cassandra, and Storm
  • Extensions to traditional database skills

About the reader

This book requires no previous exposure to large-scale data analysis or NoSQL tools. Familiarity with traditional databases is helpful.

About the author

Nathan Marz is the creator of Apache Storm and the originator of the Lambda Architecture for big data systems. James Warren is an analytics architect with a background in machine learning and scientific computing.

combo $49.99 pBook + eBook
eBook $39.99 pdf + ePub + kindle

FREE domestic shipping on three or more pBooks

A comprehensive, example-driven tour of the Lambda Architecture with its originator as your guide.

Mark Fisher, Pivotal

Contains wisdom that can only be gathered after tackling many big data projects. A must-read.

Pere Ferrera Bertran, Datasalt

The de facto guide to streamlining your data pipeline in batch and near-real time.

Alex Holmes, Author of "Hadoop in Practice"