Spark in Action
Petar Zečević and Marko Bonaći
  • November 2016
  • ISBN 9781617292606
  • 472 pages
  • printed in black & white

Dig in and get your hands dirty with one of the hottest data processing engines today. A great guide.

Jonathan Sharley, Pandora Media

Spark in Action teaches you the theory and skills you need to effectively handle batch and streaming data using Spark. Fully updated for Spark 2.0.

Table of Contents detailed table of contents

Part 1: First Steps

1. Introduction to Apache Spark

1.1. What is Spark?

1.1.1. The Spark revolution

1.1.2. MapReduce's shortcomings

1.1.3. What Spark brings to the table

1.2. Spark components

1.2.1. Spark Core

1.2.2. Spark SQL

1.2.3. Spark Streaming

1.2.4. Spark MLlib

1.2.5. Spark GraphX

1.3. Spark program flow

1.4. Spark ecosystem

1.5. Setting up the spark-in-action virtual machine

1.5.1. Downloading and starting the virtual machine

1.5.2. Stopping the virtual machine

1.6. Summary

2. Spark fundamentals

2.1. Using the spark-in-action virtual machine

2.1.1. Cloning the Spark in Action GitHub repository

2.1.2. Finding Java

2.1.3. Using the virtual machine’s Hadoop installation

2.1.4. Examining virtual machine’s Spark installation

2.2. Using Spark shell and writing your first Spark program

2.2.1. Starting Spark shell

2.2.2. The first Spark code example

2.2.3. The notion of Resilient Distributed Dataset

2.3. Basic RDD actions and transformations

2.3.1. Using the map transformation

2.3.2. Using distinct and flatMap transformations

2.3.3. Obtaining RDD’s elements with sample, take and takeSample operations

2.4. Double RDD functions

2.4.1. Basic statistics with double RDD functions

2.4.2. Visualizing data distribution with histogram

2.4.3. Approximate sum and mean

2.5. Summary

3. Writing Spark applications

3.1. Generating a new Spark project in Eclipse

3.2. Developing the application

3.2.1. Preparing the GitHub Archive data set

3.2.2. Loading JSON

3.2.3. Running the application from Eclipse

3.2.4. Aggregating the data

3.2.5. Excluding non-employees

3.2.6. Broadcast variables

3.2.7. Using the whole data set

3.3. Submitting the application

3.3.1. Building the "uberjar"

3.3.2. Adapting the application

3.3.3. Using spark-submit

3.4. Summary

4. The Spark API in depth

4.1. Working with pair RDDs

4.1.1. Creating pair RDDs

4.1.2. Basic pair RDD functions

4.2. Understanding data partitioning and reducing data shuffling

4.2.1. Using Spark's data partitioners

4.2.2. Understanding and avoiding unnecessary shuffling

4.2.3. Repartitioning RDDs

4.2.4. Mapping data inside partitions

4.3. Joining, sorting, and grouping data

4.3.1. Joining data

4.3.2. Sorting data

4.3.3. Grouping data

4.4. Understanding RDD dependencies

4.4.1. RDD dependencies and Spark execution

4.4.2. Spark stages and tasks

4.4.3. Saving RDD lineage with checkpointing

4.5. Using accumulators and broadcast variables for communicating with Spark executors

4.5.1. Obtaining data from executors with accumulators

4.5.2. Sending data to executors using broadcast variables

4.6. Summary

Part 2: Meet the Spark Family

5. Sparkling queries with Spark SQL

5.1. Working with DataFrames

5.1.1. Creating DataFrames from RDDs

5.1.2. DataFrame API basics

5.1.3. Using SQL functions to perform calculations on the data

5.1.4. Working with missing values

5.1.5. Converting DataFrames to RDDs

5.1.6. Grouping and joining data

5.1.7. Performing joins

5.2. Beyond DataFrames: introducing DataSets

5.3. Using SQL commands

5.3.1. Table catalog and Hive metastore

5.3.2. Executing SQL queries

5.3.3. Connecting to Spark SQL through Thrift server

5.4. Saving and loading DataFrame data

5.4.1. Built-in data sources

5.4.2. Saving data

5.4.3. Loading data

5.5. Catalyst optimizer

5.6. Performance improvements with Tungsten

5.7. Summary

6. Ingesting data with Spark Streaming

6.1. Writing Spark streaming applications

6.1.1. Introducing the example application

6.1.2. Creating a streaming context

6.1.3. Creating a discretized stream

6.1.4. Using discretized streams

6.1.5. Saving the results to a file

6.1.6. Starting and stopping the streaming computation

6.1.7. Saving computation state over time

6.1.8. Using window operations for time-limited calculations

6.1.9. Examining the other built-in input streams

6.2. Using external data sources

6.2.1. Setting up Kafka

6.2.2. Changing the streaming application to use Kafka

6.3. Performance of Spark Streaming jobs

6.3.1. Obtaining good performance

6.3.2. Achieving fault-tolerance

6.4. Structured Streaming

6.4.1. Creating a streamin DataFrame

6.4.2. Outputting streaming data

6.4.3. Examining streaming executions

6.4.4. Future direction of structured streaming

6.5. Summary

7. Getting smart with MLlib

7.1. Introduction to machine learning

7.1.1. Definition of machine learning

7.1.2. Classification of machine learning algorithms

7.1.3. Machine learning with Spark

7.2. Linear algebra in Spark

7.2.1. Local vector and matrix implementations

7.2.2. Distributed matrices

7.3. Linear regression

7.3.1. About linear regression

7.3.2. Simple linear regression

7.3.3. Expanding the model to multiple linear regression

7.4. Analyzing and preparing the data

7.4.1. Analyzing data distribution

7.4.2. Analyzing column cosine similarities

7.4.3. Computing the covariance matrix

7.4.4. Transforming to labeled points

7.4.5. Splitting the data

7.4.6. Feature scaling and mean normalization

7.5. Fitting and using a linear regression model

7.5.1. Predicting the target values

7.5.2. Evaluating the model's performance

7.5.3. Interpreting the model parameters

7.5.4. Loading and saving the model

7.6. Tweaking the algorithm

7.6.1. Finding the right step size and the number of iterations

7.6.2. Adding higher order polynomials

7.6.3. Bias-variance tradeoff and model complexity

7.6.4. Plotting residual plots

7.6.5. Avoiding overfitting by using regularization

7.6.6. K-fold cross-validation

7.7. Optimizing linear regression

7.7.1. Mini-batch stochastic gradient descent

7.7.2. LBFGS optimizer

7.8. Summary

8. ML: classification and clustering

8.1. Spark ML library

8.1.1. Estimators, transformers and evaluators

8.1.2. ML parameters

8.1.3. ML pipelines

8.2. Logistic regression

8.2.1. Binary logistic regression model

8.2.2. Preparing data for using logistic regression in Spark

8.2.3. Training the model

8.2.4. Evaluating classification models

8.2.5. Performing k-fold cross-validation

8.2.6. Multiclass logistic regression

8.3. Decision trees and random forest

8.3.1. Decision Tree

8.3.2. Random forest

8.4. Using k-means clustering

8.4.1. K-means clustering

8.5. Summary

9. Connecting the dots with GraphX

9.1. Graph processing with Spark

9.1.1. Constructing graphs using GraphX API

9.1.2. Transforming graphs

9.2. Graph algorithms

9.2.1. Presentation of the data set

9.2.2. Shortest paths algorithm

9.2.3. Page rank

9.2.4. Connected components

9.2.5. Strongly connected components

9.3. Implementing A* search algorithm

9.3.1. Understanding the A* algorithm

9.3.2. Implementing the A* algorithm

9.3.3. Testing the implementation

9.4. Summary

Part 3: Spark Ops

10. Running Spark

10.1. An overview of Spark’s runtime architecture

10.1.1. Spark runtime components

10.1.2. Spark cluster types

10.2. Job and resource scheduling

10.2.1. Cluster resource scheduling

10.2.2. Spark job scheduling

10.2.3. Data locality considerations

10.2.4. Spark memory scheduling

10.3. Configuring Spark

10.3.1. Spark configuration file

10.3.2. Command-line parameters

10.3.3. System environment variables

10.3.4. Setting configuration programmatically

10.3.5. The master parameter

10.3.6. Viewing all configured parameters

10.4. Spark Web UI

10.4.1. Jobs page

10.4.2. Stages page

10.4.3. Storage page

10.4.4. Environment page

10.4.5. Executors page

10.5. Running Spark on the local machine

10.5.1. Local mode

10.5.2. Local cluster mode

10.6. Summary

11. Running on a Spark standalone cluster

11.1. Spark standalone cluster components

11.2. Starting the standalone cluster

11.2.1. Starting the cluster with shell scripts

11.2.2. Starting the cluster manually

11.2.3. Viewing Spark processes

11.2.4. Standalone master high availability and recovery

11.3. Standalone cluster Web UI

11.4. Running applications in a standalone cluster

11.4.1. Location of the driver

11.4.2. Specifying the number of executors

11.4.3. Specifying extra classpath entries and files

11.4.4. Killing applications

11.4.5. Application automatic restart

11.5. Spark History Server and event logging

11.6. Running on Amazon EC2

11.6.1. Prerequisites

11.6.2. Creating an EC2 standalone cluster

11.6.3. Using the EC2 cluster

11.6.4. Destroying the cluster

11.7. Summary

12. Running on YARN and Mesos

12.1. Running Spark on YARN

12.1.1. YARN architecture

12.1.2. Installing, configuring, and starting YARN

12.1.3. Resource scheduling inside YARN

12.1.4. Submitting Spark applications to YARN

12.1.5. Configuring Spark on YARN

12.1.6. Configuring resources for Spark jobs

12.1.7. YARN UI

12.1.8. Finding logs on YARN

12.1.9. Security considerations

12.1.10. Dynamic resource allocation

12.2. Running Spark on Mesos

12.2.1. Mesos architecture

12.2.2. Installing and configuring Mesos

12.2.3. Mesos Web UI

12.2.4. Mesos resource scheduling

12.2.5. Submitting Spark applications to Mesos

12.2.6. Running Spark with Docker

12.3. Summary

Part 4: Bringing it Together

13. Case study: Real-time dashboard

13.1. Understanding the use case

13.1.1. The overall picture

13.1.2. Understanding the application's components

13.2. Running the application

13.2.1. Starting the application inside the spark-in-action virtual machine

13.2.2. Starting the application manually

13.3. Understanding the source code

13.3.1. The KafkaLogsSimulator project

13.3.2. The StreamingLogAnalyzer project

13.3.3. The WebStatsDashboard project

13.3.4. Building the projects

13.4. Summary

14. Deep learning on Spark with H2O

14.1. What is deep learning?

14.2. Using H2O with Spark

14.2.1. What is H2O?

14.2.2. Starting Sparkling Water on Spark

14.2.3. Starting the H2O cluster

14.2.4. Accessing Flow UI

14.3. Performing regression with H2O’s deep learning

14.3.1. Loading data into an H2O frame

14.3.2. Building and evaluating a deep learning model using Flow UI

14.3.3. Building and evaluating a deep learning model using Sparkling Water API

14.4. Performing classification with H2O’s deep learning

14.4.1. Loading and splitting the data

14.4.2. Building the model through Flow UI

14.4.3. Building the model with Sparkling Water API

14.4.4. Stopping the H2O cluster

14.5. Summary


Appendix A: Installing Apache Spark

A.1. Prerequisites: Installing JDK

A.1.1. Setting the JAVA_HOME environment variable

A.1.2. Downloading and configuring Spark

A.1.3. Spark shell

Appendix B: Understanding MapReduce

Appendix C: A primer on linear algebra

About the Technology

Big data systems distribute datasets across clusters of machines, making it a challenge to efficiently query, stream, and interpret them. Spark can help. It is a processing system designed specifically for distributed data. It provides easy-to-use interfaces, along with the performance you need for production-quality analytics and machine learning. Spark 2 also adds improved programming APIs, better performance, and countless other upgrades.

About the book

Spark in Action teaches you the theory and skills you need to effectively handle batch and streaming data using Spark. You'll get comfortable with the Spark CLI as you work through a few introductory examples. Then, you'll start programming Spark using its core APIs. Along the way, you'll work with structured data using Spark SQL, process near-real-time streaming data, apply machine learning algorithms, and munge graph data using Spark GraphX. For a zero-effort startup, you can download the preconfigured virtual machine ready for you to try the book's code.

What's inside

  • Updated for Spark 2.0
  • Real-life case studies
  • Spark DevOps with Docker
  • Examples in Scala, and online in Java and Python

About the reader

Written for experienced programmers with some background in big data or machine learning.

About the authors

Petar Zečević and Marko Bonaći are seasoned developers heavily involved in the Spark community.

combo $49.99 pBook + eBook
eBook $39.99 pdf + ePub + kindle

FREE domestic shipping on three or more pBooks

Must-have! Speed up your learning of Spark as a distributed computing framework.

Robert Ormandi, Yahoo!

An ambitiously comprehensive overview of Spark and its diverse ecosystem.

Jonathan Miller, Optensity

An easy-to-follow, step-by-step guide.

Gaurav Bhardwaj, 3Pillar Global