Spark in Action
Petar Zečević and Marko Bonaći
  • MEAP began April 2015
  • Publication in November 2016 (estimated)
  • ISBN 9781617292606
  • 450 pages (estimated)
  • printed in black & white

Working with big data can be complex and challenging, in part because of the multiple analysis frameworks and tools required. Apache Spark is a big data processing framework perfect for analyzing near-real-time streams and discovering historical patterns in batched data sets. But Spark goes much further than other frameworks. By including machine learning and graph processing capabilities, it makes many specialized data processing platforms obsolete. Spark's unified framework and programming model significantly lowers the initial infrastructure investment, and Spark's core abstractions are intuitive for most Scala, Java, and Python developers.

Spark in Action teaches you to use Spark for stream and batch data processing. It starts with an introduction to the Spark architecture and ecosystem followed by a taste of Spark's command line interface. You then discover the most fundamental concepts and abstractions of Spark, particularly Resilient Distributed Datasets (RDDs) and the basic data transformations that RDDs provide. The first part of the book also introduces you to writing Spark applications using the the core APIs. Next, you learn about different Spark components: how to work with structured data using Spark SQL, how to process near-real time data with Spark Streaming, how to apply machine learning algorithms with Spark MLlib, how to apply graph algorithms on graph-shaped data using Spark GraphX, and a clear introduction to Spark clustering.

Table of Contents detailed table of contents

Part 1: First Steps

1. Introduction to Apache Spark

1.1. What is Spark?

1.1.1. The Spark revolution

1.1.2. MapReduce's shortcomings

1.1.3. What Spark brings to the table

1.2. Spark components

1.2.1. Spark Core

1.2.2. Spark SQL

1.2.3. Spark Streaming

1.2.4. Spark MLlib

1.2.5. Spark GraphX

1.3. Spark program flow

1.4. Spark ecosystem

1.5. Setting up the spark-in-action virtual machine

1.5.1. Downloading and starting the virtual machine

1.5.2. Stopping the virtual machine

1.6. Summary

2. Spark fundamentals

2.1. Using the spark-in-action virtual machine

2.1.1. Cloning the Spark in Action GitHub repository

2.1.2. Finding Java

2.1.3. Using the virtual machine’s Hadoop installation

2.1.4. Examining virtual machine’s Spark installation

2.2. Using Spark shell and writing your first Spark program

2.2.1. Starting Spark shell

2.2.2. The first Spark code example

2.2.3. The notion of Resilient Distributed Dataset

2.3. Basic RDD actions and transformations

2.3.1. Using the map transformation

2.3.2. Using distinct and flatMap transformations

2.3.3. Obtaining RDD’s elements with sample, take and takeSample operations

2.4. Double RDD functions

2.4.1. Basic statistics with double RDD functions

2.4.2. Visualizing data distribution with histogram

2.4.3. Approximate sum and mean

2.5. Summary

3. Writing Spark applications

3.1. Generating a new Spark project in Eclipse

3.2. Developing the application

3.2.1. Preparing the GitHub Archive data set

3.2.2. Loading JSON

3.2.3. Running the application from Eclipse

3.2.4. Aggregating the data

3.2.5. Excluding non-employees

3.2.6. Broadcast variables

3.2.7. Using the whole data set

3.3. Submitting the application

3.3.1. Building the "uberjar"

3.3.2. Adapting the application

3.3.3. Using spark-submit

3.4. Summary

4. The Spark API in depth

4.1. Working with pair RDDs

4.1.1. Creating pair RDDs

4.1.2. Basic pair RDD functions

4.2. Understanding data partitioning and reducing data shuffling

4.2.1. Using Spark's data partitioners

4.2.2. Understanding and avoiding unnecessary shuffling

4.2.3. Repartitioning RDDs

4.2.4. Mapping data inside partitions

4.3. Joining, sorting, and grouping data

4.3.1. Joining data

4.3.2. Sorting data

4.3.3. Grouping data

4.4. Understanding RDD dependencies

4.4.1. RDD dependencies and Spark execution

4.4.2. Spark stages and tasks

4.4.3. Saving RDD lineage with checkpointing

4.5. Using accumulators and broadcast variables for communicating with Spark executors

4.5.1. Obtaining data from executors with accumulators

4.5.2. Sending data to executors using broadcast variables

4.6. Summary

Part 2: Meet the Spark Family

5. Sparkling queries with Spark SQL

5.1. Working with DataFrames

5.1.1. Creating DataFrames from RDDs

5.1.2. DataFrame API basics

5.1.3. Using SQL functions to perform calculations on the data

5.1.4. Working with missing values

5.1.5. Converting DataFrames to RDDs

5.1.6. Grouping and joining data

5.1.7. Performing joins

5.2. Using SQL commands

5.2.1. Table catalog and Hive metastore

5.2.2. Executing SQL queries

5.2.3. Connecting to Spark SQL through Thrift server

5.3. Saving and loading DataFrame data

5.3.1. Built-in data sources

5.3.2. Saving data

5.3.3. Loading data

5.4. Catalyst optimizer

5.5. Performance improvements with Tungsten

5.6. Beyond DataFrames: introducing DataSets

5.7. Summary

6. Ingesting data with Spark Streaming

6.1. Writing Spark streaming applications

6.1.1. Introducing the example application

6.1.2. Creating a streaming context

6.1.3. Creating a discretized stream

6.1.4. Using discretized streams

6.1.5. Saving the results to a file

6.1.6. Starting and stopping the streaming computation

6.1.7. Saving computation state over time

6.1.8. Using window operations for time-limited calculations

6.1.9. Examining the other built-in input streams

6.2. Using external data sources

6.2.1. Setting up Kafka

6.2.2. Changing the streaming application to use Kafka

6.3. Performance of Spark Streaming jobs

6.3.1. Obtaining good performance

6.4. Achieving fault-tolerance

6.5. Summary

7. Getting smart with MLlib

7.1. Introduction to machine learning

7.1.1. Definition of machine learning

7.1.2. Classification of machine learning algorithms

7.1.3. Machine learning with Spark

7.2. Linear algebra in Spark

7.2.1. Local vector and matrix implementations

7.2.2. Distributed matrices

7.3. Linear regression

7.3.1. About linear regression

7.3.2. Simple linear regression

7.3.3. Expanding the model to multiple linear regression

7.4. Analyzing and preparing the data

7.4.1. Analyzing data distribution

7.4.2. Analyzing column cosine similarities

7.4.3. Computing the covariance matrix

7.4.4. Transforming to labeled points

7.4.5. Splitting the data

7.4.6. Feature scaling and mean normalization

7.5. Fitting and using a linear regression model

7.5.1. Predicting the target values

7.5.2. Evaluating the model's performance

7.5.3. Interpreting the model parameters

7.5.4. Loading and saving the model

7.6. Tweaking the algorithm

7.6.1. Finding the right step size and the number of iterations

7.6.2. Adding higher order polynomials

7.6.3. Bias-variance tradeoff and model complexity

7.6.4. Plotting residual plots

7.6.5. Avoiding overfitting by using regularization

7.6.6. K-fold cross-validation

7.7. Optimizing linear regression

7.7.1. Mini-batch stochastic gradient descent

7.7.2. LBFGS optimizer

7.8. Summary

8. ML: classification and clustering

8.1. Spark ML library

8.1.1. Estimators, transformers and evaluators

8.1.2. ML parameters

8.1.3. ML pipelines

8.2. Logistic regression

8.2.1. Binary logistic regression model

8.2.2. Preparing data for using logistic regression in Spark

8.2.3. Training the model

8.2.4. Evaluating classification models

8.2.5. Performing k-fold cross-validation

8.2.6. Multiclass logistic regression

8.3. Decision trees and random forest

8.3.1. Decision Tree

8.3.2. Random forest

8.4. Using k-means clustering

8.4.1. K-means clustering

8.5. Summary

9. Connecting the dots with GraphX

9.1. Graph processing with Spark

9.1.1. Constructing graphs using GraphX API

9.1.2. Transforming graphs

9.2. Graph algorithms

9.2.1. Presentation of the data set

9.2.2. Shortest paths algorithm

9.2.3. Page rank

9.2.4. Connected components

9.2.5. Strongly connected components

9.3. Implementing A* search algorithm

9.3.1. Understanding the A* algorithm

9.3.2. Implementing the A* algorithm

9.3.3. Testing the implementation

9.4. Summary

Part 3: Spark Ops

10. Running Spark

10.1. An overview of Spark’s runtime architecture

10.1.1. Spark runtime components

10.1.2. Spark cluster types

10.2. Job and resource scheduling

10.2.1. Cluster resource scheduling

10.2.2. Spark job scheduling

10.2.3. Data locality considerations

10.2.4. Spark memory scheduling

10.3. Configuring Spark

10.3.1. Spark configuration file

10.3.2. Command-line parameters

10.3.3. System environment variables

10.3.4. Setting configuration programmatically

10.3.5. The master parameter

10.3.6. Viewing all configured parameters

10.4. Spark Web UI

10.4.1. Jobs page

10.4.2. Stages page

10.4.3. Storage page

10.4.4. Environment page

10.5. Running Spark on the local machine

10.5.1. Local mode

10.5.2. Local cluster mode

10.6. Summary

11. Running on a Spark standalone cluster

11.1. Spark standalone cluster components

11.2. Starting the standalone cluster

11.2.1. Starting the cluster with shell scripts

11.2.2. Starting the cluster manually

11.2.3. Viewing Spark processes

11.2.4. Standalone master high availability and recovery

11.3. Standalone cluster Web UI

11.4. Running applications in a standalone cluster

11.4.1. Location of the driver

11.4.2. Specifying the number of executors

11.4.3. Specifying extra classpath entries and files

11.4.4. Killing applications

11.4.5. Application automatic restart

11.5. Spark History Server and event logging

11.6. Running on Amazon EC2

11.6.1. Prerequisites

11.6.2. Creating an EC2 standalone cluster

11.6.3. Using the EC2 cluster

11.6.4. Destroying the cluster

11.7. Summary

12. Running on YARN and Mesos

12.1. Running Spark on YARN

12.1.1. YARN architecture

12.1.2. Installing, configuring, and starting YARN

12.1.3. Resource scheduling inside YARN

12.1.4. Submitting Spark applications to YARN

12.1.5. Configuring Spark on YARN

12.1.6. Configuring resources for Spark jobs

12.1.7. YARN UI

12.1.8. Finding logs on YARN

12.1.9. Security considerations

12.1.10. Dynamic resource allocation

12.2. Running Spark on Mesos

12.2.1. Mesos architecture

12.2.2. Installing and configuring Mesos

12.2.3. Mesos Web UI

12.2.4. Mesos resource scheduling

12.2.5. Submitting Spark applications to Mesos

12.2.6. Running Spark with Docker

12.3. Summary

Part 4: Bringing it Together

13. Case study: Real-time dashboard

13.1. Understanding the use case

13.1.1. The overall picture

13.1.2. Understanding the application's components

13.2. Running the application

13.2.1. Starting the application inside the spark-in-action virtual machine

13.2.2. Starting the application manually

13.3. Understanding the source code

13.3.1. The KafkaLogsSimulator project

13.3.2. The StreamingLogAnalyzer project

13.3.3. The WebStatsDashboard project

13.3.4. Building the projects

13.4. Summary

14. Deep learning on Spark with H2O

14.1. What is deep learning?

14.2. Using H2O with Spark

14.2.1. What is H2O?

14.2.2. Starting Sparkling Water on Spark

14.2.3. Starting the H2O cluster

14.2.4. Accessing Flow UI

14.3. Performing regression with H2O’s deep learning

14.3.1. Loading data into an H2O frame

14.3.2. Building and evaluating a deep learning model using Flow UI

14.3.3. Building and evaluating a deep learning model using Sparkling Water API

14.4. Performing classification with H2O’s deep learning

14.4.1. Loading and splitting the data

14.4.2. Building the model through Flow UI

14.4.3. Building the model with Sparkling Water API

14.4.4. Stopping the H2O cluster

14.5. Summary


Appendix A: Installing Apache Spark

A.1. Prerequisites: Installing JDK

A.1.1. Setting the JAVA_HOME environment variable

A.1.2. Downloading and configuring Spark

A.1.3. Spark shell

Appendix B: Understanding MapReduce

Appendix C: A primer on linear algebra

What's inside

  • Spark code in Java, Scala and Python
  • Spark installation and configuration guided tour
  • Scaffolding a new Eclipse Spark Scala project (zero-configuration)
  • How to process structured data with Spark SQL
  • How to ingest data with Spark Streaming
  • Mastering graph computation with GraphX
  • Two real-life case studies
  • Configuring, monitoring and tuning Spark
  • Spark DevOps with Docker

About the reader

Readers should be familiar with Java, Scala, or Python. No knowledge of Spark or streaming operations is assumed, but some acquaintance with machine learning is helpful.

About the authors

Petar Zečević is a CTO at SV Group. During the last 14 years he has worked on various projects as a Java developer, team leader, consultant and software specialist. He is the founder and, with Marko, organizer of popular Spark@Zg meetup group.

Marko Bonaći has worked with Java for 13 years.He works Sematext as a Spark developer and consultant. Before that, he was team lead for SV Group's IBM Enterprise Content Management team.

Manning Early Access Program (MEAP) Read chapters as they are written, get the finished eBook as soon as it’s ready, and receive the pBook long before it's in bookstores.
  • MEAP combo $49.99 pBook + eBook
  • MEAP eBook $39.99 pdf + ePub + kindle

FREE domestic shipping on three or more pBooks