Spark in Action, Second Edition
Covers Apache Spark 3 with Examples in Java, Python, and Scala
Jean-Georges Perrin
Foreword by Rob Thomas
  • May 2020
  • ISBN 9781617295522
  • 576 pages
  • printed in black & white

This book reveals the tools and secrets you need to drive innovation in your company or community.

Rob Thomas, IBM
The Spark distributed data processing platform provides an easy-to-implement tool for ingesting, streaming, and processing data from any source. In Spark in Action, Second Edition, you’ll learn to take advantage of Spark’s core features and incredible processing speed, with applications including real-time computation, delayed evaluation, and machine learning. Spark skills are a hot commodity in enterprises worldwide, and with Spark’s powerful and flexible Java APIs, you can reap all the benefits without first learning Scala or Hadoop.

About the Technology

Analyzing enterprise data starts by reading, filtering, and merging files and streams from many sources. The Spark data processing engine handles this varied volume like a champ, delivering speeds 100 times faster than Hadoop systems. Thanks to SQL support, an intuitive interface, and a straightforward multilanguage API, you can use Spark without learning a complex new ecosystem.

About the book

Spark in Action, Second Edition, teaches you to create end-to-end analytics applications. In this entirely new book, you’ll learn from interesting Java-based examples, including a complete data pipeline for processing NASA satellite data. And you’ll discover Java, Python, and Scala code samples hosted on GitHub that you can explore and adapt, plus appendixes that give you a cheat sheet for installing tools and understanding Spark-specific terms.
Table of Contents detailed table of contents

Part 1. The theory crippled by awesome examples

1 So, what is Spark, anyway?

1.1 The big picture: What Spark is and what it does

1.1.1 What is Spark?

1.1.2 The four pillars of mana

1.2 1.2 How can you use Spark?

1.3 What can I do with Spark?

1.3.1 Spark predicts restaurant quality at NC eateries

1.3.2 Spark allows fast data transfer for Lumeris

1.3.3 Spark analyzes equipment logs for CERN

1.3.4 Other use cases

1.4 Why you will love the dataframe

1.4.1 The dataframe from a Java perspective

1.4.2 The dataframe from an RDBMS perspective

1.4.3 A graphical representation of the dataframe

1.5 Your first example

1.5.2 Downloading the code

1.5.3 Running your first application

1.5.4 Your first code


2 Architecture and flow

2.1 Building your mental model

2.2 Using Java code to build your mental model

2.3 Walking through your application

2.3.1 Connecting to a master

2.3.2 Loading, or ingesting, the CSV file

2.3.3 Transforming your data

2.3.4 Saving the work done in your dataframe to a database


3 The majestic role of the dataframe

3.1 The essential role of the dataframe in Spark

3.1.1 Organization of a dataframe

3.1.2 Immutability is not a swear word

3.2 Using dataframes through examples

3.2.1 A dataframe after a simple CSV ingestion

3.2.2 Data is stored in partitions

3.2.3 Digging in the schema

3.2.4 A dataframe after a JSON ingestion

3.2.5 Combining two dataframes

3.3 The dataframe is a Dataset<Row>

3.3.1 Reusing your POJOs

3.3.2 Creating a dataset of strings

3.3.3 Converting back and forth

3.4 Dataframe’s ancestor: the RDD

3.5 Summary

4 Fundamentally lazy

4.1 A real-life example of efficient laziness

4.2 A Spark example of efficient laziness

4.2.1 Looking at the results of transformations and actions

4.2.2 The transformation process, step by step

4.2.3 The code behind the transformation/action process

4.2.4 The mystery behind the creation of 7 million datapoints in 182 ms

4.2.5 The mystery behind the timing of actions

4.3 Comparing to RDBMS and traditional applications

4.3.1 Working with the teen birth rates dataset

4.3.2 Analyzing differences between a traditional app and a Spark app

4.4 Spark is amazing for data-focused applications

4.5 Catalyst is your app catalyzer


5 Building a simple app for deployment

5.1 An ingestion-less example

5.1.1 Calculating Π

5.1.2 The code to approximate Π

5.1.3 What are lambda functions in Java?

5.1.4 Approximating Π by using lambda functions

5.2 Interacting with Spark

5.2.1 Local mode

5.2.2 Cluster mode

5.2.3 Interactive mode in Scala and Python


6 Deploying your simple app

6.1 Beyond the example: the role of the components

6.1.1 Quick overview of the components and their interaction

6.1.2 Some of the fine prints of the Spark architecture

6.1.3 Going further

6.2 Building a cluster

6.2.1 Building a cluster that works for you

6.2.2 Setting up the environment

6.3 Building your application to run on the cluster

6.3.1 Building your application’s uber JAR

6.3.2 Building your application using Git and Maven

6.4 Running your application on the cluster

6.4.1 Submitting the uber JAR

6.4.2 Running the application

6.4.3 Analyzing the Spark user interface


Part 2. Ingestion

7 Ingestion from files

7.1 Common behaviors of parsers

7.2 Complex ingestion from CSV

7.2.1 Desired output

7.2.2 Code

7.3 Ingesting a CSV with a known schema

7.3.1 Desired output

7.3.2 Code

7.4 Ingesting a JSON file

7.4.1 Desired output

7.4.2 Code

7.5 Ingesting a multiline JSON file

7.5.1 Desired output

7.5.2 Code

7.6 Ingesting an XML file

7.6.1 Desired output

7.6.2 Code

7.7 Ingesting a text file

7.7.1 Desired output

7.7.2 Code

7.8 File formats for Big Data

7.8.1 The problem with traditional file formats

7.8.2 Avro is a schema-based serialization format

7.8.3 ORC is a columnar storage format

7.8.4 Parquet is also a columnar storage format

7.8.5 Comparing Avro, ORC, and Parquet

7.9 Ingesting Avro, ORC, and Parquet files

7.9.1 Ingesting Avro

7.9.2 Ingesting ORC

7.9.3 Parquet

7.9.4 Ingesting Avro, Orc, or Parquet reference table


8 Ingestion from databases

8.1 Ingestion from relational databases

8.1.1 Database connection checklist

8.1.2 Understanding the data used in the examples

8.1.3 Desired output

8.1.4 Code

8.1.5 Alternative code

8.2 The role of the dialect

8.2.1 What is a dialect anyway?

8.2.2 JDBC dialects provided with Spark

8.2.3 Building your own dialect

8.3 Advanced queries and ingestion

8.3.1 Filtering using a where clause

8.3.2 Joining data in the database

8.3.3 Ingestion and partitioning

8.3.4 Summary of advanced features

8.4 Ingestion from Elasticsearch

8.4.1 Data flow

8.4.2 The New York restaurants dataset digested by Spark

8.4.3 Code to ingest the restaurant dataset from Elasticsearch


9 Advanced ingestion: finding data sources and building your own

9.1 What is a data source?

9.2 Benefits of a direct connection to a data source

9.2.1 Temporary files

9.2.2 Data quality scripts

9.2.3 Get data on demand

9.3 Finding data sources at Spark Packages

9.4 Building your own data source

9.4.1 Scope of the example project

9.4.2 Your data source API and options

9.5 Behind the scenes: Building the data source itself

9.6 Using the register file and the advertiser class

9.7 Understanding the relationship between the data and schema

9.7.1 The data source builds the relation

9.7.2 Inside the relation

9.8 Building the schema from a JavaBean

9.9 Building the dataframe is magic with the utilities

9.10 The other classes


10 Ingestion through structured streaming

10.1 What’s streaming?

10.2 Creating your first stream

10.2.1 Generating a file stream

10.2.2 Consuming the records

10.2.3 Getting records, not lines

10.3 Ingesting data from network streams

10.4 Dealing with multiple streams

10.5 Differentiating discretized and structured streaming


Part 3. Transforming your data

11 Working with SQL

11.1 Working with Spark SQL

11.2 The difference between local or global views

11.3 Mixing the dataframe API and Spark SQL

11.4 Don’t DELETE it!

11.5 Going further with SQL


12 Transforming your data

12.1 What is data transformation?

12.2 Process and example of record-level transformation

12.2.1 Data discovery to understand the complexity

12.2.2 Data mapping to draw the process

12.2.3 Writing the transformation code

12.2.4 Reviewing your data transformation to ensure a quality process

12.2.5 What about sorting?

12.2.6 Wrapping up your first Spark transformation

12.3 Joining datasets

12.3.1 A closer look to the datasets to join

12.3.2 Building the list of higher education institutions per county

12.3.3 Performing the joins

12.4 Performing more transformations


13 Transforming entire documents

13.1 Transforming entire documents and their structure

13.1.1 Flattening your JSON document

13.1.2 Building nested documents for transfer and storage

13.2 The magic behind static functions

13.3 Performing more transformations


14 Extending transformations with user-defined functions

14.1 Extending Apache Spark

14.2 Registering and calling a UDF

14.2.1 Registering the UDF with Spark

14.2.2 Using the UDF with the dataframe API

14.2.3 Manipulating UDFs with SQL

14.2.4 Implementing the UDF

14.2.5 Writing the service itself

14.3 Using UDFs to ensure a high level of data quality

14.4 Considering UDFs’ constraints


15 Aggregating your data

15.1 Aggregating data with Spark

15.1.1 A quick reminder on aggregations

15.1.2 Performing basic aggregations with Spark

15.2 Performing aggregations with live data

15.2.1 Preparing your dataset

15.2.2 Aggregating data to better understand the schools

15.3 Building custom aggregations with UDAF


Part 4. Going Further

16 Cache and checkpoint: Enhancing Spark’s performances

16.1 Caching and checkpointing can increase performance

16.1.1 The usefulness of Spark caching

16.1.2 The subtle effectiveness of Spark checkpointing

16.1.3 Using cache and checkpoint

16.2 Caching in action

16.3 Going further in performance optimization


17 Exporting data and building full data pipelines

17.1 Exporting data

17.1.1 Building a pipeline with NASA datasets

17.1.2 Transforming columns to datetime

17.1.3 Transforming the confidence percentage to confidence level

17.1.4 Exporting the data

17.1.5 Exporting the data: what really happened?

17.2 Delta Lake: Enjoying a database close to your system

17.2.1 Understanding why a database is needed

17.2.2 Using Delta Lake in your data pipeline

17.2.3 Consuming data from Delta Lake

17.3 Accessing cloud storage services from Spark


18 Exploring deployment constraints: Understanding the ecosystem

18.1 Managing resources with YARN, Mesos, and Kubernetes

18.1.1 The built-in standalone mode manages resources

18.1.2 YARN manages resources in a Hadoop environment

18.1.3 Mesos is a standalone resource manager

18.1.4 Kubernetes orchestrates containers

18.1.5 Choosing the right resource manager

18.2 Sharing files with Spark

18.2.1 Accessing the data contained in files

18.2.2 Sharing files through distributed file systems

18.2.3 Accessing files on shared drives or file server

18.2.4 Using file sharing services to distribute files

18.2.5 Other options for accessing files in Spark

18.2.6 Hybrid solution for sharing files with Spark

18.3 Making sure your Spark application is secure

18.3.1 Securing the network components of your infrastructure

18.3.2 Securing Spark’s disk usage



Installing Eclipse

A.1 Downloading Eclipse

A.2 Running Eclipse for the first time

Appendix B: Installing Maven

B.1 Installation on Windows

B.2 Installation on MacOS

B.3 Installation on Ubuntu

B.4 Installation on RHEL / Amazon EMR

B.5 Manual installation on Linux and other UNIX-like OSes

Installing Git

C.1 Installing Git on Windows

C.2 Installing Git on macOS

C.3 Installing Git on Ubuntu

C.4 Installing Git on RHEL / AWS EMR

C.5 Other tools to consider

Downloading the code and getting started with Eclipse

D.1 Downloading the source code from the command line

D.2 Getting started in Eclipse

A history of enterprise data

E.1 The enterprise problem

E.2 The solution is—​hmmm, was—​the data warehouse

E.3 The ephemeral data lake

E.4 Lightning-fast cluster computing

E.5 Java rules, but we’re okay with Python

Getting help with relational databases

F.1 IBM Informix

F.1.1 Installing Informix on macOS

F.1.2 Installing Informix on Windows

F.2 MariaDB

F.2.1 Installing MariaDB on macOS

F.2.2 Installing MariaDB on Windows

F.3 MySQL (Oracle)

F.3.1 Installing MySQL on macOS

F.3.2 Installing MySQL on Windows

F.3.3 Loading the Sakila database

F.4 PostgreSQL

F.4.1 Installing PostgreSQL on macOS and Windows

F.4.2 Installing PostgreSQL on Linux

F.4.3 GUI clients for PostgreSQL

Static functions ease your transformations

G.1 Functions per category

G.1.2 Aggregate functions

G.1.3 Arithmetical functions

G.1.4 Array manipulation functions

G.1.5 Binary operations

G.1.6 Byte functions

G.1.7 Comparison functions

G.1.8 Compute function

G.1.9 Conditional operations

G.1.10 Conversion functions

G.1.11 Data shape functions

G.1.12 Date and time functions

G.1.13 Digest functions

G.1.14 Encoding functions

G.1.15 Formatting functions

G.1.16 JSON functions

G.1.17 List functions

G.1.18 Map functions

G.1.19 Mathematical functions

G.1.20 Navigation functions

G.1.21 Parsing functions

G.1.22 Partition functions

G.1.23 Rounding functions

G.1.24 Sorting functions

G.1.25 Statistical functions

G.1.26 Streaming functions

G.1.27 String functions

G.1.28 Technical functions

G.1.29 Trigonometry functions

G.1.30 UDF helpers

G.1.31 Validation functions

G.1.32 Deprecated functions

G.2 Function appearance per version of Spark

G.2.1 Functions in Spark v3.0.0

G.2.2 Functions in Spark v2.4.0

G.2.3 Functions in Spark v2.3.0

G.2.4 Functions in Spark v2.2.0

G.2.5 Functions in Spark v2.1.0

G.2.5 Functions in Spark v2.0.0

G.2.7 Functions in Spark v1.6.0

G.2.7 Functions in Spark v1.5.0

G.2.7 Functions in Spark v1.4.0

G.2.7 Functions in Spark v1.3.0

Maven quick cheat sheet

H.1 Source of packages

H.2 Useful commands

H.3 Typical Maven life cycle

H.4 Useful configuration

H.4.1 Built-in properties

H.4.2 Building an uber JAR

H.4.3 Including the source code

H.4.4 Executing from Maven

Reference for transformations and actions

I.1 Transformations

I.2 Actions

Enough Scala

J.1 What is Scala

J.2 Scala to Java conversion

J.2.1 General conversions

J.2.2 Maps: Conversion from Scala to Java

Installing Spark in production and a few tips

K.1 Installation

K.1.1 Installing Spark on Windows

K.1.2 Installing Spark on macOS

K.1.3 Installing Spark on Ubuntu

K.1.4 Installing Spark on AWS EMR

K.2 Understanding the installation

K.3 Configuration

K.3.1 Properties syntax

K.3.2 Application configuration

K.3.3 Runtime configuration

K.3.4 Other configuration points

Reference for ingestion

L.1 Spark datatypes

L.2 Options for CSV ingestion

L.3 Options for JSON ingestion

L.4 Options for XML ingestion

L.5 Methods for building a full dialect

L.6 Options for ingesting and writing data from/to a database

L.7 Options for ingesting and writing data from/to Elasticsearch

Reference for joins

M.1 Setting up the decorum

M.2 Performing an inner join

M.3 Performing an outer join

M.4 Performing a left, or left-outer, join

M.5 Performing a right, or right-outer, join

M.6 Performing a left-semi join

M.7 Performing a left-anti join

M.8 Performing a cross-join

Installing Elasticsearch and sample data

N.1 Installing the software

N.1.1 All platforms

N.1.2 macOS with Homebrew

N.2 Installing the NYC restaurant dataset

N.3 Understanding Elasticsearch terminology

N.4 Working with useful commands

N.4.1 Get the server status

N.4.2 Display the structure

N.4.3 Count documents

Generating streaming data

O.1 Need for generating streaming data

O.2 A simple stream

O.3 Joined data

O.4 Types of fields

Reference for streaming

P.1 Output mode

P.2 Sinks

P.3 Sinks, output modes, and options

P.4 Examples of using the various sinks

P.4.1 Output in a file

P.4.2 Output to a Kafka topic

P.4.3 Processing streamed records through foreach

P.4.4 Output in memory and processing from memory

Reference for exporting data

Q.1 Specifying the way to save data

Q.2 Spark export formats

Q.3 Options for the main formats

Q.3.1 Exporting as CSV

Q.3.2 Exporting as JSON

Q.3.3 Exporting as Parquet

Q.3.4 Exporting as ORC

Q.3.5 Exporting as XML

Q.3.6 Exporting as text

Q.4 Exporting data to datastores

Q.4.1 Exporting data to a database via JDBC

Q.4.2 Exporting data to Elasticsearch

Q.4.3 Exporting data to Delta Lake

Finding help when you’re stuck

R.1 Small annoyances here and there

R.1.1 Service sparkDriver failed after 16 retries …​

R.1.2 Requirement failed

R.1.3 Class cast exception

R.1.4 Corrupt record in ingestion

R.1.5 Cannot find winutils.exe

R.2 Help in the outside world

R.2.1 User mailing list

R.2.2 Stack Overflow

What's inside

  • Writing Spark applications in Java
  • Spark application architecture
  • Ingestion through files, databases, streaming, and Elasticsearch
  • Querying distributed datasets with Spark SQL

About the reader

This book does not assume previous experience with Spark, Scala, or Hadoop.

About the author

Jean-Georges Perrin is an experienced data and software architect. He is France’s first IBM Champion and has been honored for 12 consecutive years.

placing your order...

Don't refresh or navigate away from the page.
print book $59.99 pBook + eBook + liveBook
Additional shipping charges may apply
Spark in Action, Second Edition (print book) added to cart
continue shopping
go to cart

eBook $47.99 3 formats + liveBook
Spark in Action, Second Edition (eBook) added to cart
continue shopping
go to cart

Prices displayed in rupees will be charged in USD when you check out.

FREE domestic shipping on three or more pBooks