Spark with Java
Jean Georges Perrin
  • MEAP began March 2018
  • Publication in Fall 2019 (estimated)
  • ISBN 9781617295522
  • 375 pages (estimated)
  • printed in black & white

I would say that this is the best book on Spark I've read.

Kelvin Johnson
Spark radically simplifies tasks like reporting, analytics, and machine learning on distributed data. Java engineers can use Spark's powerful and flexible Java APIs to stream, consolidate, filter, and transform distributed data without learning a new programming language.
Table of Contents detailed table of contents

Part 1: Some Theory with Exciting Examples

1 So, what is Spark, anyway?

1.1 The big picture: what Spark is and what it does

1.1.1 What is Spark?

1.1.2 How can you use Spark?

1.1.3 Spark in a data processing scenario

1.1.4 The four pillars of manna

1.1.5 Spark in a data science scenario

1.2 What can I do with Spark?

1.2.1 Spark predicts restaurant quality at NC Eatery

1.2.2 Spark allows fast data transfer for Lumeris

1.2.3 Spark analyzes equipment logs for the CERN

1.2.4 Other use cases

1.3 Why you will love the dataframe

1.3.1 The dataframe from a Java perspective

1.3.2 The dataframe from an RDBMS perspective

1.3.3 A graphical representation of the dataframe

1.4 Your first example

1.4.2 Downloading the code

1.4.3 Running your fist application

1.4.4 Your first code

1.5 What will you learn in this book?

1.6 Summary

2 Architecture and flow

2.1 Building your mental model

2.2 Java code to build your mental model

2.3 Walking through your application

2.3.1 Connecting to a master

2.3.2 Loading or ingesting the CSV file

2.3.3 Transforming your data

2.3.4 Saving the work done in our dataframe to a database

2.4 Summary

3 The majestic role of the dataframe

3.1 The essential role of the dataframe in Spark

3.1.1 Organization of a dataframe

3.1.2 Immutability is not a swear word

3.2 Using dataframes through examples

3.2.1 A dataframe after a simple CSV ingestion

3.2.2 Data is stored in partitions

3.2.3 Digging in the schema

3.2.4 A dataframe after a JSON ingestion

3.2.5 Combining two dataframes

3.3 The dataframe is a Dataset<Row>

3.3.1 Reuse your POJOs

3.3.2 Create a dataset of strings

3.3.3 Converting back and forth

3.4 Dataframe’s ancestor: the RDD

3.5 Summary

4 Fundamentally lazy

4.1 A real-life example of efficient laziness

4.2 A Spark example of efficient laziness

4.2.1 Looking at the results of transformations and actions

4.2.2 The transformation process, step by step

4.2.3 The code behind the transformation/action process

4.2.4 The mystery behind the creation of 7 million datapoints in 182ms

4.2.5 The mystery behind the timing of actions

4.3 Comparing to RDBMS and traditional applications

4.3.1 Working with the teen birth rates dataset

4.3.2 Analyzing the differences between a traditional app and a Spark app

4.4 Spark is amazing for data-focused application

4.5 Catalyst is your app catalyzer

4.6 Summary

5 Full flow: from scratch to deployment

6 Spark is Big Data

Part 2: Ingestion

7 Ingestion from files

7.1 Common behaviors of parsers

7.2 Complex ingestion from CSV

7.2.1 Desired output

7.2.2 Code

7.3 Ingesting a CSV with a known schema

7.3.1 Desired output

7.3.2 Code

7.4 Ingesting a JSON file

7.4.1 Desired output

7.4.2 Code

7.5 Ingesting a multiline JSON file

7.5.1 Desired output

7.5.2 Code

7.6 Ingesting an XML file

7.6.1 Desired output

7.6.2 Code

7.7 Ingesting a text file

7.7.1 Desired output

7.7.2 Code

7.8 Summary

8 Ingestion from databases

8.1 Ingestion from relational databases

8.1.1 Database connection checklist

8.1.2 Understanding the data used in the examples

8.1.3 Desired output

8.1.4 Code

8.1.5 Alternative code

8.2 The role of the dialect

8.2.1 What is a dialect anyway?

8.2.2 JDBC dialects provided with Spark

8.2.3 Building your own dialect

8.3 Advanced queries and ingestion

8.3.1 Filtering using a where clause

8.3.2 Joining data in the database

8.3.3 Ingestion and partitioning

8.3.4 Summary of advanced features

8.4 Ingestion from Elasticsearch

8.4.1 Data flow

8.4.2 The New York restaurants dataset digested by Spark

8.4.3 Code to ingest the restaurant dataset from Elasticsearch

8.5 Summary

9 Advanced ingestion: finding data sources and building your own

9.1 What is a data source?

9.2 Benefits of a direct connection to a data source

9.2.1 Temporary files

9.2.2 Data quality scripts

9.2.3 Get data on demand

9.3 Finding data sources at Spark Packages

9.4 Build your own data source

9.4.1 Scope of the example project

9.4.2 Your data source API and options

9.5 Behind the scene: building the data source itself

9.6 The register file and the advertiser class

9.7 The relation between the data and schema

9.7.1 The data source builds the relation

9.7.2 Inside the relation

9.8 Building the schema from a JavaBean

9.9 Building the dataframe is magic with the utilities

9.10 The other classes

9.11 Summary

10 Ingestion through structured streaming

Part 3: Transformation

11 Working with Spark SQL

12 Working with data

13 Aggregate your data

14 Avoid mistakes: cache and checkpoint your data

15 Interfacing with Python

16 User Defined Functions (UDF)

Part 4: Going Further

17 Advanced topics

18 A primer to ML with no math

19 Exporting data

20 Exploring the deployment constraints

Appendixes

Appendix A: Installing Eclipse

A.1 Downloading Eclipse

A.2 Running Eclipse for the first time

Appendix B: Installing Maven

B.1 Installation on Windows

B.2 Installation on MacOS

B.3 Installation on Ubuntu

B.4 Installation on RHEL / Amazon EMR

B.5 Manual installation on Linux and other Unix-like OS

Appendix C: Installing Git

C.1 Installing Git on Windows

C.2 Installing Git on macOS

C.3 Installing Git on Ubuntu

C.4 Installing Git on RHEL / AWS EMR

C.5 Other tools to consider

Appendix D: Downloading the code and getting started with Eclipse

D.1 Downloading the source code from the command line

D.2 Getting started in Eclipse

Appendix E: Installing Elasticsearch and sample data

E.1 Software installation

E.1.1 All platforms

E.1.2 macOS with Homebrew

E.2 Installing the NYC restaurant dataset

E.3 Elasticsearch vocabulary

E.4 Useful commands

E.4.1 Get the server status

E.4.2 Display the structure

E.4.3 Count documents

Appendix F: Maven quick cheat sheet

F.1 Source of packages

F.2 Useful commands

F.3 Useful configuration

F.3.1 Built-in properties

F.3.2 Building a uber jar

F.3.3 Including the source code

F.3.4 Executing from Maven

Appendix G: Getting help with relational databases

G.1 Informix (IBM)

G.1.1 Installing Informix on macOS

G.1.2 Installing Informix on Windows

G.2 MariaDB

G.2.1 Installing MariaDB on macOS

G.2.2 Installing MariaDB on Windows

G.3 MySQL (Oracle)

G.3.1 Installing MySQL on macOS

G.3.2 Installing MySQL on Windows

G.3.3 Loading the Sakila database

G.4 PostgreSQL

G.4.1 Installing PostgreSQL on macOS and Windows

G.4.2 Installing PostgreSQL on Linux

G.4.3 GUI clients for PostgreSQL

Appendix H: A history of enterprise data

H.1 The enterprise problem

H.2 The solution is, hmmm, was the data warehouse

H.3 The ephemeral data lake

H.4 Lightning fast cluster computing

H.5 Java rules, but we’re ok with Python

Appendix I: Reference for ingestion

I.1 Spark datatypes

I.2 Options for CSV ingestion

I.3 Options for JSON ingestion

I.4 Options for XML ingestion

I.5 Methods to implement to build a full dialect

I.6 Options for ingesting and writing data from/to a database

I.7 Options for ingesting and writing data from/to Elasticsearch

Appendix J: A reference for joints

Appendix K: Static functions ease your transformations

Appendix P: Installing Spark in production and a few tips

P.1 Installation

P.1.1 Installing Spark on Windows

P.1.2 Installing Spark on macOS

P.1.3 Installing Spark on Ubuntu

P.1.4 Installing Spark on AWS EMR

P.2 Understanding the installation

P.3 Configuration

P.3.1 Properties syntax

P.3.2 Application configuration

P.3.3 Runtime configuration

P.3.4 Other configuration points

Appendix S: Enough (of) Scala

S.1 What is Scala

S.2 Scala to Java conversion

S.2.1 Maps: conversion from Scala to Java

Appendix T: Reference for transformations and actions

Appendix Z: Finding help when you’re stuck

Z.1 Small annoyances here and there

Z.1.1 Service 'sparkDriver' failed after 16 retries…​

Z.1.2 Corrupt record in ingestion

Z.2 Help in the outside world

Z.2.1 User mailing list

Z.2.2 Stack Overflow

About the Technology

When you're doing analytics on big data systems, it can be a challenge to efficiently query, stream, filter, and consolidate the data distributed across a cluster, network, or cloud system. Built especially for efficiently operating over large distributed datasets, the Spark data processing engine makes handling that data so much easier! Spark's Java APIs provide an easy-to-use interface, near-limitless upgrade potential, and performance you've dreamed about all using the Java programming skills you already have!

About the book

Spark with Java teaches you how to manage distributed data using Spark's Java APIs. Taking a practical, hands-on approach, this book starts by building a basic Spark data analytics pipeline. As you work through the carefully-selected examples, you'll master SparkSQL, the dataframe API, and techniques for ingesting data from a variety of standard and non-standard sources. You'll also investigate interesting Spark use cases, like interactive reporting, machine learning pipelines, and even monitoring players in online games.

What's inside

  • Working with the Spark Java APIs
  • Ingestion through files, databases, and streaming
  • Querying and working with distributed datasets with Spark SQL
  • Caching and checkpointing your data
  • Interfacing with data scientists using Python
  • Applied machine learning, without the math overhead

About the reader

Written for Java engineers and architects interested in using Spark. No experience with Scala, functional programming, Spark, or Hadoop is required.

About the author

Jean Georges Perrin is an experienced consultant and entrepreneur passionate about software engineering and all things data. He was recognized as the first IBM Champion in France, an honor he's now held for his tenth consecutive year.

Manning Early Access Program (MEAP) Read chapters as they are written, get the finished eBook as soon as it’s ready, and receive the pBook long before it's in bookstores.
buy
Spark with Java (combo) added to cart
continue shopping
go to cart

MEAP combo $49.99 pBook + eBook + liveBook
MEAP eBook $39.99 pdf + ePub + kindle + liveBook

FREE domestic shipping on three or more pBooks

One of the most simple, but powerful introductions and dive-ins that you can ever have on a Apache library!

Igor Franca

A great book for beginners and prospective experts.

Markus Breuer