Spark with Java
Jean Georges Perrin
  • MEAP began March 2018
  • Publication in Early 2019 (estimated)
  • ISBN 9781617295522
  • 375 pages (estimated)
  • printed in black & white
Spark radically simplifies tasks like reporting, analytics, and machine learning on distributed data. Java engineers can use Spark's powerful and flexible Java APIs to stream, consolidate, filter, and transform distributed data without learning a new programming language.
Table of Contents detailed table of contents

Part 1: Some Theory with Exciting Examples

1. So, what is Spark, anyway?

1.1. The big picture: what Spark is and what it does

1.1.1. What is Spark?

1.1.2. How can you use Spark?

1.1.3. Spark in a data processing scenario

1.1.4. The four pillars of manna

1.1.5. Spark in a data science scenario

1.2. What can I do with Spark?

1.2.1. Spark predicts restaurant quality at NC Eatery

1.2.2. Spark allows fast data transfer for Lumeris

1.2.3. Spark analyzes equipment logs for the CERN

1.2.4. Other use cases

1.3. Why you will love the dataframe

1.3.1. The dataframe from a Java perspective

1.3.2. The dataframe from an RDBMS perspective

1.3.3. A graphical representation of the dataframe

1.4. Your first example

1.4.2. Downloading the code

1.4.3. Running your fist application

1.4.4. Your first code

1.5. What will you learn in this book?

1.6. Summary

2. Architecture and flow

2.1. Building your mental model

2.2. Java code to build your mental model

2.3. Walking through your application

2.3.1. Connecting to a master

2.3.2. Loading or ingesting the CSV file

2.3.3. Transforming your data

2.3.4. Saving the work done in our dataframe to a database

2.4. Summary

3. The majestic role of the dataframe

4. Fundamentally lazy

5. Full flow

6. Spark is Big Data

Part 2: Spark Applied

7. Ingestion from files

7.1. Common behaviors of parsers

7.2. Complex ingestion from CSV

7.2.1. Desired output

7.2.2. Code

7.3. Ingesting a CSV with a known schema

7.3.1. Desired output

7.3.2. Code

7.4. Ingesting a JSON file

7.4.1. Desired output

7.4.2. Code

7.5. Ingesting a multiline JSON file

7.5.1. Desired output

7.5.2. Code

7.6. Ingesting an XML file

7.6.1. Desired output

7.6.2. Code

7.7. Ingesting a text file

7.7.1. Desired output

7.7.2. Code

7.8. Summary

8. Ingestion from databases

9. Advanced ingestion

10. Ingestion through structured streaming

11. Working with Spark SQL

12. Working with data

13. Aggregate your data

14. Avoid mistakes: cache and checkpoint your data

15. Interfacing with Python

16. User Defined Functions (UDF)

17. Advanced topics

18. A primer to ML with no math

19. Exporting data

Part 3: Going Further

20. Exploring the deployment constraints

21. Tips & tricks


Appendix A: Installing Eclipse

A.1. Downloading Eclipse

A.2. Running Eclipse for the first time

Appendix B: Installing Maven

Appendix C: Downloading the code

Appendix D: Downloading the code and getting started with Eclipse

D.1. Downloading the source code from the command line

D.2. Getting started in Eclipse

Appendix E: Installing Elasticsearch and sample data

Appendix F: Maven quick cheat sheet

Appendix G: Finding help when you’re stuck

Appendix H: A history of enterprise data

H.1. The enterprise problem

H.2. The solution is, hmmm, was the data warehouse

H.3. The ephemeral data lake

H.4. Lightning fast cluster computing

H.5. Java rules, but we’re ok with Python

Appendix I: Reference for ingestion

I.1. Spark datatypes

I.2. Options for CSV ingestion

I.3. Options for JSON ingestion

I.4. Options for XML ingestion

I.5. Methods to implement to build a full dialect

I.6. Options for ingesting and writing data from/to a database

I.7. Options for ingesting and writing data from/to Elasticsearch

About the Technology

When you're doing analytics on big data systems, it can be a challenge to efficiently query, stream, filter, and consolidate the data distributed across a cluster, network, or cloud system. Built especially for efficiently operating over large distributed datasets, the Spark data processing engine makes handling that data so much easier! Spark's Java APIs provide an easy-to-use interface, near-limitless upgrade potential, and performance you've dreamed about all using the Java programming skills you already have!

About the book

Spark with Java teaches you how to manage distributed data using Spark's Java APIs. Taking a practical, hands-on approach, this book starts by building a basic Spark data analytics pipeline. As you work through the carefully-selected examples, you'll master SparkSQL, the dataframe API, and techniques for ingesting data from a variety of standard and non-standard sources. You'll also investigate interesting Spark use cases, like interactive reporting, machine learning pipelines, and even monitoring players in online games.

What's inside

  • Working with the Spark Java APIs
  • Ingestion through files, databases, and streaming
  • Querying and working with distributed datasets with Spark SQL
  • Caching and checkpointing your data
  • Interfacing with data scientists using Python
  • Applied machine learning, without the math overhead

About the reader

Written for Java engineers and architects interested in using Spark. No experience with Scala, functional programming, Spark, or Hadoop is required.

About the author

Jean Georges Perrin is an experienced consultant and entrepreneur passionate about software engineering and all things data. He was recognized as the first IBM Champion in France, an honor he's now held for his tenth consecutive year.

Manning Early Access Program (MEAP) Read chapters as they are written, get the finished eBook as soon as it’s ready, and receive the pBook long before it's in bookstores.
Spark with Java (combo) added to cart
continue shopping
go to cart

MEAP combo $49.99 pBook + eBook + liveBook
Spark with Java (eBook) added to cart
continue shopping
go to cart

MEAP eBook $39.99 pdf + ePub + kindle + liveBook

FREE domestic shipping on three or more pBooks