PySpark in Action
Python data analysis at scale
Jonathan Rioux
  • MEAP began November 2019
  • Publication in Spring 2021 (estimated)
  • ISBN 9781617297205
  • 425 pages (estimated)
  • printed in black & white

A great and gentle introduction to spark.

Javier Collado Cabeza
When it comes to data analytics, it pays to think big. PySpark blends the powerful Spark big data processing engine with the Python programming language to provide a data analysis platform that can scale up for nearly any task. PySpark in Action is your guide to delivering successful Python-driven data projects. Packed with relevant examples and essential techniques, this practical book teaches you to build lightning-fast pipelines for reporting, machine learning, and other data-centric tasks. No previous knowledge of Spark is required.

About the Technology

The Spark data processing engine is an amazing analytics factory: raw data comes in, and insight comes out. Thanks to its ability to handle massive amounts of data distributed across a cluster, Spark has been adopted as standard by organizations both big and small. PySpark, which wraps the core Spark engine with a Python-based API, puts Spark-based data pipelines in the hands of programmers and data scientists working with the Python programming language. PySpark simplifies Spark’s steep learning curve, and provides a seamless bridge between Spark and an ecosystem of Python-based data science tools.

About the book

PySpark in Action is a carefully engineered tutorial that helps you use PySpark to deliver your data-driven applications at any scale. This clear and hands-on guide shows you how to enlarge your processing capabilities across multiple machines with data from any source, ranging from Hadoop-based clusters to Excel worksheets. You’ll learn how to break down big analysis tasks into manageable chunks and how to choose and use the best PySpark data abstraction for your unique needs. By the time you’re done, you’ll be able to write and run incredibly fast PySpark programs that are scalable, efficient to operate, and easy to debug.
Table of Contents detailed table of contents

Part 1: Walk

1 Introduction

1.1 What is PySpark?

1.1.1 You saw it coming: What is Spark?

1.1.2 PySpark = Spark + Python

1.1.3 Why PySpark?

1.1.4 Your very own factory: how PySpark works

1.1.5 Some physical planning with the cluster manager

1.1.6 A factory made efficient through a lazy manager

1.2 What will you learn in this book?

1.3 What do I need to get started?

1.4 Summary

2 Your first data program in PySpark

2.1 Setting up the pyspark shell

2.1.1 The SparkSession entry-point

2.1.2 Configuring how chatty spark is: the log level

2.2 Mapping our program

2.3 Reading and ingesting data into a data frame

2.4 Exploring data in the DataFrame structure

2.4.1 Peeking under the hood: the show() method

2.5 Moving from a sentence to a list of words

2.5.1 Selecting specific columns using select()

2.5.2 Transforming columns: splitting a string into a list of words

2.5.3 Renaming columns: alias and withColumnRenamed

2.6 Reshaping your data: exploding a list into rows

2.7 Working with words: changing case and removing punctuation

2.8 Filtering rows

2.9 Summary

2.10 Exercises

Exercise 2.1

Exercice 2.2

Exercise 2.3

Exercise 2.4

Exercise 2.5

3 Submitting and scaling your first PySpark program

3.1 Grouping records: Counting word frequencies

3.2 Ordering the results on the screen using orderBy

3.3 Writing data from a data frame

3.4 Putting it all together: counting

3.4.1 Simplifying your dependencies with PySpark’s import conventions

3.4.2 Simplifying our program via method chaining

3.5 Your first non-interactive program: using spark-submit

3.5.1 Creating your own SparkSession

3.6 Using spark-submit to launch your program in batch mode

3.7 What didn’t happen in this Chapter

3.8 Scaling up our word frequency program

3.9 Summary

3.10 Exercises

3.10.1 Exercise 3.1

3.10.2 Exercise 3.2

3.10.3 Exercise 3.3

3.10.4 Exercise 3.4

4 Analyzing tabular data with pyspark.sql

4.1 What is tabular data?

4.1.1 How does PySpark represent tabular data?

4.2 PySpark for analyzing and processing tabular data

4.3 Reading delimited data in PySpark

4.3.1 Customizing the SparkReader object to read CSV data files

4.3.2 Exploring the shape of our data universe

4.4 The basics of data manipulation: diagnosing our centre table

4.4.1 Knowing what we want: selecting columns

4.4.2 Keeping what we need: deleting columns

4.4.3 Creating what’s not there: new columns with withColumn()

4.4.4 Tidying our data frame: renaming and re-ordering columns

4.4.5 Summarizing your data frame: describe() and summary()

4.5 Summary

5 The data frame through a new lens: joining and grouping

5.1 From many to one: joining data

5.1.1 What’s what in the world of joins

5.1.2 Knowing our left from our right

5.1.3 The rules to a successful join: the predicates

5.1.4 How do you do it: the join method

5.1.5 Naming conventions in the joining world

5.2 Summarizing the data via: groupby and GroupedData

5.2.1 A simple groupby blueprint

5.2.2 A column is a column: using agg with custom column definitions

5.3 Taking care of null values: drop and fill

5.3.1 Dropping it like it’s hot

5.3.2 Filling values to our heart’s content

5.4 What was our question again: our end-to-end program

5.5 Summary

5.6 Exercises

5.6.1 Exercise 5.1

5.6.2 Exercise 5.2

5.6.3 Exercise 5.3

Part 2: Jog

6 Multi-dimentional data frames: using PySpark with JSON data

6.1 Reading JSON data: getting ready for the schemapocalypse

6.1.1 Starting small: JSON data as Python dictionary

6.1.2 Going bigger: reading JSON data in PySpark

6.2 Breaking the second dimension with complex data types

6.2.1 When you have more than one value: the array

6.2.2 The map type: keys and values within a column

6.3 The struct: nesting columns within colums

6.3.1 Navigating structs as if they were nested columns

6.4 Building and using the data frame schema

6.4.1 Using Spark types as the base blocks of a schema

6.4.2 Reading a JSON document with a strict schema in place

6.4.3 Going full circle: specifying your schemas in JSON

6.5 Putting it all together: reducing duplicate data with complex data types

6.6 Summary

6.7 Exercises

7 Bilingual PySpark: blending Python and SQL code

7.1 Banking on what we know: pyspark.sql vs plain SQL

7.2 Using SQL queries on a data frame

7.2.1 Promoting a data frame to a Spark table

7.2.2 Using the Spark catalog

7.3 SQL and PySpark

7.4 Using SQL-like syntax within data frame methods

7.4.1 Select and where

7.4.2 Group and order by

7.4.3 Having

7.4.4 Create tables/views

7.4.5 Union and join

7.4.6 Subqueries and common table expressions

7.4.7 A quick summary of PySpark vs. SQL syntax

7.5 Simplifying our code: blending SQL and Python together

7.5.1 Reading our data

7.5.2 Using SQL-style expressions in PySpark

7.6 Conclusion

7.7 Summary

7.8 Exercises

7.8.1 Exercise 7.1

7.8.2 Exercise 7.2

7.8.3 Exercise 7.3

7.8.4 Exercise 7.4

8 Extending PySpark with user-defined-functions

8.1 PySpark, freestyle: the resilient distributed dataset

8.1.1 Manipulating data the RDD way: map, filter and reduce

8.2 Using Python to extend PySpark via user-defined functions

8.2.1 It all starts with plain Python: using typed Python functions

8.2.2 From Python function to UDF: two approaches

8.3 Big data is just a lot of small data: using pandas UDF

8.3.1 Setting our environment: connectors and libraries

8.3.2 Preparing our data

8.3.3 Scalar UDF

8.3.4 Grouped map UDF

8.3.5 Grouped aggregate UDF

8.3.6 Going local to troubleshoot pandas UDF

8.4 Summary

8.5 Exercises

8.5.1 Exercise 8.1

8.5.2 Exercise 8.2

8.5.3 Exercise 8.3

8.5.4 Exercise 8.4

8.5.5 Exercise 8.5

9 Faster PySpark: understanding Spark’s query planning

Part 3: Run

10 A foray into machine learning: logistic regression with PySpark

10.1 Reading, exploring and preparing our machine learning data set

10.1.1 Exploring our data and getting our first feature columns

10.1.2 Addressing data mishaps and building our first feature set

10.1.3 Getting our data set ready for assembly: null imputation and casting

10.2 Feature engineering and selection

10.2.1 Weeding out the rare binary occurrence columns

10.2.2 Creating custom features

10.2.3 Removing highly correlated features

10.2.4 Scaling our features

10.2.5 Assembling the final data set with the Vector column type

10.3 Training and evaluating our model

10.3.1 Assessing model accuracy with the Evaluator object

10.3.2 Getting the biggest drivers from our model: extracting the coefficients

10.4 Summary

11 Robust machine learning with ML Pipelines

12 PySpark and unstructured data

13 PySpark for graphs: GraphFrames

14 Testing PySpark code

15 Even faster PySpark: identify and solve bottlenecks

16 Going full circle: structuring end-to-end PySpark code

Appendixes

Appendix A: Exercise solutions

Chapter 4

Exercise 4.1

Exercise 4.2

Exercise 4.3

Exercise 4.4

Chapter 5

Exercise 5.1

Exercise 5.2

Exercise 5.3

Chapter 8

Exercise 8.1

Exercise 8.2

Appendix B: Installing PySpark locally

B.1 Windows

B.1.1 Install Java

B.1.2 Install 7-zip

B.1.3 Download and install Apache Spark

B.1.4 Install Python

B.1.5 Launching an iPython REPL and starting PySpark

B.1.6 (Optional) Install and run Jupyter to use Jupyter notebook

B.2 macOS

B.2.1 Install Homebrew

B.2.2 Install Java and Spark

B.2.3 Install Anaconda/Python

B.2.4 Launching a iPython REPL and starting PySpark

B.2.5 (Optional) Install and run Jupyter to use Jupyter notebook

B.3 GNU/Linux and WSL

B.3.1 Install Java

B.3.2 Installing Spark

B.3.3 Install Python 3 and IPython

B.3.4 Launch PySpark with IPython

B.3.5 (Optional) Install and run Jupyter to use Jupyter notebook

Appendix C: Using PySpark with a cloud provider

Appendix D: Python essentials

Appendix E: PySpark data types

Appendix F: Efficiently using PySpark’s API documentation

What's inside

  • Packaging your PySpark code
  • Managing your data as it scales across multiple machines
  • Re-writing Pandas, R, and SAS jobs in PySpark
  • Troubleshooting common data pipeline problems
  • Creating reliable long-running jobs

About the reader

Written for intermediate data scientists and data engineers comfortable with Python.

About the author

As a data scientist for an engineering consultancy Jonathan Rioux uses PySpark daily. He teaches the software to data scientists, engineers, and data-savvy business analysts.

placing your order...

Don't refresh or navigate away from the page.
Manning Early Access Program (MEAP) Read chapters as they are written, get the finished eBook as soon as it’s ready, and receive the pBook long before it's in bookstores.
print book $29.99 $49.99 pBook + eBook + liveBook
Additional shipping charges may apply
PySpark in Action (print book) added to cart
continue shopping
go to cart

eBook $24.99 $39.99 3 formats + liveBook
PySpark in Action (eBook) added to cart
continue shopping
go to cart

Prices displayed in rupees will be charged in USD when you check out.

FREE domestic shipping on three or more pBooks