Welcome to Manning India!

We are pleased to be able to offer regional eBook pricing for Indian residents.
All eBook prices are discounted 40% or more!
PySpark in Action
Python data analysis at scale
Jonathan Rioux
  • MEAP began November 2019
  • Publication in Fall 2020 (estimated)
  • ISBN 9781617297205
  • 425 pages (estimated)
  • printed in black & white

A great and gentle introduction to spark.

Javier Collado Cabeza
When it comes to data analytics, it pays to think big. PySpark blends the powerful Spark big data processing engine with the Python programming language to provide a data analysis platform that can scale up for nearly any task. PySpark in Action is your guide to delivering successful Python-driven data projects. Packed with relevant examples and essential techniques, this practical book teaches you to build lightning-fast pipelines for reporting, machine learning, and other data-centric tasks. No previous knowledge of Spark is required.
Table of Contents detailed table of contents

Part 1: Walk

1 Introduction

1.1 What is PySpark?

1.1.1 You saw it coming: What is Spark?

1.1.2 PySpark = Spark + Python

1.1.3 Why PySpark?

1.1.4 Your very own factory: how PySpark works

1.1.5 Some physical planning with the cluster manager

1.1.6 A factory made efficient through a lazy manager

1.2 What will you learn in this book?

1.3 What do I need to get started?

1.4 Summary

2 Your first data program in PySpark

2.1 Setting up the pyspark shell

2.1.1 The SparkSession entry-point

2.1.2 Configuring how chatty spark is: the log level

2.2 Mapping our program

2.3 Reading and ingesting data into a data frame

2.4 Exploring data in the DataFrame structure

2.4.1 Peeking under the hood: the show() method

2.5 Moving from a sentence to a list of words

2.5.1 Selecting specific columns using select()

2.5.2 Transforming columns: splitting a string into a list of words

2.5.3 Renaming columns: alias and withColumnRenamed

2.6 Reshaping your data: exploding a list into rows

2.7 Working with words: changing case and removing punctuation

2.8 Filtering rows

2.9 Summary

2.10 Exercises

3 Submitting and scaling your first PySpark program

3.1 Grouping records: Counting word frequencies

3.2 Ordering the results on the screen using orderBy

3.3 Writing data from a data frame

3.4 Putting it all together: counting

3.4.1 Simplifying your dependencies with PySpark’s import conventions

3.4.2 Simplifying our program via method chaining

3.5 Your first non-interactive program: using spark-submit

3.5.1 Creating your own SparkSession

3.6 Using spark-submit to launch your program in batch mode

3.7 What didn’t happen in this Chapter

3.8 Scaling up our word frequency program

3.9 Summary

3.10 Exercises

4 Analyzing data using pyspark.sql

Part 2: Jog

5 It happens to the best of us: cleaning messy data

6 Making sense of your data: types, structure and semantic

6.1 Open sesame: what does your data tell you?

6.2 The first step in understanding our data: PySpark’s scalar types

6.2.1 String and bytes

6.2.2 The numerical tower(s): integer values

6.2.3 The numerical tower(s): double, floats and decimals

6.2.4 Date and timestamp

6.2.5 Null and boolean

6.3 PySpark’s complex types

6.3.1 Complex types: the array

6.3.2 Complex types: the map

6.4 Structure and type: The dual-nature of the struct

6.4.1 A data frame is an ordered collection of columns

6.4.2 The second dimension: just enough about the row

6.4.3 Casting your way to sanity

6.4.4 Defaulting values with fillna

6.5 Summary

6.6 Exercises

7 Faster big data processing: a primer

8 Extending PySpark’s capacities with user defined functions

Part 3: Run

9 A foray into machine learning: logistic regression with PySpark

10 Simplifying your experiments with Machine learning pipelines

11 Machine learning for unstructured data

12 PySpark for graphes: GraphFrames

13 Testing PySpark code

14 Going full circle: structuring end-to-end PySpark code


Appendix A: Installing PySpark locally

A.1 Preliminary steps

A.1.1 Windows (plain): Install 7-zip

A.1.2 Windows (WSL): Install WSL

A.1.3 OS.X: Install Homebrew

A.2 Step 1: Install Java

A.2.1 Windows (plain)

A.2.2 Os.X

A.2.3 GNU/Linux Ubuntu and WSL

A.3 Step 2: Installing Spark

A.3.1 Windows (plain), GNU/Linux (including WSL)

A.3.2 OS.X

A.4 Step 3: Install Python 3 and IPython

A.4.1 Windows, Os.X

A.4.2 GNU/Linux Ubuntu and WSL

A.5 Step 4: Launch PySpark with IPython

A.5.1 Windows (plain)

A.5.2 Os.X

A.5.3 GNU/Linux Ubuntu and WSL

Appendix B: Using PySpark with a cloud provider

Appendix C: Python essentials

Appendix D: Using PySpark with a notebook UI

Appendix E: Efficiently using PySpark’s API documentation

About the Technology

The Spark data processing engine is an amazing analytics factory: raw data comes in, and insight comes out. Thanks to its ability to handle massive amounts of data distributed across a cluster, Spark has been adopted as standard by organizations both big and small. PySpark, which wraps the core Spark engine with a Python-based API, puts Spark-based data pipelines in the hands of programmers and data scientists working with the Python programming language. PySpark simplifies Spark’s steep learning curve, and provides a seamless bridge between Spark and an ecosystem of Python-based data science tools.

About the book

PySpark in Action is a carefully engineered tutorial that helps you use PySpark to deliver your data-driven applications at any scale. This clear and hands-on guide shows you how to enlarge your processing capabilities across multiple machines with data from any source, ranging from Hadoop-based clusters to Excel worksheets. You’ll learn how to break down big analysis tasks into manageable chunks and how to choose and use the best PySpark data abstraction for your unique needs. By the time you’re done, you’ll be able to write and run incredibly fast PySpark programs that are scalable, efficient to operate, and easy to debug.

What's inside

  • Packaging your PySpark code
  • Managing your data as it scales across multiple machines
  • Re-writing Pandas, R, and SAS jobs in PySpark
  • Troubleshooting common data pipeline problems
  • Creating reliable long-running jobs

About the reader

Written for intermediate data scientists and data engineers comfortable with Python.

About the author

As a data scientist for an engineering consultancy Jonathan Rioux uses PySpark daily. He teaches the software to data scientists, engineers, and data-savvy business analysts.

Manning Early Access Program (MEAP) Read chapters as they are written, get the finished eBook as soon as it’s ready, and receive the pBook long before it's in bookstores.
MEAP combo $49.99 pBook + eBook + liveBook
MEAP eBook $39.99 pdf + ePub + kindle + liveBook
Prices displayed in rupees will be charged in USD when you check out.

placing your order...

Don't refresh or navigate away from the page.

FREE domestic shipping on three or more pBooks