click to
look inside
Look inside
Manning Early Access Program (MEAP) Read chapters as they are written, get the finished eBook as soon as it’s ready, and receive the pBook long before it's in bookstores.
FREE
You can see any available part of this book for free.
Click the table of contents to start reading.

Data Analysis with Python and PySpark

Jonathan Rioux
  • MEAP began November 2019
  • Publication in Early 2022 (estimated)
  • ISBN 9781617297205
  • 425 pages (estimated)
  • printed in black & white

placing your order...

Don't refresh or navigate away from the page.
print book Receive a print copy shipped to your door + the eBook in Kindle, ePub, & PDF formats + liveBook, our enhanced eBook format accessible from any web browser. $35.99 $59.99 you save: $24 (40%) pBook + eBook + liveBook
Additional shipping charges may apply
FREE domestic shipping on orders of three or more print books
Data Analysis with Python and PySpark (print book) added to cart
continue shopping
go to cart

eBook Our eBooks come in Kindle, ePub, and DRM-free PDF formats + liveBook, our enhanced eBook format accessible from any web browser. $28.79 $47.99 you save: $19 (40%) 3 formats + liveBook
FREE domestic shipping on orders of three or more print books
Data Analysis with Python and PySpark (eBook) added to cart
continue shopping
go to cart

A great and gentle introduction to spark.

Javier Collado Cabeza
Look inside
When it comes to data analytics, it pays to think big. PySpark blends the powerful Spark big data processing engine with the Python programming language to provide a data analysis platform that can scale up for nearly any task. Data Analysis with Python and PySpark is your guide to delivering successful Python-driven data projects. Packed with relevant examples and essential techniques, this practical book teaches you to build lightning-fast pipelines for reporting, machine learning, and other data-centric tasks. No previous knowledge of Spark is required.

about the technology

The Spark data processing engine is an amazing analytics factory: raw data comes in, and insight comes out. Thanks to its ability to handle massive amounts of data distributed across a cluster, Spark has been adopted as standard by organizations both big and small. PySpark, which wraps the core Spark engine with a Python-based API, puts Spark-based data pipelines in the hands of programmers and data scientists working with the Python programming language. PySpark simplifies Spark’s steep learning curve, and provides a seamless bridge between Spark and an ecosystem of Python-based data science tools.

about the book

Data Analysis with Python and PySpark is a carefully engineered tutorial that helps you use PySpark to deliver your data-driven applications at any scale. This clear and hands-on guide shows you how to enlarge your processing capabilities across multiple machines with data from any source, ranging from Hadoop-based clusters to Excel worksheets. You’ll learn how to break down big analysis tasks into manageable chunks and how to choose and use the best PySpark data abstraction for your unique needs. By the time you’re done, you’ll be able to write and run incredibly fast PySpark programs that are scalable, efficient to operate, and easy to debug.

what's inside

  • Packaging your PySpark code
  • Managing your data as it scales across multiple machines
  • Re-writing Pandas, R, and SAS jobs in PySpark
  • Troubleshooting common data pipeline problems
  • Creating reliable long-running jobs

about the reader

Written for intermediate data scientists and data engineers comfortable with Python.

about the author

As a data scientist for an engineering consultancy Jonathan Rioux uses PySpark daily. He teaches the software to data scientists, engineers, and data-savvy business analysts.

FREE domestic shipping on orders of three or more print books

A phenomenal introduction to PySpark from the ground up.

Anonymous Reviewer

A great book to get you started with PySpark!

Jeremy Loscheider

Takes you on an example focused tour of building pyspark data structures from the data you provide and processing them at speed.

Alex Lucas
RECENTLY VIEWED