Data Science with Python and Dask
FREEYou can see any available part of this book for free.
Click any part of the table of contents to start reading.
If you’re doing data analysis using Pandas, NumPy, or Scikit, you know about THE WALL. At some point, you need to introduce parallelism to your system to handle larger-scale data or analytics tasks. The problem with THE WALL is that it can require you to rewrite your code, redesign your system, or start all over using an unfamiliar technology like Spark or Flink.
A must read... Forget PySpark and start using Dask!
Dask is a native parallel analytics tool designed to integrate seamlessly with the libraries you’re already using, including Pandas, NumPy, and Scikit-Learn. With Dask you can crunch and work with huge datasets, just using the tools you already use. And Data Science with Python and Dask is your guide to using Dask for your data projects without changing the way you work!
Table of Contents takes you straight to the bookdetailed table of contents
Part 1: The Building Blocks of Scalable Computing
1 Why Scalable Computing Matters
1.1 Why Dask?
1.2 Cooking with DAGs
1.3 Scale out, concurrency, and recovery
1.3.1 Scale Up vs. Scale Out
1.3.2 Concurrency and resource management
1.3.3 Recovering from failures
1.4 Introducing the companion dataset
2 Introducing Dask
2.1 Hello Dask!
2.2 Visualizing DAGs
2.3 Lazy Computations
2.3.1 Data Locality
Part 2: Working with Large Structured Datasets
3 Introducing Dask DataFrames
3.1 Why Use DataFrames?
3.2 Dask and Pandas
3.2.1 Managing DataFrame Partitioning
3.2.2 What is the Shuffle?
3.3 Limitations of Dask DataFrames
4 Loading Data into DataFrames
4.1 Reading Data from Text Files
4.1.1 Using Dask Datatypes
4.1.2 Creating Schemas for Dask DataFrames
4.2 Reading Data from Relational Databases
4.3 Reading Data from HDFS and S3
4.4 Reading Data in Parquet Format
5 Cleaning and Transforming DataFrames
5.1 Working with Indexes and Axes
5.2 Dealing with Missing Values
5.3 Recoding Data
5.4 Elementwise Operations
5.5 Filtering and Reindexing DataFrames
5.6 Join and Concatenate DataFrames
5.7 Writing Data to Text Files and Parquet Files
5.7.1 Writing to Delimited Text Files
5.7.2 Writing to Parquet Files
6 Summarizing and Analyzing DataFrames
6.1 Descriptive Statistics
6.2 Built-In Aggregate Functions
6.3 Custom Aggregate Functions
6.4 Rolling (Window) Functions
7 Visualizing DataFrames with Seaborn
7.1 The Prepare-Reduce-Collect-Plot Pattern
7.2 Visualizing Continuous Relationships with pairplot and regplot
7.3 Visualizing Categorical Relationships with violinplot
7.4 Visualizing Two Categorical Relationships with heatmap
8 Visualizing Location Data with Datashader
8.1 What Is Datashader and How Does It Work?
8.2 Plotting Location Data as an Interactive Heatmap
Part 3: Extending and Deploying Dask
9 Working with Bags and Arrays
9.1 Reading and Parsing Unstructured Data with Bags
9.2 Transforming, Filtering, and Folding Elements
9.3 Building Arrays and DataFrames from Bags
9.4 Using Bags for Parallel Text Analysis with NLTK
10 Machine Learning with Dask-ML
10.1 Building Linear Models with Dask ML
10.2 Evaluating and Tuning Dask ML Models
10.3 Persisting Dask ML Models
11 Scaling and Deploying Dask
11.1 Building a Dask Cluster on Amazon AWS with Docker
11.1.1 Create a Security Key
11.1.2 Create the ECS Cluster
11.1.3 Configure the Cluster’s Networking
11.1.4 Create a Shared Data Drive in Elastic File System
11.1.5 Allocate Space for Docker Images in Elastic Container Repository
11.1.6 Build and Deploy Images for Scheduler, Worker, and Notebook
11.1.7 Connect to the Cluster
11.2 Running and Monitoring Dask Jobs on a Cluster
11.3 Cleaning Up the Dask Cluster on AWS
Appendix A: Software Installation
A.1 Install Additional Packages with Anaconda
A.2 Install Packages without Anaconda
About the TechnologyDask is a self-contained, easily extendible library designed to query, stream, filter, and consolidate huge datasets. Dask simplifies data parallelism, so you spend less time on low-level system management and more time exploring your data. Dask has built-in schedulers—subsystems that distribute computing tasks—that create parallelism on a single machine using threads or processes. More advanced schedulers allow you to fully exploit clusters or HPC systems.
Large datasets tend to be distributed, non-uniform, and prone to change. Dask simplifies the process of ingesting, filtering, and transforming data, reducing or eliminating the need for a heavyweight framework like Spark. You can start and finish even massive data projects in Python.
About the bookData Science with Python and Dask teaches you how to build distributed data projects that can handle huge amounts of data. You’ll begin with an introduction to the Dask framework, concentrating on how Dask natively scales commonly-used Python libraries like Numpy and Pandas. With a particular focus on data analysis, you’ll immediately start exploring the huge amount of data found in the NYC 2013-2017 Parking Ticket database. You’ll be introduced to Dask DataFrames and learn helpful code patterns to streamline your analysis. You’ll also dig into visualization with Seaborn and learn to build machine learning models using Dask-ML.
As you work through Dask’s features, you’ll learn how to prepare and analyze the dataset to discover trends and patterns in NYC’s parking enforcement operations. How does the time of year and weather affect issued citations? Is the number of citations rising or falling? You’ll find out, and you’ll figure out how to discover similar trends in your own data! Along the way, you’ll look deeper into Dask Arrays and Bags, use Datashader to build interactive location-based visualizations, and learn to implement your own algorithms using custom task graphs. Finally, you’ll learn how to scale your Dask apps and learn how to build your very own Dask cluster using AWS and Docker.
- Working with large structured datasets
- Writing your own DataFrames
- Clean and visualize your DataFrames
- Machine learning with Dask-ML
- Working with Bags and Arrays
- Building distributed apps with Dask Distributed
- Packaging and deploying your Dask apps
About the readerWritten for data engineers and scientists with experience using Python. Knowledge of the PyData stack (Pandas, NumPy, and Scikit-learn) will be helpful. No experience with low-level parallelism is required.
About the authorJesse Daniel has five years of experience writing applications in Python, including three years working with in the PyData stack (Pandas, NumPy, SciPy, Scikit-Learn). Jesse joined the faculty of the University of Denver in 2016 as an adjunct professor of business information and analytics, where he currently teaches a Python for Data Science course.
Manning Early Access Program (MEAP) Read chapters as they are written, get the finished eBook as soon as it’s ready, and receive the pBook long before it's in bookstores.
Data Science with Python and Dask (combo) added to cart
continue shoppinggo to cart
placing your order...Don't refresh or navigate away from the page.
customers also bought