Python has become a required skill for data science, and it’s easy to see why. It’s powerful, easy to learn, and includes the libraries like Pandas, Numpy, and Scikit that help you slice, scrub, munge, and wrangle your data. Even with a great language and fantastic tools though, there’s plenty to learn!
Exploring Data with Python is a collection of chapters from three Manning books, hand-picked by Naomi Ceder, the chair of the Python Software Foundation. This free eBook starts building your foundation in data science processes with practical Python tips and techniques for working and aspiring data scientists. In it, you’ll get a clear introduction to the data science process. Then, you’ll practice using Python for processing, cleaning, and exploring interesting datasets. Finally, you’ll get a practical demonstration of modelling and prediction with classification and regression. When you finish, you’ll have a good overview of Python in data science and a well-lit path to continue your learning.
Introduction
Part 1: The data science process
2 The data science process
2.1 Overview of the data science process
2.1.1 Don’t be a slave to the process
2.2 Step 1: Defining research goals and creating a project charter
2.2.1 Spend time understanding the goals and context of your research
2.2.2 Create a project charter
2.3 Step 2: Retrieving data
2.3.1 Start with data stored within the company
2.3.2 Don’t be afraid to shop around
2.3.3 Do data quality checks now to prevent problems later
2.4 Step 3: Cleansing, integrating, and transforming data
2.4.1 Cleansing data
2.4.2 Correct errors as early as possible
2.4.3 Combining data from different data sources
2.4.4 Transforming data
2.5 Step 4: Exploratory data analysis
2.6 Step 5: Build the models
2.6.1 Model and variable selection
2.6.2 Model execution
2.6.3 Model diagnostics and model comparison
2.7 Step 6: Presenting findings and building applications on top of them
2.8 Summary
Summary
Part 2: Processing data files
21 Processing data files
21.1 Welcome to ETL
21.2 Reading text files
21.2.1 Text encoding: ASCII, Unicode, and others
21.2.2 Unstructured text
21.2.3 Delimited flat files
21.2.4 The csv module
21.2.5 Reading a csv file as a list of dictionaries
21.3 Excel files
21.4 Data cleaning
21.4.1 Cleaning
21.4.2 Sorting
21.4.3 Data cleaning issues and pitfalls
21.5 Writing data files
21.5.1 CSV and other delimited files
21.5.2 Writing Excel files
21.5.3 Packaging data files
Summary
Part 3: Exploring data
24 Exploring data
24.1 Python tools for data exploration
24.1.1 Python’s advantages for exploring data
24.1.2 Python can be better than a spreadsheet
24.2 Jupyter notebook
24.2.1 Starting a kernel
24.2.2 Executing code in a cell
24.3 Python and pandas
24.3.1 Why you might want to use pandas
24.3.2 Installing pandas
24.3.3 Data frames
24.4 Data cleaning
24.4.1 Loading and saving data with pandas
24.4.2 Data cleaning with a data frame
24.5 Data aggregation and manipulation
24.5.1 Merging data frames
24.5.2 Selecting data
24.5.3 Grouping and aggregation
24.6 Plotting data
24.7 Why you might not want to use pandas
Summary
Part 4: Modeling and prediction
3 Modeling and prediction
3.1 Basic machine-learning modeling
3.1.1 Finding the relationship between input and target
3.1.2 The purpose of finding a good model
3.1.3 Types of modeling methods
3.1.4 Supervised versus unsupervised learning
3.2 Classification: predicting into buckets
3.2.1 Building a classifier and making predictions
3.2.2 Classifying complex, nonlinear data
3.2.3 Classifying with multiple classes
3.3 Regression: predicting numerical values
3.3.1 Building a regressor and making predictions
3.3.2 Performing regression on complex, nonlinear data
3.4 Summary
3.5 Terms from this chapter
index
Symbols
Numerics
What's inside
- "The data science process" from Introducing Data Science by Davy Cielen, Arno D. B. Meysman, and Mohamed Ali
- "Processing data files" from The Quick Python Book, Third Edition by Naomi Ceder
- "Exploring data" from The Quick Python Book, Third Edition by Naomi Ceder
- "Modeling and prediction" from Real-World Machine Learning by Henrik Brink, Joseph W. Richards, and Mark Fetherolf