Statistics Every Programmer Needs you own this product

Practical Python implementations and quantitative methods

Gary Sutton

July 2025
ISBN 9781633436053
448 pages

Included with a Manning Online subscription

printed in black & white

catalog / Other / Mathematics

resources: Source code Errata Book forum Source code on Github Register your pBook for a free eBook

table of content

1 Laying the groundwork

1.1 Stats and quant

1.1.1 Understanding the basics

1.1.2 Why they matter

1.1.3 The broader effect

1.1.4 Diving deeper: Core concepts

1.2 Why Python?

1.2.1 Rich ecosystem

1.2.2 Ease of learning

1.2.3 Online support and community

1.2.4 Industry adoption

1.2.5 Versatility

1.3 Python IDEs

1.3.1 IDLE: A starting point

1.3.2 PyCharm: A professional tool

1.3.3 Other popular IDEs

1.4 Benefits and learning approach

1.4.1 From statistical measures to real-world application

1.4.2 Expanding beyond traditional techniques

1.4.3 A balanced approach to theory and practice

1.5 How this book works

1.5.1 Foundational learning with exploration and practice

1.5.2 Using Python for precision and efficiency

1.5.3 Adaptable learning for diverse skill levels

1.6 What this book does not cover

2 Exploring probability and counting

2.1 Basic probabilities

2.1.1 Probability types

2.1.2 Converting and measuring probabilities

2.2 Counting rules

2.2.1 Multiplication rule

2.2.2 Addition rule

2.2.3 Combinations and permutations

2.3 Continuous random variables

2.3.1 Examples

2.3.2 Probability density function

2.3.3 Cumulative distribution function

2.4 Discrete random variables

2.4.1 Examples

2.4.2 Probability mass function

2.4.3 Cumulative distribution function

3 Exploring probability distributions and conditional probabilities

3.1 Probability distributions

3.1.1 Normal distribution

3.1.2 Binomial distribution

3.1.3 Discrete uniform distribution

3.1.4 Poisson distribution

3.2 Probability problems

3.2.1 Complement rule for probability

3.2.2 Quick reference guide

3.2.3 Applied probability: Examples and solutions

3.3 Conditional probabilities

3.3.1 Examples

3.3.2 Conditional probabilities and independence

3.3.3 Intuitive approach to conditional probability

3.3.4 Formulaic approach to conditional probability

4 Fitting a linear regression

4.1 Primer on linear regression

4.1.1 Linear equation

4.1.2 Goodness of fit

4.1.3 Conditions for best fit

4.2 Simple linear regression

4.2.1 Importing and exploring the data

4.2.2 Fitting the model

4.2.3 Interpreting and evaluating the results

4.2.4 Testing model assumptions

5 Fitting a logistic regression

5.1 Logistic regression vs. linear regression

5.2 Multiple logistic regression

5.2.1 Importing and exploring the data

5.2.2 Fitting the model

5.2.3 Interpreting and evaluating the results

5.2.4 Calculating and evaluating classification metrics

6 Fitting a decision tree and a random forest

6.1 Understanding decision trees and random forests

6.2 Importing, wrangling, and exploring the data

6.2.1 Understanding the data

6.2.2 Wrangling the data

6.2.3 Exploring the data

6.3 Fitting a decision tree

6.3.1 Splitting the data

6.3.2 Fitting the model

6.3.3 Predicting responses

6.3.4 Evaluating the model

6.3.5 Plotting the decision tree

6.3.6 Interpreting and understanding decision trees

6.3.7 Advantages and disadvantages of decision trees

6.4 Fitting a random forest

6.4.1 Fitting the model

6.4.2 Predicting responses

6.4.3 Evaluating the model

6.4.4 Feature importance

6.4.5 Extracting random trees

7 Fitting time series models

7.1 Distinguishing forecasts from predictions

7.2 Importing and plotting the data

7.2.1 Fetching financial data

7.2.2 Understanding the data

7.2.3 Plotting the data

7.3 Fitting an ARIMA model

7.3.1 Autoregression (AR) component

7.3.2 Integration (I) component

7.3.3 Moving average (MA) component

7.3.4 Combining ARIMA components

7.3.5 Stationarity

7.3.6 Differencing

7.3.7 Stationarity and differencing applied

7.3.8 AR and MA components

7.3.9 Fitting the model

7.3.10 Evaluating model fit

7.3.11 Forecasting

7.4 Fitting exponential smoothing models

7.4.1 Model structure

7.4.2 Applicability

7.4.3 Mathematical properties

7.4.4 Types of exponential smoothing models

7.4.5 Choosing between ARIMA and exponential smoothing

7.4.6 SES and DES models

7.4.7 Holt–Winters model

8 Transforming data into decisions with linear programming

8.1 Problem formulation

8.1.1 The scenario

8.1.2 The challenge

8.1.3 The approach

8.1.4 Feature summaries

8.2 Developing the linear optimization framework

8.2.1 Explanation of linear equations and inequalities

8.2.2 Data definition

8.2.3 Objective function

8.2.4 Constraints

8.2.5 Decision variable bounds

8.2.6 Solving the linear programming problem

8.2.7 Result evaluation

9 Running Monte Carlo simulations

9.1 Applications and benefits of Monte Carlo simulations

9.2 Step-by-step process

9.3 Hands-on approach

9.3.1 Establishing a probability distribution (step 1)

9.3.2 Computing a cumulative probability distribution (step 2)

9.3.3 Establishing an interval of random numbers for each variable (step 3)

9.3.4 Generating random numbers (step 4)

9.3.5 Simulating a series of trials (step 5)

9.3.6 Analyzing the results (step 6)

9.4 Automating simulations on discrete data

9.4.1 Plotting and analyzing the results

9.5 Automating simulations on continuous data

9.5.1 Predicting stock prices with Monte Carlo simulations

9.5.2 Analyzing historical data (step 1)

9.5.3 Calculating log returns (step 2)

9.5.4 Computing statistical parameters (step 3)

9.5.5 Generating random daily returns (step 4)

9.5.6 Simulating prices (step 5)

9.5.7 Simulating multiple trials (step 6)

9.5.8 Analyzing the results (step 7)

10 Building and plotting a decision tree

10.1 Decision-making without probabilities

10.1.1 Maximax method

10.1.2 Maximin method

10.1.3 Minimax Regret method

10.1.4 Expected Value method

10.2 Decision trees

10.2.1 Creating the schema

10.2.2 Plotting the tree

11 Predicting future states with Markov analysis

11.1 Understanding the mechanics of Markov analysis

11.2 States and state probabilities

11.2.1 Understanding the vector of state probabilities for multistate systems

11.2.2 Matrix of transition probabilities

11.3 Equilibrium conditions

11.3.1 Predicting equilibrium conditions programmatically

11.4 Absorbing states

11.4.1 Obtaining the fundamental matrix

11.4.2 Predicting absorbing states

11.4.3 Predicting absorbing states programmatically

12 Examining and testing naturally occurring number sequences

12.1 Benford’s law explained

12.2 Naturally occurring number sequences

12.3 Uniform and random distributions

12.3.1 Uniform distribution

12.3.2 Random distribution

12.3.3 Plotted distributions

12.4 Examples

12.4.1 Street addresses

12.4.2 World population figures

12.4.3 Payment amounts

12.5 Validating Benford’s law

12.5.1 Chi-square test

12.5.2 Mean absolute deviation

12.5.3 Distortion factor and z-statistic

12.5.4 Mantissa statistics

13 Managing projects

13.1 Creating a work breakdown structure

13.2 Estimating activity times with PERT

13.3 Finding the critical path

13.3.1 Earliest times

13.3.2 Latest times

13.3.3 Slack

13.3.4 Finding the critical path programmatically

13.4 Estimating the probability of project completion

13.5 Crashing the project

14 Visualizing quality control

14.1 Quality control measures

14.1.1 Upper control limit and lower control limit

14.1.2 Mean and center line

14.1.3 Standard deviation

14.1.4 Range

14.1.5 Sample size

14.1.6 Proportion defective

14.1.7 Number of defective items

14.1.8 Number of defects

14.1.9 Defects per unit

14.1.10 Moving range

14.1.11 z-score

14.1.12 Process capability indices

14.2 Control charts for attributes

14.2.1 p-charts

14.2.2 np-charts

14.2.3 c-charts

14.2.4 g-charts

14.3 Control charts for variables

14.3.1 x-bar charts

14.3.2 r-charts

14.3.3 s-charts

14.3.4 I-MR charts

14.3.5 EWMA charts

Overview

6 Fitting a decision tree and random forest

This chapter introduces tree-based models for classification and regression, focusing on how to fit, interpret, and evaluate decision trees and random forests. It frames the ideas with a realistic prediction task: whether an NFL team converts a fourth down. The workflow emphasizes practical data preparation and exploration—filtering to relevant plays, handling missing values, transforming score differential to reflect the offense’s perspective, encoding categories, and deriving the binary target (CONVERT) from yards gained versus yards to go. Exploratory analysis highlights which features are most informative, notably that shorter “to-go” distances strongly correlate with success, play type aligns with the yardage required, and quarter and score context offer secondary signals. Throughout, the chapter balances model building with interpretation, evaluation, and the trade-offs between simplicity, accuracy, and robustness.

The decision tree section explains recursive partitioning, split criteria (Gini impurity or entropy), stopping rules, pruning, and how to read node attributes. It shows a complete pipeline: selecting features (quarter, to-go, venue, score differential, play type), creating train/test splits, fitting a depth-limited classifier, making predictions, and assessing performance. The fitted tree achieves about 61% accuracy overall, with markedly better performance on successful conversions than failures, illustrating class-dependent behavior. Interpretation ties back to the data: the root split is at TO_GO ≤ 3.5, reflecting that short distances dominate outcomes; subsequent splits involve PLAY_TYPE, SCORE_DIFFERENTIAL, and venue, refining predictions while reducing impurity. The chapter also walks through the math of Gini impurity—how to compute it for categorical and numeric features, weight impurities across child nodes, and choose thresholds by scanning midpoints—making clear why TO_GO emerges as the most discriminative variable.

The random forest section builds on this foundation, describing how bagging (bootstrap sampling) and random feature selection at each split reduce variance and mitigate overfitting relative to a single tree. Using a modest forest (50 shallow trees), the model edges up to about 62% accuracy overall, again performing substantially better on successful conversions than failures. Feature importance ranks TO_GO and PLAY_TYPE as dominant, with quarter and score adding smaller but meaningful contributions; sampling individual trees from the ensemble reveals diverse structures and split choices. The chapter closes by contrasting advantages and drawbacks: decision trees are transparent and easy to deploy but can overfit and be unstable; random forests trade some interpretability for robustness and improved generalization. Together, these methods give programmers practical, statistically grounded tools for interpretable modeling and stronger predictive performance.

A grouped bar chart that displays the CONVERT class label counts by QUARTER. Teams were more successful than not converting fourth down attempts in the first three quarters, but less successful in the fourth quarter.

Paired histograms that display the distributions of TO_GO grouped by the CONVERT class labels. When teams succeeded on their fourth down attempts, they usually needed less than 3 yards to convert, and definitely fewer than 5 yards. When teams failed to convert on fourth down, they oftentimes needed to gain more than 5 yards, and sometimes up to 25 yards or more.

Paired histograms that display the distributions of SCORE_DIFFERENTIAL grouped by the CONVERT class labels. Regardless of the CONVERT class label, the distribution is normally distributed about the mean.

Counts of fourth down conversion attempts categorized by the OFFENSIVE_TEAM_VENUE and CONVERT class labels. Teams playing at home were more successful than visiting teams in converting fourth down attempts.

Counts of fourth down conversion attempts categorized by the PLAY_TYPE and CONVERT class labels. Teams that ran the ball on fourth down were much more successful in converting those attempts compared to other teams that passed the ball instead. This doesn’t necessarily mean that running is typically a better strategy than passing; rather, it might simply reflect the yards required for a conversion.

A decision tree demonstrating the home loan approval process based on the applicant’s age and income. The tree is interpreted from top to bottom, starting at the root node and progressing to the leaf nodes. The left branch is followed when conditions are true, and the right branch is followed when conditions are false.

A plotted decision tree that represents the model’s decision-making process, showing how features are used to split the data and make predictions. This tree was pruned during construction, so it contains fewer splits and even fewer (less significant) features than otherwise.

A close look at the root node attributes from the plotted decision tree. The root node and the internal nodes all contain these same attributes; leaf nodes have these same attributes, too, minus the condition statement at the top.

A close look at the very top of our decision tree—the root node and the first level of internal nodes. The root node has arrows pointing away from it, while internal nodes have arrows pointing toward and away from them.

A closer look at the left subtree—two levels of internal nodes (A) and, at the bottom, four leaf nodes (B). When PLAY_TYPE is 0, thereby indicating a running play, the decision tree predicts a successful fourth down conversion attempt (a); it doesn’t matter if the team on offense is the road or home team. Alternatively, when PLAY_TYPE is 1, indicating a passing play, the decision tree then evaluates SCORE_DIFFERENTIAL before making a prediction. If the team on offense is trailing by 22 points or more, the tree predicts a failed fourth down conversion attempt (b); but if the team on offense is trailing by fewer than 22 points, the decision tree predicts a successful conversion (c).

The right branch of the decision tree uses internal nodes TO_GO (A) and SCORE_DIFFERENTIAL (B) to establish a classification pathway, with the initial TO_GO split playing a crucial role in distinguishing the target class, followed by subsequent refinements that confirm the predicted class labels (a and b).

A feature importance plot generated from the RandomForestClassifier. It displays the relative importance of each feature to the final predicted class labels where, if stacked, a single bar would equal 1. The features are otherwise sorted, from left to right, in descending order of relative importance, with TO_GO and PLAY_TYPE accounting for approximately 80% of the model’s predictive power.

The first of two random trees from a random forest model containing 50 trees. Notice that QUARTER is at the root node; thus, based on this one random split of the data, QUARTER, which didn’t factor into our decision tree model, is the most significant variable in this subset for predicting the final class labels.

The second of two random trees from the same random forest model. The features and splits to construct one tree versus another can, and will, vary significantly.

Summary

A decision tree is a supervised machine learning model used for classification and regression tasks. It splits the data into subsets based on feature values, creating a tree structure with decision nodes and leaf nodes representing predictions. Decision trees are easy to interpret and visualize.
Training a decision tree involves splitting the data based on features to minimize node impurities. It uses the training set for learning and the test set for evaluation, ensuring unbiased accuracy and generalization assessment.
Evaluating a decision tree includes computing overall accuracy by comparing predicted and actual labels in the test set. A confusion matrix provides a detailed performance breakdown, showing true/false positives and negatives for each class, helping to identify areas for improvement.
Building a decision tree involves recursive splitting based on significant features to maximize class separation, continuing until the subsets are pure or meet stopping criteria. Interpretation follows the path from root to leaves, ensuring consistent model construction and understanding.
A random forest is a supervised model for classification and regression, consisting of multiple decision trees trained on different data and feature subsets. This ensemble method typically enhances accuracy and reduces overfitting by aggregating predictions from multiple trees, providing robust performance.
Evaluating a random forest involves steps similar to a decision tree, including computing overall accuracy and using a confusion matrix for granular performance insights. The aggregated results from multiple trees offer a more robust evaluation and reliable performance across classes.
Feature importance in random forests measures each feature's contribution to predictive power by assessing impurity reduction across all trees. This provides a comprehensive view of influential features, helping to identify and prioritize significant variables for accurate predictions.
While this chapter focuses on decision trees and random forests, other powerful tree-based methods like XGBoost and Gradient Boosting are also popular for tackling complex classification and regression problems. These models build on the strengths of decision trees, using advanced techniques to boost accuracy and handle challenging data patterns, offering further options for sophisticated analysis beyond what we covered here.

FAQ

What is a decision tree and how does it make predictions?

A decision tree recursively splits the data on feature thresholds that best separate the classes. Each internal node holds a test (for example, TO_GO ≤ 3.5), branches represent outcomes of that test, and each leaf holds a predicted class. Splits are chosen to minimize impurity (commonly Gini impurity or entropy), and growth stops via criteria like max depth or minimum samples per leaf. Predictions follow the path from root to a leaf.

What is a random forest and why does it often perform better than a single tree?

A random forest is an ensemble of many decision trees trained on bootstrap samples (bagging) with random feature subsets considered at each split. For classification, it predicts by majority vote across trees. This randomness reduces correlation between trees, lowers variance, mitigates overfitting, and typically improves robustness and accuracy compared to a single tree.

How was the target variable (CONVERT) created for the NFL fourth-down example?

The target CONVERT is a derived binary variable: if YARDS_GAINED is less than TO_GO, CONVERT = 0 (failed conversion); otherwise CONVERT = 1 (successful conversion). This transforms play-by-play data into a classification problem.

What key data-wrangling steps were needed before fitting the models?

- Filter plays to fourth downs with Run or Pass only - Replace NaN values in YARDS_GAINED with 0 - Transform SCORE_DIFFERENTIAL to the offense’s perspective by flipping the sign when the offense was the Road team - Map categorical strings to integers (OFFENSIVE_TEAM_VENUE and PLAY_TYPE) - Select only needed columns at import with usecols to save memory

Why did TO_GO become the root split in the decision tree?

Root features are chosen by the split that yields the lowest weighted average Gini impurity. Evaluating candidate splits showed TO_GO ≤ 3.5 produced the largest impurity reduction across the training data, making it the most informative feature at the root.

Gini impurity vs. entropy: which did we use and what’s the difference?

The chapter uses Gini impurity (criterion = "gini") for speed and simplicity. Both Gini and entropy measure node impurity and often produce similar trees. Entropy (information gain) can be more sensitive to class probability changes; Gini typically computes faster.

How were the models trained and evaluated in scikit-learn?

- Split features and target into train/test sets (70/30) with a fixed random_state for reproducibility - Fit a DecisionTreeClassifier (gini, max_depth=3) and then a RandomForestClassifier (n_estimators=50, gini, max_depth=3) - Predict on X_test and compute accuracy and confusion matrices to assess overall and per-class performance

What were the observed results for accuracy and class-wise performance?

The decision tree achieved about 61% accuracy; the random forest about 62%. Both models performed much better on CONVERT=1 than on CONVERT=0. For example, the tree had roughly 51% accuracy when the true label was 0 and about 72% when it was 1; the forest had about 51% (label 0) and 75% (label 1).

Which features mattered most in the random forest and how is feature importance interpreted?

Feature importance scores showed TO_GO and PLAY_TYPE contributed the majority of predictive power (about 80% together). Higher importance means a feature more frequently and effectively reduces impurity across the ensemble’s splits. QUARTER, while less influential overall, still showed signal in some trees.

What are key pros and cons of decision trees versus random forests?

- Decision trees: easy to interpret and visualize, handle numeric and categorical data, capture non-linearities; but can overfit, be unstable to small data changes, and may underperform alone.
- Random forests: reduce overfitting and improve stability/accuracy via ensembling and randomness; but are less interpretable than a single tree and require more computation.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$74.99 $47.24

you save $27.75 (37%)

include audio $24.99 $15.74

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$74.99 $47.24

you save $27.75 (37%)

include audio $24.99 $15.74

eBook

pdf, ePub, online

$74.99 $47.24

you save $27.75 (37%)

include audio $24.99 $15.74

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more