Statistics Every Programmer Needs you own this product

Practical Python implementations and quantitative methods

Gary Sutton

July 2025
ISBN 9781633436053
448 pages

Included with a Manning Online subscription

printed in black & white

catalog / Other / Mathematics

resources: Source code Errata Book forum Source code on Github Register your pBook for a free eBook

table of content

1 Laying the groundwork

1.1 Stats and quant

1.1.1 Understanding the basics

1.1.2 Why they matter

1.1.3 The broader effect

1.1.4 Diving deeper: Core concepts

1.2 Why Python?

1.2.1 Rich ecosystem

1.2.2 Ease of learning

1.2.3 Online support and community

1.2.4 Industry adoption

1.2.5 Versatility

1.3 Python IDEs

1.3.1 IDLE: A starting point

1.3.2 PyCharm: A professional tool

1.3.3 Other popular IDEs

1.4 Benefits and learning approach

1.4.1 From statistical measures to real-world application

1.4.2 Expanding beyond traditional techniques

1.4.3 A balanced approach to theory and practice

1.5 How this book works

1.5.1 Foundational learning with exploration and practice

1.5.2 Using Python for precision and efficiency

1.5.3 Adaptable learning for diverse skill levels

1.6 What this book does not cover

2 Exploring probability and counting

2.1 Basic probabilities

2.1.1 Probability types

2.1.2 Converting and measuring probabilities

2.2 Counting rules

2.2.1 Multiplication rule

2.2.2 Addition rule

2.2.3 Combinations and permutations

2.3 Continuous random variables

2.3.1 Examples

2.3.2 Probability density function

2.3.3 Cumulative distribution function

2.4 Discrete random variables

2.4.1 Examples

2.4.2 Probability mass function

2.4.3 Cumulative distribution function

3 Exploring probability distributions and conditional probabilities

3.1 Probability distributions

3.1.1 Normal distribution

3.1.2 Binomial distribution

3.1.3 Discrete uniform distribution

3.1.4 Poisson distribution

3.2 Probability problems

3.2.1 Complement rule for probability

3.2.2 Quick reference guide

3.2.3 Applied probability: Examples and solutions

3.3 Conditional probabilities

3.3.1 Examples

3.3.2 Conditional probabilities and independence

3.3.3 Intuitive approach to conditional probability

3.3.4 Formulaic approach to conditional probability

4 Fitting a linear regression

4.1 Primer on linear regression

4.1.1 Linear equation

4.1.2 Goodness of fit

4.1.3 Conditions for best fit

4.2 Simple linear regression

4.2.1 Importing and exploring the data

4.2.2 Fitting the model

4.2.3 Interpreting and evaluating the results

4.2.4 Testing model assumptions

5 Fitting a logistic regression

5.1 Logistic regression vs. linear regression

5.2 Multiple logistic regression

5.2.1 Importing and exploring the data

5.2.2 Fitting the model

5.2.3 Interpreting and evaluating the results

5.2.4 Calculating and evaluating classification metrics

6 Fitting a decision tree and a random forest

6.1 Understanding decision trees and random forests

6.2 Importing, wrangling, and exploring the data

6.2.1 Understanding the data

6.2.2 Wrangling the data

6.2.3 Exploring the data

6.3 Fitting a decision tree

6.3.1 Splitting the data

6.3.2 Fitting the model

6.3.3 Predicting responses

6.3.4 Evaluating the model

6.3.5 Plotting the decision tree

6.3.6 Interpreting and understanding decision trees

6.3.7 Advantages and disadvantages of decision trees

6.4 Fitting a random forest

6.4.1 Fitting the model

6.4.2 Predicting responses

6.4.3 Evaluating the model

6.4.4 Feature importance

6.4.5 Extracting random trees

7 Fitting time series models

7.1 Distinguishing forecasts from predictions

7.2 Importing and plotting the data

7.2.1 Fetching financial data

7.2.2 Understanding the data

7.2.3 Plotting the data

7.3 Fitting an ARIMA model

7.3.1 Autoregression (AR) component

7.3.2 Integration (I) component

7.3.3 Moving average (MA) component

7.3.4 Combining ARIMA components

7.3.5 Stationarity

7.3.6 Differencing

7.3.7 Stationarity and differencing applied

7.3.8 AR and MA components

7.3.9 Fitting the model

7.3.10 Evaluating model fit

7.3.11 Forecasting

7.4 Fitting exponential smoothing models

7.4.1 Model structure

7.4.2 Applicability

7.4.3 Mathematical properties

7.4.4 Types of exponential smoothing models

7.4.5 Choosing between ARIMA and exponential smoothing

7.4.6 SES and DES models

7.4.7 Holt–Winters model

8 Transforming data into decisions with linear programming

8.1 Problem formulation

8.1.1 The scenario

8.1.2 The challenge

8.1.3 The approach

8.1.4 Feature summaries

8.2 Developing the linear optimization framework

8.2.1 Explanation of linear equations and inequalities

8.2.2 Data definition

8.2.3 Objective function

8.2.4 Constraints

8.2.5 Decision variable bounds

8.2.6 Solving the linear programming problem

8.2.7 Result evaluation

9 Running Monte Carlo simulations

9.1 Applications and benefits of Monte Carlo simulations

9.2 Step-by-step process

9.3 Hands-on approach

9.3.1 Establishing a probability distribution (step 1)

9.3.2 Computing a cumulative probability distribution (step 2)

9.3.3 Establishing an interval of random numbers for each variable (step 3)

9.3.4 Generating random numbers (step 4)

9.3.5 Simulating a series of trials (step 5)

9.3.6 Analyzing the results (step 6)

9.4 Automating simulations on discrete data

9.4.1 Plotting and analyzing the results

9.5 Automating simulations on continuous data

9.5.1 Predicting stock prices with Monte Carlo simulations

9.5.2 Analyzing historical data (step 1)

9.5.3 Calculating log returns (step 2)

9.5.4 Computing statistical parameters (step 3)

9.5.5 Generating random daily returns (step 4)

9.5.6 Simulating prices (step 5)

9.5.7 Simulating multiple trials (step 6)

9.5.8 Analyzing the results (step 7)

10 Building and plotting a decision tree

10.1 Decision-making without probabilities

10.1.1 Maximax method

10.1.2 Maximin method

10.1.3 Minimax Regret method

10.1.4 Expected Value method

10.2 Decision trees

10.2.1 Creating the schema

10.2.2 Plotting the tree

11 Predicting future states with Markov analysis

11.1 Understanding the mechanics of Markov analysis

11.2 States and state probabilities

11.2.1 Understanding the vector of state probabilities for multistate systems

11.2.2 Matrix of transition probabilities

11.3 Equilibrium conditions

11.3.1 Predicting equilibrium conditions programmatically

11.4 Absorbing states

11.4.1 Obtaining the fundamental matrix

11.4.2 Predicting absorbing states

11.4.3 Predicting absorbing states programmatically

12 Examining and testing naturally occurring number sequences

12.1 Benford’s law explained

12.2 Naturally occurring number sequences

12.3 Uniform and random distributions

12.3.1 Uniform distribution

12.3.2 Random distribution

12.3.3 Plotted distributions

12.4 Examples

12.4.1 Street addresses

12.4.2 World population figures

12.4.3 Payment amounts

12.5 Validating Benford’s law

12.5.1 Chi-square test

12.5.2 Mean absolute deviation

12.5.3 Distortion factor and z-statistic

12.5.4 Mantissa statistics

13 Managing projects

13.1 Creating a work breakdown structure

13.2 Estimating activity times with PERT

13.3 Finding the critical path

13.3.1 Earliest times

13.3.2 Latest times

13.3.3 Slack

13.3.4 Finding the critical path programmatically

13.4 Estimating the probability of project completion

13.5 Crashing the project

14 Visualizing quality control

14.1 Quality control measures

14.1.1 Upper control limit and lower control limit

14.1.2 Mean and center line

14.1.3 Standard deviation

14.1.4 Range

14.1.5 Sample size

14.1.6 Proportion defective

14.1.7 Number of defective items

14.1.8 Number of defects

14.1.9 Defects per unit

14.1.10 Moving range

14.1.11 z-score

14.1.12 Process capability indices

14.2 Control charts for attributes

14.2.1 p-charts

14.2.2 np-charts

14.2.3 c-charts

14.2.4 g-charts

14.3 Control charts for variables

14.3.1 x-bar charts

14.3.2 r-charts

14.3.3 s-charts

14.3.4 I-MR charts

14.3.5 EWMA charts

Overview

9 Running Monte Carlo simulations

Monte Carlo simulations use random sampling to approximate the behavior of complex systems when analytical solutions are impractical. The chapter introduces the core idea—simulate many possible outcomes by drawing from specified probability distributions—and contrasts this approach with deterministic models that provide single-point forecasts. It emphasizes handling both discrete and continuous random variables, the importance of defining appropriate distributions, and the value of summarizing results with probabilities and statistics to quantify uncertainty, support scenario analysis, and improve decision-making under risk.

Through a hands-on discrete example, the chapter walks step by step from defining a Poisson distribution for employee absenteeism to computing cumulative probabilities, mapping outcomes to random-number intervals, generating random samples, running trials, and interpreting results. A small, manual run (10 trials) illustrates mechanics and the pitfalls of small samples; automation then scales to 500 simulations, normalizes probabilities, and produces stable outcome frequencies that inform staffing decisions (e.g., choosing between cost efficiency and higher service levels). The discussion highlights reproducibility with random seeds, the role of expected value versus simulated variability, and why larger trial counts reduce the influence of anomalies while still reflecting rare events according to their probabilities.

For continuous data, the chapter applies Monte Carlo methods to forecast stock prices: gather historical prices, compute daily log returns, estimate mean and volatility, draw random returns from a normal model, and iteratively compound from the last observed price to generate many price paths over a chosen horizon. Analysis of simulated paths covers directionality (ending above/below start), summary statistics across all simulated prices, and practical insights such as clustering around central tendencies, widening uncertainty over time, and the inclusion of extreme scenarios for risk assessment. The chapter concludes that Monte Carlo methods complement theory with realistic variability, offer flexible scenario exploration, and provide a robust foundation for data-driven choices in uncertain environments.

A Poisson distribution with a rate parameter (𝜆) of 2. At lower rate parameters, the distribution is right-skewed, indicating that smaller outcomes are more likely while larger outcomes become increasingly rare.

A Poisson distribution and its cumulative distribution, with the bars representing the Poisson probabilities and the line representing the cumulative probabilities. The bars correspond to the primary y-axis on the left, while the line with circular markers corresponds to the secondary y-axis on the right.

Probability distribution from 500 Monte Carlo simulations closely resembling a theoretical Poisson distribution with a rate parameter of 2. The labels atop the bars represent the percentage of trials that resulted in each unique random variable between 0 and 7. No trials results in a 𝑘 value of 8, consistent with the low theoretical probability of such an outcome.

Density plot of the log returns for GM stock between July and December 2023. The distribution is approximately normal, centered around a mean close to zero, with most log returns clustered near zero but with some larger positive and negative values in the tails.

500 Monte Carlo simulations of GM’s closing stock price for January 2024. Each line represents a single trial. Each line corresponds to a single simulation, depicting potential stock price trajectories based on the historical mean and volatility of log returns.

Summary

Monte Carlo simulations provide a powerful tool for modeling and analyzing complex systems with inherent uncertainty, offering insights into the range of possible outcomes and their probabilities. This chapter demonstrated how to apply Monte Carlo simulations to both discrete and continuous data, highlighting the differences in approach and the specific steps involved in each case.
For discrete data, such as employee absenteeism, simulations relied on historical frequencies and discrete probability distributions to generate potential scenarios. For continuous data, such as stock price movements, the process involved calculating key statistical parameters like the mean and standard deviation to model the variability of outcomes.
The simulations illustrated the importance of capturing inherent randomness and variability. Discrete simulations provided clarity in scenarios with distinct outcomes, while continuous simulations accounted for the fluidity of real-world processes. The chapter also emphasized the role of visual tools, such as density plots and trajectory graphs, in interpreting simulation results and understanding the implications of variability.
Running a large number of simulations was highlighted as essential for achieving robust results, smoothing out anomalies, and providing reliable insights. By generating a range of potential future outcomes, Monte Carlo simulations empower decision-makers to assess risks, plan for contingencies, and optimize strategies in uncertain environments.
The hands-on approach demonstrated in this chapter underscored the versatility of Monte Carlo methods in addressing both discrete and continuous uncertainties. Looking ahead, the next chapter builds on these methods, exploring decision trees to derive expected values and assess alternatives, further enhancing our capacity for data-driven decision-making.

FAQ

What is a Monte Carlo simulation and when should I use it?

Monte Carlo simulations use random sampling to approximate the distribution of possible outcomes for a system. They are most useful when analytical solutions are difficult or impossible due to complexity, nonlinearity, many interacting variables, or uncertainty. Typical applications include finance (risk and pricing), engineering (reliability), project management (timelines/budgets), and the natural sciences (stochastic processes).

How do discrete and continuous random variables differ in Monte Carlo simulations?

- Discrete: Outcomes take distinct values (for example, number of absentees). You define a probability mass function, convert it to a CDF, and map random numbers to outcome intervals.
- Continuous: Outcomes span ranges (for example, stock prices). You sample directly from continuous distributions (for example, normal for log returns) and propagate values through a model (for example, compounding returns) to build paths.

What are the core steps to run a Monte Carlo simulation?

1) Define probability distributions for uncertain inputs. 2) Compute the cumulative distribution function (CDF). 3) Map outcome probabilities to random-number intervals (discrete) or set up continuous sampling. 4) Generate random numbers (set a seed for reproducibility if needed). 5) Run many trials to sample outcomes. 6) Aggregate and analyze results (summary stats, visuals, percentiles, scenario/risk metrics).

Why compute a CDF and how is it used to map random numbers to outcomes?

The CDF accumulates probabilities up to each outcome, creating contiguous probability ranges that cover 0–1 (or 0–100%). By drawing a random number, you select the unique interval it falls into, which determines the outcome. This makes sampling consistent with the specified probabilities and simplifies implementation.

How are random-number intervals assigned to discrete outcomes (for example, Poisson absenteeism)?

You allocate spans of random digits (for example, 00–99) proportional to each outcome’s probability. Higher-probability outcomes receive larger ranges; rare outcomes can receive a single digit. If needed, the range can wrap around at the end to exhaust all digits. This guarantees alignment between draws and the target distribution.

How many trials should I run, and why set a random seed?

More trials reduce sampling noise and stabilize estimates (law of large numbers), but beyond a point add diminishing returns. Hundreds to thousands often suffice for many problems. A random seed ensures reproducibility, letting you regenerate the same sequence for validation, debugging, and comparison.

In the call-center example, why can the simulated staffing need differ from λ and from the expected value?

The Poisson rate (λ=2) and the expected value (~1.9976) summarize the distribution, but single or small sets of simulations sample its variability. For 10 trials, the mean outcome was 2.4; with more trials it would approach the expected value. Decisions (for example, overstaff by 2 vs 3) depend on risk tolerance: simulate to understand tail risks, not just the average.

When automating discrete simulations, why normalize probabilities and what happens to rare outcomes?

If you truncate a distribution’s tail (for example, ignore k>8 with tiny probabilities), the remaining probabilities won’t sum to 1. Normalize so the sampling weights are valid. Rare outcomes may appear infrequently or not at all in finite runs (for example, no k=8 in 500 trials), which is consistent with their very low probability.

How do you simulate continuous outcomes like stock prices with Monte Carlo?

- Compute daily log returns from historical prices.
- Estimate mean (mu) and standard deviation (sigma) of log returns.
- For each trial and day, sample a return from N(mu, sigma).
- Compound prices: P(t+1) = P(t) × exp(r_t).
- Repeat across many trials to form a distribution of paths and ending prices. Note the assumptions (approximate normality, constant volatility) and the fact that uncertainty widens over longer horizons.

How should I interpret and report Monte Carlo results to support decisions?

Summarize with histograms/CDFs, mean/median, standard deviation, ranges, and percentiles (for example, 5th/95th). Compare starting vs ending values, quantify probabilities of key events, and present scenario implications (best/likely/worst cases). Use the distribution—not a single point estimate—to align actions with risk tolerance and objectives.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$74.99 $48.74

you save $26.25 (35%)

include audio $24.99 $16.24

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$74.99 $48.74

you save $26.25 (35%)

include audio $24.99 $16.24

eBook

pdf, ePub, online

$74.99 $48.74

you save $26.25 (35%)

include audio $24.99 $16.24

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more