table of content

1 Laying the groundwork

1.1 Stats and quant

1.1.1 Understanding the basics

1.1.2 Why they matter

1.1.3 The broader effect

1.1.4 Diving deeper: Core concepts

1.2 Why Python?

1.2.1 Rich ecosystem

1.2.2 Ease of learning

1.2.3 Online support and community

1.2.4 Industry adoption

1.2.5 Versatility

1.3 Python IDEs

1.3.1 IDLE: A starting point

1.3.2 PyCharm: A professional tool

1.3.3 Other popular IDEs

1.4 Benefits and learning approach

1.4.1 From statistical measures to real-world application

1.4.2 Expanding beyond traditional techniques

1.4.3 A balanced approach to theory and practice

1.5 How this book works

1.5.1 Foundational learning with exploration and practice

1.5.2 Using Python for precision and efficiency

1.5.3 Adaptable learning for diverse skill levels

1.6 What this book does not cover

2 Exploring probability and counting

2.1 Basic probabilities

2.1.1 Probability types

2.1.2 Converting and measuring probabilities

2.2 Counting rules

2.2.1 Multiplication rule

2.2.2 Addition rule

2.2.3 Combinations and permutations

2.3 Continuous random variables

2.3.1 Examples

2.3.2 Probability density function

2.3.3 Cumulative distribution function

2.4 Discrete random variables

2.4.1 Examples

2.4.2 Probability mass function

2.4.3 Cumulative distribution function

3 Exploring probability distributions and conditional probabilities

3.1 Probability distributions

3.1.1 Normal distribution

3.1.2 Binomial distribution

3.1.3 Discrete uniform distribution

3.1.4 Poisson distribution

3.2 Probability problems

3.2.1 Complement rule for probability

3.2.2 Quick reference guide

3.2.3 Applied probability: Examples and solutions

3.3 Conditional probabilities

3.3.1 Examples

3.3.2 Conditional probabilities and independence

3.3.3 Intuitive approach to conditional probability

3.3.4 Formulaic approach to conditional probability

4 Fitting a linear regression

4.1 Primer on linear regression

4.1.1 Linear equation

4.1.2 Goodness of fit

4.1.3 Conditions for best fit

4.2 Simple linear regression

4.2.1 Importing and exploring the data

4.2.2 Fitting the model

4.2.3 Interpreting and evaluating the results

4.2.4 Testing model assumptions

5 Fitting a logistic regression

5.1 Logistic regression vs. linear regression

5.2 Multiple logistic regression

5.2.1 Importing and exploring the data

5.2.2 Fitting the model

5.2.3 Interpreting and evaluating the results

5.2.4 Calculating and evaluating classification metrics

6 Fitting a decision tree and a random forest

6.1 Understanding decision trees and random forests

6.2 Importing, wrangling, and exploring the data

6.2.1 Understanding the data

6.2.2 Wrangling the data

6.2.3 Exploring the data

6.3 Fitting a decision tree

6.3.1 Splitting the data

6.3.2 Fitting the model

6.3.3 Predicting responses

6.3.4 Evaluating the model

6.3.5 Plotting the decision tree

6.3.6 Interpreting and understanding decision trees

6.3.7 Advantages and disadvantages of decision trees

6.4 Fitting a random forest

6.4.1 Fitting the model

6.4.2 Predicting responses

6.4.3 Evaluating the model

6.4.4 Feature importance

6.4.5 Extracting random trees

7 Fitting time series models

7.1 Distinguishing forecasts from predictions

7.2 Importing and plotting the data

7.2.1 Fetching financial data

7.2.2 Understanding the data

7.2.3 Plotting the data

7.3 Fitting an ARIMA model

7.3.1 Autoregression (AR) component

7.3.2 Integration (I) component

7.3.3 Moving average (MA) component

7.3.4 Combining ARIMA components

7.3.5 Stationarity

7.3.6 Differencing

7.3.7 Stationarity and differencing applied

7.3.8 AR and MA components

7.3.9 Fitting the model

7.3.10 Evaluating model fit

7.3.11 Forecasting

7.4 Fitting exponential smoothing models

7.4.1 Model structure

7.4.2 Applicability

7.4.3 Mathematical properties

7.4.4 Types of exponential smoothing models

7.4.5 Choosing between ARIMA and exponential smoothing

7.4.6 SES and DES models

7.4.7 Holt–Winters model

8 Transforming data into decisions with linear programming

8.1 Problem formulation

8.1.1 The scenario

8.1.2 The challenge

8.1.3 The approach

8.1.4 Feature summaries

8.2 Developing the linear optimization framework

8.2.1 Explanation of linear equations and inequalities

8.2.2 Data definition

8.2.3 Objective function

8.2.4 Constraints

8.2.5 Decision variable bounds

8.2.6 Solving the linear programming problem

8.2.7 Result evaluation

9 Running Monte Carlo simulations

9.1 Applications and benefits of Monte Carlo simulations

9.2 Step-by-step process

9.3 Hands-on approach

9.3.1 Establishing a probability distribution (step 1)

9.3.2 Computing a cumulative probability distribution (step 2)

9.3.3 Establishing an interval of random numbers for each variable (step 3)

9.3.4 Generating random numbers (step 4)

9.3.5 Simulating a series of trials (step 5)

9.3.6 Analyzing the results (step 6)

9.4 Automating simulations on discrete data

9.4.1 Plotting and analyzing the results

9.5 Automating simulations on continuous data

9.5.1 Predicting stock prices with Monte Carlo simulations

9.5.2 Analyzing historical data (step 1)

9.5.3 Calculating log returns (step 2)

9.5.4 Computing statistical parameters (step 3)

9.5.5 Generating random daily returns (step 4)

9.5.6 Simulating prices (step 5)

9.5.7 Simulating multiple trials (step 6)

9.5.8 Analyzing the results (step 7)

10 Building and plotting a decision tree

10.1 Decision-making without probabilities

10.1.1 Maximax method

10.1.2 Maximin method

10.1.3 Minimax Regret method

10.1.4 Expected Value method

10.2 Decision trees

10.2.1 Creating the schema

10.2.2 Plotting the tree

11 Predicting future states with Markov analysis

11.1 Understanding the mechanics of Markov analysis

11.2 States and state probabilities

11.2.1 Understanding the vector of state probabilities for multistate systems

11.2.2 Matrix of transition probabilities

11.3 Equilibrium conditions

11.3.1 Predicting equilibrium conditions programmatically

11.4 Absorbing states

11.4.1 Obtaining the fundamental matrix

11.4.2 Predicting absorbing states

11.4.3 Predicting absorbing states programmatically

12 Examining and testing naturally occurring number sequences

12.1 Benford’s law explained

12.2 Naturally occurring number sequences

12.3 Uniform and random distributions

12.3.1 Uniform distribution

12.3.2 Random distribution

12.3.3 Plotted distributions

12.4 Examples

12.4.1 Street addresses

12.4.2 World population figures

12.4.3 Payment amounts

12.5 Validating Benford’s law

12.5.1 Chi-square test

12.5.2 Mean absolute deviation

12.5.3 Distortion factor and z-statistic

12.5.4 Mantissa statistics

13 Managing projects

13.1 Creating a work breakdown structure

13.2 Estimating activity times with PERT

13.3 Finding the critical path

13.3.1 Earliest times

13.3.2 Latest times

13.3.3 Slack

13.3.4 Finding the critical path programmatically

13.4 Estimating the probability of project completion

13.5 Crashing the project

14 Visualizing quality control

14.1 Quality control measures

14.1.1 Upper control limit and lower control limit

14.1.2 Mean and center line

14.1.3 Standard deviation

14.1.4 Range

14.1.5 Sample size

14.1.6 Proportion defective

14.1.7 Number of defective items

14.1.8 Number of defects

14.1.9 Defects per unit

14.1.10 Moving range

14.1.11 z-score

14.1.12 Process capability indices

14.2 Control charts for attributes

14.2.1 p-charts

14.2.2 np-charts

14.2.3 c-charts

14.2.4 g-charts

14.3 Control charts for variables

14.3.1 x-bar charts

14.3.2 r-charts

14.3.3 s-charts

14.3.4 I-MR charts

14.3.5 EWMA charts

Overview

4 Fitting a linear regression

This chapter presents linear regression as a core supervised learning technique for predicting numeric outcomes and answering practical questions programmers care about: whether a meaningful relationship exists between predictors and a response, its direction and strength, how to quantify expected changes, which variables matter in multiple regression, how well the model fits, and whether assumptions are satisfied. It distinguishes simple from multiple regression, frames the model as a line of best fit through data, and motivates its enduring value across domains such as marketing, housing, admissions, and retail forecasting. The treatment combines conceptual grounding with an end‑to‑end, hands‑on workflow to build, interpret, and validate a model.

On the theory side, the chapter covers the linear and multiple regression equations, ordinary least squares estimation, and residuals, then explains goodness‑of‑fit using R-squared while cautioning about its limits: it always increases with added predictors, does not imply causality, and can mask model misspecification or overfitting. It clarifies positive, negative, and neutral relationships and emphasizes conditions for best fit, notably the influence of non-normality and outliers. Practical remedies are discussed—data transformations (log, reciprocal, square root, modest power) and outlier handling (removal or winsorization)—with the warning that such steps alter the modeling scale, may hide nonlinearity better addressed by other methods, and should be justified and documented.

On the practice side, readers import and explore a small race‑timing dataset with pandas, compute descriptive statistics, test variable normality via Shapiro–Wilk, and screen for outliers using a three‑standard‑deviation rule. They fit an OLS model with statsmodels, interpret coefficients (intercept and slope) and predictions, and connect output back to sums of squares (SST, SSR, SSE) to derive R-squared. Model significance is evaluated with the F‑statistic and its p‑value, and individual predictors via t‑tests. Assumptions are then diagnosed: linearity (residuals plot), independence (Durbin–Watson), homoscedasticity (Breusch–Pagan), and residual normality (Q–Q plot and Jarque–Bera). The chapter closes by noting multicollinearity concerns in multiple regression and reinforcing a careful, transparent workflow to balance simplicity, interpretability, and generalization.

A scatter plot that displays 8 data points, their respective x and y coordinates, and a regression line. The data points represent the observed data (for instance, when x equals 202, y equals 288, which has been highlighted and magnified). The regression line ties back to a simple linear regression that was fit on the data where a response variable called y, which runs along the y-axis, was regressed against a predictor called x, which runs along the x-axis. It otherwise represents the linear equation at the top that can be derived from the model output; which is to say it is also a representation of the predictions for y given x. R-squared is one of several metrics contained in the model output; it represents the percentage of variance in the response variable that can be explained by changes in the predictor.

A scatter plot that shows a negative relationship between variables. The relationship is negative because the variables move in opposite directions—as the independent variable x increases, the dependent variable y decreases.

A scatter plot that shows a neutral relationship between variables. The relationship is neutral because changes in the independent variable x appear to have almost no effect on the dependent variable y.

A scatter plot that displays 7 data points rather than 8. The one presumed outlier was removed and another regression was fit, resulting in a coefficient of determination now equal to 0.95, which was previously equal to 0.88. By removing one outlier, we’ve made a strong model even stronger.

A scatter plot that displays the observed stage1 and stage2 values from the mds data frame, their respective x and y coordinates, and a regression line that represents the predictions for stage2 from a simple linear regression where stage2 was regressed against stage1. The regression line is otherwise drawn from applying the linear equation at the top of the plot. R-squared is one of several metrics contained in the model output; it represents the percentage of variance in stage2 that can be explained by changes in stage1.

A snippet from a typical F-table where the selected significance level equals 5%. The critical value is located at the intersection of the predictor count (equal to 1) and the observation count, minus 1 (equal to 19). Because the F-statistic is greater than the critical value, the model therefore explains a statistically significant amount of the variance in the response variable.

A residuals plot that displays the fitted values from a linear regression on the x-axis and the residuals from the same model on the y-axis. Linearity between variables is confirmed when the data points do not follow any obvious pattern or trend.

A Q-Q plot that displays the distribution of the residuals (points) versus a perfectly-normal distribution (dashed line). Linear regression assumes that the residuals are normally distributed; however, our Q-Q plot might suggest otherwise.

Summary

Linear regression is a statistical method used to model the relationship between a numeric dependent variable and one or more independent variables by fitting a linear equation to observed data. It assumes a linear relationship between the independent and dependent variables, with the goal of estimating the parameters of the linear equation to minimize the overall difference between observed and predicted values. Linear regression is commonly used to predict and understand the relationship between variables across multiple domains, including economics, finance, and the social sciences.
Goodness of fit should be evaluated from multiple measures. R², or the coefficient of determination, which equals a number between 0 and 1 that specifies the proportion of variance in the dependent variable explained by the regression, might be the most meaningful measure, but hardly the only measure that matters. Overall fit is determined by the F-statistic. Significance (and insignificance) are fixed by the p-values for the individual coefficients as well as the p-value for the model.
Testing for model assumptions—linearity between variables, independence, homoscedasticity, and normality of the residuals—warrants the reliability and validity of regression results and therefore further facilitates interpretation and decision-making.
Applying best practices along the way, like testing for normality in the predictors and removing any and all outliers from the data, contribute considerably to getting a best possible fit and guaranteeing positive results in post-regression tests.

FAQ

When should I use linear regression, and when should I avoid it?

Use it when you want to model or predict a numeric (continuous) response from one or more predictors and the relationship is approximately linear. Do not use linear regression for classification problems; consider logistic regression or other classifiers instead.

What is the basic linear regression equation and how do I interpret the coefficients?

The simple model is ŷ = β₀ + β₁x. β₀ (intercept) is the predicted value when x = 0. β₁ (slope) is the expected change in the response for a one-unit increase in x. Residuals are the differences between observed and predicted values.

How do I fit a simple linear regression in Python with statsmodels?

- Define y (response) and x (predictor)
- Add an intercept with sm.add_constant(x)
- Fit the model: lm = sm.OLS(y, x).fit()
- Review results: print(lm.summary())

What does R-squared tell me, and what are its limitations?

R-squared is the proportion of variance in the response explained by the predictor(s), ranging from 0 to 1. Limitations: it does not show direction or causality, it can increase just by adding predictors (even weak ones), and a high value does not guarantee linearity or good generalization. Use Adjusted R-squared to penalize unnecessary predictors.

How do I interpret p-values in the output (P>|t| and Prob(F-statistic))?

- P>|t| tests each coefficient: a small p-value (e.g., < 0.05) suggests that predictor’s effect is statistically significant.
- Prob(F-statistic) tests the model as a whole: a small p-value indicates the regression explains a statistically significant portion of the response variance.

What assumptions does linear regression make, and how can I test them?

- Linearity: check a residuals vs fitted plot; no pattern suggests linearity holds.
- Independence of residuals: use the Durbin–Watson statistic; values near 2 suggest independence.
- Homoscedasticity (constant variance): use the Breusch–Pagan test; p-value > 0.05 suggests constant variance.
- Normality of residuals: inspect a Q–Q plot and confirm with the Jarque–Bera test.

Should I test variables for normality before modeling? How?

While the assumption applies to residuals, checking variable distributions helps. Use the Shapiro–Wilk test on predictors/response; a p-value > 0.05 suggests the data are not significantly different from normal. Normal predictors often lead to more normal residuals.

How do outliers affect linear regression, and how can I handle them?

Outliers can unduly influence the fitted line and reduce model reliability. Detect them via scatter plots or rules like values beyond ±3 standard deviations. Possible treatments: transform data (log, sqrt, reciprocal), winsorize extreme values, or remove outliers with caution (risk of overfitting or bias). Always document your choices.

What does the F-statistic measure, and how is it used?

The F-statistic evaluates whether the model explains significantly more variance than a model with only an intercept. You interpret it via its p-value (Prob(F-statistic)); a small value (e.g., < 0.05) indicates the model is statistically significant overall.

What are SST, SSR, and SSE, and how do they relate to R-squared?

- SST (total): total variability in the response around its mean.
- SSR (regression): variability explained by the model.
- SSE (error): unexplained variability (sum of squared residuals).
They relate as SST = SSR + SSE, and R-squared = SSR / SST = 1 − SSE / SST.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$63.99 $44.79

you save $19.20 (30%)

include audio $24.99 $17.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$63.99 $44.79

you save $19.20 (30%)

include audio $24.99 $17.49

eBook

pdf, ePub, online

$63.99 $44.79

you save $19.20 (30%)

include audio $24.99 $17.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more