Timeless Algorithms: The Seminal Papers you own this product

Gary Sutton

MEAP began November 2025
Last updated March 2026
Publication in Fall 2026 (estimated)

ISBN 9781633434462
375 pages (estimated)

Included with a Manning Online subscription

printed in black & white

catalog / Data Science / AI

resources: Book forum

table of content

1 Seeing inside the black box

1.1 The illusion of understanding

1.2 Why foundation still matters

1.3 Why it matters more than ever

1.3.1 Interpretability and accountability

1.3.2 Diagnostic power

1.3.3 Model selection and design

1.3.4 Ethical and epistemological insight

1.3.5 Beyond automation

1.4 The hidden stack of modern intelligence

1.5 What you’ll need

1.6 How this book will teach you

1.7 Summary

2 From effect to cause: Bayes’ Theorem and the first algorithm of learning

2.1 From observation to belief: how Bayesian reasoning begins

2.2 From prior to posterior: how Bayes’ Theorem works

2.2.1 The mathematical structure of Bayes’ Theorem

2.2.2 Before the formula: Bayes’ spatial intuition

2.2.3 Bayesian inference in real-world scenarios

2.2.4 The structure that makes inference possible

2.3 Building the machinery: the rules that power the theorem

2.3.1 Partitioning belief: the addition rule

2.3.2 Contrasting outcomes: the complement rule

2.3.3 Expressing likelihoods: odds and expected value

2.3.4 Linking outcomes: the logic of the multiplication rule

2.3.5 Independent repetition: compound events

2.3.6 Likelihood from repetition: binomial probability

2.4 Applications in machine learning and AI

2.4.1 Naive Bayes classifiers

2.4.2 Bayesian networks

2.4.3 Bayesian optimization

2.4.4 Thompson sampling

2.4.5 Bayesian A/B testing

2.4.6 The addition and multiplication probability rules in action

2.5 Why it still matters: everyday and not-so-everyday uses

2.5.1 Forecasting under uncertainty

2.5.2 Reinforcement learning and belief updating

2.5.3 Bayesian decision analysis

2.5.4 Markov Chain Monte Carlo (MCMC) and approximate inference

2.6 Where Bayesian inference goes wrong

2.7 Summary

3 The algorithm of estimation: Ronald Fisher’s likelihood principle

3.1 What is likelihood?

3.2 The criteria for a good estimator

3.2.1 Consistency: converging toward truth

3.2.2 Efficiency: getting the most from the data

3.2.3 Sufficiency: capturing all the information

3.2.4 Why likelihood wins: Fisher’s critique of Bayesian inference and the method of moments

3.3 Maximum likelihood as the backbone of modern modeling

3.3.1 From biology and agriculture to modeling: the generalization of maximum likelihood

3.3.2 Core models built on MLE

3.3.3 From likelihood to loss functions

3.3.4 Model selection and evaluation

3.3.5 Fisher Information in practice

3.4 What Fisher set in motion

3.5 Summary

4 Testing what we assume to know: Neyman, Pearson, and the principles of hypothesis testing

4.1 The stepwise framework of hypothesis testing

4.1.1 Step 1: state the hypothesis

4.1.2 Step 2: choose a significance level

4.1.3 Step 3: select the test

4.1.4 Step 4: compute the test statistic

4.1.5 Step 5: define the critical region

4.1.6 Step 6: make a decision

4.1.7 Step 7: draw a conclusion

4.1.8 From framework to practice

4.2 Hypothesis testing in action

4.2.1 Example 1: medical diagnostics

4.2.2 Example 2: safety monitoring

4.2.3 Example 3: a small clinical trial

4.2.4 From examples to principles

4.3 Why It Matters

4.3.1 Guarding against false discoveries

4.3.2 Turning uncertainty into strategy

4.3.3 Finding signal in the noise

4.4 Applications in statistics, data science, and AI

4.5 Summary

5 The birth of information theory: Shannon and the mathematics of uncertainty

5.1 Primers on information and entropy

5.1.1 Information: from meaning to uncertainty

5.1.2 Entropy: quantifying uncertainty

5.2 Shannon’s framework of communication (intended contribution)

5.2.1 Channel capacity

5.2.2 Noisy channel coding

5.2.3 Source coding (compression)

5.2.4 Signal processing and modulation

5.2.5 Why it mattered—and where it led

5.3 Entropy and information gain in data partitioning

5.3.1 Decision trees

5.3.2 Random forests

5.3.3 Feature selection

5.3.4 Clustering

5.4 Entropy and uncertainty reduction in deep learning

5.4.1 Neural networks

5.4.2 Representation learning

5.5 From communication to universal uncertainty

5.6 Summary

6 The logic of multi-stage decision processes: Richard Bellman and the principle of recursive optimization

6.1 The dynamic programming paradigm

6.1.1 Structural prerequisites for dynamic programming

6.1.2 Recursive formulations: expressing problems as recurrences

6.1.3 States, transitions, and base cases: the anatomy of a dynamic programming solution

6.1.4 Implementation methods: memoization and tabulation

6.1.5 Trade-offs and best practices

6.1.6 From principles to practice

6.2 Recurrence basics: the Fibonacci sequence

6.2.1 Fibonacci as a dynamic programming problem

6.3 Graph decomposition: the shortest-route problem

6.4 Counting solutions: the coin change problem

6.5 Constrained optimization: the knapsack problem

6.6 Efficacy and enduring value of dynamic programming

6.7 Enduring value: dynamic programming in real-world systems

6.7.1 Constrained optimization

6.7.2 Inventory control

6.7.3 Shortest routes and networks

6.7.4 Bioinformatics and string matching

6.7.5 Machine learning and AI

6.8 Markov Decision Processes and the Bellman equation

6.8.1 Markov Decision Processes explained

6.8.2 The Bellman equation: recursive decomposition of value

6.8.3 From dynamic programming to reinforcement learning

6.9 Pitfalls and challenges of dynamic programming

6.9.1 State explosion and the curse of dimensionality

6.9.2 Memory versus computation trade-offs

6.9.3 Defining states and the Markov property

6.10 Synthesis and future direction

6.11 Summary

7 From inference to choice: Howard Raiffa, Robert Schlaifer, and the Bayesian revolution

7.1 Revisiting Bayes: From Posterior Probabilities to Rational Action

7.1.1 From posterior probability to expected utility

7.1.2 Decision trees: visualizing choice under uncertainty

7.1.3 The value of information

7.2 The decision framework of Raiffa and Schlaifer

7.2.1 Structure of decision problems

7.2.2 The expected utility formulation

7.2.3 Terminal versus preposterior analysis

7.2.4 Synthesis and enduring impact

7.3 Opportunity loss, regret, and expected utility

7.3.1 From utility to loss

7.3.2 The intuition of regret

7.3.3 From regret to refinement

7.4 Conjugate priors and computable Bayesian learning

7.3.4 From algebraic burden to reusable structure

7.3.5 Conjugacy as a design principle

7.3.6 Worked example: Beta-Binomial updating

7.4 The enduring value of a rational framework

7.4.1 Decision analysis as a managerial discipline

7.4.2 Decision architecture in AI

7.4.3 Enduring value and modern applications

7.4.4 Closing reflection: from belief to action

7.5 Summary

8 The geometry of separation: Vladimir Vapnik and the mathematics of support vector machines

8.1 Key terms and concepts in margin-based learning

8.1.1 Geometry of separation

8.1.2 Learning and generalization

8.1.3 Optimization and computation

8.1.4 Algorithmic variants and practical extensions

8.1.5 Why these terms and concepts matter

8.2 Seeing SVMs in action

8.2.1 A one-dimensional separable case

8.2.2 When separation fails: the limits of one dimension

8.2.3 Drawing the separating line

8.2.4 Allowing overlap—soft margins

8.2.5 Soft margins in action: a two-dimensional worked example

8.3 The mathematical engine of margin-based learning

8.3.1 Dot products and the algebra of separation

8.3.2 Kernels and the implicit feature space

8.3.3 The optimization view of learning

8.3.4 Capacity, margins, and the logic of generalization

8.3.5 A unified view

8.4 The enduring legacy of Vapnik’s work

8.4.1 A breakthrough combination: margin, kernel, and convexity

8.4.2 Direct descendants: algorithms built on the SVM framework

8.4.3 Margins return to center stage: the deep learning renaissance

8.4.4 Margins under adversarial pressure: robustness as distance to the boundary

8.4.5 Domains where SVMs still dominate

8.4.6 The persistence of Vapnik–Chervonenkis theory

8.4.7 Fairness, privacy, and distributional robustness

8.5 Closing perspective

8.6 Summary

9 From single trees to forests: Leo Breiman and the logic of ensemble learning

9.1 Why single decision trees fail to generalize

9.1.1 The appeal—and the trap—of recursive splitting

9.1.2 Instability, variance, and overfitting

9.1.3 A critical insight

9.2 How a decision tree actually learns

9.2.1 Recursive partitioning

9.2.2 Impurity and information

9.2.3 Overfitting as depth increases

9.2.4 Interpreting a decision tree: from root to leaf

9.3 Bagging: the first step toward stability

9.3.1 Bootstrap aggregation

9.3.2 What bagging fixes—and what it doesn’t

9.4 Random forests: injecting randomness where it matters

9.4.1 Breiman’s defining move

9.4.2 Why feature randomness matters

9.4.3 A conceptual definition

9.5 Strength, correlation, and generalization: Breiman’s theory

9.5.1 Margin in random forests

9.5.2 The strength–correlation trade-off

9.5.3 Why random forests don’t overfit as trees grow

9.5.4 Closing perspective

9.6 Out-of-bag error: internal validation without a test set

9.6.1 What “out-of-bag” really means

9.6.2 Why this mattered historically

9.6.3 Conceptual importance

9.7 Variable importance and the forest as a glass box

9.7.1 From black box to diagnostic tool

9.7.2 Limits and cautions

9.7.3 Interpreting importance responsibly

9.8 Random forests in practice: a conceptual walkthrough

9.8.1 From data to many worlds

9.8.2 How a single tree grows inside the forest

9.8.3 Accumulating randomness into structure

9.8.4 From trees to predictions

9.8.5 A system, not an assembly of techniques

9.9 Random forests versus support vector machines: two paths to generalization

9.9.1 Geometry versus aggregation

9.9.2 Bias–variance trade-offs revisited

9.10 The enduring legacy of random forests

9.11 Summary

10 From isolated algorithms to coherent systems: David J.C. MacKay and the unifying logic of learning

11 Learning through representation: LeCun, Bengio, Hinton, and the mathematics of neural networks

12 From recurrence to attention: Google Brain and the transformer architecture

13 From foundations to frontier: Open AI and the scaling laws of modern intelligence

Overview

5 The birth of information theory: Shannon and the mathematics of uncertainty

The chapter recounts how Claude Shannon reframed uncertainty into a measurable resource and, in doing so, founded information theory. Building on earlier statistical ideas yet striking in a new direction, Shannon treated a message as a probabilistic sequence of symbols and defined information as the reduction of uncertainty. By introducing entropy—quantified in bits—and insisting that communication engineering concerns transmission rather than meaning, he supplied a universal yardstick that made “information” concrete, comparable, and optimizable across any channel.

Armed with this definition, Shannon articulated a complete end‑to‑end model—source, encoder, channel with noise, decoder, receiver—and turned the central trade‑offs of communication into mathematics. Source coding tied the limits of lossless compression to a source’s entropy; noisy‑channel coding showed that carefully added redundancy can make errors arbitrarily rare below a channel’s capacity; and capacity itself set the ceiling for reliable information rate. These results, extended through practical signaling and modulation schemes, solved urgent mid‑century engineering problems and still underwrite today’s digital infrastructure—from telephony and storage to satellite links, Wi‑Fi, and streaming.

The chapter also traces Shannon’s unintended legacy far beyond wires and radio. Entropy and mutual information became core tools in statistics, data science, and AI: decision trees and random forests split on information gain; feature selection ranks predictors by mutual information; clustering is assessed with entropy‑based measures such as purity and normalized mutual information; and neural networks learn by minimizing cross‑entropy, while representation learning compresses inputs to preserve high‑entropy structure and discard redundancy. In all these settings, the bit remains the unit and entropy the common currency for quantifying, managing, and ultimately learning from uncertainty.

Entropy quantifies the uncertainty of a message in bits. A fair coin resolves 1 bit of uncertainty, while a biased coin conveys less (0.47 bits) because the outcome is more predictable. A die roll (2.58 bits) and a card draw (5.7 bits) carry progressively higher entropy as the number of equally likely outcomes increases. As entropy rises, so do uncertainty, surprise, information gain, and channel capacity requirements. When entropy is low, redundancy allows greater opportunities for compression.

An illustration of representation learning as compression. On the left is the original raw object, containing many pixels and redundancies. In representation learning, an encoder compresses this object into a compact form, shown here on the right. The compressed version looks blurrier, but it preserves the high-entropy elements—the unpredictable parts that carry real information—while discarding low-entropy redundancies. This compressed representation is the “message” that can be transmitted or stored efficiently. Although not depicted here, a decoder can reconstruct an image that closely resembles the original, demonstrating how uncertainty can be reduced without losing essential content.

Summary

Claude Shannon redefined information not as meaning but as the resolution of uncertainty. A predictable message like “the sun will rise tomorrow” carries almost no information, while an unpredictable one like a coin flip resolves genuine doubt. This shift allowed information to be quantified in probabilistic terms rather than treated as a vague concept.
Borrowing from physics, Shannon formalized entropy as the average uncertainty in a source of messages. Measured in bits, entropy provides a precise yardstick: one bit for a fair coin flip, more for larger sets of equally likely outcomes. With entropy, engineers could assign numbers to unpredictability and compare the information content of different sources.
Low-entropy sequences, such as repeated letters, are highly redundant and can be compressed without loss. High-entropy sequences, by contrast, resist compression because every symbol is unpredictable. Shannon also showed that redundancy, when deliberately added through coding, enables error correction—balancing efficiency with reliability in noisy channels.
Shannon demonstrated that every channel has a maximum capacity, defined by its bandwidth and noise level, beyond which reliable transmission is impossible. By comparing entropy against channel capacity, engineers could design systems that transmit as much information as possible with minimal error. This principle still governs today’s telephony, Wi-Fi, satellite links, and digital media.
In the 1940s, telegraphy, telephony, and radio faced urgent problems of efficiency and reliability. By making uncertainty measurable and showing how entropy governs redundancy, coding, and channel limits, Shannon gave engineers the tools to tame noise, compress signals, and design communication systems that approach theoretical limits of performance.
Entropy quickly escaped engineering. In data science, it became the foundation of decision trees and random forests, where splits are chosen to maximize information gain. Mutual information extended this idea, guiding feature selection and quantifying relationships between variables. Entropy also became central to clustering, neural networks, and representation learning.
By treating information as measurable uncertainty and entropy as its unit, Shannon gave science and engineering a new common currency. His work solved practical problems of mid-20th-century communication while also laying the groundwork for the information age. Today, bits and entropy remain the backbone of both digital technology and modern artificial intelligence.

FAQ

What did Shannon redefine “information” to mean in this chapter?

He defined information as the reduction of uncertainty. By treating a message as one outcome selected from many possible outcomes with known probabilities, Shannon made information measurable. The average uncertainty of a source is quantified by entropy, measured in bits.

What is entropy, and why is it measured in bits?

Entropy is the average uncertainty of a source—the expected “surprise” before a message arrives. Measuring with base-2 logarithms makes the natural unit the bit, the resolution of one yes/no choice. Examples: a fair coin resolves 1 bit; a biased 90/10 coin conveys about 0.47 bits; a fair die about 2.58 bits; drawing a card from a 52-card deck about 5.7 bits.

Why does Shannon’s theory ignore meaning (semantics)?

Shannon’s framework is about transmission, not interpretation. It treats all messages uniformly and focuses on how uncertain we are before a message and how efficiently a channel can resolve that uncertainty. By stripping away meaning, information becomes a quantifiable commodity in bits, applicable to any message.

What is redundancy and why is it a “double-edged sword”?

Redundancy is predictable, repeated structure. It enables compression (strip predictable parts to save capacity) but also enables error correction (add structured redundancy so receivers can detect/fix errors). AAAAAA is highly compressible (low entropy); a string like W7dX8$ isn’t (high entropy). Shannon showed both compression and coding are governed by entropy.

What is channel capacity and how does it relate to entropy?

Channel capacity is the maximum reliable information rate a channel can support, set by bandwidth and noise. Compare source entropy (bits needed) to capacity (bits per second available): if entropy exceeds capacity, you must compress or slow the rate; if you transmit below capacity with proper coding, reliable communication is achievable.

What is the noisy channel coding theorem and why was it surprising?

It states that for any noisy channel, if the transmission rate is below capacity, there exist codes that make error probability arbitrarily small. Noise doesn’t impose an absolute reliability floor—structured redundancy can beat it. This insight underpins practical codes like Hamming, Reed–Solomon, LDPC, and Turbo codes.

What does the source coding (compression) theorem say?

It sets a hard limit: the average length of any lossless code cannot be less than the source’s entropy. Predictable (low-entropy) sources compress well; near-random (high-entropy) sources don’t. This explains why ZIP can’t shrink already compressed or random data and why lossy codecs trade redundancy and perceptual irrelevance for efficiency.

How do decision trees and random forests use entropy?

At each split, trees measure node entropy and choose the feature/threshold that maximizes information gain (entropy reduction). The Gini index is an alternative impurity measure with similar behavior. Feature importance in trees/forests aggregates these entropy (or Gini) reductions: variables that consistently cut uncertainty most rank highest.

What is mutual information and how is it used beyond communications?

Mutual information measures how much knowing one variable reduces uncertainty about another. It guides feature selection (keep predictors that cut target uncertainty) and evaluates/optimizes clustering (e.g., purity, cluster entropy, normalized mutual information) by asking how much cluster assignments reveal about true categories.

How does entropy appear in deep learning and representation learning?

Neural networks are trained by minimizing cross-entropy between predicted and true distributions—directly reducing uncertainty in predictions. Representation learning (e.g., autoencoders, embeddings) compresses data by preserving high-entropy, informative structure and discarding low-entropy redundancy, echoing Shannon’s balance between efficiency and fidelity.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$55.99 $39.19

you save $16.80 (30%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$55.99 $39.19

you save $16.80 (30%)

eBook

pdf, ePub, online

$55.99 $39.19

you save $16.80 (30%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more