PART 2: BAYESIAN MODELS IN ACTION

3 Conjugate priors for closed-form Bayesian updates: Skipping over heavy computations with clever math

4 Model checking and comparison: Evaluating and picking out the best Bayesian model

5 Monte Carlo methods: Being a tour guide for unknown Bayesian posteriors

6 Variational inference

Overview

1 Introduction to Bayesian statistics: Representing our knowledge and uncertainty with probabilities

Bayesian statistics provides a practical language for thinking and acting under uncertainty by representing what we know—and how sure we are—using probabilities. The chapter motivates this need with intuitive examples like weather forecasting, showing how probabilistic outputs enable better, personalized decisions than yes/no predictions and how finer-grained models increase usefulness and trust. Aimed at practitioners without requiring advanced math, the chapter frames Bayesian reasoning as natural and intuitive: we express our current knowledge, update it with evidence, and make decisions that reflect both data and uncertainty.

At the core is the idea of modeling unknowns as random variables with probability distributions (binary, categorical, or continuous), allowing us to quantify beliefs, compute expectations, and answer questions about ranges of outcomes. The Bayesian workflow links three pillars—the prior (initial belief), data (evidence), and posterior (updated belief via conditional probability Pr(X | D))—making assumptions explicit and transparent. The chapter contrasts this belief-centric view with frequentism, which interprets probability as long-run frequency under repeated trials; it notes that Bayesian methods can be more computationally intensive and subjective through priors, whereas frequentist methods are turnkey and data-driven. With abundant data the two approaches often converge, but with limited data or when domain knowledge matters, the Bayesian framework can be more informative.

Applications span recommendation systems, medical diagnostics, and modern AI, including large language models. The chapter explains LLMs as next-word predictors operating over conditional probabilities, highlighting why probabilistic outputs are essential and how models rely on approximations rather than fully Bayesian posteriors for tractability. It also touches on using multiple likely predictions to elicit user feedback and improve models. Finally, the book’s roadmap moves from foundations (priors, data, posteriors) to core techniques (model comparison, hierarchical and mixture models, Monte Carlo, variational inference), then specialized models (Kalman filters, dynamic Bayesian networks, Bayesian neural networks), and culminates with Bayesian decision theory for principled choices under risk.

An illustration of machine learning models without probabilistic reasoning capabilities being susceptible to noise and overconfidently making the wrong predictions.

An example categorical distribution for rainfall rate.

Summary

We need probability to model phenomena in the real world whose outcomes we haven’t observed.
With Bayesian probability, we use probability to represent our personal belief about an unknown quantity, which we model using a random variable.
From a Bayesian belief about a quantity of interest, we can compute quantities that represent our knowledge and uncertainty about that quantity of interest.
There are three main components to a Bayesian model: the prior distribution, the data, and the posterior distribution. The last component is the result of combining the first two and what we want out of a Bayesian model.
Bayesian probability is useful when we want to incorporate prior knowledge into a model, when data is limited, and for decision-making under uncertainty.
A different interpretation of probability, frequentism, views probability as the frequency of an event under infinite repeats, which limits the application of probability in various scenarios.
Large language models, which power popular chat artificial intelligence models, apply Bayesian probability to predict the next word in a sentence.

FAQ

Why do we need probabilities instead of simple yes/no predictions?

Because real-world predictions are uncertain. Probabilities quantify that uncertainty, letting you make decisions that match your risk tolerance (for example, whether to bring an umbrella at 10% vs 40% chance of rain). A yes/no output throws away useful information about how confident the prediction is.

What is a random variable, and how does it model weather?

A random variable is a numerical representation of an uncertain outcome. For weather, a binary random variable can represent “rain or no rain” (0 or 1), while a categorical or continuous variable can represent “how much it rains.” Choosing the type sets the granularity of reasoning about the phenomenon.

What is a probability distribution, and what does the Bernoulli parameter p mean?

A probability distribution assigns likelihoods to the possible values of a random variable, subject to non-negativity and summing to 1. A Bernoulli distribution models a single yes/no event; its parameter p is the probability of “success” (for example, rain), and 1 − p is the probability of “failure” (no rain).

How do I choose between binary, categorical, and continuous modeling?

It’s a trade-off between practicality and detail. Binary is simplest but coarse; continuous is most expressive but can be complex; categorical often offers a useful middle ground. More granularity can improve usefulness and user trust but may increase computational cost.

What is the expected value, and why is it a weighted average?

The expected value summarizes a distribution with a single number by averaging possible values weighted by their probabilities. It gives the distribution’s central tendency, unlike a simple average that treats all values as equally likely. This weighting reflects what outcomes you actually consider more plausible.

What are the prior, data, and posterior in a Bayesian model?

The prior encodes your initial belief about the unknown before seeing data. Data provides evidence from the world. The posterior is your updated belief—formally a conditional probability (for example, the probability of the quantity given the observed data) that combines prior information with evidence.

How do Bayesian and frequentist interpretations of probability differ?

Bayesian probability represents degrees of belief about unknowns and updates those beliefs with data. Frequentist probability views uncertainty as arising from repeated sampling; probabilities are long-run frequencies under infinitely many trials. They can yield similar answers with abundant data but differ in interpretation and workflow.

Is Bayesian analysis subjective because of priors? Is that a problem?

Bayesian methods are “subjective” in that priors reflect assumptions or domain knowledge. This can be a strength: priors add transparency, enable better decisions with limited data, and encode expertise. Different reasonable priors can lead to different conclusions, so diagnosing and justifying priors is part of the process.

Why can neural networks be overconfident, and are softmax outputs true probabilities?

Conventional neural networks can assign high confidence to wrong predictions, especially under small, imperceptible input perturbations. Softmax outputs are normalized scores and shouldn’t automatically be treated as calibrated probabilities. Incorporating probabilistic reasoning can help address overconfidence.

How do large language models (LLMs) connect to Bayesian ideas?

LLMs perform next-word prediction using conditional probabilities that depend on context and training data. In spirit, this mirrors Bayesian conditioning, but computing full posteriors over all possible words is too costly, so LLMs approximate by focusing on a few likely candidates. Their probabilistic outputs also enable generating multiple plausible responses and learning from user feedback.

An illustration of machine learning models without probabilistic reasoning capabilities being susceptible to noise and overconfidently making the wrong predictions.

An example categorical distribution for rainfall rate.

pro $24.99 per month

lite $19.99 per month

team

pro

team

pro

team