Overview

1 Introduction to Bayesian statistics: Representing our knowledge and uncertainty with probabilities

Bayesian statistics is introduced as a practical language for reasoning and decision-making under uncertainty. The chapter motivates why probability is indispensable by contrasting a yes/no weather app with probabilistic forecasts and richer distributions over rainfall amounts, showing how random variables (binary, categorical, continuous) let us model the world at different levels of granularity. Probability not only supports personalized, risk-aware decisions but also curbs overconfident mistakes in predictive systems, making it central across data science, healthcare, recommendation engines, and modern AI, including large language models. The book promises an intuitive, visual path that builds from basics without heavy prerequisites.

The Bayesian viewpoint treats probabilities as degrees of belief about unknown quantities, represented with probability distributions whose parameters encode how likely different outcomes are. Through concrete examples (a Bernoulli belief for “rain today” and a categorical belief over rainfall amounts), the chapter shows how to summarize beliefs (for example, via expected value) and compute event probabilities. The core Bayesian workflow is to begin with a prior, observe data, and update to a posterior—explicitly conditioning on evidence to refine beliefs—mirroring everyday reasoning (e.g., revising odds of rain after seeing dark clouds). Crucially, Bayesian models return full distributions, not just point estimates, enabling calibrated decisions that reflect uncertainty.

Contrasted with this belief-centric view, frequentist statistics defines probability as long-run frequency under repeated trials, treating parameters as fixed and uncertainty as arising from data collection. The chapter highlights trade-offs: frequentist methods are often computationally lighter and require fewer modeling choices, while Bayesian methods are flexible, transparent about assumptions, and especially powerful when data are scarce or domain knowledge matters—albeit “subjective” through priors, which can be a feature. As a contemporary showcase, large language models are framed as next-word predictors using conditional probabilities over vocabularies; while not fully Bayesian for computational reasons, their probabilistic generation supports multiple plausible outputs and user feedback for refinement. The chapter closes by outlining the book’s roadmap from foundational concepts to scalable inference, specialized models, and decision-making under uncertainty.

An illustration of machine learning models without probabilistic reasoning capabilities being susceptible to noise and overconfidently making the wrong predictions.
An example categorical distribution for rainfall rate.

Summary

  • We need probability to model phenomena in the real world whose outcomes we haven’t observed.
  • With Bayesian probability, we use probability to represent our personal belief about an unknown quantity, which we model using a random variable.
  • From a Bayesian belief about a quantity of interest, we can compute quantities that represent our knowledge and uncertainty about that quantity of interest.
  • There are three main components to a Bayesian model: the prior distribution, the data, and the posterior distribution. The last component is the result of combining the first two and what we want out of a Bayesian model.
  • Bayesian probability is useful when we want to incorporate prior knowledge into a model, when data is limited, and for decision-making under uncertainty.
  • A different interpretation of probability, frequentism, views probability as the frequency of an event under infinite repeats, which limits the application of probability in various scenarios.
  • Large language models, which power popular chat artificial intelligence models, apply Bayesian probability to predict the next word in a sentence.

FAQ

Why do we need probability to make decisions under uncertainty?Because the world is noisy and predictions are never perfectly certain. Probability lets us express degrees of belief about unknown outcomes and make decisions that reflect risk and context. For example, a weather app that says “30% chance of rain” helps different users act differently—someone on a short run may skip an umbrella, while someone outdoors all day might bring one.
What’s the difference between non‑probabilistic and probabilistic predictions?Non‑probabilistic predictions output a single label (for example, “rain” or “sun”). Probabilistic predictions attach numbers to possibilities (for example, “30% rain”), capturing uncertainty. You can go further and model a whole distribution over amounts (for example, probabilities for 0, 0.01, 0.02 inches/hour), giving finer granularity for decisions.
What is a random variable, and how can it be binary, categorical, or continuous?A random variable represents an unknown quantity whose value comes from a probabilistic process. Binary variables take two values (for example, rain vs. no rain). Categorical variables take one of several discrete options (for example, binned rainfall rates). Continuous variables range over real numbers (for example, any non‑negative rainfall rate). The choice sets the granularity of your reasoning.
What is a probability distribution, and what does the Bernoulli distribution represent?A probability distribution assigns likelihoods to the possible values of a random variable, obeying non‑negativity and summing to 1. The Bernoulli distribution models a single two‑outcome trial (success=1, failure=0) with parameter p: Pr(1)=p and Pr(0)=1−p. In the rain example, p could be your believed chance of rain today.
How does Bayesian updating work (prior, data, posterior)?Bayesian modeling starts with a prior distribution (initial belief), incorporates data (evidence), and produces a posterior distribution (updated belief), written as Pr(X | D). For example: start with a low rain prior (dry season), observe dark clouds (data), and update your belief upward that it will rain (posterior).
What is expected value and why is it a weighted average?The expected value summarizes a distribution as the probability‑weighted average of its possible values. If your rainfall belief is 80% at 0, 10% at 0.01, and 10% at 0.02 inches/hour, the expected value is 0×0.8 + 0.01×0.1 + 0.02×0.1 = 0.003. Weighting by probabilities ensures more likely outcomes influence the average more than unlikely ones.
How do Bayesian and frequentist interpretations of probability differ?Bayesian: probability quantifies belief about unknown quantities; uncertainty comes from incomplete knowledge; models combine prior + data → posterior; priors make results potentially subjective but transparent and powerful with limited data. Frequentist: probability is long‑run frequency under infinite repeats; the parameter is fixed, randomness comes from data; methods are often computationally lighter and widely taught; with abundant data, both approaches often agree.
Why do machine learning models benefit from probabilistic reasoning?Deterministic predictors can be overconfident, especially with noisy or adversarial inputs. Probabilistic reasoning lets models express calibrated uncertainty, reduce overconfidence in wrong predictions, and support risk‑aware decisions—crucial in high‑stakes settings like healthcare. Also, common classifier outputs (for example, softmax scores) aren’t necessarily true probabilities unless the model is calibrated.
How do large language models (LLMs) connect to Bayesian ideas?LLMs perform next‑word prediction using conditional probabilities like Pr(next word | context, D). This probabilistic framing supports generating multiple plausible candidates and adapting via feedback. Although LLMs aren’t fully Bayesian (exact posteriors over all words are computationally prohibitive), they use Bayesian‑inspired conditioning and approximations to be efficient in practice.
Who is this book for, and what will I learn?It’s for data scientists, analysts, and AI practitioners who make decisions under uncertainty. You’ll learn: Part 1—core Bayesian concepts (prior, data, posterior); Part 2—model evaluation, hierarchical/mixture models, Monte Carlo, variational inference; Part 3—specialized models (Kalman filters, dynamic Bayesian networks, Bayesian neural networks); Part 4—Bayesian decision theory for principled, risk‑aware choices.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Grokking Bayes ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Grokking Bayes ebook for free