Overview

6 Tabular Reinforcement Learning

This chapter positions temporal-difference (TD) learning as the load‑bearing wall of reinforcement learning and pairs it with generalized policy iteration as the field’s core. It traces the path from Markov chains and Bellman’s dynamic programming to model‑free TD methods that learn directly from experience, emphasizing online, incremental updates driven by prediction error. The TD error—the gap between what happened and what was expected—serves as a practical, biologically plausible learning signal that lets agents improve without full knowledge of transition dynamics, aligning closely with how businesses and organisms adapt under uncertainty.

Building on this foundation, the chapter explains how sampling replaces explicit expectations and why action values Q(s,a) become central in model‑free control. It clarifies exploration–exploitation and the bootstrap nature of TD, then contrasts on‑policy and off‑policy learning through SARSA and Q‑learning. SARSA updates from the action the agent actually takes, yielding conservative, risk‑aware behavior; Q‑learning learns toward the greedy target regardless of behavior, often converging faster but with optimistic bias. In tabular settings, knowledge is stored in lookup tables (Q‑tables), making learning transparent and direct. A simple restaurant scheduling example demonstrates TD’s convergence to dynamic‑programming optima purely from experience.

The chapter extends one‑step TD to multi‑step credit assignment with TD(λ) and eligibility traces, showing how λ trades bias for variance and how traces efficiently propagate credit backward in time. It distinguishes accumulating versus replacing traces, and explains on‑policy SARSA(λ) versus Watkins’s off‑policy Q(λ), which resets traces after non‑greedy actions to preserve off‑policy correctness. A realistic gas‑station inventory case study ties everything together: tabular agents learn profitable, robust ordering policies under stochastic demand and delivery delays, revealing practical trade‑offs—e.g., SARSA’s stability and realistic risk handling, Q‑learning’s efficiency and lower stockouts, and situations where added algorithmic complexity (eligibility traces) may not outperform simpler baselines.

Like cooking, we use feedback from error to update parameters step by step.
The formula shows how each update moves the old value toward a better one.
Convergence behavior of Temporal difference learning algorithm.
illustration of problem of aligning within the TD framework
SARSA Algorithm Flowchart.
Q-learning Algorithm Flowchart.
Comparison of TD(0), Monte Carlo and TD(λ) methods.
Update mechanism of temporal difference learning with eligibility traces.
integrating eligibility trace in SARSA framework: SARSA(λ).
Integrating eligibility traces with Q-learning using Watkins's Q(λ) method.
Tabular Reinforcement Learning Curves and Performance Analysis.

Summary

  • In model-based learning, the agent has access to the full dynamics of the environment—the transition probabilities P(s'|s,a) and reward function R(s,a). Dynamic Programming is a model-based approach.
  • In model-free learning, the agent does not know these dynamics and must learn directly from experience tuples (s, a, r, s') gathered through trial and error. Temporal difference learning, Q-learning, and SARSA are all model-free.
  • In offline learning, the agent is trained on a fixed, pre-collected batch of data and does not learn during interaction.
  • In online learning, the agent updates its knowledge continuously during interaction with the environment, learning step-by-step as new experience arrives. All the temporal difference methods discussed in this chapter are online learners.
  • In on-policy learning, the agent learns to improve the same policy it is using to make decisions. It learns from its own "lived experience." SARSA is on-policy.
  • In off-policy learning, the agent can learn about a different policy (the target policy, typically the optimal one) than the one it is using to generate behavior (the behavior policy, typically an exploratory one). Q-learning is off-policy.
  • Temporal-Difference learning is one of the two pillars of reinforcement learning, enabling agents to learn from raw experience without a model of the environment.
  • The core idea of temporal difference learning is to update your predictions based on the error between your current guess and a slightly better guess one step into the future.
  • This process of updating estimates using other estimates is called bootstrapping, and it allows learning to happen online, after every single step.
  • The TD error (δ) is the learning signal, quantifying the "surprise" or difference between what you expected and what you observed.
  • Temporal difference learning is the bridge between Dynamic Programming (which needs a full model) and Monte Carlo methods (which must wait for an episode to end).
  • For an agent to make decisions in a model-free world, it must learn action values (Q(s,a)), not just state values (V(s)).
  • Tabular methods represent the value function as a lookup table or spreadsheet, with a distinct entry for every state-action pair.
  • Q-learning is the quintessential off-policy algorithm; it learns about the optimal greedy policy while behaving according to a different, more exploratory policy.
  • Because Q-learning uses the max Q-value for its update, it is considered optimistic and often converges faster to the optimal policy.
  • SARSA is the quintessential on-policy algorithm; it learns about the same policy it is using to make decisions, including all its exploratory actions.
  • Because SARSA's updates use the Q-value of the actual next action taken, it is considered more realistic or conservative, learning a policy that is robust in the face of its own exploration.
  • The credit assignment problem asks how to distribute credit for a reward among the many actions that may have led to it, especially when rewards are delayed.
  • Eligibility traces (TD(λ)) are the solution, creating a short-term memory that allows the TD error to be propagated backward to all recently visited state-action pairs.
  • The λ parameter controls how far back credit is assigned, providing a tuneable spectrum between one-step TD (λ=0) and Monte Carlo (λ=1).
  • SARSA(λ) is a highly effective on-policy algorithm that combines the stability of SARSA with the learning speed of eligibility traces.
  • Watkins's Q(λ) adapts traces for off-policy learning by resetting the traces to zero whenever a non-greedy (exploratory) action is taken.

FAQ

What is temporal-difference (TD) learning and why is it central to reinforcement learning?TD learning updates value estimates using the next prediction as a target rather than waiting for final outcomes. It is model-free, online, and bootstraps from partial experience, making it practical and scalable. This error-driven process underpins many RL algorithms and bridges ideas from dynamic programming, machine learning, and neuroscience.
How does TD learning differ from the Bellman equation used in dynamic programming?The Bellman equation computes expectations using known transition probabilities and the policy (a full model). TD learning replaces those expectations with single sampled transitions, requiring no explicit model. Over many samples, TD’s one-sample updates converge to the same expected values by the law of large numbers.
What is the TD error and how is it used to update values?The TD error measures surprise between what happened and what was predicted. For state values: δ = r + γ V(s′) − V(s). For action values: δ = r + γ Q(s′, a′) − Q(s, a). Updates move the estimate toward the target, e.g., V(s) ← V(s) + αδ, improving predictions incrementally at each step.
What does “tabular” mean in reinforcement learning?Tabular methods store one value per state (V) or per state–action pair (Q) in a lookup table. They are simple, exact for the visited entries, and easy to implement, but they scale poorly as the number of states/actions grows, motivating function approximation and deep RL.
What’s the difference between on-policy and off-policy learning?On-policy methods learn about the same policy that generates experience (behavior = target). Off-policy methods learn about a different, typically greedy, target policy while behaving with an exploratory policy. This separation allows broad exploration without sacrificing convergence to optimal behavior.
How do SARSA and Q-learning differ, and when might I prefer each?SARSA (on-policy) uses the value of the action actually chosen next: target = r + γ Q(s′, a′). It learns conservatively and reflects exploration-induced risk. Q-learning (off-policy) uses the maximum next action value: target = r + γ maxa′ Q(s′, a′), learning the optimal policy faster but with optimistic assumptions that can be risky or biased under noise. Prefer SARSA when safety/stability under exploration matters; prefer Q-learning for faster convergence to high performance and easy reuse of diverse data.
What are eligibility traces and TD(λ), and why use them?Eligibility traces assign decaying credit to recently visited states or state–action pairs, blending frequency and recency. TD(λ) combines one-step bootstrapping with multi-step returns, trading bias and variance via λ. Accumulating vs replacing traces differ in how repeat visits are counted; Watkins’s Q(λ) resets traces after non-greedy actions to preserve off-policy correctness.
How does the restaurant table scheduling example demonstrate TD convergence?Without using known arrival probabilities, TD learning repeatedly samples evening sequences and updates values from the TD error. The learned value function approaches the analytically optimal values computed by dynamic programming, illustrating that TD can discover optimal returns from experience alone.
How is exploration–exploitation handled in tabular control methods?Agents typically use ε-greedy policies: with probability ε they explore (random action), and with 1−ε they exploit (greedy action). Decaying ε over episodes shifts behavior from broad exploration early to mostly exploitation later, balancing learning speed, coverage, and stability.
What did the gas-station case study reveal about algorithm choice in practice?All tabular methods improved profits over a random baseline, but trade-offs emerged: SARSA achieved the highest average reward with moderate variance (stable in deployment), Q-learning produced the fewest stockouts (safer service levels), and eligibility-trace variants did not outperform base methods under short, dense-reward episodes. The study highlights aligning algorithm behavior with business goals like stability, service levels, and risk tolerance.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Applied Reinforcement Learning ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Applied Reinforcement Learning ebook for free