Overview

6 Scheduling with tabular reinforcement learning

This chapter presents temporal-difference (TD) learning as the practical, load-bearing core of reinforcement learning (RL) for business decision-making. Building on generalized policy iteration, it motivates TD through neuroscience’s prediction-error signals and the machine-learning analogy of loss-driven updates. The chapter explains how TD learns online from experience using a one-step bootstrapped target—reward plus the discounted next estimate—sidestepping explicit models of dynamics and making Bellman’s expectations implicit via sampling. Through a simple restaurant scheduling problem—choosing to seat small parties now or wait for larger parties—it demonstrates that TD converges to the optimal values purely from experience, illustrating why error-driven, incremental updates are both theoretically sound and operationally suited to business settings that rely on rolling feedback rather than full-information labels.

Moving from evaluation to control, the text shifts to action values and the exploration–exploitation tradeoff, introducing on-policy versus off-policy learning. SARSA (on-policy) updates with the value of the action the agent will actually take next, yielding a cautious, behavior-faithful learner; Q-learning (off-policy) instead targets the best next action, learning the optimal policy regardless of exploratory behavior and often converging faster, though with a tendency toward optimistic estimates. The chapter then extends TD to multi-step credit assignment with TD(λ) and eligibility traces, connecting forward- and backward-views to blend the speed of TD(0) with the accuracy of Monte Carlo. It outlines SARSA(λ) for on-policy credit propagation and Watkins’s Q(λ) to preserve off-policy correctness by cutting traces after non-greedy moves, framing eligibility traces as an efficient, theory-grounded solution to the credit assignment problem.

To ground the methods in a realistic setting, the chapter formulates gas station fuel purchasing as a Markov decision process: compact, tabular states capture current inventory, pending supply, and time to the next delivery; actions discretize daily order quantities; rewards encode profit from sales, stockout penalties, holding costs, and ordering costs; and uncertainty enters via stochastic demand and delivery lead times. Implementations of Random, SARSA, Q-learning, and their λ-variants are trained over episodic simulations. The empirical takeaways are managerial as much as technical: both TD control methods learn robust, adaptive policies from data alone; SARSA tends to deliver higher average profit with moderate variance (policy realism), while Q-learning often achieves fewer stockouts at slightly lower profit (greedy safety stock). Eligibility traces offer situational benefits, but in short-horizon, dense-reward problems may add complexity without clear gains. The chapter closes by emphasizing fit-to-purpose algorithm selection—aligning learning behavior with business risk tolerance and operational constraints—while showcasing how tabular TD methods provide a strong, interpretable foundation before scaling to function approximation and deep RL.

Like cooking, we use feedback from error to update parameters step by step.
The formula shows how each update moves the old value toward a better one.
Convergence behavior of Temporal difference learning algorithm.
illustration of problem of aligning within the TD framework
SARSA Algorithm Flowchart.
Q-learning Algorithm Flowchart.
Comparison of TD(0), Monte Carlo and TD(λ) methods.
Update mechanism of temporal difference learning with eligibility traces.
integrating eligibility trace in SARSA framework: SARSA(λ).
Integrating eligibility traces with Q-learning using Watkins's Q(λ) method.
Tabular Reinforcement Learning Curves and Performance Analysis.

Summary

  • In model-based learning, the agent has access to the full dynamics of the environment—the transition probabilities P(s'|s,a) and reward function R(s,a). Dynamic Programming is a model-based approach.
  • In model-free learning, the agent does not know these dynamics and must learn directly from experience tuples (s, a, r, s') gathered through trial and error. Temporal difference learning, Q-learning, and SARSA are all model-free.
  • In offline learning, the agent is trained on a fixed, pre-collected batch of data and does not learn during interaction.
  • In online learning, the agent updates its knowledge continuously during interaction with the environment, learning step-by-step as new experience arrives. All the temporal difference methods discussed in this chapter are online learners.
  • In on-policy learning, the agent learns to improve the same policy it is using to make decisions. It learns from its own "lived experience." SARSA is on-policy.
  • In off-policy learning, the agent can learn about a different policy (the target policy, typically the optimal one) than the one it is using to generate behavior (the behavior policy, typically an exploratory one). Q-learning is off-policy.
  • Temporal-Difference learning is one of the two pillars of reinforcement learning, enabling agents to learn from raw experience without a model of the environment.
  • The core idea of temporal difference learning is to update your predictions based on the error between your current guess and a slightly better guess one step into the future.
  • This process of updating estimates using other estimates is called bootstrapping, and it allows learning to happen online, after every single step.
  • The TD error (δ) is the learning signal, quantifying the "surprise" or difference between what you expected and what you observed.
  • Temporal difference learning is the bridge between Dynamic Programming (which needs a full model) and Monte Carlo methods (which must wait for an episode to end).
  • For an agent to make decisions in a model-free world, it must learn action values (Q(s,a)), not just state values (V(s)).
  • Tabular methods represent the value function as a lookup table or spreadsheet, with a distinct entry for every state-action pair.
  • Q-learning is the quintessential off-policy algorithm; it learns about the optimal greedy policy while behaving according to a different, more exploratory policy.
  • Because Q-learning uses the max Q-value for its update, it is considered optimistic and often converges faster to the optimal policy.
  • SARSA is the quintessential on-policy algorithm; it learns about the same policy it is using to make decisions, including all its exploratory actions.
  • Because SARSA's updates use the Q-value of the actual next action taken, it is considered more realistic or conservative, learning a policy that is robust in the face of its own exploration.
  • The credit assignment problem asks how to distribute credit for a reward among the many actions that may have led to it, especially when rewards are delayed.
  • Eligibility traces (TD(λ)) are the solution, creating a short-term memory that allows the TD error to be propagated backward to all recently visited state-action pairs.
  • The λ parameter controls how far back credit is assigned, providing a tuneable spectrum between one-step TD (λ=0) and Monte Carlo (λ=1).
  • SARSA(λ) is a highly effective on-policy algorithm that combines the stability of SARSA with the learning speed of eligibility traces.
  • Watkins's Q(λ) adapts traces for off-policy learning by resetting the traces to zero whenever a non-greedy (exploratory) action is taken.

FAQ

What is temporal-difference (TD) learning and why is it central to this chapter?TD learning updates value estimates from incomplete experience by comparing a prediction to a one-step-ahead, bootstrapped target. It enables online, model-free learning and underpins practical control methods like SARSA and Q-learning, making it a load-bearing concept for real-world scheduling problems.
How is TD learning different from the Bellman equation in dynamic programming?The Bellman equation expresses values as expectations over all actions and transitions using known dynamics. TD learning uses sampled experience: an action is already taken and one next state is observed. Over many samples, TD’s one-sample estimates converge to the same expectation without ever needing explicit transition probabilities or π(a|s).
What is the TD error and how does the value update work?The TD error δ measures surprise: δ = r + γ V(s′) − V(s) (or with action-values, δ = r + γ Q(s′,a′) − Q(s,a)). The update moves the old estimate toward the target: V(s) ← V(s) + αδ (and similarly for Q). Positive δ raises the estimate; negative δ lowers it.
What does the restaurant table scheduling example illustrate?A single premium table can seat a small party now ($50) or wait for a larger one ($120 with 30% chance per hour) over a 4-hour horizon. Analytically, seat immediately with 1 hour left; wait earlier. TD learning, without knowing the probabilities, converges to the same optimal values through repeated experience.
What is the difference between on-policy and off-policy learning?On-policy methods learn about the policy they behave with; the samples and updates align (e.g., SARSA uses Q(s′,a′) for the action it will actually take). Off-policy methods behave under one policy but learn about another (e.g., Q-learning uses max_a Q(s′,a), targeting the greedy policy regardless of the action taken.
How do SARSA and Q-learning differ in their targets and behavior?SARSA (on-policy) updates with the value of the action it will truly take next, making it conservative and reflective of actual exploration. Q-learning (off-policy) updates with the maximum next action value, learning the optimal greedy policy faster but with potential overoptimism and maximization bias.
What does “tabular” mean in this context, and when is it suitable?Tabular methods store one value per state or state–action pair in a lookup table (a Q-table). They’re suitable when the state-action space is small enough to be visited frequently, enabling reliable learning without function approximation.
How does TD(λ) with eligibility traces improve credit assignment?TD(λ) blends TD(0) and Monte Carlo by propagating credit backward with exponentially decaying eligibility traces. Recently and frequently visited states (or state–action pairs) receive more credit. Accumulating traces add on each visit; replacing traces reset to 1. Watkins’s Q(λ) resets traces after non-greedy actions to preserve off-policy learning.
How is exploration handled in the chapter’s implementations?Using ε-greedy. With probability ε the agent explores by picking a random action; otherwise it exploits the best-known action. ε decays over episodes to shift from broad exploration early to focused exploitation later.
What were the key takeaways from the gas station fuel scheduling case?Tabular control learned profitable, risk-aware policies. SARSA achieved the highest average reward (~$3,886) with moderate variance; Q-learning earned slightly less (~$3,811) but with the fewest stockouts (~2.96), reflecting a higher safety stock. Eligibility-trace variants underperformed here, likely due to short episodes and dense rewards where multi-step credit had limited payoff.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Applied Reinforcement Learning ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Applied Reinforcement Learning ebook for free