Applied Reinforcement Learning you own this product

Business optimization and LLM fine-tuning

Hadi Aghazadeh

MEAP began September 2025
Last updated April 2026
Publication in Summer 2026 (estimated)

ISBN 9781633434844
375 pages (estimated)

Included with a Manning Online subscription

printed in black & white

catalog / Data Science / Deep Learning / Deep Reinforcement Learning

resources: Source code Book forum Source code on GitHub

table of content

Part 1: Fundamentals: Building a Reinforcement Learning Toolkit

1 Real-world Decision Making with Reinforcement Learning

1.1 What reinforcement learning really enables?

1.2 Different types of business analysis

1.3 Business optimization definition

1.4 Examples of business optimization problems

1.5 Challenges in business optimization problems

1.6 Classical business optimization models

1.6.1 Operations research

1.6.2 Stochastic simulation

1.6.3 System dynamics

1.6.4 Game theory

1.7 Reinforcement learning for business optimization

1.8 Limitations in classical models and reinforcement learning

1.9 Summary

2 Markov Decision Process: Turning Problems intro Solvable Models

2.1 State: Anatomy of sequential decision making

2.2 Markov chain and Markov property

2.3 Markov decision process

2.4 Examples of Markov decision processes

2.5 Build a Markov Decision Process for Production Planning

2.6 Reward engineering and constraint handling strategies

2.6.1 Design rewards to be stepwise, whenever possible

2.6.2 Inject constraint information into the state

2.6.3 Handle soft constrains with stepwise penalties

2.6.4 Use action masking with penalties to handle hard constraints

2.6.5 Avoid mismatched scales with reward normalization / balancing

2.6.6 Avoid deceptive shortcuts in reward function

2.7 Summary

3 Design Custom Environments for Reinforcement Learning Algorithms

3.1 Conceptual framework for designing business environment

3.2 Warehouse order picking environment

3.3 Perishable product dynamic pricing environment

3.4 Trailer loading and packing environment

3.5 Summary

Part 2: Reinforcement Learning for Business Optimization

4 Perfect Knowledge, Optimal Policy: Dynamic Programming

4.1 Paradigms on solving Markov decision process

4.2 The domino decision rule: Bellman equations

4.3 Solving bellman equations: Generalized Policy Iteration

4.4 Hands-on code: solving a resource allocation problem

4.5 Limitations of dynamic programming

4.6 Summary

5 Contextual Bandit: Optimizing Stochastic One-step Decisions

5.1 Bandits as lightweight reinforcement learning

5.2 Tradeoff between exploitation and exploration

5.3 Simulating an Ad campaign problem with bandit algorithms

5.4 Quantifying bandit algorithms performance with Regret

5.5 Dynamic personalized discounting with contextual bandits

5.6 Beyond stationary bandit problems

5.7 Summary

6 Tabular Reinforcement Learning

6.1 Temporal difference learning

6.2 A concrete example: restaurant table scheduling

6.3 Off policy vs on policy learning

6.4 Tabular Reinforcement Learning: Q-learning and SARSA

6.4.1 SARSA: learning from what you actually do

6.4.2 Q-learning: learning from what you should do

6.5 TD(λ) and Eligibility traces

6.6 Gas station fuel purchase scheduling with tabular methods

6.7 Summary

Monte Carlo Tree Search: Searching with Reinforcement Learning Principles

7.1 Tree data structure and its applications in business

7.2 Fundamental tree search algorithms

7.3 Monte Carlo tree search theory and algorithm

7.4 Solving capacitated vehicle routing problem with Monte Carlo tree search

7.5 Summary

Part 3: Deep Reinforcement Learning for Business Optimization

8 Deep Q-Networks for High-dimensional Data

8.1 From tabular to linear function approximation

8.2 Deep learning fundamentals

8.3 DQN theory: Q-learning meets neural networks

8.4 Implementation of DQN for treatment optimization

8.5 DQN beyond treatment: when and how to use DQN

8.6 Summary

9 The Calculus of Decisions: Policy Gradient Methods

Part 4: Reinforcement Learning for Large Language Models Fine Tuning

10 Fine-Tuning Large Language Models with PPO

11 Reinforcement Learning with Human Feedback Using GRPO

12 RLVR and Advanced PPO Methods for LLM Reasoning

Overview

6 Tabular Reinforcement Learning

This chapter positions temporal-difference (TD) learning as the load‑bearing wall of reinforcement learning and pairs it with generalized policy iteration as the field’s core. It traces the path from Markov chains and Bellman’s dynamic programming to model‑free TD methods that learn directly from experience, emphasizing online, incremental updates driven by prediction error. The TD error—the gap between what happened and what was expected—serves as a practical, biologically plausible learning signal that lets agents improve without full knowledge of transition dynamics, aligning closely with how businesses and organisms adapt under uncertainty.

Building on this foundation, the chapter explains how sampling replaces explicit expectations and why action values Q(s,a) become central in model‑free control. It clarifies exploration–exploitation and the bootstrap nature of TD, then contrasts on‑policy and off‑policy learning through SARSA and Q‑learning. SARSA updates from the action the agent actually takes, yielding conservative, risk‑aware behavior; Q‑learning learns toward the greedy target regardless of behavior, often converging faster but with optimistic bias. In tabular settings, knowledge is stored in lookup tables (Q‑tables), making learning transparent and direct. A simple restaurant scheduling example demonstrates TD’s convergence to dynamic‑programming optima purely from experience.

The chapter extends one‑step TD to multi‑step credit assignment with TD(λ) and eligibility traces, showing how λ trades bias for variance and how traces efficiently propagate credit backward in time. It distinguishes accumulating versus replacing traces, and explains on‑policy SARSA(λ) versus Watkins’s off‑policy Q(λ), which resets traces after non‑greedy actions to preserve off‑policy correctness. A realistic gas‑station inventory case study ties everything together: tabular agents learn profitable, robust ordering policies under stochastic demand and delivery delays, revealing practical trade‑offs—e.g., SARSA’s stability and realistic risk handling, Q‑learning’s efficiency and lower stockouts, and situations where added algorithmic complexity (eligibility traces) may not outperform simpler baselines.

Like cooking, we use feedback from error to update parameters step by step.

The formula shows how each update moves the old value toward a better one.

Convergence behavior of Temporal difference learning algorithm.

illustration of problem of aligning within the TD framework

SARSA Algorithm Flowchart.

Q-learning Algorithm Flowchart.

Comparison of TD(0), Monte Carlo and TD(λ) methods.

Update mechanism of temporal difference learning with eligibility traces.

integrating eligibility trace in SARSA framework: SARSA(λ).

Integrating eligibility traces with Q-learning using Watkins's Q(λ) method.

Tabular Reinforcement Learning Curves and Performance Analysis.

Summary

In model-based learning, the agent has access to the full dynamics of the environment—the transition probabilities P(s'|s,a) and reward function R(s,a). Dynamic Programming is a model-based approach.
In model-free learning, the agent does not know these dynamics and must learn directly from experience tuples (s, a, r, s') gathered through trial and error. Temporal difference learning, Q-learning, and SARSA are all model-free.
In offline learning, the agent is trained on a fixed, pre-collected batch of data and does not learn during interaction.
In online learning, the agent updates its knowledge continuously during interaction with the environment, learning step-by-step as new experience arrives. All the temporal difference methods discussed in this chapter are online learners.
In on-policy learning, the agent learns to improve the same policy it is using to make decisions. It learns from its own "lived experience." SARSA is on-policy.
In off-policy learning, the agent can learn about a different policy (the target policy, typically the optimal one) than the one it is using to generate behavior (the behavior policy, typically an exploratory one). Q-learning is off-policy.
Temporal-Difference learning is one of the two pillars of reinforcement learning, enabling agents to learn from raw experience without a model of the environment.
The core idea of temporal difference learning is to update your predictions based on the error between your current guess and a slightly better guess one step into the future.
This process of updating estimates using other estimates is called bootstrapping, and it allows learning to happen online, after every single step.
The TD error (δ) is the learning signal, quantifying the "surprise" or difference between what you expected and what you observed.
Temporal difference learning is the bridge between Dynamic Programming (which needs a full model) and Monte Carlo methods (which must wait for an episode to end).
For an agent to make decisions in a model-free world, it must learn action values (Q(s,a)), not just state values (V(s)).
Tabular methods represent the value function as a lookup table or spreadsheet, with a distinct entry for every state-action pair.
Q-learning is the quintessential off-policy algorithm; it learns about the optimal greedy policy while behaving according to a different, more exploratory policy.
Because Q-learning uses the max Q-value for its update, it is considered optimistic and often converges faster to the optimal policy.
SARSA is the quintessential on-policy algorithm; it learns about the same policy it is using to make decisions, including all its exploratory actions.
Because SARSA's updates use the Q-value of the actual next action taken, it is considered more realistic or conservative, learning a policy that is robust in the face of its own exploration.
The credit assignment problem asks how to distribute credit for a reward among the many actions that may have led to it, especially when rewards are delayed.
Eligibility traces (TD(λ)) are the solution, creating a short-term memory that allows the TD error to be propagated backward to all recently visited state-action pairs.
The λ parameter controls how far back credit is assigned, providing a tuneable spectrum between one-step TD (λ=0) and Monte Carlo (λ=1).
SARSA(λ) is a highly effective on-policy algorithm that combines the stability of SARSA with the learning speed of eligibility traces.
Watkins's Q(λ) adapts traces for off-policy learning by resetting the traces to zero whenever a non-greedy (exploratory) action is taken.

FAQ

What is temporal-difference (TD) learning and why is it central to reinforcement learning?

TD learning updates value estimates using the next prediction as a target rather than waiting for final outcomes. It is model-free, online, and bootstraps from partial experience, making it practical and scalable. This error-driven process underpins many RL algorithms and bridges ideas from dynamic programming, machine learning, and neuroscience.

How does TD learning differ from the Bellman equation used in dynamic programming?

The Bellman equation computes expectations using known transition probabilities and the policy (a full model). TD learning replaces those expectations with single sampled transitions, requiring no explicit model. Over many samples, TD’s one-sample updates converge to the same expected values by the law of large numbers.

What is the TD error and how is it used to update values?

The TD error measures surprise between what happened and what was predicted. For state values: δ = r + γ V(s′) − V(s). For action values: δ = r + γ Q(s′, a′) − Q(s, a). Updates move the estimate toward the target, e.g., V(s) ← V(s) + αδ, improving predictions incrementally at each step.

What does “tabular” mean in reinforcement learning?

Tabular methods store one value per state (V) or per state–action pair (Q) in a lookup table. They are simple, exact for the visited entries, and easy to implement, but they scale poorly as the number of states/actions grows, motivating function approximation and deep RL.

What’s the difference between on-policy and off-policy learning?

On-policy methods learn about the same policy that generates experience (behavior = target). Off-policy methods learn about a different, typically greedy, target policy while behaving with an exploratory policy. This separation allows broad exploration without sacrificing convergence to optimal behavior.

How do SARSA and Q-learning differ, and when might I prefer each?

SARSA (on-policy) uses the value of the action actually chosen next: target = r + γ Q(s′, a′). It learns conservatively and reflects exploration-induced risk. Q-learning (off-policy) uses the maximum next action value: target = r + γ max_a′ Q(s′, a′), learning the optimal policy faster but with optimistic assumptions that can be risky or biased under noise. Prefer SARSA when safety/stability under exploration matters; prefer Q-learning for faster convergence to high performance and easy reuse of diverse data.

What are eligibility traces and TD(λ), and why use them?

Eligibility traces assign decaying credit to recently visited states or state–action pairs, blending frequency and recency. TD(λ) combines one-step bootstrapping with multi-step returns, trading bias and variance via λ. Accumulating vs replacing traces differ in how repeat visits are counted; Watkins’s Q(λ) resets traces after non-greedy actions to preserve off-policy correctness.

How does the restaurant table scheduling example demonstrate TD convergence?

Without using known arrival probabilities, TD learning repeatedly samples evening sequences and updates values from the TD error. The learned value function approaches the analytically optimal values computed by dynamic programming, illustrating that TD can discover optimal returns from experience alone.

How is exploration–exploitation handled in tabular control methods?

Agents typically use ε-greedy policies: with probability ε they explore (random action), and with 1−ε they exploit (greedy action). Decaying ε over episodes shifts behavior from broad exploration early to mostly exploitation later, balancing learning speed, coverage, and stability.

What did the gas-station case study reveal about algorithm choice in practice?

All tabular methods improved profits over a random baseline, but trade-offs emerged: SARSA achieved the highest average reward with moderate variance (stable in deployment), Q-learning produced the fewest stockouts (safer service levels), and eligibility-trace variants did not outperform base methods under short, dense-reward episodes. The study highlights aligning algorithm behavior with business goals like stability, service levels, and risk tolerance.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$55.99 $33.59

you save $22.40 (40%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$55.99 $33.59

you save $22.40 (40%)

eBook

pdf, ePub, online

$55.99 $33.59

you save $22.40 (40%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more