Overview

6 Reinforcement Learning

This chapter surveys how reinforcement learning is used to align language models with human preferences in RLHF. It starts from the practical training loop: the current policy generates completions, a reward model scores them, and a KL penalty to a frozen reference constrains drift while the optimizer updates the policy. While deep RL spans value-based and policy-gradient families (often combined in actor–critic setups), the algorithms that popularized RLHF are policy-gradient methods run on on-policy data. Despite the mathematical sophistication, the authors stress that outcomes hinge heavily on data quality and reward design.

The chapter builds the policy-gradient objective from first principles (returns, the log-derivative trick, value functions) and shows why baselines and the advantage A(s,a) reduce variance without bias. It then instantiates the theory in widely used algorithms. REINFORCE applies Monte Carlo estimates and simple baselines; RLOO lowers variance by using a leave-one-out per-prompt baseline. PPO adds importance sampling to reuse batches and a clipped surrogate to bound step size, typically with token-level credit assignment and a learned value function trained via targets such as GAE. GRPO removes the value network and computes group-relative sequence-level advantages (with common variants that adjust normalization), while PPO-like clipping is retained and the KL term is often added directly to the loss. For long sequences and large models, the chapter motivates alternatives to token-level importance ratios: GSPO moves ratios to the sequence level (length-normalized geometric mean), and CISPO clips IS weights directly with stop-gradient so every token still gets signal—each choice reflects a bias–variance and stability trade-off when rewards are sequence-level.

Turning to practice, the authors detail choices that strongly affect stability and throughput: whether KL is folded into reward or added to the loss, how to aggregate per-token losses (per-sequence, per-token, or fixed-length normalization), value-network initialization and targets, and reward/advantage normalization or whitening. They contrast bandit-style (sequence-level) and MDP-style (token-level) framings, note how per-token KL can coexist with sequence-level advantages, and discuss asynchronous actor–learner systems that keep GPUs busy while tolerating slight off-policyness (plus techniques like sequence packing). They also explain “double regularization”: in typical LLM setups with one gradient step per batch, PPO clipping often does little and the KL penalty dominates; with one-step updates, PPO/GRPO reduce to near-vanilla policy gradient. Finally, the chapter highlights reasoning-oriented refinements—such as clip-asymmetry, dynamic sampling, and value-aware variants (e.g., DAPO, VAPO)—and emphasizes that most methods share the same backbone; careful choices of advantage estimation, importance-sampling granularity, loss aggregation, and systems design usually determine stability, sample efficiency, and final quality.

Overview of the RLHF training loop. A prompt from the dataset is passed to the tuned policy, which generates a completion. The reward model scores this completion, while the frozen initial model computes log probabilities on the same text to calculate a KL penalty that prevents excessive drift. The combined reward signal then drives a reinforcement learning update to the policy parameters.
figure
Basic REINFORCE architecture for language models. The shaped reward combines the reward model score with a KL penalty from the reference model. We build on this structure throughout the chapter.
figure
REINFORCE Leave-One-Out (RLOO) architecture. Multiple completions per prompt provide a leave-one-out baseline for advantage estimation without learning a value function.
figure
PPO architecture. A learned value function enables Generalized Advantage Estimation (GAE) for per-token advantages, used with a clipped surrogate objective.
figure
Visualization of the different regions of the PPO objective for a hypothetical advantage. The “trust region” would be described as the region where the probability ratio is within \(1\pm\varepsilon\).
figure
Value function training uses on-policy rollouts to compute targets. The model predicts \(V_t\) at each token, which is trained via MSE against the target return \(\hat{V}_t\). The advantage \(A_t = \hat{V}_t - V_t\) then weights the policy gradient update.
figure
GRPO architecture. Advantages are normalized relative to the group mean and standard deviation. The KL penalty is applied directly in the loss rather than shaping the reward.
figure
A comparison of the generation-update phases for synchronous or asynchronous RL training following Noukhovitch et al. 2024.
figure
An example distributed RL system, where two queues are managed to pass data to the learner and actor GPUs, which can both be synchronized with a distributed computing library such as Ray. Olmo Team 2025, license CC-BY.
figure

Summary

  • The RL stage of RLHF closes the loop: the policy generates completions, the reward model scores them, and policy gradient algorithms update the model to produce higher-reward outputs. This is where the preference signal captured by the reward model is translated into model behavior.
  • Policy gradient algorithms derive a gradient of expected reward with respect to model parameters, enabling direct optimization of the policy. Practical algorithms then reduce the variance of this gradient estimate through baselines – reference values that measure how good an action is relative to what’s typical – and advantage functions.
  • The chapter covers several algorithms, each with different trade-offs and focuses:
    • REINFORCE / RLOO: The simplest approaches – REINFORCE uses Monte Carlo return estimates with a batch-level baseline, while RLOO uses a leave-one-out baseline across multiple completions per prompt. Neither requires a learned value function.
    • PPO: Uses importance sampling with clipped surrogate objectives and a learned value function for per-token advantage estimation via GAE. Dominated early RLHF but requires more memory and implementation complexity.
    • GRPO: Replaces the learned value function of PPO with group-relative advantage normalization across multiple completions per prompt. Popular for reasoning tasks after DeepSeek R1’s release.
    • GSPO / CISPO: Address numerical instability in per-token importance sampling ratios, particularly for large MoE models. GSPO computes a single importance ratio per sequence rather than per token, while CISPO clips the importance weights themselves rather than the objective, ensuring every token still receives a gradient signal.
  • Implementation details such as loss aggregation (per-token vs. per-sequence), clipping strategies, value network initialization, and handling truncated generations all affect stability and final model quality.
  • Asynchronous training – where generation and gradient updates run on separate systems that may hold different model weights – is crucial to modern RL infrastructure for LLMs and introduces distribution mismatch that importance sampling (and other techniques) must correct for.
  • Generalized Advantage Estimation (GAE) reduces variance in PPO’s policy gradient updates by blending multi-step temporal difference (TD) residuals, though newer value-function-free methods like GRPO and RLOO are making it less central to modern LLMs.
  • The same policy gradient algorithms covered here form the backbone of reasoning training (chapter 7), where they are applied at larger scale with verifiable rewards rather than learned reward models.

FAQ

How does the RLHF training loop use reinforcement learning?The policy (the fine-tuned model) samples responses to prompts, a reward model scores each response, and a KL penalty to a frozen reference model discourages drift. These scores are combined into a shaped reward. A policy-gradient optimizer then updates the policy on this on-policy data, iterating over many batches/epochs.
Why are policy-gradient methods favored over value-based methods for language models?Policy-gradient methods (e.g., REINFORCE, PPO, GRPO) optimize a parameterized policy directly using recent on-policy samples, matching the LM generation workflow. Actor-critic variants reduce variance by using value estimates. Value-based replay-heavy methods (like DQN) are less natural for sequence generation and large action spaces and are rarely used in RLHF for LMs.
What is an advantage and why do baselines matter in policy gradients?The advantage measures how much better an action is than the average at a state. Subtracting a baseline (e.g., a value estimate, batch mean, or a leave-one-out mean) from returns reduces gradient variance without changing the expected gradient. In practice, advantage estimates stabilize learning, especially with sparse or noisy rewards and stochastic generation.
How do REINFORCE and RLOO differ?REINFORCE estimates the gradient from sampled returns, often with a baseline to reduce variance. RLOO (REINFORCE Leave-One-Out) sets the baseline for a completion to the average reward of the other completions from the same prompt, so the advantage becomes “this sample vs. its peers.” RLOO needs multiple completions per prompt and achieves low-variance, sequence-level credit assignment without a value network.
What problem does PPO solve and how does clipping help?PPO reuses data collected under an older policy via importance ratios between the current and old policies. Large ratios can cause destructive updates, so PPO clips the ratio within a trust region to prevent overly big steps. For LMs, PPO is computed per token and typically pairs with a learned value function to compute per-token advantages. Within the clip range it behaves like standard policy gradients; outside, gradients are limited. With only one gradient step per batch, clipping often has no effect.
What is the role of the value function and GAE in PPO?The value function serves as a learned baseline that predicts the future return from each token, enabling token-level credit assignment. Generalized Advantage Estimation (GAE) blends multi-step advantages to balance bias and variance via a lambda parameter. Together, they significantly stabilize and improve learning compared with raw Monte Carlo returns.
How do GRPO, GSPO, and CISPO differ from PPO and from each other?GRPO drops the value network and computes a group-relative advantage across multiple completions of the same prompt; it typically applies the KL penalty as a separate loss term. GSPO keeps GRPO’s idea but moves importance sampling to the sequence level (length-normalized geometric mean ratio), avoiding “token dropping” and instability from per-token ratios. CISPO clips the importance weights themselves (with stop-gradient) instead of clipping the surrogate objective, preserving nonzero gradients for all tokens while controlling variance.
How is KL regularization applied in practice?Two conventions exist: fold KL into the reward (reward shaping) or add it as a separate loss term. PPO and many REINFORCE-style implementations often subtract KL from the reward; GRPO commonly adds KL directly to the loss. Reasoning-focused RL with verifiable rewards sometimes reduces or removes KL, though it remains a key stability tool in RLHF.
Which implementation details most affect stability in RLHF?Commonly impactful choices include loss aggregation (per-sequence, per-token, or fixed-length normalization), handling EOS and truncation (score only at EOS; penalize over-long outputs), reward/advantage normalization or whitening, value-network initialization (e.g., from the reward model), KL estimators and controllers, and where to place the KL term (reward vs. loss). These choices can dominate outcomes as much as the algorithm choice.
Why is asynchronicity common, and how is off-policy drift addressed?Modern systems separate “actors” (generation) and “learners” (updates) to keep GPUs busy, especially with long reasoning traces. This introduces slight off-policy drift as weights lag between components or across multiple gradient steps. Importance sampling (ratios) corrects for the mismatch, and techniques like sequence-level packing improve throughput. Fully asynchronous/off-policy extensions are being actively explored for large-scale training.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • The RLHF Book ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • The RLHF Book ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • The RLHF Book ebook for free