The RLHF Book you own this product

Reinforcement learning from human feedback, alignment, and post-training LLMs

Nathan Lambert

MEAP began November 2025
Last updated February 2026
Publication in Summer 2026 (estimated)

ISBN 9781633434301
225 pages (estimated)

Included with a Manning Online subscription

printed in black & white

available in Korean

catalog / Data Science / Machine Learning

resources: Book forum

table of content

1 Introduction

1.1 What Does RLHF Do?

1.2 An Intuition for Post-Training

1.3 How We Got Here

1.4 Future of RLHF

2 A Tiny History of RLHF

2.1 Origins to 2018: RL on Preferences

2.2 2019 to 2022: RL from Human Preferences on Language Models

2.3 2023 to Present: ChatGPT Era

3 Training Overview

3.1 Problem Formulation

3.1.1 A Simple Example: The Thermostat

3.1.2 Classic RL Example: CartPole

3.1.3 Manipulating the Standard RL Setup

3.1.4 Fine-tuning and Regularization

3.1.5 Optimization Tools

3.2 Canonical Training Recipes

3.2.1 InstructGPT

3.2.2 Tülu 3

3.2.3 DeepSeek R1

4 Instruction Fine-tuning

4.1 Chat templates and the structure of instructions

4.2 Best practices of instruction tuning

4.3 Implementation

5 Reward Modeling

5.1 Training Reward Models

5.2 Architecture

5.3 Implementation Example

5.4 Variants

5.4.1 Preference Margin Loss

5.4.2 Balancing Multiple Comparisons Per Prompt

5.4.3 K-wise Loss Function

5.5 Outcome Reward Models

5.6 Process Reward Models

5.7 Reward Models vs. Outcome RMs vs. Process RMs vs. Value Functions

5.7.1 Inference Differences

5.8 Generative Reward Modeling

5.9 Further Reading

6 Reinforcement Learning

6.1 Policy Gradient Algorithms

6.1.1 Vanilla Policy Gradient

6.1.2 REINFORCE

6.1.3 REINFORCE Leave One Out (RLOO)

6.1.4 Proximal Policy Optimization (PPO)

6.1.5 Group Relative Policy Optimization (GRPO)

6.1.6 Group Sequence Policy Optimization (GSPO)

6.1.7 Clipped Importance Sampling Policy Optimization (CISPO)

6.1.8 Comparing Algorithms

6.2 Implementation

6.2.1 Policy Gradient Basics

6.2.2 Loss Aggregation

6.2.3 Asynchronicity

6.2.4 Proximal Policy Optimization

6.2.5 Group Relative Policy Optimization

6.3 Auxiliary Topics

6.3.1 Generalized Advantage Estimation (GAE)

6.3.2 Double Regularization

6.3.3 Further Reading

7 Reasoning & Inference-Time Scaling

7.1 The Origins of New Reasoning Models

7.1.1 Why Does RL Work Now?

7.1.2 RL Training vs. Inference-time Scaling

7.1.3 The Future (Beyond Reasoning) of RLVR

7.2 Understanding Reasoning Training Methods

7.2.1 Reasoning Research Pre OpenAI’s o1 or DeepSeek R1

7.2.2 Early Reasoning Models

7.2.3 Common Practices in Training Reasoning Models

7.3 Looking Ahead

8 Direct Alignment Algorithms

8.1 Direct Preference Optimization (DPO)

8.1.1 How DPO Works

8.1.2 DPO Derivation

8.2 Numerical Concerns, Weaknesses, and Alternatives

8.3 Implementation Considerations

8.4 DAAs with Synthetic Preference Data

8.5 DAAs vs. RL: Online vs. Offline Data

9 Rejection Sampling

9.1 Training Process

9.1.1 Generating Completions

9.1.2 Scoring Completions

9.1.3 Fine-tuning

9.2 Implementation Details

9.3 Related: Best-of-N Sampling

10 The Nature of Preferences

10.1 The Origins of RLHF and Preferences

10.2 Specifying objectives: from logic of utility to reward functions

10.3 Tools for optimizing utility

10.4 Complexity of optimizing preferences

11 Preference Data

11.1 Why We Need Preference Data

11.2 Collecting Preference Data

11.2.1 Interface

11.2.2 Rankings vs. Ratings

11.2.3 Multi-turn Data

11.2.4 Structured Preference Data

11.2.5 Sourcing and Contracts

11.3 Bias: Things to Watch Out For in Data Collection

11.4 Open Questions in RLHF Preference Data

12 Synthetic Data

12.1 Distillation

12.2 AI Feedback

12.2.1 Balancing AI and Human Feedback Data

12.2.2 Specific LLMs for Judgement

12.3 Constitutional AI

12.3.1 Further Reading on CAI

12.4 Rubrics: AI Feedback for Training

13 Tool Use & Function Calling

13.1 Interweaving Tool Calls in Generation

13.2 Multi-step Tool Reasoning

13.3 Model Context Protocol (MCP)

13.4 Implementation

14 Over Optimization

14.1 Qualitative Over-optimization

14.1.1 Managing Proxy Objectives

14.1.2 Over-refusal and “Too Much RLHF”

14.2 Quantitative over-optimization

14.3 Misalignment and the Role of RLHF

15 Regularization

15.1 KL Divergences in RL Optimization

15.1.1 Reference Model to Generations

15.1.2 Implementation Example

15.2 Pretraining Gradients

15.3 Other Regularization

16 Evaluation

16.1 Prompting Formatting: From Few-shot to Zero-shot to CoT

16.2 Why Many External Evaluation Comparisons are Unreliable

16.3 How Labs Actually use Evaluations Internally to Improve Models

16.4 Contamination

16.5 Tooling

17 Product, UX, and Model Character

17.1 Character Training

17.2 Model Specifications

17.3 Product Cycles, UX, and RLHF

Appendix

Appendix A: Definitions

A.1 Language Modeling Overview

A.2 ML Definitions

A.3 NLP Definitions

A.4 RL Definitions

A.5 RLHF Only Definitions

A.6 Extended Glossary

Appendix B: Beyond ‘Just Style’

Appendix C: Practical Issues

C.1 Compute Costs of Post-Training

C.2 Evaluation Variance

C.3 Managing Training Performance Variance

C.4 Identifying Bad Training Jobs

Overview

6 Reinforcement Learning

This chapter surveys how reinforcement learning is used to align language models with human preferences in RLHF. It starts from the practical training loop: the current policy generates completions, a reward model scores them, and a KL penalty to a frozen reference constrains drift while the optimizer updates the policy. While deep RL spans value-based and policy-gradient families (often combined in actor–critic setups), the algorithms that popularized RLHF are policy-gradient methods run on on-policy data. Despite the mathematical sophistication, the authors stress that outcomes hinge heavily on data quality and reward design.

The chapter builds the policy-gradient objective from first principles (returns, the log-derivative trick, value functions) and shows why baselines and the advantage A(s,a) reduce variance without bias. It then instantiates the theory in widely used algorithms. REINFORCE applies Monte Carlo estimates and simple baselines; RLOO lowers variance by using a leave-one-out per-prompt baseline. PPO adds importance sampling to reuse batches and a clipped surrogate to bound step size, typically with token-level credit assignment and a learned value function trained via targets such as GAE. GRPO removes the value network and computes group-relative sequence-level advantages (with common variants that adjust normalization), while PPO-like clipping is retained and the KL term is often added directly to the loss. For long sequences and large models, the chapter motivates alternatives to token-level importance ratios: GSPO moves ratios to the sequence level (length-normalized geometric mean), and CISPO clips IS weights directly with stop-gradient so every token still gets signal—each choice reflects a bias–variance and stability trade-off when rewards are sequence-level.

Turning to practice, the authors detail choices that strongly affect stability and throughput: whether KL is folded into reward or added to the loss, how to aggregate per-token losses (per-sequence, per-token, or fixed-length normalization), value-network initialization and targets, and reward/advantage normalization or whitening. They contrast bandit-style (sequence-level) and MDP-style (token-level) framings, note how per-token KL can coexist with sequence-level advantages, and discuss asynchronous actor–learner systems that keep GPUs busy while tolerating slight off-policyness (plus techniques like sequence packing). They also explain “double regularization”: in typical LLM setups with one gradient step per batch, PPO clipping often does little and the KL penalty dominates; with one-step updates, PPO/GRPO reduce to near-vanilla policy gradient. Finally, the chapter highlights reasoning-oriented refinements—such as clip-asymmetry, dynamic sampling, and value-aware variants (e.g., DAPO, VAPO)—and emphasizes that most methods share the same backbone; careful choices of advantage estimation, importance-sampling granularity, loss aggregation, and systems design usually determine stability, sample efficiency, and final quality.

Overview of the RLHF training loop. A prompt from the dataset is passed to the tuned policy, which generates a completion. The reward model scores this completion, while the frozen initial model computes log probabilities on the same text to calculate a KL penalty that prevents excessive drift. The combined reward signal then drives a reinforcement learning update to the policy parameters.

Basic REINFORCE architecture for language models. The shaped reward combines the reward model score with a KL penalty from the reference model. We build on this structure throughout the chapter.

REINFORCE Leave-One-Out (RLOO) architecture. Multiple completions per prompt provide a leave-one-out baseline for advantage estimation without learning a value function.

PPO architecture. A learned value function enables Generalized Advantage Estimation (GAE) for per-token advantages, used with a clipped surrogate objective.

Visualization of the different regions of the PPO objective for a hypothetical advantage. The “trust region” would be described as the region where the probability ratio is within $1\pm\varepsilon$.

Value function training uses on-policy rollouts to compute targets. The model predicts $V_t$ at each token, which is trained via MSE against the target return $\hat{V}_t$. The advantage $A_t = \hat{V}_t - V_t$ then weights the policy gradient update.

GRPO architecture. Advantages are normalized relative to the group mean and standard deviation. The KL penalty is applied directly in the loss rather than shaping the reward.

A comparison of the generation-update phases for synchronous or asynchronous RL training following Noukhovitch et al. 2024.

An example distributed RL system, where two queues are managed to pass data to the learner and actor GPUs, which can both be synchronized with a distributed computing library such as Ray. Olmo Team 2025, license CC-BY.

Summary

The RL stage of RLHF closes the loop: the policy generates completions, the reward model scores them, and policy gradient algorithms update the model to produce higher-reward outputs. This is where the preference signal captured by the reward model is translated into model behavior.
Policy gradient algorithms derive a gradient of expected reward with respect to model parameters, enabling direct optimization of the policy. Practical algorithms then reduce the variance of this gradient estimate through baselines – reference values that measure how good an action is relative to what’s typical – and advantage functions.
The chapter covers several algorithms, each with different trade-offs and focuses:
- REINFORCE / RLOO: The simplest approaches – REINFORCE uses Monte Carlo return estimates with a batch-level baseline, while RLOO uses a leave-one-out baseline across multiple completions per prompt. Neither requires a learned value function.
- PPO: Uses importance sampling with clipped surrogate objectives and a learned value function for per-token advantage estimation via GAE. Dominated early RLHF but requires more memory and implementation complexity.
- GRPO: Replaces the learned value function of PPO with group-relative advantage normalization across multiple completions per prompt. Popular for reasoning tasks after DeepSeek R1’s release.
- GSPO / CISPO: Address numerical instability in per-token importance sampling ratios, particularly for large MoE models. GSPO computes a single importance ratio per sequence rather than per token, while CISPO clips the importance weights themselves rather than the objective, ensuring every token still receives a gradient signal.
Implementation details such as loss aggregation (per-token vs. per-sequence), clipping strategies, value network initialization, and handling truncated generations all affect stability and final model quality.
Asynchronous training – where generation and gradient updates run on separate systems that may hold different model weights – is crucial to modern RL infrastructure for LLMs and introduces distribution mismatch that importance sampling (and other techniques) must correct for.
Generalized Advantage Estimation (GAE) reduces variance in PPO’s policy gradient updates by blending multi-step temporal difference (TD) residuals, though newer value-function-free methods like GRPO and RLOO are making it less central to modern LLMs.
The same policy gradient algorithms covered here form the backbone of reasoning training (chapter 7), where they are applied at larger scale with verifiable rewards rather than learned reward models.

FAQ

How does the RLHF training loop use reinforcement learning?

The policy (the fine-tuned model) samples responses to prompts, a reward model scores each response, and a KL penalty to a frozen reference model discourages drift. These scores are combined into a shaped reward. A policy-gradient optimizer then updates the policy on this on-policy data, iterating over many batches/epochs.

Why are policy-gradient methods favored over value-based methods for language models?

Policy-gradient methods (e.g., REINFORCE, PPO, GRPO) optimize a parameterized policy directly using recent on-policy samples, matching the LM generation workflow. Actor-critic variants reduce variance by using value estimates. Value-based replay-heavy methods (like DQN) are less natural for sequence generation and large action spaces and are rarely used in RLHF for LMs.

What is an advantage and why do baselines matter in policy gradients?

The advantage measures how much better an action is than the average at a state. Subtracting a baseline (e.g., a value estimate, batch mean, or a leave-one-out mean) from returns reduces gradient variance without changing the expected gradient. In practice, advantage estimates stabilize learning, especially with sparse or noisy rewards and stochastic generation.

How do REINFORCE and RLOO differ?

REINFORCE estimates the gradient from sampled returns, often with a baseline to reduce variance. RLOO (REINFORCE Leave-One-Out) sets the baseline for a completion to the average reward of the other completions from the same prompt, so the advantage becomes “this sample vs. its peers.” RLOO needs multiple completions per prompt and achieves low-variance, sequence-level credit assignment without a value network.

What problem does PPO solve and how does clipping help?

PPO reuses data collected under an older policy via importance ratios between the current and old policies. Large ratios can cause destructive updates, so PPO clips the ratio within a trust region to prevent overly big steps. For LMs, PPO is computed per token and typically pairs with a learned value function to compute per-token advantages. Within the clip range it behaves like standard policy gradients; outside, gradients are limited. With only one gradient step per batch, clipping often has no effect.

What is the role of the value function and GAE in PPO?

The value function serves as a learned baseline that predicts the future return from each token, enabling token-level credit assignment. Generalized Advantage Estimation (GAE) blends multi-step advantages to balance bias and variance via a lambda parameter. Together, they significantly stabilize and improve learning compared with raw Monte Carlo returns.

How do GRPO, GSPO, and CISPO differ from PPO and from each other?

GRPO drops the value network and computes a group-relative advantage across multiple completions of the same prompt; it typically applies the KL penalty as a separate loss term. GSPO keeps GRPO’s idea but moves importance sampling to the sequence level (length-normalized geometric mean ratio), avoiding “token dropping” and instability from per-token ratios. CISPO clips the importance weights themselves (with stop-gradient) instead of clipping the surrogate objective, preserving nonzero gradients for all tokens while controlling variance.

How is KL regularization applied in practice?

Two conventions exist: fold KL into the reward (reward shaping) or add it as a separate loss term. PPO and many REINFORCE-style implementations often subtract KL from the reward; GRPO commonly adds KL directly to the loss. Reasoning-focused RL with verifiable rewards sometimes reduces or removes KL, though it remains a key stability tool in RLHF.

Which implementation details most affect stability in RLHF?

Commonly impactful choices include loss aggregation (per-sequence, per-token, or fixed-length normalization), handling EOS and truncation (score only at EOS; penalize over-long outputs), reward/advantage normalization or whitening, value-network initialization (e.g., from the reward model), KL estimators and controllers, and where to place the KL term (reward vs. loss). These choices can dominate outcomes as much as the algorithm choice.

Why is asynchronicity common, and how is off-policy drift addressed?

Modern systems separate “actors” (generation) and “learners” (updates) to keep GPUs busy, especially with long reasoning traces. This introduces slight off-policy drift as weights lag between components or across multiple gradient steps. Importance sampling (ratios) corrects for the mismatch, and techniques like sequence-level packing improve throughput. Fully asynchronous/off-policy extensions are being actively explored for large-scale training.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $12.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $12.49

eBook

pdf, ePub, online

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $12.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more