The RLHF Book you own this product

Reinforcement learning from human feedback, alignment, and post-training LLMs

Nathan Lambert

MEAP began November 2025
Last updated February 2026
Publication in Summer 2026 (estimated)

ISBN 9781633434301
225 pages (estimated)

Included with a Manning Online subscription

printed in black & white

available in Korean

catalog / Data Science / Machine Learning

resources: Book forum

table of content

1 Introduction

1.1 What Does RLHF Do?

1.2 An Intuition for Post-Training

1.3 How We Got Here

1.4 Future of RLHF

2 A Tiny History of RLHF

2.1 Origins to 2018: RL on Preferences

2.2 2019 to 2022: RL from Human Preferences on Language Models

2.3 2023 to Present: ChatGPT Era

3 Training Overview

3.1 Problem Formulation

3.1.1 A Simple Example: The Thermostat

3.1.2 Classic RL Example: CartPole

3.1.3 Manipulating the Standard RL Setup

3.1.4 Fine-tuning and Regularization

3.1.5 Optimization Tools

3.2 Canonical Training Recipes

3.2.1 InstructGPT

3.2.2 Tülu 3

3.2.3 DeepSeek R1

4 Instruction Fine-tuning

4.1 Chat templates and the structure of instructions

4.2 Best practices of instruction tuning

4.3 Implementation

5 Reward Modeling

5.1 Training Reward Models

5.2 Architecture

5.3 Implementation Example

5.4 Variants

5.4.1 Preference Margin Loss

5.4.2 Balancing Multiple Comparisons Per Prompt

5.4.3 K-wise Loss Function

5.5 Outcome Reward Models

5.6 Process Reward Models

5.7 Reward Models vs. Outcome RMs vs. Process RMs vs. Value Functions

5.7.1 Inference Differences

5.8 Generative Reward Modeling

5.9 Further Reading

6 Reinforcement Learning

6.1 Policy Gradient Algorithms

6.1.1 Vanilla Policy Gradient

6.1.2 REINFORCE

6.1.3 REINFORCE Leave One Out (RLOO)

6.1.4 Proximal Policy Optimization (PPO)

6.1.5 Group Relative Policy Optimization (GRPO)

6.1.6 Group Sequence Policy Optimization (GSPO)

6.1.7 Clipped Importance Sampling Policy Optimization (CISPO)

6.1.8 Comparing Algorithms

6.2 Implementation

6.2.1 Policy Gradient Basics

6.2.2 Loss Aggregation

6.2.3 Asynchronicity

6.2.4 Proximal Policy Optimization

6.2.5 Group Relative Policy Optimization

6.3 Auxiliary Topics

6.3.1 Generalized Advantage Estimation (GAE)

6.3.2 Double Regularization

6.3.3 Further Reading

7 Reasoning & Inference-Time Scaling

7.1 The Origins of New Reasoning Models

7.1.1 Why Does RL Work Now?

7.1.2 RL Training vs. Inference-time Scaling

7.1.3 The Future (Beyond Reasoning) of RLVR

7.2 Understanding Reasoning Training Methods

7.2.1 Reasoning Research Pre OpenAI’s o1 or DeepSeek R1

7.2.2 Early Reasoning Models

7.2.3 Common Practices in Training Reasoning Models

7.3 Looking Ahead

8 Direct Alignment Algorithms

8.1 Direct Preference Optimization (DPO)

8.1.1 How DPO Works

8.1.2 DPO Derivation

8.2 Numerical Concerns, Weaknesses, and Alternatives

8.3 Implementation Considerations

8.4 DAAs with Synthetic Preference Data

8.5 DAAs vs. RL: Online vs. Offline Data

9 Rejection Sampling

9.1 Training Process

9.1.1 Generating Completions

9.1.2 Scoring Completions

9.1.3 Fine-tuning

9.2 Implementation Details

9.3 Related: Best-of-N Sampling

10 The Nature of Preferences

10.1 The Origins of RLHF and Preferences

10.2 Specifying objectives: from logic of utility to reward functions

10.3 Tools for optimizing utility

10.4 Complexity of optimizing preferences

11 Preference Data

11.1 Why We Need Preference Data

11.2 Collecting Preference Data

11.2.1 Interface

11.2.2 Rankings vs. Ratings

11.2.3 Multi-turn Data

11.2.4 Structured Preference Data

11.2.5 Sourcing and Contracts

11.3 Bias: Things to Watch Out For in Data Collection

11.4 Open Questions in RLHF Preference Data

12 Synthetic Data

12.1 Distillation

12.2 AI Feedback

12.2.1 Balancing AI and Human Feedback Data

12.2.2 Specific LLMs for Judgement

12.3 Constitutional AI

12.3.1 Further Reading on CAI

12.4 Rubrics: AI Feedback for Training

13 Tool Use & Function Calling

13.1 Interweaving Tool Calls in Generation

13.2 Multi-step Tool Reasoning

13.3 Model Context Protocol (MCP)

13.4 Implementation

14 Over Optimization

14.1 Qualitative Over-optimization

14.1.1 Managing Proxy Objectives

14.1.2 Over-refusal and “Too Much RLHF”

14.2 Quantitative over-optimization

14.3 Misalignment and the Role of RLHF

15 Regularization

15.1 KL Divergences in RL Optimization

15.1.1 Reference Model to Generations

15.1.2 Implementation Example

15.2 Pretraining Gradients

15.3 Other Regularization

16 Evaluation

16.1 Prompting Formatting: From Few-shot to Zero-shot to CoT

16.2 Why Many External Evaluation Comparisons are Unreliable

16.3 How Labs Actually use Evaluations Internally to Improve Models

16.4 Contamination

16.5 Tooling

17 Product, UX, and Model Character

17.1 Character Training

17.2 Model Specifications

17.3 Product Cycles, UX, and RLHF

Appendix

Appendix A: Definitions

A.1 Language Modeling Overview

A.2 ML Definitions

A.3 NLP Definitions

A.4 RL Definitions

A.5 RLHF Only Definitions

A.6 Extended Glossary

Appendix B: Beyond ‘Just Style’

Appendix C: Practical Issues

C.1 Compute Costs of Post-Training

C.2 Evaluation Variance

C.3 Managing Training Performance Variance

C.4 Identifying Bad Training Jobs

Overview

7 Reasoning & Inference-Time Scaling

Reasoning-focused language models and inference-time scaling have driven major performance gains from late 2024 through 2025 by training models to think more before answering and by spending more compute at inference. Building on the classic “cake” view of learning—self-supervision as the cake, instruction tuning as the icing, and reinforcement learning as the cherry—this wave confirmed that scaled reinforcement learning can reliably elicit stronger reasoning, coding, and math abilities. A central development is Reinforcement Learning with Verifiable Rewards (RLVR), which complements RLHF: instead of scoring subjective qualities with a learned reward model, RLVR leans on objective checks (e.g., correctness of a math answer or passing unit tests) to shape behavior. Models like o1 and DeepSeek R1 popularized this approach and demonstrated that allocating more test-time compute—longer chains of thought or multiple sampled solutions—translates into markedly better results.

Operationally, RLVR follows a simple but powerful loop: sample multiple answers, verify them with deterministic checks, reinforce the successful trajectories, and repeat—often revisiting the same problems many times to consolidate rare, latent behaviors into robust skills. Verification functions make reward modeling optional and reduce over-optimization risks in domains with clear signals. This training both encourages and benefits from inference-time scaling: models learn to generate longer reasoning traces when useful and to leverage multiple candidate solutions with answer selection. Improved stability, better tooling, and stronger base models have made large-scale RL runs practical, while distillation and instruction tuning on the outputs of RL-trained models propagate reasoning behaviors to smaller or faster variants without fully reproducing the original RL cost.

Practically, a shared playbook has emerged: filter data by difficulty so the model sees problems it solves inconsistently (to create learning signal), schedule curricula or online filtering during training, remove or relax constraints like KL penalties and clipping to enable exploration, adopt asynchronous or off-policy updates for throughput, add light rewards for formatting and language consistency, manage length with penalties and “thinking budget” curricula, normalize losses to avoid bias, and scale test-time compute with parallel rollouts and answer selection. Empirical lessons include that text-only reasoning phases can improve multimodal performance and that system prompts can toggle reasoning depth. The field is moving rapidly toward open documentation of full training lifecycles and early scaling laws for RL in reasoning. Reinforcement learning has shifted from a decorative “cherry on top” to a load-bearing component of modern post-training, and while today’s RLVR techniques are not final, they form the foundation for the next generation of reasoning-capable models.

RLVR in the form of an RL feedback loop. Instead of a reward model, a verification function is used.

Summary

Reasoning models use reinforcement learning with verifiable rewards (RLVR) to dramatically improve performance on math, code, science, and other tasks where the correctness of an answer can be assessed. Unlike RLHF, which requires a learned reward model, RLVR uses verification functions – such as answer matching or unit tests – that return definitive correctness signals.
The core training loop applies the policy gradient algorithms from chapter 6 in a remarkably simple way: sample multiple answers to questions, take gradient steps toward the correct ones, and repeat.
Inference-time scaling – using more computation during generation to get better responses – is closely linked to reasoning training. RL-trained models learn to produce longer reasoning chains that are strongly correlated with improved performance, a meaningful shift from the superficial length bias seen in early RLHF.
Common practices for reasoning training include difficulty filtering (training only on problems the model solves 20-80% of the time), removing or relaxing KL penalties to allow greater exploration, format and language consistency rewards, and progressive length scheduling to combat overthinking.
The reasoning model landscape evolved rapidly from OpenAI’s o1 and DeepSeek R1 through dozens of open-weight models – including Qwen 3, Phi-4 Reasoning, Llama-Nemotron, and OLMo 3 Think – with the field converging on recipes that combine instruction tuning, RLHF, and large-scale RLVR in carefully sequenced stages. Substantial changes to these recipes are expected to continue.

FAQ

What are “reasoning models,” and what changed around 2024–2025?

Reasoning models are language models trained to think extensively before answering (often generating internal reasoning tokens) and to leverage more compute at inference. Around 2024–2025, large-scale reinforcement learning with verifiable rewards (RLVR), together with RLHF and long-context models, enabled a major jump in performance on math, code, and science problems. This shifted post-training priorities industry-wide toward scaling RL and inference-time compute.

How does RLVR differ from RLHF?

- RLHF: Uses a learned reward model to score subjective qualities like clarity, helpfulness, and safety; there is no single correct answer.
- RLVR: Uses domain verification functions (e.g., exact answer checks or unit tests) that provide definitive rewards (often binary or partial credit). The reward model is optional because correctness can be directly verified.

How are answers “verified” in RLVR for math and code?

- Math: Extract the final answer from the model’s output (e.g., via an answer marker like “The answer is:” or special tokens) and compare to the known correct value to return a 1/0 reward (or partial credit variants).
- Code: Run unit tests on the generated code; if all pass, reward = 1; otherwise 0 (with optional partial credit by tests passed). No learned reward model is required.

What does the RLVR training loop look like, and why does it generalize?

1) Sample multiple answers to many questions. 2) Take gradient steps toward the answers that verify as correct. 3) Repeat, revisiting the same data many times. Despite its simplicity, this procedure helps the model discover and reinforce behaviors correlated with correctness, and gains on training tasks often transfer to new questions and related domains.

What is inference-time scaling, and how do reasoning models use it?

Inference-time scaling means spending more compute during generation (e.g., longer reasoning chains or multiple sampled rollouts) to improve accuracy. Reasoning models often produce more tokens per response and can aggregate multiple rollouts (e.g., majority vote or learned selector) to boost performance. The key is a strong correlation between extra tokens/compute and downstream gains, not just longer outputs.

Why is RL “working now” for language models?

- Improved stability and infrastructure for long RL runs.
- Mature, accessible tooling (e.g., TRL, Open Instruct, veRL, OpenRLHF).
- Stronger pretrained bases reached a capability threshold that makes reasoning RL viable (observed circa 2024+).
- Broad practitioner experience reducing brittleness (loss spikes, crashes) and reproducibility issues.

What common practices improve RL training for reasoning?

- Offline difficulty filtering (focus on problems the base solves ~20–80% of the time).
- Per-batch online filtering/curriculums to schedule difficulty over training.
- Removing the KL penalty to allow broader exploration when rewards are reliable.
- Relaxed/two-sided clipping variants to encourage exploration and reduce spurious signals.
- Off-policy or asynchronous updates to keep GPUs busy on variable-length traces.
- Small format rewards (e.g., enforce think/answer structure) and language consistency rewards for multilingual stability.
- Length penalties or progressive context extension to curb overthinking and stabilize training.
- Batch-level loss/advantage normalization to avoid group-length biases.
- Parallel test-time compute (majority vote or learned selector over multiple rollouts).

What are “thinking tokens,” and can users control reasoning length?

Thinking tokens are the model’s internal reasoning traces (often bracketed, then followed by a concise answer). For hard tasks, thousands of tokens may be generated. Training often rewards consistent formatting and, in many systems, users can toggle or budget reasoning effort via system prompts or length-controlled training. Some models also distill long-reasoning behavior into smaller models.

How does RLVR compare to standard instruction tuning for developers?

- Instruction tuning (often with parameter-efficient methods like LoRA) matches model outputs to provided completions in 1–2 epochs, primarily shaping behavior and style.
- RLVR optimizes for correctness using verification signals, running hundreds or thousands of epochs over smaller, curated datasets to turn sparse, fragile skills into robust, repeatable behaviors.

Where can I find code, tools, and reference models to get started?

Functional code for the chapter’s algorithms is linked at https://rlhfbook.com/code. Widely used open tools include TRL, Open Instruct, veRL, and OpenRLHF. Landmark reasoning systems like OpenAI’s o1 and DeepSeek R1 popularized the approach, and many open-weight successors document practices you can replicate or adapt.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $12.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $12.49

eBook

pdf, ePub, online

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $12.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more