Reinforcement Learning from Human Feedback you own this product

Alignment and post-training of LLMs

Nathan Lambert

MEAP began November 2025
Last updated April 2026
Publication in July 2026 (estimated)

ISBN 9781633434301
225 pages (estimated)

Included with a Manning Online subscription

printed in black & white

available in Korean, Simplified Chinese

catalog / Data Science / Machine Learning

resources: Book forum

table of content

1 Introduction

1.1 What Does RLHF Do?

1.2 Walkthrough of a RLHF Recipe

1.3 An Intuition for Post-Training

1.4 How We Got Here

1.5 Future of RLHF

2 A Tiny History of RLHF

2.1 Origins to 2018: RL on Preferences

2.2 2019 to 2022: RL from Human Preferences on Language Models

2.3 2023 to Present: ChatGPT Era

3 Training Overview

3.1 Problem Formulation

3.1.1 A Simple Example: The Thermostat

3.1.2 Classic RL Example: CartPole

3.1.3 Manipulating the Standard RL Setup

3.1.4 Fine-tuning and Regularization

3.1.5 Optimization Tools

3.1.6 Subtle Advantages of RL in Post-Training Language Models

3.2 Canonical Training Recipes

3.2.1 InstructGPT

3.2.2 Tülu 3

3.2.3 DeepSeek R1

4 Instruction Fine-tuning

4.1 Chat Templates and the Structure of Instructions

4.2 Best Practices of Instruction Tuning

4.3 Implementation Details

5 Reward Modeling

5.1 Training a Bradley-Terry Reward Model

5.2 The Default Reward Model Architecture

5.3 Implementation Example

5.4 Reward Model Variants

5.4.1 Preference Margin Loss

5.4.2 Balancing Multiple Comparisons Per Prompt

5.4.3 K-wise Loss Function

5.5 Outcome Reward Models

5.6 Process Reward Models

5.7 Comparing Reward Model Types (and Value Functions)

5.7.1 Inference Across Reward Model Types

5.8 Generative Reward Modeling (a.k.a. LLM-as-a-judge)

5.9 Further Reading

6 Reinforcement Learning

6.1 Policy Gradient Algorithms

6.1.1 Deriving the Policy Gradient

6.1.2 Vanilla Policy Gradient

6.1.3 REINFORCE

6.1.4 REINFORCE Leave One Out (RLOO)

6.1.5 Example: Proximal Policy Optimization (PPO)

6.1.6 Group Relative Policy Optimization (GRPO)

6.1.7 Group Sequence Policy Optimization (GSPO)

6.1.8 Clipped Importance Sampling Policy Optimization (CISPO)

6.1.9 Comparing Algorithms

6.2 Implementation

6.2.1 Policy Gradient Basics

6.2.2 Loss Aggregation Trade-offs

6.2.3 Asynchronous RL Systems

6.2.4 Truncated Importance Sampling (TIS)

6.2.5 Proximal Policy Optimization

6.2.6 Example: Group Relative Policy Optimization

6.3 Auxiliary Topics

6.3.1 Generalized Advantage Estimation (GAE)

6.3.2 Double Regularization

6.3.3 Further Reading

7 Reasoning & Inference-Time Scaling

7.1 The Origins of New Reasoning Models

7.1.1 Why Does RL Work Now?

7.1.2 RL Training vs. Inference-time Scaling

7.1.3 The Future (Beyond Reasoning) of RLVR

7.2 Understanding Reasoning Training Methods

7.2.1 Reasoning Research Pre OpenAI’s o1 or DeepSeek R1

7.2.2 Early Reasoning Models

7.2.3 Common Practices in Training Reasoning Models

7.3 Looking Ahead

8 Direct Alignment Algorithms (DAAs)

8.1 Direct Preference Optimization (DPO)

8.1.1 How DPO Works

8.1.2 DPO Derivation

8.2 Numerical Concerns, Weaknesses, and Alternatives

8.3 Implementation Details

8.4 DAAs with Synthetic Preference Data

8.5 DAAs vs. RL: Online vs. Offline Data

9 Rejection Sampling

9.1 Training Process Step By Step

9.1.1 Generating Completions

9.1.2 Scoring Completions

9.1.3 Fine-tuning

9.2 Implementation Details

9.3 Related: Best-of-N Sampling

10 The Nature of Preferences

10.1 The Origins of RLHF and Preferences

10.2 Specifying Objectives: From Logic of Utility to Reward Functions

10.3 Tools for Optimizing Utility

10.4 Complexity of Optimizing Preferences

11 Preference Data

11.1 Why We Need Preference Data

11.2 Collecting Preference Data

11.2.1 Interfaces

11.2.2 Rankings vs. Ratings

11.2.3 Multi-turn Data

11.2.4 Structured Preference Data

11.2.5 Sourcing and Contracts

11.3 Bias: Things to Watch Out For in Data Collection

11.4 Open Questions in RLHF Preference Data

12 Synthetic Data

12.1 Distillation

12.2 AI Feedback

12.2.1 Balancing AI and Human Feedback Data

12.2.2 Building Specific LLMs for Judgement

12.3 Constitutional AI

12.3.1 Further Reading on CAI

12.4 Rubrics: Prompt-Specific AI Feedback for Training

13 Tool Use & Function Calling

13.1 Interweaving Tool Calls in Generation

13.2 Multi-step Tool Reasoning

13.3 Model Context Protocol (MCP)

13.4 Implementation Details

14 Over-Optimization

14.1 Qualitative Over-optimization

14.1.1 Managing Proxy Objectives

14.1.2 Over-refusal and “Too Much RLHF”

14.2 Quantitative Over-Optimization

14.3 Misalignment and the Role of RLHF

15 Regularization

15.1 KL Divergence in RL Optimization

15.1.1 Reference Model to Generations

15.1.2 Implementation Example

15.2 Implicit Regularization

15.2.1 SFT Memorizes, RL Generalizes

15.2.2 Retaining by Doing: On-Policy Data Mitigates Forgetting

15.2.3 RL’s Razor: Why Online RL Forgets Less

15.3 Other Types of Regularization

15.3.1 Pretraining Gradients

15.3.2 Margin-based Regularization

16 Evaluation

16.1 Prompting Formatting: From Few-shot to Zero-shot to CoT

16.2 Why Many External Evaluation Comparisons Are Unreliable

16.3 How Labs Actually Use Evaluations Internally to Improve Models

16.4 Contamination

16.5 Tooling

17 Crafting Model Character and Products

17.1 Character Training

17.1.1 Persona Vectors

17.1.2 The Assistant Axis

17.1.3 Persona Subnetworks

17.2 Model Specifications

17.3 Product Cycles and What’s Next for RLHF

Appendix

Appendix A: Definitions

A.1 Language Modeling Overview

A.2 ML Definitions

A.3 NLP Definitions

A.4 RL Definitions

A.5 RLHF Only Definitions

A.6 Extended Glossary

Appendix B: Beyond ‘Just Style’

Appendix C: Practical Issues

C.1 Compute Costs of Post-Training

C.2 Evaluation Variance

C.3 Managing Training Performance Variance

C.4 Identifying Bad Training Jobs

Overview

1 Introduction

Reinforcement Learning from Human Feedback (RLHF) is presented as the key technique that turned large language models from next-token predictors into helpful, safe, and engaging assistants. Born from the need to solve hard-to-specify objectives where human preferences are subtle and context dependent, RLHF rose to prominence with ChatGPT and now sits inside a broader post-training stack alongside instruction/supervised fine-tuning and reinforcement learning with verifiable rewards. Its central contribution is shaping model behavior—style, tone, format, and reliability—so outputs align with what people actually want, while also improving generalization across domains compared with instruction tuning alone.

The canonical RLHF pipeline proceeds in three stages: first, instruction tuning teaches models to follow prompts in a question–answering format; second, human preference data trains a reward model that scores responses; third, reinforcement learning optimizes the policy to produce higher-scoring outputs. Unlike per-token learning in instruction tuning, RLHF updates at the response level using contrastive signals that highlight both preferable and undesirable behaviors. This flexibility delivers strong gains but introduces challenges: reward models are proxy objectives prone to over-optimization, regularization is essential, length bias can appear, costs are higher than simple fine-tuning, and success depends on starting from a capable base model and interleaving stages within a full post-training recipe.

The chapter traces the field’s evolution from early instruction-tuning recipes (e.g., Alpaca-era) and open skepticism about RLHF, through the widespread adoption of preference-tuning methods like Direct Preference Optimization, to today’s scaled, multi-stage post-training in frontier systems that increasingly emphasize reasoning and RL with verifiable rewards. It argues for an elicitation view of post-training—most capability is learned in pretraining, and post-training extracts and organizes it—while noting that modern RL-based methods now shape not just style but complex behaviors. Looking ahead, RLHF remains the foundation and bridge to richer RL-driven training, with the enduring goal of mapping human values and objectives into models, and the book aims to provide the concepts, trade-offs, and practical tools needed to do so.

A rendition of the early, three stage RLHF process: first training via supervised fine-tuning (SFT, chapter 4), building a reward model (RM, chapter 5), and then optimizing with reinforcement learning (RL, chapter 6).

Summary

RLHF incorporates human preferences into AI systems to solve problems that are hard-to-specify programmatically, and became widely known through ChatGPT’s breakout, which made the capabilities of language models more approachable.
The basic RLHF pipeline has three steps: instruction fine-tuning to teach the model to follow the question-answering format, training a reward model on human preferences, and optimizing the model with RL against that reward.
RLHF is known to primarily change the style, tone, and format of model responses – making them more helpful, warm, and engaging. But it’s not “just style transfer”: RLHF also improves benchmark performance, though over-optimization (e.g., excessive length or chattiness) can harm capabilities in other domains.
The elicitation theory of post-training suggests that base models contain latent potential, and post-training’s job is to extract and cultivate that intelligence into useful behaviors.
RLHF is one component of modern post-training, alongside instruction fine-tuning (IFT/SFT) and reinforcement learning with verifiable rewards (RLVR), used together in an intertwined manner to craft particular training recipes.

FAQ

What is Reinforcement Learning from Human Feedback (RLHF) and what problems does it solve?

RLHF is a technique that incorporates human preference information into AI systems to tackle hard-to-specify objectives. It is especially useful when explicit task rewards are unavailable or user preferences are subtle or unexpressible, making it relevant across domains where humans interact directly with models.

Why did RLHF become prominent after the release of ChatGPT?

RLHF was central to making large language models feel helpful, harmless, and engaging in real-world use. ChatGPT’s success showcased how aligning models to human preferences could transform base models from pure next-token predictors into general-purpose assistants, catalyzing rapid investment in post-training methods.

What are the three main steps in the canonical RLHF pipeline?

- Supervised fine-tuning (SFT) to teach instruction-following and basic formatting
- Collecting human preference data to train a reward model that scores responses
- Optimizing the model with RL by sampling completions, scoring them with the reward model, and updating the policy to make better responses more likely

How does RLHF change a model’s behavior compared to a base or SFT-only model?

Base models continue text in an open-ended way, often mimicking internet artifacts. SFT adds reliable Q&A behavior and formatting. RLHF further shifts responses to be concise, reliable, warm, and engaging by optimizing at the response level using comparative (contrastive) signals about what is better or worse, rather than just next-token prediction.

Where does RLHF fit within modern post-training?

Post-training commonly includes three optimization types: (1) Instruction/Supervised Fine-tuning (IFT/SFT) to learn features and formats, (2) Preference Fine-tuning (PreFT) where RLHF dominates to align style and subtle human preferences, and (3) Reinforcement Learning with Verifiable Rewards (RLVR) to boost performance on verifiable tasks. RLHF is the core of the second stage.

What is a reward model, and how is human feedback used in RLHF?

A reward model maps a response to a scalar score reflecting human preferences. It is trained on datasets of pairwise (or comparative) judgments across diverse prompts and labelers. During RL, this model ranks sampled completions so the policy can be updated toward higher-scoring behaviors.

Why can RL-based post-training generalize better than instruction tuning alone?

RL-based methods optimize entire responses against a quality signal (human preferences or verifiable rewards), not just per-token likelihood. This response-level, comparative training yields broader generalization across domains and tasks than SFT alone.

What are the main challenges and costs of doing RLHF?

- Training a robust reward model is nontrivial and application-dependent
- Over-optimization risk due to proxy rewards, requiring careful regularization and controls
- Practical issues like length bias can emerge
- RLHF is more expensive in data, compute, and time, and it benefits from a strong starting model

What is the “Elicitation Theory” of post-training?

It posits that base models already contain broad knowledge and capabilities, and post-training primarily elicits and amplifies valuable behaviors—like upgrading a car’s performance around a fixed chassis. Significant gains can be achieved by shaping outputs from next-token prediction into high-quality conversational behavior, and the compute spent on post-training is growing for frontier models.

How has the field evolved, and what’s the outlook for RLHF?

After early instruction-tuned models (e.g., Alpaca, Vicuna), skepticism about RLHF gave way to strong results from groups that invested in it. Direct Preference Optimization (DPO) simplified preference learning and drove open-recipe progress, while closed labs built multi-stage post-training pipelines. Newer directions like RLVR and reasoning-focused RL are accelerating, with RLHF remaining the core of preference fine-tuning and a bridge to more advanced RL-based methods.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $33.59

you save $14.40 (30%)

include audio $24.99 $17.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $33.59

you save $14.40 (30%)

include audio $24.99 $17.49

eBook

pdf, ePub, online

$47.99 $33.59

you save $14.40 (30%)

include audio $24.99 $17.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more