The RLHF Book you own this product

Reinforcement learning from human feedback, alignment, and post-training LLMs

Nathan Lambert

MEAP began November 2025
Last updated February 2026
Publication in Summer 2026 (estimated)

ISBN 9781633434301
225 pages (estimated)

Included with a Manning Online subscription

printed in black & white

catalog / Data Science / Machine Learning

resources: Book forum

table of content

1 Introduction

1.1 What Does RLHF Do?

1.2 An Intuition for Post-Training

1.3 How We Got Here

1.4 Future of RLHF

2 A Tiny History of RLHF

2.1 Origins to 2018: RL on Preferences

2.2 2019 to 2022: RL from Human Preferences on Language Models

2.3 2023 to Present: ChatGPT Era

3 Training Overview

3.1 Problem Formulation

3.1.1 A Simple Example: The Thermostat

3.1.2 Classic RL Example: CartPole

3.1.3 Manipulating the Standard RL Setup

3.1.4 Fine-tuning and Regularization

3.1.5 Optimization Tools

3.2 Canonical Training Recipes

3.2.1 InstructGPT

3.2.2 Tülu 3

3.2.3 DeepSeek R1

4 Instruction Fine-tuning

4.1 Chat templates and the structure of instructions

4.2 Best practices of instruction tuning

4.3 Implementation

5 Reward Modeling

5.1 Training Reward Models

5.2 Architecture

5.3 Implementation Example

5.4 Variants

5.4.1 Preference Margin Loss

5.4.2 Balancing Multiple Comparisons Per Prompt

5.4.3 K-wise Loss Function

5.5 Outcome Reward Models

5.6 Process Reward Models

5.7 Reward Models vs. Outcome RMs vs. Process RMs vs. Value Functions

5.7.1 Inference Differences

5.8 Generative Reward Modeling

5.9 Further Reading

6 Reinforcement Learning

6.1 Policy Gradient Algorithms

6.1.1 Vanilla Policy Gradient

6.1.2 REINFORCE

6.1.3 REINFORCE Leave One Out (RLOO)

6.1.4 Proximal Policy Optimization (PPO)

6.1.5 Group Relative Policy Optimization (GRPO)

6.1.6 Group Sequence Policy Optimization (GSPO)

6.1.7 Clipped Importance Sampling Policy Optimization (CISPO)

6.1.8 Comparing Algorithms

6.2 Implementation

6.2.1 Policy Gradient Basics

6.2.2 Loss Aggregation

6.2.3 Asynchronicity

6.2.4 Proximal Policy Optimization

6.2.5 Group Relative Policy Optimization

6.3 Auxiliary Topics

6.3.1 Generalized Advantage Estimation (GAE)

6.3.2 Double Regularization

6.3.3 Further Reading

7 Reasoning & Inference-Time Scaling

7.1 The Origins of New Reasoning Models

7.1.1 Why Does RL Work Now?

7.1.2 RL Training vs. Inference-time Scaling

7.1.3 The Future (Beyond Reasoning) of RLVR

7.2 Understanding Reasoning Training Methods

7.2.1 Reasoning Research Pre OpenAI’s o1 or DeepSeek R1

7.2.2 Early Reasoning Models

7.2.3 Common Practices in Training Reasoning Models

7.3 Looking Ahead

8 Direct Alignment Algorithms

8.1 Direct Preference Optimization (DPO)

8.1.1 How DPO Works

8.1.2 DPO Derivation

8.2 Numerical Concerns, Weaknesses, and Alternatives

8.3 Implementation Considerations

8.4 DAAs with Synthetic Preference Data

8.5 DAAs vs. RL: Online vs. Offline Data

9 Rejection Sampling

9.1 Training Process

9.1.1 Generating Completions

9.1.2 Scoring Completions

9.1.3 Fine-tuning

9.2 Implementation Details

9.3 Related: Best-of-N Sampling

10 The Nature of Preferences

10.1 The Origins of RLHF and Preferences

10.2 Specifying objectives: from logic of utility to reward functions

10.3 Tools for optimizing utility

10.4 Complexity of optimizing preferences

11 Preference Data

11.1 Why We Need Preference Data

11.2 Collecting Preference Data

11.2.1 Interface

11.2.2 Rankings vs. Ratings

11.2.3 Multi-turn Data

11.2.4 Structured Preference Data

11.2.5 Sourcing and Contracts

11.3 Bias: Things to Watch Out For in Data Collection

11.4 Open Questions in RLHF Preference Data

12 Synthetic Data

12.1 Distillation

12.2 AI Feedback

12.2.1 Balancing AI and Human Feedback Data

12.2.2 Specific LLMs for Judgement

12.3 Constitutional AI

12.3.1 Further Reading on CAI

12.4 Rubrics: AI Feedback for Training

13 Tool Use & Function Calling

13.1 Interweaving Tool Calls in Generation

13.2 Multi-step Tool Reasoning

13.3 Model Context Protocol (MCP)

13.4 Implementation

14 Over Optimization

14.1 Qualitative Over-optimization

14.1.1 Managing Proxy Objectives

14.1.2 Over-refusal and “Too Much RLHF”

14.2 Quantitative over-optimization

14.3 Misalignment and the Role of RLHF

15 Regularization

15.1 KL Divergences in RL Optimization

15.1.1 Reference Model to Generations

15.1.2 Implementation Example

15.2 Pretraining Gradients

15.3 Other Regularization

16 Evaluation

16.1 Prompting Formatting: From Few-shot to Zero-shot to CoT

16.2 Why Many External Evaluation Comparisons are Unreliable

16.3 How Labs Actually use Evaluations Internally to Improve Models

16.4 Contamination

16.5 Tooling

17 Product, UX, and Model Character

17.1 Character Training

17.2 Model Specifications

17.3 Product Cycles, UX, and RLHF

Appendix

Appendix A: Definitions

A.1 Language Modeling Overview

A.2 ML Definitions

A.3 NLP Definitions

A.4 RL Definitions

A.5 RLHF Only Definitions

A.6 Extended Glossary

Appendix B: Beyond ‘Just Style’

Appendix C: Practical Issues

C.1 Compute Costs of Post-Training

C.2 Evaluation Variance

C.3 Managing Training Performance Variance

C.4 Identifying Bad Training Jobs

Overview

1 Introduction

Reinforcement learning from human feedback (RLHF) incorporates human preferences into AI systems to solve objectives that are difficult to specify explicitly. It rose to prominence by making foundation models more helpful, safe, and usable, and now sits at the center of modern post-training—the collection of methods applied after large-scale pretraining to turn raw next-token predictors into capable assistants. Within post-training, instruction/supervised fine-tuning establishes basic instruction following, preference fine-tuning (where RLHF is the dominant approach) aligns models to nuanced human values and style, and reinforcement learning with verifiable rewards targets tasks with clear, checkable objectives. This chapter motivates why RLHF became essential and frames it as the technique that catalyzed today’s broader post-training toolbox.

Operationally, RLHF follows a three-step pipeline: train an instruction-following base model, collect human preference data to learn a reward model, and optimize the policy against that reward using an RL optimizer. Behaviorally, RLHF shifts models from raw continuations to concise, cooperative answers, shaping tone, empathy, and format in ways that improve reliability and user experience. Unlike instruction tuning’s per-token objective, RLHF learns at the response level with contrastive signals about better versus worse outputs, which tends to generalize more robustly across domains. These advantages come with challenges: reward models are proxies that can induce over-optimization and length bias, the process is costlier in data and compute, and strong results require a good starting model plus careful regularization. RLHF is thus most effective as one stage in a multi-stage post-training recipe.

RLHF has powered progress across control tasks, summarization, instruction following, web question answering, and safety, and its practice has evolved alongside open and closed efforts—from early instruction-tuned imitators, through skepticism about the need for RLHF, to the rise of direct alignment methods that simplify preference optimization. The chapter advances an elicitation view of post-training: much of a model’s latent capability exists after pretraining, and well-designed post-training extracts and amplifies it, often yielding large gains without changing the base model itself—while still depending on data scale and thoughtful design choices. Looking forward, preference fine-tuning remains foundational even as newer RL approaches for verifiable rewards and reasoning accelerate. The book aims to clarify trade-offs, offer practical implementation guidance, and build the intuition needed to reproduce, extend, and responsibly apply RLHF in real systems.

A rendition of the early, three stage RLHF process: first training via supervised fine-tuning (SFT, chapter 4), building a reward model (RM, chapter 5), and then optimizing with reinforcement learning (RL, chapter 6).

Summary

RLHF incorporates human preferences into AI systems to solve problems that are hard-to-specify programmatically, and became widely known through ChatGPT’s breakout, which made the capabilities of language models more approachable.
The basic RLHF pipeline has three steps: instruction fine-tuning to teach the model to follow the question-answering format, training a reward model on human preferences, and optimizing the model with RL against that reward.
RLHF is known to primarily change the style, tone, and format of model responses – making them more helpful, warm, and engaging. But it’s not “just style transfer”: RLHF also improves benchmark performance, though over-optimization (e.g., excessive length or chattiness) can harm capabilities in other domains.
The elicitation theory of post-training suggests that base models contain latent potential, and post-training’s job is to extract and cultivate that intelligence into useful behaviors.
RLHF is one component of modern post-training, alongside instruction fine-tuning (IFT/SFT) and reinforcement learning with verifiable rewards (RLVR), used together in an intertwined manner to craft particular training recipes.

FAQ

What is RLHF and why is it important?

Reinforcement Learning from Human Feedback (RLHF) incorporates human preferences into AI training to handle objectives that are hard to specify directly. It became essential as AI systems moved from research benchmarks to user-facing tools, where nuanced, often inexpressible human preferences matter.

How does the classic RLHF pipeline work?

- Train a base instruction-following model via Instruction/Supervised Fine-tuning (SFT/IFT).
- Collect human preference data and train a reward model that scores outputs by preference.
- Optimize the model with reinforcement learning, sampling responses and improving them using the reward model’s scores.

Where does RLHF fit within modern post-training?

Post-training is the many-stage process after large-scale pretraining. It commonly includes: (1) SFT/IFT to teach format and instruction-following, (2) Preference Fine-tuning (PreFT), where RLHF dominates and aligns models to human preferences and style, and (3) RL with Verifiable Rewards (RLVR) to boost performance on tasks with clear, checkable rewards.

How does RLHF change a model’s behavior compared to a base model?

Base models tend to continue text as found on the internet. RLHF shifts outputs toward concise, directly helpful, user-oriented answers with clearer tone, formatting, and conversational responsiveness—improving reliability, warmth, and engagement.

How is RLHF different from instruction fine-tuning?

Instruction fine-tuning uses per-token supervision to imitate reference text and teach features/format. RLHF makes response-level updates using preference signals (often contrastive), teaching the model what is better or worse rather than exactly what to copy, and generally transfers better across domains.

What are the main challenges and costs of RLHF?

- Training robust reward models lacks universal best practices and depends on the domain.
- Risk of over-optimization (Goodharting) against proxy rewards, requiring strong regularization.
- Biases such as length bias can emerge.
- It is more expensive in data, compute, and time, and works best starting from a strong instruction-tuned base.

What does the book mean by the “elicitation” view of post-training?

Post-training is seen as eliciting latent capabilities already present in the pretrained model—amplifying valuable behaviors without changing the underlying base model, similar to improving a race car’s performance through aerodynamics and systems on a fixed chassis.

Is alignment just “style”? What about the Superficial Alignment Hypothesis?

Style (tone, format) matters and can shift with relatively little data, but post-training also shapes behavior and reasoning (e.g., encouraging structured, multi-step thinking). Treating alignment as merely superficial misses the deeper capability and generalization gains achieved with broader preference tuning and RL.

What role did Direct Preference Optimization (DPO) and open efforts play?

DPO showed you can optimize from pairwise preference data without an explicit reward model, simplifying pipelines and accelerating open-source progress. Alongside datasets like UltraFeedback and efforts such as Tülu and Zephyr, it helped establish preference-tuning as table stakes, while closed labs advanced multi-stage post-training at larger scales.

What’s next after RLHF?

RLHF is a core part of Preference Fine-tuning, but newer trends like RL with Verifiable Rewards (RLVR) and specialized reasoning training are rapidly advancing. RLHF remains the bridge to broader RL methods for fine-tuning large models, with a long-term goal of mapping complex human values and objectives into everyday AI systems.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $28.79

you save $19.20 (40%)

include audio $24.99 $14.99

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $28.79

you save $19.20 (40%)

include audio $24.99 $14.99

eBook

pdf, ePub, online

$47.99 $28.79

you save $19.20 (40%)

include audio $24.99 $14.99

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more