The RLHF Book you own this product

Reinforcement learning from human feedback, alignment, and post-training LLMs

Nathan Lambert

MEAP began November 2025
Last updated February 2026
Publication in Summer 2026 (estimated)

ISBN 9781633434301
225 pages (estimated)

Included with a Manning Online subscription

printed in black & white

available in Korean

catalog / Data Science / Machine Learning

resources: Book forum

table of content

1 Introduction

1.1 What Does RLHF Do?

1.2 An Intuition for Post-Training

1.3 How We Got Here

1.4 Future of RLHF

2 A Tiny History of RLHF

2.1 Origins to 2018: RL on Preferences

2.2 2019 to 2022: RL from Human Preferences on Language Models

2.3 2023 to Present: ChatGPT Era

3 Training Overview

3.1 Problem Formulation

3.1.1 A Simple Example: The Thermostat

3.1.2 Classic RL Example: CartPole

3.1.3 Manipulating the Standard RL Setup

3.1.4 Fine-tuning and Regularization

3.1.5 Optimization Tools

3.2 Canonical Training Recipes

3.2.1 InstructGPT

3.2.2 Tülu 3

3.2.3 DeepSeek R1

4 Instruction Fine-tuning

4.1 Chat templates and the structure of instructions

4.2 Best practices of instruction tuning

4.3 Implementation

5 Reward Modeling

5.1 Training Reward Models

5.2 Architecture

5.3 Implementation Example

5.4 Variants

5.4.1 Preference Margin Loss

5.4.2 Balancing Multiple Comparisons Per Prompt

5.4.3 K-wise Loss Function

5.5 Outcome Reward Models

5.6 Process Reward Models

5.7 Reward Models vs. Outcome RMs vs. Process RMs vs. Value Functions

5.7.1 Inference Differences

5.8 Generative Reward Modeling

5.9 Further Reading

6 Reinforcement Learning

6.1 Policy Gradient Algorithms

6.1.1 Vanilla Policy Gradient

6.1.2 REINFORCE

6.1.3 REINFORCE Leave One Out (RLOO)

6.1.4 Proximal Policy Optimization (PPO)

6.1.5 Group Relative Policy Optimization (GRPO)

6.1.6 Group Sequence Policy Optimization (GSPO)

6.1.7 Clipped Importance Sampling Policy Optimization (CISPO)

6.1.8 Comparing Algorithms

6.2 Implementation

6.2.1 Policy Gradient Basics

6.2.2 Loss Aggregation

6.2.3 Asynchronicity

6.2.4 Proximal Policy Optimization

6.2.5 Group Relative Policy Optimization

6.3 Auxiliary Topics

6.3.1 Generalized Advantage Estimation (GAE)

6.3.2 Double Regularization

6.3.3 Further Reading

7 Reasoning & Inference-Time Scaling

7.1 The Origins of New Reasoning Models

7.1.1 Why Does RL Work Now?

7.1.2 RL Training vs. Inference-time Scaling

7.1.3 The Future (Beyond Reasoning) of RLVR

7.2 Understanding Reasoning Training Methods

7.2.1 Reasoning Research Pre OpenAI’s o1 or DeepSeek R1

7.2.2 Early Reasoning Models

7.2.3 Common Practices in Training Reasoning Models

7.3 Looking Ahead

8 Direct Alignment Algorithms

8.1 Direct Preference Optimization (DPO)

8.1.1 How DPO Works

8.1.2 DPO Derivation

8.2 Numerical Concerns, Weaknesses, and Alternatives

8.3 Implementation Considerations

8.4 DAAs with Synthetic Preference Data

8.5 DAAs vs. RL: Online vs. Offline Data

9 Rejection Sampling

9.1 Training Process

9.1.1 Generating Completions

9.1.2 Scoring Completions

9.1.3 Fine-tuning

9.2 Implementation Details

9.3 Related: Best-of-N Sampling

10 The Nature of Preferences

10.1 The Origins of RLHF and Preferences

10.2 Specifying objectives: from logic of utility to reward functions

10.3 Tools for optimizing utility

10.4 Complexity of optimizing preferences

11 Preference Data

11.1 Why We Need Preference Data

11.2 Collecting Preference Data

11.2.1 Interface

11.2.2 Rankings vs. Ratings

11.2.3 Multi-turn Data

11.2.4 Structured Preference Data

11.2.5 Sourcing and Contracts

11.3 Bias: Things to Watch Out For in Data Collection

11.4 Open Questions in RLHF Preference Data

12 Synthetic Data

12.1 Distillation

12.2 AI Feedback

12.2.1 Balancing AI and Human Feedback Data

12.2.2 Specific LLMs for Judgement

12.3 Constitutional AI

12.3.1 Further Reading on CAI

12.4 Rubrics: AI Feedback for Training

13 Tool Use & Function Calling

13.1 Interweaving Tool Calls in Generation

13.2 Multi-step Tool Reasoning

13.3 Model Context Protocol (MCP)

13.4 Implementation

14 Over Optimization

14.1 Qualitative Over-optimization

14.1.1 Managing Proxy Objectives

14.1.2 Over-refusal and “Too Much RLHF”

14.2 Quantitative over-optimization

14.3 Misalignment and the Role of RLHF

15 Regularization

15.1 KL Divergences in RL Optimization

15.1.1 Reference Model to Generations

15.1.2 Implementation Example

15.2 Pretraining Gradients

15.3 Other Regularization

16 Evaluation

16.1 Prompting Formatting: From Few-shot to Zero-shot to CoT

16.2 Why Many External Evaluation Comparisons are Unreliable

16.3 How Labs Actually use Evaluations Internally to Improve Models

16.4 Contamination

16.5 Tooling

17 Product, UX, and Model Character

17.1 Character Training

17.2 Model Specifications

17.3 Product Cycles, UX, and RLHF

Appendix

Appendix A: Definitions

A.1 Language Modeling Overview

A.2 ML Definitions

A.3 NLP Definitions

A.4 RL Definitions

A.5 RLHF Only Definitions

A.6 Extended Glossary

Appendix B: Beyond ‘Just Style’

Appendix C: Practical Issues

C.1 Compute Costs of Post-Training

C.2 Evaluation Variance

C.3 Managing Training Performance Variance

C.4 Identifying Bad Training Jobs

Overview

10 The Nature of Preferences

Reinforcement learning from human feedback centers on modeling human preferences in domains where explicit reward design is hard. The chapter argues that “better” is often irreducibly subjective—illustrated by judging two poems—so human judgments are used as indirect reward signals to align models with what people tend to prefer. Because preferences are psychologically, socially, and philosophically complex, RLHF sits at the intersection of philosophy, psychology, economics and decision theory, optimal control and reinforcement learning, and modern deep learning. In practice, today’s systems prioritize empirical alignment on concrete tasks and style, while research continues on pluralistic alignment across populations and personalization to individuals.

Historically, RLHF draws on ideas that link preferences, rewards, and costs to a quantitative notion of utility under uncertainty. Modern reinforcement learning inherits tools from optimal control—Bellman equations, reward-to-go, discounting, and the Markov decision process—and from operant conditioning’s notion of reward as a signal of desirability. With temporal-difference learning and Q-learning, these methods achieved notable success in games and control. However, their guarantees assume stable, well-specified rewards and closed environments. When RLHF compresses diverse, multimodal human judgments into a single scalar reward model, it departs from those assumptions, and related strands like inverse reinforcement learning remain underused in large-scale language settings.

The chapter details why optimizing preferences is inherently harder than optimizing fixed rewards. Human preferences drift over time, depend on context and presentation, and are embedded in social relations. While the Von Neumann–Morgenstern utility theorem licenses utility-based modeling, its assumptions rarely hold cleanly in open-ended, partially observed language tasks; human-computer interaction shows interface framing can alter choices, and social choice theory proves no aggregation satisfies all fairness desiderata. Assumptions enabling interpersonal utility comparison invite principal–agent framings but create tensions with corrigibility. Practically, RLHF must contend with nontransitive or incomparable judgments, proxy feedback signals, choice-set and ordering effects, and low inter-rater agreement. The upshot is that RLHF will never be a fully solved problem; it is a useful approximation that demands careful dataset engineering, explicit acknowledgment of uncertainty and pluralism, and evaluation tailored to real-world use.

The timeline of the integration of various subfields into the modern version of RLHF. The direct links are continuous developments of specific technologies, and the arrows indicate motivations and conceptual links.

Summary

RLHF sits at the intersection of philosophy, economics, psychology, reinforcement learning, and deep learning – each bringing its own assumptions about what preferences are and how they can be optimized.
Reinforcement learning was designed for domains with stable, deterministic reward functions, but human preferences are noisy, context-dependent, temporally shifting, and not always transitive – a fundamental mismatch that shapes the limitations of RLHF.
The Von Neumann-Morgenstern utility theorem provides theoretical license for modeling preferences as scalar functions, but its assumptions (transitivity, comparability, stability) are routinely violated in practice. Impossibility theorems in social choice theory further show that no single aggregation method over preferences can satisfy all fairness criteria simultaneously.
These challenges explain why RLHF will never be fully “solved,” but they do not prevent it from being useful. In practice, RLHF operates on more tractable problems of style and performance rather than attempting to resolve the full complexity of human values.
The practical mechanics of collecting and structuring preference data in light of RLHF’s complex motivations are covered in Chapter 11.

FAQ

What is RLHF and why are human preferences central to it?

Reinforcement learning from human feedback (RLHF) trains models using human judgments when explicit reward functions are hard to specify. Early work called it “reinforcement learning from human preferences” because human preferences supply the comparisons, ratings, and other signals that reward models learn to predict, which then guide policy optimization.

Why is “Which poem is better?” a good example for RLHF?

Unlike factual questions with a single correct answer, evaluating creative outputs (like poems) is subjective and context-dependent. This lack of ground truth motivates using human feedback as an indirect reward signal to align models with what people tend to prefer.

Why will RLHF never be a fully solved problem?

Human preferences are plural, dynamic, and context-sensitive. Aggregating them raises conflicts from social choice theory (e.g., not all fairness criteria can be met simultaneously), introduces measurement bias, and depends on presentation and timing effects—so any solution is inherently approximate and contingent.

Which assumptions from classic RL conflict with real human preferences?

Stationary, deterministic reward functions vs. non-stationary, context-shifting human preferences
Markovian state assumptions vs. rich histories and partial observability in language tasks
Single scalar utility maximization vs. multiple, sometimes incompatible human values
Clear optimality notions vs. ambiguity and disagreement among annotators

How do “preference,” “reward,” and “cost” relate in RLHF?

They are different formalisms for “relative goodness.” Economics and decision theory motivate preferences; control and RL operationalize goals via rewards (maximize) or costs (minimize). RLHF compresses messy, multi-criteria human judgments into a scalar reward model that a policy can optimize.

What’s the difference between empirical alignment and value alignment in RLHF?

Empirical alignment optimizes for observable performance on tasks (e.g., helpfulness, harmlessness) using human feedback. Value alignment aims to reflect deeper, often contested human values across people and contexts. RLHF in practice emphasizes empirical alignment, while research continues on pluralistic alignment and personalization.

What role do Bellman equations and MDPs play, and what are their limits here?

Modern RL algorithms rely on Bellman recursions within Markov Decision Processes to estimate reward-to-go and improve policies with theoretical guarantees. Open-ended language settings violate many MDP assumptions (stationarity, full observability), so these guarantees rarely carry over cleanly to RLHF on LLMs.

How does the Von Neumann–Morgenstern (VNM) utility theorem relate to RLHF?

VNM shows that, under certain axioms, preferences can be represented as expected utility—licensing scalar reward modeling. In practice, its assumptions are strained: preferences shift over time, can be intransitive, and depend on framing and interface design, especially in partially observed, high-dimensional language tasks.

How do social choice results affect preference aggregation for RLHF?

Arrow-style impossibility theorems imply no aggregation method can satisfy all desirable criteria at once. Workarounds (e.g., interpersonal utility comparisons, principal–agent framing, multi-principal settings) help but can clash with goals like corrigibility. This makes global “one-size-fits-all” alignment inherently limited.

What data and measurement pitfalls should RLHF practitioners watch for?

Low inter-annotator agreement masking genuine pluralism
Presentation and UI effects that alter expressed preferences
Preference drift during sequential labeling
Overreliance on proxy signals (e.g., dwell time) that entangle deployment and data collection
Choice set design (binary vs. multiple options) and timing/context effects

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $12.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $12.49

eBook

pdf, ePub, online

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $12.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more