The RLHF Book you own this product

Reinforcement learning from human feedback, alignment, and post-training LLMs

Nathan Lambert

MEAP began November 2025
Last updated February 2026
Publication in Summer 2026 (estimated)

ISBN 9781633434301
225 pages (estimated)

Included with a Manning Online subscription

printed in black & white

available in Korean

catalog / Data Science / Machine Learning

resources: Book forum

table of content

1 Introduction

1.1 What Does RLHF Do?

1.2 An Intuition for Post-Training

1.3 How We Got Here

1.4 Future of RLHF

2 A Tiny History of RLHF

2.1 Origins to 2018: RL on Preferences

2.2 2019 to 2022: RL from Human Preferences on Language Models

2.3 2023 to Present: ChatGPT Era

3 Training Overview

3.1 Problem Formulation

3.1.1 A Simple Example: The Thermostat

3.1.2 Classic RL Example: CartPole

3.1.3 Manipulating the Standard RL Setup

3.1.4 Fine-tuning and Regularization

3.1.5 Optimization Tools

3.2 Canonical Training Recipes

3.2.1 InstructGPT

3.2.2 Tülu 3

3.2.3 DeepSeek R1

4 Instruction Fine-tuning

4.1 Chat templates and the structure of instructions

4.2 Best practices of instruction tuning

4.3 Implementation

5 Reward Modeling

5.1 Training Reward Models

5.2 Architecture

5.3 Implementation Example

5.4 Variants

5.4.1 Preference Margin Loss

5.4.2 Balancing Multiple Comparisons Per Prompt

5.4.3 K-wise Loss Function

5.5 Outcome Reward Models

5.6 Process Reward Models

5.7 Reward Models vs. Outcome RMs vs. Process RMs vs. Value Functions

5.7.1 Inference Differences

5.8 Generative Reward Modeling

5.9 Further Reading

6 Reinforcement Learning

6.1 Policy Gradient Algorithms

6.1.1 Vanilla Policy Gradient

6.1.2 REINFORCE

6.1.3 REINFORCE Leave One Out (RLOO)

6.1.4 Proximal Policy Optimization (PPO)

6.1.5 Group Relative Policy Optimization (GRPO)

6.1.6 Group Sequence Policy Optimization (GSPO)

6.1.7 Clipped Importance Sampling Policy Optimization (CISPO)

6.1.8 Comparing Algorithms

6.2 Implementation

6.2.1 Policy Gradient Basics

6.2.2 Loss Aggregation

6.2.3 Asynchronicity

6.2.4 Proximal Policy Optimization

6.2.5 Group Relative Policy Optimization

6.3 Auxiliary Topics

6.3.1 Generalized Advantage Estimation (GAE)

6.3.2 Double Regularization

6.3.3 Further Reading

7 Reasoning & Inference-Time Scaling

7.1 The Origins of New Reasoning Models

7.1.1 Why Does RL Work Now?

7.1.2 RL Training vs. Inference-time Scaling

7.1.3 The Future (Beyond Reasoning) of RLVR

7.2 Understanding Reasoning Training Methods

7.2.1 Reasoning Research Pre OpenAI’s o1 or DeepSeek R1

7.2.2 Early Reasoning Models

7.2.3 Common Practices in Training Reasoning Models

7.3 Looking Ahead

8 Direct Alignment Algorithms

8.1 Direct Preference Optimization (DPO)

8.1.1 How DPO Works

8.1.2 DPO Derivation

8.2 Numerical Concerns, Weaknesses, and Alternatives

8.3 Implementation Considerations

8.4 DAAs with Synthetic Preference Data

8.5 DAAs vs. RL: Online vs. Offline Data

9 Rejection Sampling

9.1 Training Process

9.1.1 Generating Completions

9.1.2 Scoring Completions

9.1.3 Fine-tuning

9.2 Implementation Details

9.3 Related: Best-of-N Sampling

10 The Nature of Preferences

10.1 The Origins of RLHF and Preferences

10.2 Specifying objectives: from logic of utility to reward functions

10.3 Tools for optimizing utility

10.4 Complexity of optimizing preferences

11 Preference Data

11.1 Why We Need Preference Data

11.2 Collecting Preference Data

11.2.1 Interface

11.2.2 Rankings vs. Ratings

11.2.3 Multi-turn Data

11.2.4 Structured Preference Data

11.2.5 Sourcing and Contracts

11.3 Bias: Things to Watch Out For in Data Collection

11.4 Open Questions in RLHF Preference Data

12 Synthetic Data

12.1 Distillation

12.2 AI Feedback

12.2.1 Balancing AI and Human Feedback Data

12.2.2 Specific LLMs for Judgement

12.3 Constitutional AI

12.3.1 Further Reading on CAI

12.4 Rubrics: AI Feedback for Training

13 Tool Use & Function Calling

13.1 Interweaving Tool Calls in Generation

13.2 Multi-step Tool Reasoning

13.3 Model Context Protocol (MCP)

13.4 Implementation

14 Over Optimization

14.1 Qualitative Over-optimization

14.1.1 Managing Proxy Objectives

14.1.2 Over-refusal and “Too Much RLHF”

14.2 Quantitative over-optimization

14.3 Misalignment and the Role of RLHF

15 Regularization

15.1 KL Divergences in RL Optimization

15.1.1 Reference Model to Generations

15.1.2 Implementation Example

15.2 Pretraining Gradients

15.3 Other Regularization

16 Evaluation

16.1 Prompting Formatting: From Few-shot to Zero-shot to CoT

16.2 Why Many External Evaluation Comparisons are Unreliable

16.3 How Labs Actually use Evaluations Internally to Improve Models

16.4 Contamination

16.5 Tooling

17 Product, UX, and Model Character

17.1 Character Training

17.2 Model Specifications

17.3 Product Cycles, UX, and RLHF

Appendix

Appendix A: Definitions

A.1 Language Modeling Overview

A.2 ML Definitions

A.3 NLP Definitions

A.4 RL Definitions

A.5 RLHF Only Definitions

A.6 Extended Glossary

Appendix B: Beyond ‘Just Style’

Appendix C: Practical Issues

C.1 Compute Costs of Post-Training

C.2 Evaluation Variance

C.3 Managing Training Performance Variance

C.4 Identifying Bad Training Jobs

Overview

12 Synthetic Data

This chapter explains how synthetic data has moved from a curiosity to a cornerstone of modern post-training. As models improved from early RLHF-era systems to GPT-4-class capabilities, they became reliable enough to generate prompts, completions, preferences, and filters at scale—often surpassing human annotators for many instruction-following tasks. While concerns like model collapse exist, the chapter argues that diverse teachers, deduplication, quality filters, and mixing in human data keep training regimes healthy. The upshot is a broader “post-training” paradigm in which synthetic data accelerates iteration, expands coverage, and enables large, longer-response datasets (from Alpaca-scale to Tülu 3 and OpenThoughts 3) that power today’s top models, even as humans remain crucial for ground truth, capability frontiers, and evaluation design.

The chapter then surveys the main ways synthetic data is used. Distillation—training smaller models on outputs from stronger “teacher” models—serves both as a general data engine (for instructions, preferences, and verification) and for targeted skill transfer (e.g., math and code). AI feedback (RLAIF) leverages LLM-as-a-judge to generate scalable preference labels and evaluations at a fraction of human cost, with practical tradeoffs: human data is high-noise/low-bias, while synthetic preference data is low-noise/high-bias. In practice, hybrid pipelines route hard or sensitive cases to humans and rely on AI for volume. The chapter also discusses judge calibration, including inconsistencies and self-preference bias, and notes that while specialized critic or judge models exist, leading general models—combined with techniques like repeated sampling or tournament ranking—tend to suffice for most workflows.

Finally, the chapter details Constitutional AI and rubric-based rewards as influential frameworks for AI-generated supervision. Constitutional AI uses a written set of principles to guide synthetic critiques of instruction data and to produce principle-grounded preference pairs for RLHF, seeding much of today’s RLAIF practice. Rubrics extend this idea by turning task-specific criteria—hard rules and graded principles—into near-verifiable signals for RL, enabling meaningful gains in domains like scientific reasoning and factuality. Although these methods now underpin many state-of-the-art post-training pipelines, the chapter emphasizes that synthetic data complements rather than replaces human input: people remain essential for setting objectives, creating benchmarks, and handling the toughest, most novel problems, while synthetic pipelines provide scalable, fast, and increasingly reliable supervision everywhere else.

Traditional knowledge distillation trains a smaller student model to match the soft probability distribution of a larger teacher model using KL divergence loss. Both models process the same input simultaneously, and temperature scaling ($\tau > 1$) softens the distributions to reveal more information about class relationships.

Synthetic data generation in LLM post-training: prompts are passed through a strong model to generate completions, which are paired to create a training dataset. This dataset is then used to fine-tune smaller models via standard supervised learning. More complex pipelines may involve multiple models editing completions, generating preference pairs, or filtering for quality.

Summary

Synthetic data has become essential to modern post-training – language models are used to generate training prompts, write completions, provide preference labels, and filter data at every stage of the pipeline.
Distillation, using outputs from a stronger model to train a smaller one, is the most common form of synthetic data usage and has largely replaced human-written completions for instruction tuning, though human data remains important at capability frontiers and for establishing ground truth answers.
AI feedback can approximate human preference labels at a fraction of the cost within the RLHF pipeline, denoted RLAIF. A useful rule of thumb: human data is high-noise and low-bias, while synthetic preference data is low-noise and high-bias – making AI feedback easier to start with but prone to systematic second-order effects.
Constitutional AI (CAI) was the earliest large-scale use of synthetic data for RLHF, using a set of written principles to guide both instruction data revision and synthetic preference data generation.
Rubric-based rewards extend RL training to domains without clearly verifiable answers by using LLM judges to score completions against per-prompt evaluation criteria, enabling RL-style training on tasks like scientific reasoning, creative writing, or model personality.

FAQ

What is “synthetic data” in RLHF/post-training, and why has it become so important?

Synthetic data is model-generated training material—prompts, completions, critiques, and preference labels—used to fine-tune other models. With GPT-4–class systems, models became reliable enough to outproduce humans on many tasks, making synthetic data cheaper, faster to iterate on, and central to modern post-training.

Where does synthetic data plug into the training pipeline?

Across the pipeline: generating or expanding prompts, producing high-quality completions for supervised fine-tuning, creating preference data (AI feedback/RLAIF), filtering/rewriting outputs, and serving as judges for evaluation. Many state-of-the-art post-training recipes rely on synthetic data at multiple stages.

Does synthetic data cause “model collapse,” and how can we avoid it?

Collapse risks arise when repeatedly training on unfiltered, repetitive outputs from a single model. Practical mitigations include mixing in human data, using diverse teacher models, deduplicating aggressively, and enforcing strong quality filters. With these safeguards, large-scale synthetic data has not shown catastrophic collapse in modern pipelines.

What is distillation, and how is it used for LLMs?

Distillation is training a smaller “student” on outputs from a stronger “teacher.” In practice it’s used (1) as a data engine to produce instruction, preference, and verification data, and (2) to transfer specific skills (e.g., math, coding) into a smaller model. Success hinges on curated prompts, diverse sampling, and strict filtering of teacher outputs.

How should I balance AI feedback vs. human feedback?

AI feedback is dramatically cheaper and scales fast, but tends to be lower-noise and higher-bias; human data is higher-noise and lower-bias, offering a strong, reliable signal when curated. A common strategy is hybrid: route hard or high-stakes items to humans while letting AI judges handle the bulk of routine labeling and large-scale evaluation.

What is Constitutional AI (CAI), and how does it relate to RLAIF?

CAI uses a written “constitution” of principles to (1) critique and revise instruction data to align outputs and (2) generate principle-guided preference labels that drive RLHF—i.e., RLAIF. It inaugurated large-scale use of AI feedback for alignment and inspired broader judge- and principle-based training methods.

Are general-purpose LLM judges reliable, or should I train a specialized judge?

LLM judges can show biases (e.g., self-preference) and calibration issues, but frontier models are extensively trained for evaluation and usually suffice. You can reduce bias via repeated sampling, self-refinement, or tournament ranking. Train a dedicated judge mainly when your task is highly specialized or involves private, domain-specific criteria.

What are “rubrics,” and how do they enable RL when answers aren’t verifiable?

Rubrics are structured, near-verifiable criteria (e.g., Hard Rules vs. Principles) that let an LLM judge score candidate responses. Those scores serve as rewards for RL, enabling improvement in domains like scientific reasoning and factuality where exact ground truth is hard to check automatically.

How do I design and scale rubric-based training cost-effectively?

Start with a domain-level base rubric, then have an LLM assign per-prompt rubric scores during training. Use concise, atomic criteria with explicit weights and priority tags to guide learning. This reduces per-prompt rubric generation costs while preserving consistent, actionable feedback signals for RL.

How has synthetic dataset scale evolved, and what should I start with?

Datasets have grown from tens of thousands of short examples to millions of prompts with much longer responses, yielding orders-of-magnitude more tokens. For newcomers, begin with small, simple instruction-following sets to iterate quickly, then scale to larger, domain-focused synthetic corpora for peak performance.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $12.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $12.49

eBook

pdf, ePub, online

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $12.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more