The RLHF Book you own this product

Reinforcement learning from human feedback, alignment, and post-training LLMs

Nathan Lambert

MEAP began November 2025
Last updated February 2026
Publication in Summer 2026 (estimated)

ISBN 9781633434301
225 pages (estimated)

Included with a Manning Online subscription

printed in black & white

available in Korean

catalog / Data Science / Machine Learning

resources: Book forum

table of content

1 Introduction

1.1 What Does RLHF Do?

1.2 An Intuition for Post-Training

1.3 How We Got Here

1.4 Future of RLHF

2 A Tiny History of RLHF

2.1 Origins to 2018: RL on Preferences

2.2 2019 to 2022: RL from Human Preferences on Language Models

2.3 2023 to Present: ChatGPT Era

3 Training Overview

3.1 Problem Formulation

3.1.1 A Simple Example: The Thermostat

3.1.2 Classic RL Example: CartPole

3.1.3 Manipulating the Standard RL Setup

3.1.4 Fine-tuning and Regularization

3.1.5 Optimization Tools

3.2 Canonical Training Recipes

3.2.1 InstructGPT

3.2.2 Tülu 3

3.2.3 DeepSeek R1

4 Instruction Fine-tuning

4.1 Chat templates and the structure of instructions

4.2 Best practices of instruction tuning

4.3 Implementation

5 Reward Modeling

5.1 Training Reward Models

5.2 Architecture

5.3 Implementation Example

5.4 Variants

5.4.1 Preference Margin Loss

5.4.2 Balancing Multiple Comparisons Per Prompt

5.4.3 K-wise Loss Function

5.5 Outcome Reward Models

5.6 Process Reward Models

5.7 Reward Models vs. Outcome RMs vs. Process RMs vs. Value Functions

5.7.1 Inference Differences

5.8 Generative Reward Modeling

5.9 Further Reading

6 Reinforcement Learning

6.1 Policy Gradient Algorithms

6.1.1 Vanilla Policy Gradient

6.1.2 REINFORCE

6.1.3 REINFORCE Leave One Out (RLOO)

6.1.4 Proximal Policy Optimization (PPO)

6.1.5 Group Relative Policy Optimization (GRPO)

6.1.6 Group Sequence Policy Optimization (GSPO)

6.1.7 Clipped Importance Sampling Policy Optimization (CISPO)

6.1.8 Comparing Algorithms

6.2 Implementation

6.2.1 Policy Gradient Basics

6.2.2 Loss Aggregation

6.2.3 Asynchronicity

6.2.4 Proximal Policy Optimization

6.2.5 Group Relative Policy Optimization

6.3 Auxiliary Topics

6.3.1 Generalized Advantage Estimation (GAE)

6.3.2 Double Regularization

6.3.3 Further Reading

7 Reasoning & Inference-Time Scaling

7.1 The Origins of New Reasoning Models

7.1.1 Why Does RL Work Now?

7.1.2 RL Training vs. Inference-time Scaling

7.1.3 The Future (Beyond Reasoning) of RLVR

7.2 Understanding Reasoning Training Methods

7.2.1 Reasoning Research Pre OpenAI’s o1 or DeepSeek R1

7.2.2 Early Reasoning Models

7.2.3 Common Practices in Training Reasoning Models

7.3 Looking Ahead

8 Direct Alignment Algorithms

8.1 Direct Preference Optimization (DPO)

8.1.1 How DPO Works

8.1.2 DPO Derivation

8.2 Numerical Concerns, Weaknesses, and Alternatives

8.3 Implementation Considerations

8.4 DAAs with Synthetic Preference Data

8.5 DAAs vs. RL: Online vs. Offline Data

9 Rejection Sampling

9.1 Training Process

9.1.1 Generating Completions

9.1.2 Scoring Completions

9.1.3 Fine-tuning

9.2 Implementation Details

9.3 Related: Best-of-N Sampling

10 The Nature of Preferences

10.1 The Origins of RLHF and Preferences

10.2 Specifying objectives: from logic of utility to reward functions

10.3 Tools for optimizing utility

10.4 Complexity of optimizing preferences

11 Preference Data

11.1 Why We Need Preference Data

11.2 Collecting Preference Data

11.2.1 Interface

11.2.2 Rankings vs. Ratings

11.2.3 Multi-turn Data

11.2.4 Structured Preference Data

11.2.5 Sourcing and Contracts

11.3 Bias: Things to Watch Out For in Data Collection

11.4 Open Questions in RLHF Preference Data

12 Synthetic Data

12.1 Distillation

12.2 AI Feedback

12.2.1 Balancing AI and Human Feedback Data

12.2.2 Specific LLMs for Judgement

12.3 Constitutional AI

12.3.1 Further Reading on CAI

12.4 Rubrics: AI Feedback for Training

13 Tool Use & Function Calling

13.1 Interweaving Tool Calls in Generation

13.2 Multi-step Tool Reasoning

13.3 Model Context Protocol (MCP)

13.4 Implementation

14 Over Optimization

14.1 Qualitative Over-optimization

14.1.1 Managing Proxy Objectives

14.1.2 Over-refusal and “Too Much RLHF”

14.2 Quantitative over-optimization

14.3 Misalignment and the Role of RLHF

15 Regularization

15.1 KL Divergences in RL Optimization

15.1.1 Reference Model to Generations

15.1.2 Implementation Example

15.2 Pretraining Gradients

15.3 Other Regularization

16 Evaluation

16.1 Prompting Formatting: From Few-shot to Zero-shot to CoT

16.2 Why Many External Evaluation Comparisons are Unreliable

16.3 How Labs Actually use Evaluations Internally to Improve Models

16.4 Contamination

16.5 Tooling

17 Product, UX, and Model Character

17.1 Character Training

17.2 Model Specifications

17.3 Product Cycles, UX, and RLHF

Appendix

Appendix A: Definitions

A.1 Language Modeling Overview

A.2 ML Definitions

A.3 NLP Definitions

A.4 RL Definitions

A.5 RLHF Only Definitions

A.6 Extended Glossary

Appendix B: Beyond ‘Just Style’

Appendix C: Practical Issues

C.1 Compute Costs of Post-Training

C.2 Evaluation Variance

C.3 Managing Training Performance Variance

C.4 Identifying Bad Training Jobs

Overview

4 Instruction Fine-tuning

Instruction fine-tuning (also called supervised fine-tuning) emerged as a pragmatic bridge from next-token prediction to reliable instruction-following, turning general pretrained language models into assistants that respond in an instruction–response format. Building on prompting and in-context learning, the field shifted toward a unified “text-to-text” framing and large collections of instruction–response examples, which made broad task generalization far more dependable. Today, instruction fine-tuning is the standard first step of post-training and the essential foundation for RLHF: it establishes consistent question–answer behavior and the conversational structure models need in order to collect preferences and optimize with reinforcement learning. Central to this structure are chat templates that serialize conversations into tokens with explicit roles—system, user, and assistant—using special markers so models can reliably parse context, handle multi-turn dialogues, and continue generation from the assistant role.

Effective instruction tuning hinges on data quality and distribution match to downstream use. While early systems achieved strong results with relatively small human-written sets, the trend has moved toward large-scale synthetic datasets that improve robustness across tasks. In practice, around a million well-targeted prompts can produce models that are excellent bases for RLHF, with diminishing returns beyond that. The model primarily learns from completions, so high-quality responses matter most; focused datasets can suffice for narrower chat alignment, and parameter-efficient approaches (such as quantization-aware fine-tuning) make the process accessible. Because later post-training stages can correct some noise, optimizing the overall pipeline is typically more impactful than over-optimizing any single stage.

Although the loss matches pretraining’s autoregressive objective, several implementation choices differ. Instruction-tuned jobs generally run with substantially smaller batch sizes than pretraining, reflecting shorter training runs and token budgets. Prompt masking is used so the model learns to predict assistant outputs rather than user queries, and multi-turn data can be handled either by training only on the final assistant turn or by masking all user turns while training on every assistant turn; long conversations are often unrolled into shorter examples. In the open ecosystem, chat templates are commonly implemented in tokenizers to ensure consistent tokenization of roles and turns, sometimes extending to tool-use markers. Together, these practices yield stable, instruction-following models that are ready for preference collection and RLHF optimization.

Summary

Instruction fine-tuning (IFT/SFT) teaches pretrained language models to respond in an instruction-response format, and is the foundation that all later post-training stages – from preference data collection to RLHF optimization – depend on.
Chat templates define how user queries, system prompts, and assistant responses are formatted into token sequences using special tokens, and are the standard interface between users and instruction-tuned models.
Implementation details include smaller batch sizes than pretraining, prompt masking so the model learns responses rather than queries, and multi-turn masking strategies that control which assistant turns are trained on.

FAQ

What is instruction fine-tuning (IFT/SFT) and how is it different from prompting or pretraining?

Instruction fine-tuning adapts a pretrained language model to follow instructions by training on instruction–response pairs, using the same autoregressive loss as pretraining but focusing on responses. Unlike prompting/in-context learning (which relies on zero- or few-shot generalization), IFT explicitly teaches the instruction–response format, making behavior more reliable across tasks.

Why is instruction fine-tuning the foundation before RLHF?

IFT equips the model to understand and adhere to the instruction–response chat format. This baseline capability is necessary for collecting preference data and running RLHF; without it, later post-training stages (preference modeling and online optimization) are hard or impossible to perform effectively.

What is a chat template and why does it matter?

A chat template is the serialization scheme that converts role-tagged messages into a single token sequence for the model. It inserts special tokens (e.g., BOS/EOS and role markers), enforces role alternation, and can append an assistant-start tag to cue generation. Consistent templating underpins all post-training stages, including IFT and RLHF.

What roles exist in chat templates and how are they used?

There are three standard roles: system (first-turn, hidden instructions and context for the assistant), user (queries from the human), and assistant (model replies). The system message can set behavior or context, while user and assistant alternate throughout the conversation.

How does the model know when to start generating the assistant’s reply?

Templates often end the serialized prompt with an assistant-start marker and no content (and may set add_generation_prompt). This signals the model to continue generation from the assistant role until it emits the end-of-sequence/end-of-message token.

How are multi-turn conversations handled during training?

Conversations are serialized as alternating user/assistant turns. Two common masking choices are used for the loss: (1) final-turn only (train only on the last assistant reply, mask all prior context), or (2) mask user turns only (train on every assistant turn). Long dialogues can be “unrolled” into multiple examples either way.

What are best practices for instruction-tuning datasets?

- Prioritize high-quality completions (the model learns primarily from responses). - Use prompts close to downstream tasks. - Around 1M prompts typically suffice for strong post-training/RLHF; more helps with diminishing returns. - Later stages can recover from some noise—optimize the full pipeline, not just one step. - Small, focused sets can work for narrow chat alignment; large synthetic sets now dominate many tasks. - Efficient methods like parameter quantization (e.g., QLoRA) make IFT widely accessible.

How does instruction tuning differ operationally from pretraining?

- Much smaller batch sizes (e.g., post-training uses far fewer sequences per step than pretraining), so fewer devices are used concurrently. - Prompt masking: loss is applied to assistant responses, not user prompts. - Multi-turn masking as above. - Same autoregressive loss as pretraining but with different data, masking, and sequence handling.

How are chat templates implemented in practice?

In the open ecosystem, a Jinja-based template is commonly stored with the tokenizer and applied via apply_chat_template. Many templates derive from ChatML, and variants exist (e.g., Zephyr, Tülu). Providers may also use hierarchical instruction systems and add tokens for tool use.

What shifted the field toward instruction fine-tuning?

The move from bespoke task heads to a unified text-to-text framing (e.g., T5, FLAN, T0, Natural Instructions) plus the evidence from scaling and in-context learning showed broad generalization was possible—but substantially more reliable when models were explicitly trained on instruction–response data. This convergence sparked widespread adoption of IFT.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $12.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $12.49

eBook

pdf, ePub, online

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $12.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more