1 Introduction
Reinforcement learning from human feedback (RLHF) incorporates human preferences into AI systems to solve tasks that are hard to specify, and it rose to prominence with ChatGPT and the rapid progress of large language models. The core pipeline couples an instruction-following base model with a learned reward model from human preferences, then optimizes the model’s responses using reinforcement learning. In modern practice, RLHF sits within a broader post‑training stack alongside instruction/supervised finetuning and newer reinforcement finetuning approaches, together shaping models to be more useful and general across domains. Conceptually, RLHF emphasizes response‑level alignment and stylistic behavior—teaching what better and worse answers look like—leading to stronger cross‑task generalization than instruction finetuning alone.
RLHF reframes model behavior from raw next‑token continuation toward concise, user‑appropriate answers by optimizing entire responses and leveraging contrastive signals, including negative feedback. This flexibility brings challenges: training robust reward models, preventing over‑optimization against proxy rewards, managing biases such as length preference, and maintaining control through regularization. Effective RLHF depends on a strong starting model and is more costly in data, compute, and engineering than simple instruction tuning, but it is often essential for high‑performance assistants. A useful intuition is the elicitation view of post‑training: much of the gain comes from extracting and amplifying valuable behaviors already latent in pretrained models, especially as scale increases.
Practices and perceptions have evolved quickly: early open‑source instruction‑tuned models sparked enthusiasm, skepticism about RLHF followed, and then preference‑tuning methods like Direct Preference Optimization catalyzed a resurgence. Meanwhile, large‑scale post‑training in industry has grown into complex, multi‑stage pipelines that combine instruction tuning, RLHF, prompt and data design, and more. Current innovation focuses on reinforcement finetuning, reasoning‑oriented training, tool use, and strategic use of synthetic data and evaluation. This book aims to provide practical guidance for the canonical RLHF workflow—covering preference data, reward modeling, optimization and regularization tools, and advanced topics—targeted at readers with foundational ML knowledge who want a concise, actionable reference.
A rendition of the early, three stage RLHF process with SFT, a reward model, and then optimization.
Future of RLHF
With the investment in language modeling, many variations on the traditional RLHF methods emerged. RLHF colloquially has become synonymous with multiple overlapping approaches. RLHF is a subset of preference fine-tuning (PreFT) techniques, including Direct Alignment Algorithms (See Chapter 12). RLHF is the tool most associated with rapid progress in "post-training" of language models, which encompasses all training after the large-scale autoregressive training on primarily web data. This textbook is a broad overview of RLHF and its directly neighboring methods, such as instruction tuning and other implementation details needed to set up a model for RLHF training.
As more successes of fine-tuning language models with RL emerge, such as OpenAI’s o1 reasoning models, RLHF will be seen as the bridge that enabled further investment of RL methods for fine-tuning large base models. At the same time, while the spotlight of focus may be more intense on the RL portion of RLHF in the near future – as a way to maximize performance on valuable tasks – the core of RLHF is that it is a lens for studying the grand problems facing modern forms of AI. How do we map the complexities of human values and objectives into systems we use on a regular basis? This book hopes to be the foundation of decades of research and lessons on these problems.
The RLHF Book ebook for free