1 Overview
Reinforcement Learning from Human Feedback (RLHF) integrates human preferences into AI systems to solve hard-to-specify objectives, especially in settings where users’ needs are nuanced or implicit. It rose to prominence by helping transform large language models from raw next-token predictors into conversational assistants that respond clearly, helpfully, and with an appropriate tone and format. This chapter explains why RLHF matters, how it changes models’ behavior—shaping style, structure, and reliability—and situates it within the broader “post-training” toolkit that turns pretrained models into practical, general-purpose systems.
The RLHF pipeline typically proceeds in three stages: train an instruction-following model, collect human preference data to fit a reward model, and then optimize the policy with reinforcement learning using that proxy reward. In modern post-training, this sits alongside instruction/supervised fine-tuning (to establish formatting and follow-the-instructions behavior) and reinforcement learning with verifiable rewards (to push performance in domains with clear correctness signals). RLHF primarily drives preference fine-tuning: it optimizes whole responses, encourages better behaviors while discouraging worse ones, and often generalizes across domains more effectively than instruction tuning alone. This flexibility comes with costs and pitfalls—designing robust reward models, guarding against over-optimization and length bias, higher compute and data expense—and works best when built on a strong base model.
RLHF has powered progress across tasks like control, summarization, instruction following, web question answering, and safety alignment, and the field’s open ecosystem has evolved through waves: early synthetic instruction-tuning, skepticism about the need for RLHF, the rise of direct preference optimization to simplify preference learning, and recognition that leading results rely on multi-stage, large-scale post-training. An intuitive lens is elicitation: much of a model’s latent capability from pretraining can be unlocked by post-training, analogous to refining a race car around a fixed chassis. The book’s aim is to distill trade-offs and best practices from fast-moving research, provide hands-on starting points, and build the intuition needed to apply RLHF within full post-training workflows as newer methods—like verifiable-reward RL and advanced reasoning training—continue to mature.
A rendition of the early, three stage RLHF process: first training via supervised fine-tuning (SFT, Chapter 4), building a reward model (RM, Chapter 5), and then optimizing with reinforcement learning (RL, Chapter 6).
Future of RLHF
With the investment in language modeling, many variations on the traditional RLHF methods emerged. RLHF colloquially has become synonymous with multiple overlapping approaches. RLHF is a subset of preference fine-tuning (PreFT) techniques, including Direct Alignment Algorithms (see Chapter 8), which are the class of methods downstream of DPO that solve the preference learning problem by taking gradient steps directly on preference data, rather than learning an intermediate reward model. RLHF is the tool most associated with rapid progress in “post-training” of language models, which encompasses all training after the large-scale autoregressive training on primarily web data. This textbook is a broad overview of RLHF and its directly neighboring methods, such as instruction tuning and other implementation details needed to set up a model for RLHF training.
As more successes of fine-tuning language models with RL emerge, such as OpenAI’s o1 reasoning models, RLHF will be seen as the bridge that enabled further investment of RL methods for fine-tuning large base models. At the same time, while the spotlight of focus may be more intense on the RL portion of RLHF in the near future – as a way to maximize performance on valuable tasks – the core of RLHF is that it is a lens for studying the grand problems facing modern forms of AI. How do we map the complexities of human values and objectives into systems we use on a regular basis? This book hopes to be the foundation of decades of research and lessons on these problems.
FAQ
What is RLHF and why did it become important?
Reinforcement Learning from Human Feedback (RLHF) incorporates human preference signals into AI training to handle hard-to-specify objectives—common in real user interactions where preferences are nuanced or tacit. It moved from early control problems to mainstream prominence with ChatGPT, helping transform LLMs from raw next-token predictors into helpful, user-facing assistants.What is the basic three-step RLHF pipeline?
- Train an instruction-following base model via supervised fine-tuning (SFT/IFT).
- Collect human preference data to train a reward model (RM).
- Optimize the policy with reinforcement learning, sampling responses and scoring them with the RM.
How does RLHF fit into modern post-training?
Post-training typically combines three methods: (1) Instruction/Supervised Fine-tuning to teach format and basic following, (2) Preference Fine-tuning—dominated by RLHF—to align style and subtle human preferences, and (3) RL with Verifiable Rewards (RLVR) to boost performance on verifiable tasks. RLHF is the core of preference fine-tuning and remains more established than RLVR.What does RLHF actually change in model behavior?
RLHF pushes models toward responses that are more helpful, concise, warm, and appropriately formatted for users. Instead of free-form continuation, models learn to give direct, user-aligned answers, improving generalization across domains and matching the tone and structure users prefer.How is RLHF different from instruction fine-tuning (SFT/IFT)?
SFT learns per-token likelihood on supervised targets—great for format and basic capabilities. RLHF optimizes at the response level using preference signals (contrastive objectives), teaching what makes one answer better than another and which behaviors to avoid, leading to stronger cross-domain generalization.What are the main challenges and costs of RLHF?
- Reward modeling is hard: preferences are noisy and application-dependent.
- Over-optimization risk: proxy rewards can drive undesired behaviors without regularization.
- Biases (e.g., length bias) can appear if unmanaged.
- It requires a strong starting model and is costly in data, compute, and engineering effort—more than SFT.
The RLHF Book ebook for free