1 Introduction
Reinforcement Learning from Human Feedback (RLHF) integrates human judgments into AI systems to solve objectives that are hard to specify directly. It rose to prominence with conversational LLMs and is now a core part of “post-training,” the set of methods that transform a pretrained model into one that is helpful, safe, and responsive. The canonical RLHF pipeline proceeds by first training a model to follow instructions, then collecting human preference data to train a reward model, and finally optimizing the policy with reinforcement learning guided by that reward. Within post-training, instruction/supervised finetuning teaches format and basic behaviors, preference finetuning aligns outputs to human tastes and norms, and reinforcement finetuning targets verifiable, capability-heavy tasks; this book focuses on preference finetuning while situating it in the broader toolkit.
RLHF changes model behavior from raw “internet-style continuation” to concise, user-oriented answers by emphasizing style, format, and holistic response quality. Unlike instruction finetuning’s per-token next-word objective, RLHF optimizes at the response level using contrastive feedback that indicates which completions are better or worse, improving generalization across domains and embedding subtle behavioral preferences. This flexibility introduces challenges: reward models are imperfect proxies with evolving best practices, training can over-optimize without careful regularization, and practical issues like length bias, data cost, and compute make RLHF more expensive than simple instruction tuning. In practice, RLHF works best when built on strong pretrained and instruction-tuned bases, as one component of a deliberate post-training strategy.
The field has progressed from early demonstrations in control, summarization, and instruction following to successive waves in the open community, initial skepticism about RLHF, and a resurgence led by direct preference optimization and improved recipes that made preference tuning table stakes. Large-scale, closed efforts now run multi-stage post-training programs that combine data curation, instruction tuning, RLHF, prompt design, and evaluation at scale, while open recipes have consolidated around a few influential datasets. Today, post-training is a multi-objective, iterative process, with rapid innovation in reinforcement finetuning, reasoning training, AI feedback, and tooling. This book offers practical guidance on the core RLHF workflow—preference data, reward modeling, regularization, and direct alignment methods—alongside key intuitions, trade-offs, and open questions such as over-optimization and the role of style in shaping model behavior and product experience.
A rendition of the early, three stage RLHF process with SFT, a reward model, and then optimization.
Future of RLHF
With the investment in language modeling, many variations on the traditional RLHF methods emerged. RLHF colloquially has become synonymous with multiple overlapping approaches. RLHF is a subset of preference fine-tuning (PreFT) techniques, including Direct Alignment Algorithms (See Chapter 12). RLHF is the tool most associated with rapid progress in "post-training" of language models, which encompasses all training after the large-scale autoregressive training on primarily web data. This textbook is a broad overview of RLHF and its directly neighboring methods, such as instruction tuning and other implementation details needed to set up a model for RLHF training.
As more successes of fine-tuning language models with RL emerge, such as OpenAI’s o1 reasoning models, RLHF will be seen as the bridge that enabled further investment of RL methods for fine-tuning large base models. At the same time, while the spotlight of focus may be more intense on the RL portion of RLHF in the near future – as a way to maximize performance on valuable tasks – the core of RLHF is that it is a lens for studying the grand problems facing modern forms of AI. How do we map the complexities of human values and objectives into systems we use on a regular basis? This book hopes to be the foundation of decades of research and lessons on these problems.
The RLHF Book ebook for free