Overview

1 Introduction

Reinforcement Learning from Human Feedback (RLHF) integrates human judgments into AI systems to solve objectives that are hard to specify directly. It rose to prominence with conversational LLMs and is now a core part of “post-training,” the set of methods that transform a pretrained model into one that is helpful, safe, and responsive. The canonical RLHF pipeline proceeds by first training a model to follow instructions, then collecting human preference data to train a reward model, and finally optimizing the policy with reinforcement learning guided by that reward. Within post-training, instruction/supervised finetuning teaches format and basic behaviors, preference finetuning aligns outputs to human tastes and norms, and reinforcement finetuning targets verifiable, capability-heavy tasks; this book focuses on preference finetuning while situating it in the broader toolkit.

RLHF changes model behavior from raw “internet-style continuation” to concise, user-oriented answers by emphasizing style, format, and holistic response quality. Unlike instruction finetuning’s per-token next-word objective, RLHF optimizes at the response level using contrastive feedback that indicates which completions are better or worse, improving generalization across domains and embedding subtle behavioral preferences. This flexibility introduces challenges: reward models are imperfect proxies with evolving best practices, training can over-optimize without careful regularization, and practical issues like length bias, data cost, and compute make RLHF more expensive than simple instruction tuning. In practice, RLHF works best when built on strong pretrained and instruction-tuned bases, as one component of a deliberate post-training strategy.

The field has progressed from early demonstrations in control, summarization, and instruction following to successive waves in the open community, initial skepticism about RLHF, and a resurgence led by direct preference optimization and improved recipes that made preference tuning table stakes. Large-scale, closed efforts now run multi-stage post-training programs that combine data curation, instruction tuning, RLHF, prompt design, and evaluation at scale, while open recipes have consolidated around a few influential datasets. Today, post-training is a multi-objective, iterative process, with rapid innovation in reinforcement finetuning, reasoning training, AI feedback, and tooling. This book offers practical guidance on the core RLHF workflow—preference data, reward modeling, regularization, and direct alignment methods—alongside key intuitions, trade-offs, and open questions such as over-optimization and the role of style in shaping model behavior and product experience.

A rendition of the early, three stage RLHF process with SFT, a reward model, and then optimization.
figure

Future of RLHF

With the investment in language modeling, many variations on the traditional RLHF methods emerged. RLHF colloquially has become synonymous with multiple overlapping approaches. RLHF is a subset of preference fine-tuning (PreFT) techniques, including Direct Alignment Algorithms (See Chapter 12). RLHF is the tool most associated with rapid progress in "post-training" of language models, which encompasses all training after the large-scale autoregressive training on primarily web data. This textbook is a broad overview of RLHF and its directly neighboring methods, such as instruction tuning and other implementation details needed to set up a model for RLHF training.

As more successes of fine-tuning language models with RL emerge, such as OpenAI’s o1 reasoning models, RLHF will be seen as the bridge that enabled further investment of RL methods for fine-tuning large base models. At the same time, while the spotlight of focus may be more intense on the RL portion of RLHF in the near future – as a way to maximize performance on valuable tasks – the core of RLHF is that it is a lens for studying the grand problems facing modern forms of AI. How do we map the complexities of human values and objectives into systems we use on a regular basis? This book hopes to be the foundation of decades of research and lessons on these problems.

FAQ

What is RLHF and why did it become prominent?Reinforcement Learning from Human Feedback (RLHF) is a technique to incorporate human preferences into AI systems, originally used for hard-to-specify objectives. It became widely known through ChatGPT and the rapid progress of large language models (LLMs).
What are the three core steps in a canonical RLHF pipeline?(1) Train an instruction-following base model via supervised finetuning. (2) Collect human preference data and train a reward model. (3) Optimize the language model with an RL method using the reward model to score sampled responses.
How does RLHF differ from instruction finetuning (SFT/IFT)?Instruction finetuning applies per-token supervision to learn format and features. RLHF operates at the response level using preference signals, teaching what better and worse answers look like via contrastive objectives, and tends to generalize better across domains.
Where does RLHF fit within post-training?Post-training is a broader toolkit to make models useful, typically including: Instruction/Supervised Finetuning (format and features), Preference Finetuning (style and subtle human preferences), and Reinforcement Finetuning (boosts on verifiable tasks). RLHF is a key component of post-training, especially the preference finetuning stage.
What does RLHF actually do to model behavior?It shapes style and behavior: models become more concise, helpful, and aligned with human expectations. Compared to a pretrained model that merely continues text, an RLHF-trained model answers directly and clearly, reflecting learned preference-driven style.
What are the main challenges and pitfalls of RLHF?Training robust reward models is hard and domain-dependent. RL optimization can overfit to proxy rewards, requiring careful regularization. RLHF is costlier than instruction finetuning (compute, data, and time) and can introduce issues like length bias; strong starting models help.
Why is post-training so impactful if the base model is unchanged?Via the elicitation interpretation: post-training extracts and amplifies latent capabilities learned during pretraining, akin to an F1 team improving performance through season-long tuning. Scaling base models enables larger, faster gains from post-training.
What is the Superficial Alignment Hypothesis and how does this book view it?It claims most capabilities come from pretraining and alignment mainly teaches style, so few examples can suffice. The book argues style matters but alignment is not merely superficial: the choice of data and RL-style training can alter behaviors and improve challenging capabilities.
How have open and closed post-training practices evolved?After early instruction-tuned models (e.g., Alpaca era) and skepticism of RLHF, preference-tuning surged with methods like Direct Preference Optimization (DPO) once best practices emerged. Open recipes eventually plateaued, while closed labs advanced multi-stage, large-scale post-training pipelines.
What is the scope and intended audience of this book?The book focuses on practical, canonical RLHF—especially preference finetuning—covering recurring techniques, trade-offs, and pitfalls rather than exhaustive history. It targets readers with entry-level ML/RL/LM background and aims to provide the minimum needed to implement or explore RLHF.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • The RLHF Book ebook for free
choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • The RLHF Book ebook for free