Overview

1 Introduction

Reinforcement learning from human feedback (RLHF) incorporates human preferences into AI systems to solve tasks that are hard to specify, and it rose to prominence with ChatGPT and the rapid progress of large language models. The core pipeline couples an instruction-following base model with a learned reward model from human preferences, then optimizes the model’s responses using reinforcement learning. In modern practice, RLHF sits within a broader post‑training stack alongside instruction/supervised finetuning and newer reinforcement finetuning approaches, together shaping models to be more useful and general across domains. Conceptually, RLHF emphasizes response‑level alignment and stylistic behavior—teaching what better and worse answers look like—leading to stronger cross‑task generalization than instruction finetuning alone.

RLHF reframes model behavior from raw next‑token continuation toward concise, user‑appropriate answers by optimizing entire responses and leveraging contrastive signals, including negative feedback. This flexibility brings challenges: training robust reward models, preventing over‑optimization against proxy rewards, managing biases such as length preference, and maintaining control through regularization. Effective RLHF depends on a strong starting model and is more costly in data, compute, and engineering than simple instruction tuning, but it is often essential for high‑performance assistants. A useful intuition is the elicitation view of post‑training: much of the gain comes from extracting and amplifying valuable behaviors already latent in pretrained models, especially as scale increases.

Practices and perceptions have evolved quickly: early open‑source instruction‑tuned models sparked enthusiasm, skepticism about RLHF followed, and then preference‑tuning methods like Direct Preference Optimization catalyzed a resurgence. Meanwhile, large‑scale post‑training in industry has grown into complex, multi‑stage pipelines that combine instruction tuning, RLHF, prompt and data design, and more. Current innovation focuses on reinforcement finetuning, reasoning‑oriented training, tool use, and strategic use of synthetic data and evaluation. This book aims to provide practical guidance for the canonical RLHF workflow—covering preference data, reward modeling, optimization and regularization tools, and advanced topics—targeted at readers with foundational ML knowledge who want a concise, actionable reference.

A rendition of the early, three stage RLHF process with SFT, a reward model, and then optimization.
figure

Future of RLHF

With the investment in language modeling, many variations on the traditional RLHF methods emerged. RLHF colloquially has become synonymous with multiple overlapping approaches. RLHF is a subset of preference fine-tuning (PreFT) techniques, including Direct Alignment Algorithms (See Chapter 12). RLHF is the tool most associated with rapid progress in "post-training" of language models, which encompasses all training after the large-scale autoregressive training on primarily web data. This textbook is a broad overview of RLHF and its directly neighboring methods, such as instruction tuning and other implementation details needed to set up a model for RLHF training.

As more successes of fine-tuning language models with RL emerge, such as OpenAI’s o1 reasoning models, RLHF will be seen as the bridge that enabled further investment of RL methods for fine-tuning large base models. At the same time, while the spotlight of focus may be more intense on the RL portion of RLHF in the near future – as a way to maximize performance on valuable tasks – the core of RLHF is that it is a lens for studying the grand problems facing modern forms of AI. How do we map the complexities of human values and objectives into systems we use on a regular basis? This book hopes to be the foundation of decades of research and lessons on these problems.

FAQ

What is RLHF and why did it emerge?Reinforcement Learning from Human Feedback (RLHF) is a technique to incorporate human preferences into AI systems, created to tackle problems that are hard to specify directly. It began in traditional RL domains and became widely known with ChatGPT and the rise of large language and foundation models.
What are the three core steps in the RLHF pipeline?1) Train a language model to follow instructions (SFT/IFT). 2) Collect human preference data and train a reward model. 3) Optimize the language model with an RL-style objective using the reward model to rate sampled outputs.
How does RLHF fit into modern post-training?RLHF is one component of post-training, which includes three optimization methods: Instruction/Supervised Finetuning (IFT/SFT), Preference Finetuning (PreFT), and Reinforcement Finetuning (RFT). This book focuses on Preference Finetuning and the broader RLHF toolkit that catalyzed modern post-training.
What does RLHF actually change about model behavior?RLHF shapes response-level behavior and style, teaching models what better and worse answers look like (including negative feedback). Compared to a raw pretrained model that continues text and metadata, an RLHF-trained model responds concisely and usefully to user queries, and tends to generalize better across domains.
How does RLHF differ from instruction finetuning (SFT)?SFT makes per-token updates to predict the next token given familiar contexts, emphasizing learned features and formatting. RLHF tunes at the response level using relative preferences (contrastive training), telling the model which kinds of answers are preferable and which to avoid, rather than prescribing exact target text.
What are the main challenges and costs of RLHF?RLHF requires training a reward model, where best practices depend on the application. The optimization can overfit proxy rewards (over-optimization), demanding careful regularization and a strong starting model. It is more expensive than SFT in data, compute, and time, and can introduce issues like length bias.
What is the “elicitation interpretation” of post-training?Post-training is viewed as extracting and amplifying valuable behaviors already latent in the base model—like an F1 team improving performance through aerodynamics and systems after the chassis/engine. Much of user-perceived gains come from post-training, and larger base models enable faster, broader improvements.
What is the Superficial Alignment Hypothesis, and what’s the critique?The hypothesis says pretraining learns capabilities and alignment mainly teaches style, implying small datasets can suffice. The chapter argues this is incomplete: while style matters, post-training can drive non-superficial behavioral and reasoning gains when data and objectives change, going well beyond mere “vibes.”
How have open post-training methods evolved since ChatGPT?Early open efforts (Alpaca, Vicuna, Koala, Dolly) used limited human data plus synthetic instruction tuning. Skepticism about RLHF followed. The DPO paper later sparked breakthroughs (e.g., Zephyr-Beta, Tülu 2) after tuning learning rates. Open recipes then hit resource limits (e.g., UltraFeedback as a staple), while closed labs advanced multi-stage post-training; Tülu 3 offered a comprehensive open foundation.
What is the scope of this book and who should read it?The book covers canonical RLHF implementations, recurring techniques, trade-offs, and pitfalls, not exhaustive history. It includes problem setup, optimization tools (reward modeling, regularization, SFT, sampling, RL, direct alignment), advanced topics (constitutional AI, reasoning/RFT, tools, synthetic data, evaluation), and open questions. It targets readers with entry-level ML/RL/LLM background and serves as a concise, web-first practical reference.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • The RLHF Book ebook for free
choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • The RLHF Book ebook for free
choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • The RLHF Book ebook for free