Overview

12 Synthetic Data

This chapter explains how synthetic data has moved from a curiosity to a cornerstone of modern post-training. As models improved from early RLHF-era systems to GPT-4-class capabilities, they became reliable enough to generate prompts, completions, preferences, and filters at scale—often surpassing human annotators for many instruction-following tasks. While concerns like model collapse exist, the chapter argues that diverse teachers, deduplication, quality filters, and mixing in human data keep training regimes healthy. The upshot is a broader “post-training” paradigm in which synthetic data accelerates iteration, expands coverage, and enables large, longer-response datasets (from Alpaca-scale to Tülu 3 and OpenThoughts 3) that power today’s top models, even as humans remain crucial for ground truth, capability frontiers, and evaluation design.

The chapter then surveys the main ways synthetic data is used. Distillation—training smaller models on outputs from stronger “teacher” models—serves both as a general data engine (for instructions, preferences, and verification) and for targeted skill transfer (e.g., math and code). AI feedback (RLAIF) leverages LLM-as-a-judge to generate scalable preference labels and evaluations at a fraction of human cost, with practical tradeoffs: human data is high-noise/low-bias, while synthetic preference data is low-noise/high-bias. In practice, hybrid pipelines route hard or sensitive cases to humans and rely on AI for volume. The chapter also discusses judge calibration, including inconsistencies and self-preference bias, and notes that while specialized critic or judge models exist, leading general models—combined with techniques like repeated sampling or tournament ranking—tend to suffice for most workflows.

Finally, the chapter details Constitutional AI and rubric-based rewards as influential frameworks for AI-generated supervision. Constitutional AI uses a written set of principles to guide synthetic critiques of instruction data and to produce principle-grounded preference pairs for RLHF, seeding much of today’s RLAIF practice. Rubrics extend this idea by turning task-specific criteria—hard rules and graded principles—into near-verifiable signals for RL, enabling meaningful gains in domains like scientific reasoning and factuality. Although these methods now underpin many state-of-the-art post-training pipelines, the chapter emphasizes that synthetic data complements rather than replaces human input: people remain essential for setting objectives, creating benchmarks, and handling the toughest, most novel problems, while synthetic pipelines provide scalable, fast, and increasingly reliable supervision everywhere else.

Traditional knowledge distillation trains a smaller student model to match the soft probability distribution of a larger teacher model using KL divergence loss. Both models process the same input simultaneously, and temperature scaling (\(\tau > 1\)) softens the distributions to reveal more information about class relationships.
figure
Synthetic data generation in LLM post-training: prompts are passed through a strong model to generate completions, which are paired to create a training dataset. This dataset is then used to fine-tune smaller models via standard supervised learning. More complex pipelines may involve multiple models editing completions, generating preference pairs, or filtering for quality.
figure

Summary

  • Synthetic data has become essential to modern post-training – language models are used to generate training prompts, write completions, provide preference labels, and filter data at every stage of the pipeline.
  • Distillation, using outputs from a stronger model to train a smaller one, is the most common form of synthetic data usage and has largely replaced human-written completions for instruction tuning, though human data remains important at capability frontiers and for establishing ground truth answers.
  • AI feedback can approximate human preference labels at a fraction of the cost within the RLHF pipeline, denoted RLAIF. A useful rule of thumb: human data is high-noise and low-bias, while synthetic preference data is low-noise and high-bias – making AI feedback easier to start with but prone to systematic second-order effects.
  • Constitutional AI (CAI) was the earliest large-scale use of synthetic data for RLHF, using a set of written principles to guide both instruction data revision and synthetic preference data generation.
  • Rubric-based rewards extend RL training to domains without clearly verifiable answers by using LLM judges to score completions against per-prompt evaluation criteria, enabling RL-style training on tasks like scientific reasoning, creative writing, or model personality.

FAQ

What is “synthetic data” in RLHF/post-training, and why has it become so important?Synthetic data is model-generated training material—prompts, completions, critiques, and preference labels—used to fine-tune other models. With GPT-4–class systems, models became reliable enough to outproduce humans on many tasks, making synthetic data cheaper, faster to iterate on, and central to modern post-training.
Where does synthetic data plug into the training pipeline?Across the pipeline: generating or expanding prompts, producing high-quality completions for supervised fine-tuning, creating preference data (AI feedback/RLAIF), filtering/rewriting outputs, and serving as judges for evaluation. Many state-of-the-art post-training recipes rely on synthetic data at multiple stages.
Does synthetic data cause “model collapse,” and how can we avoid it?Collapse risks arise when repeatedly training on unfiltered, repetitive outputs from a single model. Practical mitigations include mixing in human data, using diverse teacher models, deduplicating aggressively, and enforcing strong quality filters. With these safeguards, large-scale synthetic data has not shown catastrophic collapse in modern pipelines.
What is distillation, and how is it used for LLMs?Distillation is training a smaller “student” on outputs from a stronger “teacher.” In practice it’s used (1) as a data engine to produce instruction, preference, and verification data, and (2) to transfer specific skills (e.g., math, coding) into a smaller model. Success hinges on curated prompts, diverse sampling, and strict filtering of teacher outputs.
How should I balance AI feedback vs. human feedback?AI feedback is dramatically cheaper and scales fast, but tends to be lower-noise and higher-bias; human data is higher-noise and lower-bias, offering a strong, reliable signal when curated. A common strategy is hybrid: route hard or high-stakes items to humans while letting AI judges handle the bulk of routine labeling and large-scale evaluation.
What is Constitutional AI (CAI), and how does it relate to RLAIF?CAI uses a written “constitution” of principles to (1) critique and revise instruction data to align outputs and (2) generate principle-guided preference labels that drive RLHF—i.e., RLAIF. It inaugurated large-scale use of AI feedback for alignment and inspired broader judge- and principle-based training methods.
Are general-purpose LLM judges reliable, or should I train a specialized judge?LLM judges can show biases (e.g., self-preference) and calibration issues, but frontier models are extensively trained for evaluation and usually suffice. You can reduce bias via repeated sampling, self-refinement, or tournament ranking. Train a dedicated judge mainly when your task is highly specialized or involves private, domain-specific criteria.
What are “rubrics,” and how do they enable RL when answers aren’t verifiable?Rubrics are structured, near-verifiable criteria (e.g., Hard Rules vs. Principles) that let an LLM judge score candidate responses. Those scores serve as rewards for RL, enabling improvement in domains like scientific reasoning and factuality where exact ground truth is hard to check automatically.
How do I design and scale rubric-based training cost-effectively?Start with a domain-level base rubric, then have an LLM assign per-prompt rubric scores during training. Use concise, atomic criteria with explicit weights and priority tags to guide learning. This reduces per-prompt rubric generation costs while preserving consistent, actionable feedback signals for RL.
How has synthetic dataset scale evolved, and what should I start with?Datasets have grown from tens of thousands of short examples to millions of prompts with much longer responses, yielding orders-of-magnitude more tokens. For newcomers, begin with small, simple instruction-following sets to iterate quickly, then scale to larger, domain-focused synthetic corpora for peak performance.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • The RLHF Book ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • The RLHF Book ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • The RLHF Book ebook for free