table of content

1 Understanding reasoning models

1.1 Defining reasoning in the context of LLMs

1.2 Understanding the standard LLM training pipeline

1.3 Improving LLM reasoning with training and inference techniques

1.4 Pattern matching vs. logical reasoning

1.5 Simulating reasoning without explicit rules

1.6 Why build reasoning models from scratch?

1.7 A road map to building reasoning models from scratch

2 Generating text with a pretrained LLM

2.1 Introducing LLMs for text generation

2.2 Setting up the coding environment

2.3 Understanding hardware needs and recommendations

2.4 Preparing input texts for LLMs

2.5 Loading pretrained models

2.6 Understanding the sequential LLM text generation process

2.7 Coding a minimal text generation function

2.8 Faster inference via KV caching

2.9 Faster inference via PyTorch model compilation

3 Evaluating reasoning models

3.1 Building a math verifier

3.2 Loading a pretrained model to generate text

3.3 Implementing a wrapper for easier text generation

3.4 Extracting the final answer box

3.5 Normalizing the extracted answer

3.6 Verifying mathematical equivalence

3.7 Grading answers

3.8 Loading the evaluation dataset

3.9 Evaluating the model

4 Improving reasoning with inference-time scaling

4.1 Introduction to inference-time scaling

4.2 Loading a pretrained model

4.3 Generating better responses with chain-of-thought prompting

4.4 Controlling output diversity with temperature scaling

4.4.1 Understanding the process of selecting the next token

4.4.2 Rescaling token scores (logits) via a temperature parameter

4.4.3 Sampling the next token from a probability distribution

4.4.4 Adding temperature scaling to the text generation function

4.5 Balancing diversity and coherence with top-p sampling

4.5.1 Selecting a subset of top-p tokens

4.5.2 Adding a top-p filter to the text generation function

4.6 Improving response accuracy with self-consistency

5 Inference-time scaling via self-refinement

5.1 Scoring and iteratively improving model responses

5.2 Loading a pretrained model

5.3 Scoring LLM responses with a rule-based score

5.4 Understanding token probability scores

5.5 From token probability scores to log probabilities

5.6 Scoring model confidence with log probabilities

5.7 Self-refinement through iterative feedback

5.8 Coding the self-refinement loop

6 Training reasoning models with reinforcement learning

6.1 Introduction to RL for LLMs

6.1.1 The original RL pipeline with human feedback (RLHF)

6.1.2 From human feedback to verifiable rewards

6.2 RLVR using GRPO

6.2.1 High-level GRPO intuition via a chef analogy

6.2.2 The high-level GRPO procedure

6.3 Loading a pretrained model

6.4 Loading a MATH training subset

6.5 Sampling rollouts

6.6 Calculating rewards

6.7 Preparing learning signals from rollouts via advantages

6.8 Scoring rollouts with sequence log probabilities

6.9 From advantages to policy updates via the GRPO loss

6.10 Putting everything together in a single GRPO function

6.11 Implementing the GRPO training loop

6.12 Loading and evaluating saved model checkpoints

7 Improving GRPO for reinforcement learning

7.1 Improving GRPO

7.2 Tracking GRPO performance metrics

7.2.1 Executing a GRPO training run

7.2.2 Inspecting the GRPO training run

7.3 Tracking more advanced GRPO performance metrics

7.3.1 Advantage tracking

7.3.2 Entropy tracking

7.3.3 Plotting additional GRPO metrics

7.4 Stabilizing sequence-level GRPO using clipped policy ratios

7.4.1 Computing clipped policy ratios

7.4.2 Training with clipped policy ratios

7.5 Controlling how much the model changes with a KL term

7.5.1 Implementing the KL loss term

7.5.2 Training with a KL loss term

7.6 Adding an explicit format reward

7.6.1 Using <think> tokens

7.6.2 Training a model to emit <think> tokens

7.6.3 More GRPO modifications, tips, and tricks

8 Distilling reasoning models for efficient reasoning

8.1 Introducing model distillation for reasoning tasks

8.2 Generating a dataset for reasoning distillation

8.3 Loading the MATH training dataset for distillation

8.4 Building training examples

8.4.1 Loading and understanding the tokenizer

8.4.2 Formatting and tokenizing the dataset

8.4.3 Filtering and splitting the dataset

8.5 Loading a pretrained model

8.6 Computing the training and validation losses

8.7 Implementing the training loop for distillation

8.8 Evaluating the distilled model

8.9 Future directions for reasoning models

8.10 Conclusions

8.10.1 What’s next

8.10.2 Staying up to date in a fast-moving field

Appendices

Appendix A: References and further reading

A.1 Chapter 1: Understanding reasoning models

A.1.1 References

A.1.2 Further reading

A.2 Chapter 2: Generating text with a pretrained LLM

A.2.1 References

A.2.2 A.2.2 Further reading

A.3 Chapter 3: Evaluating reasoning models

A.3.1 References

A.3.2 Further reading

A.4 Chapter 4: Improving reasoning with inference-time scaling

A.4.1 References

A.4.2 Further reading

A.5 Chapter 5: Inference-time scaling via self-refinement

A.5.1 References

A.5.2 Further reading

A.6 Chapter 6: Training reasoning models with reinforcement learning

A.6.1 References

A.6.2 Further reading

A.7 Chapter 7: Improving GRPO for reinforcement learning

A.7.1 References

A.7.2 Further reading

A.8 Chapter 8: Distilling reasoning models for efficient reasoning

A.8.1 References

A.8.2 Further reading

A.9 Appendix F: Common approaches to LLM evaluation

A.9.1 References

A.9.2 Further reading

Appendix B: Exercise solutions

B.1 Chapter 2

B.2 Chapter 3

B.3 Chapter 4

B.4 Chapter 5

B.5 Chapter 6

B.6 Chapter 7

B.7 Chapter 8

Appendix C: Qwen3 LLM source code

C.1 RMSNorm

C.2 Feedforward module

C.3 Rotary position embeddings

C.4 Grouped query attention

C.5 Transformer block

C.6 Main model code

C.7 KV cache

C.8 Tokenizer

C.9 Using the model

Appendix D: Using larger LLMs

D.1 Larger dense Qwen3 configurations

D.2 Downloading larger checkpoints overview

D.3 Loading a larger base model

D.4 Loading a larger reasoning variant

D.5 Practical recommendations

Appendix E: Batching and throughput-oriented execution

E.1 Why batching helps

E.2 Running batched generation

E.3 Padding and attention masks

E.4 Chapter 3: Batched MATH-500 evaluation

E.5 Chapter 4: Batched self-consistency sampling

E.6 Chapter 6: Batched GRPO rollouts

E.7 Chapter 8: Batched distillation

E.8 Single-sequence vs. batch generation

Appendix F: Common approaches to model evaluation

F.1 Understanding the main evaluation methods for LLMs

F.2 Evaluating answer-choice accuracy

F.2.1 Loading the model

F.2.2 Checking the generated answer letter

F.3 Using verifiers to check answers

F.4 Comparing models using preferences and leaderboards

F.5 Judging responses with other LLMs

F.5.1 Implementing the LLM-as-a-judge approach in Ollama

F.5.2 Evaluating responses with a grading rubric

Appendix G: Building a chat interface

G.1 Installing Chainlit

G.2 Running the code as a script

G.3 Downloading the scripts

G.4 The regular single-turn script

G.5 Running the single-turn script

G.6 The multi-turn interface

G.6.1 What multi-turn means

G.6.2 The multi-turn variant

G.6.3 How the multi-turn script uses history

G.6.4 Recommendations

Overview

6 Training reasoning models with reinforcement learning

This chapter explains how reinforcement learning can improve an LLM’s reasoning ability as a training-time scaling method. It contrasts this with inference-time scaling, noting that both approaches can be combined: a model can first be trained to reason better and then use more compute during generation for further gains. The central idea is that pre-training teaches a model broad knowledge through next-token prediction, while reinforcement learning shapes how the model uses that knowledge by optimizing whole outputs, such as whether an answer is correct.

The chapter compares reinforcement learning with human feedback and reinforcement learning with verifiable rewards. Human-feedback methods rely on human preference rankings to train a separate reward model, which then scores new model outputs during training. Verifiable-reward methods simplify this pipeline by replacing the learned reward model with deterministic checks, such as verifying whether a math answer matches the ground truth in the required format. This makes training cheaper, more reproducible, and easier to scale, although it is mainly applicable to domains where correctness can be automatically checked, such as math and code.

The main implementation focus is training a small reasoning model with verifiable rewards using group relative policy optimization. The process begins by loading a pretrained model and a non-overlapping MATH training subset, then generating multiple sampled answers for each prompt. Each answer receives a binary correctness reward, these rewards are converted into group-relative advantages, and the model computes sequence-level log-probabilities for the generated responses. The GRPO loss reinforces responses that performed better than other rollouts for the same prompt and discourages worse ones, after which a standard PyTorch training loop updates the model, logs metrics, saves checkpoints, and evaluates progress. Even a short training run can unlock strong reasoning behavior, though the chapter notes that the simplified GRPO version can become unstable over longer runs and that later improvements add regularization and stability techniques.

A mental model of the topics covered in this book. This chapter focuses on techniques that improve reasoning with additional training (stage 4). Specifically, this chapter covers reinforcement learning.

Conceptual comparison of inference-time scaling and training-time scaling. Increasing compute at inference improves accuracy by spending more resources per answer generation, while increasing compute during training improves accuracy by investing more resources upfront.

Common training stages for LLMs. The ordering of the reasoning training and preference tuning stages can vary, and some pipelines interleave reasoning and preference tuning rather than strictly sequencing them.

Roadmap of this chapter. After a brief introduction to reinforcement learning (RL) for LLMs in this section, we discuss the difference between two RL stages, RLHF and RLVR, in the next section. Then, we focus on implementing RLVR using the GRPO algorithm in the remainder of this chapter.

Two-stage overview of reinforcement learning with human feedback (RLHF). First, a reward model is trained on human-ranked responses to assign a preference score to each. Second, the LLM is updated using these reward scores within an RL objective to encourage preferred responses and discourage less desirable ones.

Overview of reinforcement learning with verifiable rewards (RLVR). The LLM generates a response that is evaluated by a deterministic verifier, for example the math verifier from chapter 3, which assigns a correctness label used as a reward signal within an RL objective to update the model.

After introducing the two main reinforcement learning approaches for LLMs, RLHF and RLVR, the remaining sections focus on implementing RLVR using the GRPO algorithm, from dataset loading to implementing the full training loop.

High-level overview of the GRPO algorithm for RLVR using a chef analogy. Multiple rollouts are generated and scored, relative advantages are computed, and a policy gradient objective with a KL-based regularization (loss) term is used to update the model parameters.

Step-by-step GRPO update for RLVR. (1) A prompt is sampled and multiple rollouts are generated. (2) Each rollout is scored using a verifiable reward. (3) Group-relative advantages are computed from these rewards. (4) The log-probability of each rollout under the current model is calculated. (5) Advantages and log-probabilities are combined to form the policy gradient loss. (6) A KL regularization term against a reference model is added, and the resulting total loss is used to update the model parameters.

In stage 5, we load the pre-trained model (this section) and dataset (next section) that we will use for the model training.

Structure and split of the MATH dataset. The full dataset contains about 12,500 problems that are divided into a 500-problem test set (MATH-500), which we used for model evaluation in chapter 3. A non-overlapping set of 12,000 problems is used for training in this chapter.

After outlining the RLVR method and GRPO algorithm, the following sections implement the individual GRPO stages that we need to train the LLM via verifiable rewards on the MATH dataset.

Step-by-step GRPO update for RLVR (without KL loss term). We begin by prompting the LLM to generate the different rollouts.

The second stage in the GRPO pipeline computes the rewards for each rollout the LLM generated in the previous section.

The third GRPO stage computes the advantage values from the answer (rollout) correctness rewards.

The fourth stage in the GRPO pipeline computes log-probabilities for each rollout, which is related to the logprob scorer we developed in the previous chapter.

The fifth stage in the GRPO pipeline computes the policy gradient loss that we use to update the model. Stage number 6, the model weight update, will be implemented as part of the training loop later.

The complete GRPO workflow where (1) multiple rollouts are generated for a prompt, (2) assigned correctness rewards, (3) converted into group-relative advantages, and (4) combined with log probabilities to (5) compute the policy gradient loss. The loss gradients (6) will be computed and used to update the model in the next section.

After implementing the individual GRPO stages, we now implement the surrounding training loop to update the model weights.

Outline of the training loop. The overall structure follows a standard deep learning training loop. The key difference lies in how the loss is computed: instead of a standard supervised objective, the loss is obtained via the GRPO stages (stage 4).

The final step of this chapter discusses how we can load the saved model checkpoints and evaluate them.

Summary

Reinforcement learning (RL) can be used to train LLMs on human preference labels and verifiable rewards.
RL is typically applied as post-training on top of a pre-trained base model, and it can be inserted at different stages of an LLM pipeline, including reasoning training and preference tuning.
RL with human feedback (RLHF) optimizes for human preferences via a two-stage setup: train a reward model from ranked responses, then use reward scores to update the LLM.
RL with verifiable rewards (RLVR) simplifies RLHF by replacing learned reward models with deterministic, automatically computed verifiers (for example, math answer checking)
We focussed on RLVR for math reasoning.
We used GRPO as the policy optimization algorithm that turns verifier rewards into parameter updates; because GRPO directly optimizes the model using sequence-level rewards without requiring a separate value model, it is particularly convenient.
GRPO is a more resource-friendly alternative to other RL algorithms for LLMs because it avoids training a separate value model and instead derives learning signals from comparisons within a group of sampled rollouts.
A "rollout" refers to a full model answer (completion) for a prompt; rewards, advantages, and log-probabilities are computed from the rollout in later steps.
Rewards are computed from a verifier that only grants a reward if the final answer is both correct and extractable in a required format like "\boxed{}".
Raw rewards are transformed into advantages by normalizing each rollout reward relative to the group mean and standard deviation.
GRPO also relies on sequence-level log-probabilities, which are computed by summing token log-probabilities over the generated answer tokens.
Sequence log-probabilities, together with the advantages, form the core policy-gradient objective in GRPO.
The full GRPO loss computation is combined into a single function that performs rollout sampling, reward computation, advantage calculation, log-prob computation, and policy-gradient loss calculation.
The surrounding training loop is a standard deep learning loop, with the key difference being that the loss comes from GRPO rather than conventional classification losses.
Training is resource-intensive because each step requires generating multiple, potentially long rollouts, but even short GRPO runs can increase MATH-500 accuracy from 15% to 47%.

FAQ

What is the difference between inference-time scaling and training-time scaling for reasoning models?

Inference-time scaling improves accuracy by spending more computation each time the model generates an answer, for example by sampling or searching more during generation. Training-time scaling improves accuracy by investing additional computation during training so the model becomes better before inference. Chapter 6 focuses on training-time scaling through reinforcement learning, although inference-time and training-time scaling can also be combined.

How does reinforcement learning fit into the LLM training pipeline?

For LLMs, reinforcement learning is usually applied as a post-training stage on top of a pretrained model, often after instruction fine-tuning. Two common RL stages are reasoning training and preference tuning. Reasoning-focused RL can also be applied directly to a pretrained base model, as demonstrated by DeepSeek-R1-Zero, which makes it easier to attribute improvements specifically to reasoning training.

What is reinforcement learning with human feedback (RLHF)?

RLHF is a training approach that uses human preference labels to shape model behavior. Human annotators rank multiple model responses to the same prompt, and those rankings are used to train a reward model. The reward model then scores new outputs, and the LLM is fine-tuned with reinforcement learning to produce responses that receive higher preference scores.

What is reinforcement learning with verifiable rewards (RLVR)?

RLVR replaces the learned reward model used in RLHF with deterministic verifiers. For example, in math tasks, a verifier checks whether the model’s final answer matches the ground truth and assigns a reward such as 1 for correct and 0 for incorrect. This removes the need for human annotation or a separate reward model, but it requires domains where reliable verification is possible, such as math or code.

Why is RLVR simpler than RLHF?

RLHF typically requires two stages: training a reward model from human preferences, then using that reward model to train the target LLM. RLVR collapses this into a single loop: the model generates responses, a deterministic verifier assigns rewards, and those rewards are used directly to update the model. This makes RLVR cheaper, more reproducible, and easier to scale when reliable verification signals are available.

What is GRPO, and why is it used in this chapter?

GRPO stands for group relative policy optimization. It is a policy optimization algorithm used to update the LLM’s weights from RLVR rewards. Unlike PPO, GRPO does not require a separate value model. Instead, it samples multiple responses for the same prompt and compares their rewards relative to each other, which makes it more resource-friendly for reasoning-model training.

What does “group relative” mean in GRPO?

The “group relative” part means that the model generates multiple rollouts, or complete responses, for the same prompt. Each rollout receives a reward, and those rewards are converted into advantages by comparing each reward to the group mean and standard deviation. This tells the model which responses were better or worse than the other responses generated for the same prompt.

How are rewards computed for math problems in the chapter’s RLVR setup?

Rewards are computed with a math verifier. The verifier extracts the model’s final answer and compares it with the ground-truth answer. In the chapter’s implementation, an answer receives 1.0 only if it is correct and uses the required \boxed{} format. Otherwise, it receives 0.0. This encourages the model to produce both correct and properly formatted answers.

Why does GRPO convert rewards into advantages?

Rewards indicate how well individual rollouts performed, while advantages indicate how each rollout performed relative to the other rollouts for the same prompt. Positive advantages increase the likelihood of the actions that produced a rollout, negative advantages decrease their likelihood, and near-zero advantages contribute little to the update. If all rollouts receive the same reward, the advantages become zero and GRPO produces little or no learning signal.

How does the GRPO training loop update the model?

The training loop generates multiple rollouts for a prompt, scores them with verifiable rewards, converts rewards into group-relative advantages, computes sequence-level log-probabilities for the rollouts, and combines advantages with log-probabilities into a policy gradient loss. PyTorch then performs a backward pass, optionally clips gradients for stability, and updates the model weights with an optimizer such as AdamW. The chapter’s simplified implementation omits the KL regularization term, which is added in the next chapter.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $12.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $12.49

eBook

pdf, ePub, online

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $12.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more