table of content

1 Understanding reasoning models

1.1 Defining reasoning in the context of LLMs

1.2 Understanding the standard LLM training pipeline

1.3 Improving LLM reasoning with training and inference techniques

1.4 Pattern matching versus logical reasoning

1.5 Simulating reasoning without explicit rules

1.6 Why build reasoning models from scratch?

1.7 A roadmap to reasoning models from scratch

1.8 Summary

2 Generating text with a pre-trained LLM

2.1 Introduction to LLMs for text generation

2.2 Setting up the coding environment

2.3 Understanding hardware needs and recommendations

2.4 Preparing input texts for LLMs

2.5 Loading pre-trained models

2.6 Understanding the sequential LLM text generation process

2.7 Coding a minimal text generation function

2.8 Faster inference via KV caching

2.9 Faster inference via PyTorch model compilation

2.10 Summary

3 Evaluating reasoning models

3.1 Building a math verifier

3.2 Loading a pre-trained model to generate text

3.3 Implementing a wrapper for easier text generation

3.4 Extracting the final answer box

3.5 Normalizing the extracted answer

3.6 Verifying mathematical equivalence

3.7 Grading answers

3.8 Loading the evaluation dataset

3.9 Evaluating the model

3.10 Summary

4 Improving reasoning with inference-time scaling

4.1 Introduction to inference-time scaling

4.2 Loading a pre-trained model

4.3 Generating better responses with chain-of-thought prompting

4.4 Controlling output diversity with temperature scaling

4.4.1 Understanding the process of selecting the next token

4.4.2 Rescaling token scores (logits) via a temperature parameter

4.4.3 Sampling the next token from a probability distribution

4.4.4 Adding temperature scaling to the text generation function

4.5 Balancing diversity and coherence with top-p sampling

4.5.1 Selecting a subset of top-p tokens

4.5.2 Adding a top-p filter to the text generation function

4.6 Improving response accuracy with self-consistency

4.7 Summary

5 Inference-time scaling via self-refinement

5.1 Scoring and iteratively improving model responses

5.2 Loading a pre-trained model

5.3 Scoring LLM responses with a rule-based score

5.4 Understanding token probability scores

5.5 From token probability scores to log-probabilities

5.6 Scoring model confidence with log-probabilities

5.7 Self-refinement through iterative feedback

5.8 Coding the self-refinement loop

5.9 Summary

6 Training reasoning models with reinforcement learning

6.1 Introduction to reinforcement learning for LLMs

6.1.1 The original reinforcement learning pipeline with human feedback (RLHF)

6.1.2 From human feedback to verifiable rewards (RLVR)

6.2 Reinforcement learning with verifiable rewards walkthrough using GRPO

6.2.1 High-level GRPO intuition via a chef analogy

6.2.2 The high-level GRPO procedure

6.3 Loading a pre-trained model

6.4 Loading a MATH training subset

6.5 Sampling rollouts

6.6 Calculating rewards

6.7 Preparing learning signals from rollouts via advantages

6.8 Scoring rollouts with sequence log-probabilities

6.9 From advantages to policy updates via the GRPO loss

6.10 Putting everything together in a single GRPO function

6.11 Implementing the GRPO training loop

6.12 Loading and evaluating saved model checkpoints

6.13 Summary

7 Improving GRPO for reinforcement learning

7.1 Improving GRPO

7.2 Tracking GRPO performance metrics

7.2.1 Executing a GRPO training run

7.2.2 Inspecting the GRPO training run

7.3 Tracking more advanced GRPO performance metrics

7.3.1 Advantage tracking

7.3.2 Entropy tracking

7.3.3 Plotting additional GRPO metrics

7.4 Stabilizing sequence-level GRPO using clipped policy ratios

7.4.1 Computing clipped policy ratios

7.4.2 Training with clipped policy ratios

7.5 Controlling how much the model changes with a KL term

7.5.1 Implementing the KL loss term

7.5.2 Training with a KL loss term

7.6 Adding an explicit format reward

7.6.1 Using <think> tokens

7.6.2 Training a model to emit <think> tokens

7.6.3 More GRPO modifications, tips, and tricks

7.7 Summary

8 Distilling reasoning models for efficient reasoning

8.1 Introduction to model distillation for reasoning tasks

8.2 Generating a dataset for reasoning distillation

8.3 Loading the MATH training dataset for distillation

8.4 Building training examples

8.4.1 Loading and understanding the tokenizer

8.4.2 Formatting and tokenizing the dataset

8.4.3 Filtering and splitting the dataset

8.5 Loading a pre-trained model

8.6 Computing the training and validation losses

8.7 Implementing the training loop for distillation

8.8 Evaluating the distilled model

8.9 Future directions for reasoning models

8.10 Conclusions

8.10.1 What’s next

8.10.2 Staying up to date in a fast-moving field

8.11 Summary

Appendices

Appendix A: References and further reading

A.1 Chapter 1: Understanding reasoning models

A.1.1 References

A.1.2 Further Reading

A.2 Chapter 2: Generating text with a pre-trained LLM

A.2.1 References

A.2.2 Further Reading

A.3 Chapter 3: Evaluating reasoning models

A.3.1 References

A.3.2 Further Reading

A.4 Chapter 4: Improving reasoning with inference-time scaling

A.4.1 References

A.4.2 Further Reading

A.5 Chapter 5: Inference-time scaling via self-refinement

A.5.1 References

A.5.2 Further Reading

A.6 Chapter 6: Training reasoning models with reinforcement learning

A.6.1 References

A.6.2 Further Reading

A.7 Chapter 7: Improving GRPO for reinforcement learning

A.7.1 References

A.7.2 Further Reading

A.8 Chapter 8: Distilling Reasoning Models for Efficient Reasoning

A.8.1 References

A.8.2 Further Reading

A.9 Appendix F: Common approaches to LLM evaluation

A.9.1 References

A.9.2 Further Reading

Appendix B: Exercise solutions

B.1 Chapter 2

B.2 Chapter 3

B.3 Chapter 4

B.4 Chapter 5

B.5 Chapter 6

B.6 Chapter 7

B.7 Chapter 8

Appendix C: Qwen3 LLM source code

C.1 Root mean square layer normalization (RMSNorm)

C.2 Feed forward module

C.3 Rotary position embeddings (RoPE)

C.4 Grouped query attention (GQA)

C.5 Transformer block

C.6 Main model code

C.7 KV cache

C.8 Tokenizer

C.9 Using the model

Appendix D: Using larger LLMs

D.1 Larger dense Qwen3 configurations

D.2 Downloading larger checkpoints overview

D.3 Loading a larger base model

D.4 Loading a larger reasoning variant

D.5 Practical recommendations

Appendix E: Batching and throughput-oriented execution

E.1 Why batching helps

E.2 Running batched generation

E.3 Padding and attention masks

E.4 Chapter 3: batched MATH-500 evaluation

E.5 Chapter 4: batched self-consistency sampling

E.6 Chapter 6: batched GRPO rollouts

E.7 Chapter 8: batched distillation

E.8 Single-sequence versus batch generation

Appendix F: Common approaches to model evaluation

F.1 Understanding the main evaluation methods for LLMs

F.2 Evaluating answer-choice accuracy

F.2.1 Loading the model

F.2.2 Checking the generated answer letter

F.3 Using verifiers to check answers

F.4 Comparing models using preferences and leaderboards

F.5 Judging responses with other LLMs

F.5.1 Implementing a LLM-as-a-judge approach in Ollama

Appendix G: Building a chat interface

G.1 Installing Chainlit

G.2 Running the code as a script

G.3 Downloading the scripts

G.4 The regular single-turn script

G.5 Running the single-turn script

G.6 The multi-turn interface

G.6.1 What multi-turn means

G.6.2 The multi-turn variant

G.6.3 How the multi-turn script uses history

G.6.4 Recommendations

Overview

Appendix C. Qwen3 LLM source code

This appendix presents the concise, readable source code for the Qwen3 model used throughout the book, clarifying that “from scratch” refers to the reasoning methods rather than building an LLM end to end. The implementation mirrors a GPT-2–style, decoder-only transformer while adopting modern architectural updates common in contemporary LLMs. Instead of a step-by-step deep dive, the appendix offers a guided overview that connects design choices to code, so readers can see how the major components fit together and reuse the model via the reasoning_from_scratch package.

Key updates over a classic GPT-2 include RMSNorm in place of LayerNorm for lower cost and stable training, and a SwiGLU (SiLU-activated GLU) feed-forward that improves expressivity while often using fewer parameters than the standard two-layer MLP. Positional information is injected with rotary position embeddings (RoPE), implemented in a clear “two-halves” style that rotates query and key vectors. Attention uses grouped query attention (GQA), which shares keys and values across groups of query heads to reduce parameters and KV-cache bandwidth without hurting quality; Qwen3 additionally applies optional QK normalization (RMSNorm on queries/keys) to further steady optimization. KV-cache support is integrated so generation can be accelerated in streaming or incremental decoding scenarios.

The transformer block combines RMSNorm, RoPE-applied masked GQA, and the SwiGLU feed-forward module with residual connections, and is stacked repeatedly (28 times in the 0.6B variant). The Qwen3Model wraps these blocks with token embeddings, a final RMSNorm, and an output projection, and precomputes RoPE buffers while managing cache-aware causal masks; a small KVCache utility holds per-layer keys and values during generation. A flexible configuration system supports multiple sizes; the 0.6B setup uses 1024-dimensional embeddings, 16 heads with 128 head dimension, 28 layers, a 3072-wide intermediate, 8 KV groups, a large vocabulary, and a long context window. The tokenizer reimplementation mirrors the official behavior, handling numerous special and chat tokens and a hybrid “thinking” mode with intentionally nuanced prompt rules, enabling consistent formatting for both standard and reasoning-style interactions.

Figure C.1 Architectural comparison between Qwen3 and GPT-2. Both models process text through embedding layers and stacked transformer blocks, but they differ in certain design choices.

Figure C.2 Comparison of LayerNorm (used in GPT-2) and RMSNorm (used in Qwen3). LayerNorm (left) normalizes activations so that their average value (mean) is exactly zero and their spread (variance) is exactly one. RMSNorm (right) instead scales activations based on their root mean square, which does not enforce zero mean or unit variance, but still keeps the mean and variance within a reasonable range for stable training.

Figure C.3 In GPT-2 (top), the feed forward module consists of two fully connected (linear) layers separated by a non-linear activation function. In Qwen3 (bottom), this module is a gated linear unit (GLU) variant, which adds a third linear layer (linear layer 3) and multiplies the output of this linear layer 3 elementwise with the activated output of linear layer 1.

Figure C.4 Different activation functions that can be used in a feed forward module (neural network). GELU and SiLU (Swish) offer smooth alternatives to ReLU, which has a sharp kink at input zero.

Figure C.5 A comparison between MHA and GQA. Here, the group size is 2, where a key and value pair is shared among 2 queries.

Figure C.6 The Structure of the transformer block in Qwen3. Each block includes RMSNorm, RoPE, masked grouped-query attention, and a feed-forward module, and is repeated 28 times in the 0.6B-parameter model.

Figure C.7 Architecture of the Qwen3 0.6B model. The model consists of a token embedding layer followed by 28 transformer blocks, each containing RMSNorm, RoPE, QKNorm, masked grouped-query attention with 16 heads, and a feed-forward module with an intermediate size of 3,072.

C.9 Using the model

Let's now instantiate and use the model to confirm that the code works by reusing the text generation approach from chapter 2.

First, we instantiate the model using the pre-trained model weights:

The output shows the structure of the instantiated model, which should match the values we used in the configuration file in listing C.7:

Next, we re-use the text generation functions from chapter 2 to generate text:

Since we used the same prompt as in chapter 2, the generated text matches the generated text from chapter 2 exactly:

While the main chapters use the 0.6-billion-parameter variant of Qwen3 to lower the resource requirements for this book, interested readers can find more information on how to use the larger models in appendix D.

FAQ

What does “from scratch” mean in this book’s context?

It refers to building reasoning techniques and training/evaluation utilities from the ground up, not implementing a full LLM from raw primitives. The full “implement an LLM from scratch” topic is covered in the author’s separate Build a Large Language Model (From Scratch) book.

How is Qwen3 similar to and different from GPT-2?

Both are decoder-only transformer architectures with token embeddings, stacked transformer blocks, and a final projection. Qwen3 adopts modern choices absent in GPT-2: RMSNorm instead of LayerNorm, a SwiGLU feed-forward module, rotary position embeddings (RoPE), grouped query attention (GQA), and optional QKNorm on queries/keys.

Why does Qwen3 use RMSNorm instead of LayerNorm?

RMSNorm rescales activations using their root mean square without mean-centering. It is slightly cheaper, removes the bias term by default, and halves cross-feature reductions compared to LayerNorm, reducing GPU communication and improving training efficiency while maintaining stability.

How does the SwiGLU feed-forward module work, and why can it use fewer parameters?

SwiGLU replaces the standard 2-layer MLP with three linear layers and a gated interaction: SiLU(fc1(x)) * fc2(x), followed by fc3. In practice, fc1 and fc2 are each half-width of a standard MLP’s expansion, so the total parameter count is often lower while the multiplicative gate increases expressivity and performance.

What are Rotary Position Embeddings (RoPE), and how are they implemented here?

RoPE encodes position by rotating attention queries and keys with position-dependent cos/sin phases, avoiding added positional vectors. The appendix implements the split-halves variant (cosine half + sine half), which is equivalent to the interleaved even/odd style used in some repos.

What is Grouped Query Attention (GQA), and why is it used?

GQA shares key/value projections across groups of attention heads, reducing parameters and KV-cache bandwidth during inference. Empirically it matches standard multi-head attention quality. Qwen3 also supports QKNorm (RMSNorm on queries/keys) to improve stability.

What is the KV cache, and how does it speed up generation?

The KV cache stores past keys/values per layer so the model doesn’t recompute attention over prior tokens at each decoding step. During generation, new keys/values are appended to the cache, the causal mask is sliced to the active window, and only the latest tokens are processed, yielding significant speedups.

What does a Qwen3 transformer block contain?

Each block applies RMSNorm → masked attention (with RoPE and GQA) → residual add, then RMSNorm → SwiGLU feed-forward → residual add. In the 0.6B model, this block is repeated 28 times.

What are the main components of the Qwen3Model forward pass?

The model embeds tokens, builds a causal mask (with special handling when a KV cache is present), applies precomputed RoPE cos/sin buffers inside attention, passes through the stack of transformer blocks, applies a final RMSNorm, and projects to vocabulary logits. With caching, it tracks the current position and updates per-layer cached K/V tensors.

How does the Qwen3 tokenizer differ between base and reasoning modes?

It supports many special tokens and a chat template. The effective end-of-sequence token differs between base and chat/reasoning variants. A noteworthy quirk: when add_thinking=True, no explicit “…” block is inserted; when add_thinking=False, that block is added. This mirrors the hybrid behavior of the official Qwen3 0.6B reasoning-capable model.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $12.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $12.49

eBook

pdf, ePub, online

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $12.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more