Build a Reasoning Model (From Scratch) you own this product

Sebastian Raschka

MEAP began August 2025
Last updated January 2026
Publication in Summer 2026 (estimated)

ISBN 9781633434677
375 pages (estimated)

Included with a Manning Online subscription

printed in color

catalog / Data Science / Deep Learning / Generative AI

resources: Source code Book forum Source code on GitHub

table of content

1 Understanding reasoning models

1.1 Defining reasoning in the context of LLMs

1.2 Understanding the standard LLM training pipeline

1.3 Improving LLM reasoning with training and inference techniques

1.4 Modeling language through pattern matching

1.5 Simulating reasoning without explicit rules

1.6 Why build reasoning models from scratch?

1.7 A roadmap to reasoning models from scratch

1.8 Summary

2 Generating text with a pre-trained LLM

2.1 Introduction to LLMs for text generation

2.2 Setting up the coding environment

2.3 Understanding hardware needs and recommendations

2.4 Preparing input texts for LLMs

2.5 Loading pre-trained models

2.6 Understanding the sequential LLM text generation process

2.7 Coding a minimal text generation function

2.8 Faster inference via KV caching

2.9 Faster inference via PyTorch model compilation

2.10 Summary

3 Evaluating reasoning models

3.1 Building a math verifier

3.2 Loading a pre-trained model to generate text

3.3 Implementing a wrapper for easier text generation

3.4 Extracting the final answer box

3.5 Normalizing the extracted answer

3.6 Verifying mathematical equivalence

3.7 Grading answers

3.8 Loading the evaluation dataset

3.9 Evaluating the model

3.10 Summary

4 Improving reasoning with inference-time scaling

4.1 Introduction to inference-time scaling

4.2 Loading a pre-trained model

4.3 Generating better responses with chain-of-thought prompting

4.4 Controlling output diversity with temperature scaling

4.4.1 Understanding the process of selecting the next token

4.4.2 Rescaling token scores (logits) via a temperature parameter

4.4.3 Sampling the next token from a probability distribution

4.4.4 Adding temperature scaling to the text generation function

4.5 Balancing diversity and coherence with top-p sampling

4.5.1 Selecting a subset of top-p tokens

4.5.2 Adding a top-p filter to the text generation function

4.6 Improving response accuracy with self-consistency

4.7 Summary

5 Inference-time scaling via self-refinement

5.1 Scoring and iteratively improving model responses

5.2 Loading a pre-trained model

5.3 Scoring LLM responses with a rule-based score

5.4 Understanding token probability scores

5.5 From token probability scores to log-probabilities

5.6 Scoring model confidence with log-probabilities

5.7 Self-refinement through iterative feedback

5.8 Coding the self-refinement loop

5.9 Summary

6 Training reasoning models with reinforcement learning

6.1 Introduction to reinforcement learning for LLMs

6.1.1 The original reinforcement learning pipeline with human feedback (RLHF)

6.1.2 From human feedback to verifiable rewards (RLVR)

6.2 Reinforcement learning with verifiable rewards walkthrough using GRPO

6.2.1 High-level GRPO intuition via a chef analogy

6.2.2 The high-level GRPO procedure

6.3 Loading a pre-trained model

6.4 Loading a MATH training subset

6.5 Sampling rollouts

6.6 Calculating rewards

6.7 Preparing learning signals from rollouts via advantages

6.8 Scoring rollouts with sequence log-probabilities

6.9 From advantages to policy updates via the GRPO loss

6.10 Putting everything together in a single GRPO function

6.11 Implementing the GRPO training loop

6.12 Loading and evaluating saved model checkpoints

6.13 Summary

7 Distilling reasoning models for efficient reasoning

8 Improving the reasoning pipeline and future research directions

Appendices

Appendix A: References and further reading

A.1 Chapter 1

A.1.1 References

A.1.2 Further Reading

A.2 Chapter 2

A.2.1 References

A.2.2 Further Reading

A.3 Chapter 3

A.3.1 References

A.3.2 Further Reading

A.4 Chapter 4

A.4.1 References

A.4.2 Further Reading

A.5 Chapter F

A.5.1 References

A.5.2 Further Reading

Appendix B: Exercise solutions

B.1 Chapter 2

B.2 Chapter 3

B.3 Chapter 4

Appendix C: Qwen3 LLM source code

C.1 Root mean square layer normalization (RMSNorm)

C.2 Feed forward module

C.3 Rotary position embeddings (RoPE)

C.4 Grouped query attention (GQA)

C.5 Transformer block

C.6 Main model code

C.7 KV cache

C.8 Tokenizer

C.9 Using the model

Appendix D: Using larger LLMs

Appendix E: Batched inference

Appendix F: Common approaches to model evaluation

F.1 Understanding the main evaluation methods for LLMs

F.2 Evaluating answer-choice accuracy

F.2.1 Loading the model

F.2.2 Checking the generated answer letter

F.3 Using verifiers to check answers

F.4 Comparing models using preferences and leaderboards

F.5 Judging responses with other LLMs

F.5.1 Implementing a LLM-as-a-judge approach in Ollama

Overview

4 Improving reasoning with inference-time scaling

This chapter shows how to improve a model’s reasoning without retraining it by spending more compute during inference. It frames “inference-time scaling” as a practical complement to training-time scaling: you can get better answers by letting the model think longer and/or by sampling more. The chapter extends a flexible text-generation pipeline to plug in new decoding strategies and demonstrates, on math problems, that these inference-time methods can more than double accuracy compared to a greedy baseline—at the cost of generating more tokens and thus higher latency and compute.

The first method is chain-of-thought prompting: nudging the model to explain its steps (for example, by appending “Explain step by step.”) often boosts reliability on multi-step problems because it aligns with patterns seen in training data and creates chances for self-correction. The chapter then equips the generator to produce diverse answers by rescaling logits with temperature and sampling with multinomial draws, while adding top-p (nucleus) filtering to suppress low-probability tokens and balance diversity with coherence. These components are integrated into a pluggable generation function, highlighting the trade-off: richer, more accurate reasoning typically means longer outputs and higher cost, and not every task or model benefits (overthinking can hurt on simple items, and some reasoning-tuned models already explain themselves).

The second method, self-consistency, uses temperature and top-p to sample multiple complete solutions, extracts the final answer from each, and returns the majority vote. Although simple, this voting step materially improves accuracy and can be parallelized across devices. Empirically, a greedy base model around 15% accuracy rises to about 41% with chain-of-thought; temperature+top-p alone yields only modest gains; self-consistency brings the base model into the low 30% range with more samples; and combining self-consistency with chain-of-thought reaches roughly 52%—while runtime grows substantially. A reasoning-tuned model also benefits (about 48% to 55%). The chapter closes with practical guidance on temperature (roughly 0.5–0.9) and top-p (about 0.7–0.9), caveats such as tie handling and the need for extractable final answers, and a preview of the next chapter’s broader, iterative self-refinement approach.

A mental model of the topics covered in this book. This chapter focuses on techniques that improve reasoning without additional training (stage 3). In particular, it extends the text-generation function and implements a voting-based method to improve answer accuracy. The next chapter then introduces an inference-time scaling approach where the model iteratively refines its own answers.

Comparison of inference-time scaling (this chapter) and training-time scaling (after chapter 5). Both improve accuracy by using more compute, but inference-time scaling does this on the fly, without changing the model's weight parameters. The plots are inspired by OpenAI's article introducing their first reasoning model (https://openai.com/index/learning-to-reason-with-llms/).

Overview of three inference-time methods to improve reasoning covered in this book. The first modifies the prompt to encourage step-by-step reasoning, and the second samples multiple answers and selects the most frequent one. Both are discussed in this chapter. The third method, in which the model iteratively refines its own response, is introduced in the next chapter.

The first inference-time method, chain-of-thought prompting, modifies the prompt to encourage the model to explain its reasoning step by step before producing a final answer.

The second inference-time method, self-consistency sampling, generates multiple answers and selects the most frequent one. This method relies on temperature scaling, covered in this section, which influences how the model samples its next token.

Illustration of how an LLM generates the next token. The model converts the input into token IDs, computes scores for all possible next tokens, and selects the one with the highest score as the next output.

Example of next-token logits for a language model. Each bar represents a possible token's score, with "Berlin" having the highest logit value and being selected as the next token.

In this section, we implement the core part of temperature scaling (step 3.2), which adjusts the next-token scores. This allows us to control how confidently the model selects its next token in later steps.

The effect of temperature scaling on logits. Lower temperatures make the distribution sharper, while higher temperatures flatten it. (Please note that this visualization is shown as a line plot for readability, though a bar plot would more accurately represent the discrete vocabulary scores.)

Overview of the sampling process for generating tokens. In this section, we focus on steps 3.3 and 3.4, where the next-token scores are converted into a probability distribution, and the next token is sampled based on that distribution.

Token probabilities obtained by applying the softmax function to the rescaled logits. The token of the highest probability (corresponding to " Berlin", but with the label omitted for code simplicity) is selected as the next output.

Overview of the top-p filtering process. The filter keeps only the highest-probability tokens by sorting them, applying a cumulative cutoff, selecting the top-p subset, and renormalizing the result.

Example of token probabilities before top-p filtering. The distribution includes many low-probability tokens, which will later be truncated by applying a cumulative probability threshold.

Visualization of sorted token probabilities and their cumulative sum. This step prepares for top-p filtering by showing how probabilities accumulate when ordered from highest to lowest, which helps determine where to set the cutoff threshold.

Illustration of top-p (nucleus) filtering. Tokens are sorted by probability, and the smallest subset whose cumulative probability exceeds the threshold (p = 0.8) is kept for sampling.

Integration of top-p filtering with temperature scaling. After rescaling the next-token scores, top-p filtering is applied between steps 3.3 and 3.4 to limit sampling to the most probable tokens.

The self-consistency sampling method generates multiple responses from the LLM and selects the most frequent answer, which improves answer accuracy through majority voting across these sampled responses.

The three main steps for implementing self-consistency sampling. First, we generate multiple answers for the same prompt using a temperature greater than zero and top-p filtering to generate different answers. Second, we extract the final boxed answer from each generated solution. Third, we select the most frequently extracted answer as the final prediction.

Summary of this chapter's focus on inference-time techniques. Here, the text generation function was extended with a voting-based method to improve answer accuracy. The next chapter introduces self-refinement, in which the model iteratively improves its responses.

Summary

Reasoning abilities and answer accuracy can be improved without retraining the model by increasing compute at inference time (inference-time scaling).
This chapter focuses on two such techniques: chain-of-thought prompting and self-consistency; a third method, self-refinement, which was briefly described, will be covered in for the next chapter.
A flexible text generation wrapper (generate_text_stream_concat_flex) that uses different sampling strategies that can be plugged in without changing the surrounding code.
Next tokens are produced from logits via softmax
Temperature scaling changes logits to control the diversity of the generated text.
Top-p (nucleus) sampling filters out low-probability tokens to reduce the chance of generating nonsensical answers
Chain-of-thought prompting (like "Explain step by step." or similar) often yields more accurate answers by encouraging the model to write out intermediate reasoning, though it increases the number of generated tokens and thus increases the runtime cost.
Self-consistency sampling generates multiple answers, extracts the final boxed result from each, and selects the most frequent answer via majority vote to improve the answer accuracy.
Experiments on the MATH-500 dataset show that combining chain-of-thought prompting with self-consistency can substantially boost accuracy compared to the baseline without sampling, at the cost of much longer runtimes.
The central trade-off of inference-time scaling: higher accuracy in exchange for more compute.

FAQ

What is inference-time scaling, and how does it differ from training-time compute?

Inference-time scaling means spending more compute while the model is generating text to improve answer quality, without changing the model’s weights. Examples include asking the model to “think” longer (more tokens), sampling multiple answers, and refining answers. Training-time compute, by contrast, improves the model through more or better training. In practice, strong systems use both: heavy training and increased inference-time compute.

Which methods does this chapter implement to improve reasoning at inference time?

This chapter implements two methods: (1) chain-of-thought prompting (ask the model to explain step by step), and (2) self-consistency sampling (generate multiple answers via randomized sampling and pick the majority). It also adds temperature scaling and top-p (nucleus) sampling to produce diverse responses. Iterative self-refinement is introduced in the next chapter.

How does chain-of-thought prompting help, and when can it hurt?

Asking for step-by-step reasoning often improves accuracy by giving the model chances to self-correct and matching patterns seen during training (many solutions include worked steps). However, it increases token usage (cost/latency) and can sometimes reduce accuracy on simple problems (“overthinking”). Models already trained to produce reasoning often need little or no extra prompting.

What does the temperature parameter do, and how should I choose it?

Temperature rescales logits before sampling. Lower T (e.g., 0.5) sharpens the distribution and makes outputs more deterministic; higher T (e.g., 0.9) increases diversity but can introduce incoherence if too high. Typical ranges: T ≈ 0.5–0.9. If all long answers look identical, raise T slightly; if answers look off or random, lower T.

What is top-p (nucleus) sampling, and how is it different from top-k?

Top-p keeps the smallest set of tokens whose cumulative probability mass is ≤ p, zeros out the rest, renormalizes, and samples. It adapts the candidate set size to the distribution. Top-k instead keeps a fixed number k of the most likely tokens. Both control diversity; top-p is often preferred because it adapts to context.

Why combine temperature with top-p?

Temperature shapes how peaked or flat the distribution is; top-p removes the long tail of low-probability tokens. Together they balance coherence and diversity, enabling multiple plausible solutions while reducing “weird” tokens. This combo underpins self-consistency, which needs diverse but reasonable samples.

How does self-consistency sampling work?

It’s majority voting across diverse samples: (1) generate N answers with T>0 and top-p, (2) extract each answer’s final result (e.g., the boxed value), (3) pick the most frequent result. It increases accuracy by aggregating independent reasoning paths. It costs more compute but can be parallelized. Tie-breaking and early stopping (stop once a majority is certain) are useful refinements.

What accuracy/cost trade-offs were observed on MATH-500?

Highlights from the chapter’s experiments: base model greedy ≈ 15%; chain-of-thought ≈ 41%; temperature+top-p alone ≈ 18%; self-consistency with top-p (n=10) ≈ 32%; chain-of-thought + self-consistency + top-p (n=10) ≈ 52%. A reasoning model baseline ≈ 48% rose to ≈ 55% with self-consistency (n=3). Gains come with substantial extra runtime (more tokens and multiple samples).

What practical tips improve reliability when sampling?

- Keep T in ~0.5–0.9 and top-p in ~0.7–0.9, adjust based on diversity vs coherence. - Seed runs for reproducibility; results can still vary across devices (CPU/GPU/MPS). - Use parallel sampling if you have multiple GPUs. - Implement tie-breaking (e.g., first occurrence) and consider early stopping once a majority forms. - Inspect full (long) answers when tuning—identical long answers → raise T slightly; nonsensical long answers → lower T.

What are the limitations and caveats of these methods?

- Higher inference cost and latency (more tokens, more samples). - Chain-of-thought can hurt simple tasks or induce errors (“overthinking”). - Self-consistency works best when a short, extractable final answer exists; open-ended tasks need other strategies (e.g., iterative self-refinement). - Overly high temperature can produce irrelevant tokens; tune carefully. - Results may vary by device and implementation details.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $26.39

you save $21.60 (45%)

include audio $24.99 $13.74

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $26.39

you save $21.60 (45%)

include audio $24.99 $13.74

eBook

pdf, ePub, online

$47.99 $26.39

you save $21.60 (45%)

include audio $24.99 $13.74

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more