Build a Reasoning Model (From Scratch) you own this product

Sebastian Raschka

MEAP began August 2025
Last updated November 2025
Publication in Summer 2026 (estimated)

ISBN 9781633434677
375 pages (estimated)

Included with a Manning Online subscription

printed in color

catalog / Data Science / Deep Learning / Generative AI

table of content

1 Understanding reasoning models

1.1 Defining reasoning in the context of LLMs

1.2 Understanding the standard LLM training pipeline

1.3 Improving LLM reasoning with training and inference techniques

1.4 Modeling language through pattern matching

1.5 Simulating reasoning without explicit rules

1.6 Why build reasoning models from scratch?

1.7 A roadmap to reasoning models from scratch

1.8 Summary

2 Generating text with a pre-trained LLM

2.1 Introduction to LLMs for text generation

2.2 Setting up the coding environment

2.3 Understanding hardware needs and recommendations

2.4 Preparing input texts for LLMs

2.5 Loading pre-trained models

2.6 Understanding the sequential LLM text generation process

2.7 Coding a minimal text generation function

2.8 Faster inference via KV caching

2.9 Faster inference via PyTorch model compilation

2.10 Summary

3 Evaluating reasoning models

3.1 Building a math verifier

3.2 Loading a pre-trained model to generate text

3.3 Implementing a wrapper for easier text generation

3.4 Extracting the final answer box

3.5 Normalizing the extracted answer

3.6 Verifying mathematical equivalence

3.7 Grading answers

3.8 Loading the evaluation dataset

3.9 Evaluating the model

3.10 Summary

4 Improving reasoning with inference-time scaling

4.1 Introduction to inference-time scaling

4.2 Loading a pre-trained model

4.3 Generating better responses with chain-of-thought prompting

4.4 Controlling output diversity with temperature scaling

4.4.1 Understanding the process of selecting the next token

4.4.2 Rescaling token scores (logits) via a temperature parameter

4.4.3 Sampling the next token from a probability distribution

4.4.4 Adding temperature scaling to the text generation function

4.5 Balancing diversity and coherence with top-p sampling

4.5.1 Selecting a subset of top-p tokens

4.5.2 Adding a top-p filter to the text generation function

4.6 Improving response accuracy with self-consistency

4.7 Summary

5 Training reasoning models with reinforcement learning

6 Distilling reasoning models for efficient reasoning

7 Improving the reasoning pipeline and future research directions

Appendices

Appendix A: References and further reading

A.1 Chapter 1

A.1.1 References

A.1.2 Further Reading

A.2 Chapter 2

A.2.1 References

A.2.2 Further Reading

A.3 Chapter 3

A.3.1 References

A.3.2 Further Reading

A.4 Chapter 4

A.4.1 References

A.4.2 Further Reading

A.5 Chapter F

A.5.1 References

A.5.2 Further Reading

Appendix B: Exercise solutions

B.1 Chapter 2

B.2 Chapter 3

B.3 Chapter 4

Appendix C: Qwen3 LLM source code

C.1 Root mean square layer normalization (RMSNorm)

C.2 Feed forward module

C.3 Rotary position embeddings (RoPE)

C.4 Grouped query attention (GQA)

C.5 Transformer block

C.6 Main model code

C.7 KV cache

C.8 Tokenizer

C.9 Using the model

Appendix D: Using larger LLMs

Appendix E: Batched inference

Appendix F: Common approaches to model evaluation

F.1 Understanding the main evaluation methods for LLMs

F.2 Evaluating answer-choice accuracy

F.2.1 Loading the model

F.2.2 Checking the generated answer letter

F.3 Using verifiers to check answers

F.4 Comparing models using preferences and leaderboards

F.5 Judging responses with other LLMs

F.5.1 Implementing a LLM-as-a-judge approach in Ollama

Overview

1 Understanding reasoning models

This chapter introduces reasoning as the next stage in large language models and defines it pragmatically: an LLM makes its intermediate steps explicit before producing an answer, often referred to as chain-of-thought. While such traces can look human-like, the model is still generating tokens probabilistically and is not performing guaranteed, rule-based logic. The book takes a hands-on, code-first approach aimed at practitioners, starting from an existing pre-trained model and incrementally adding reasoning capabilities to understand how the methods work in practice.

To ground the discussion, the chapter reviews the standard LLM pipeline of pre-training (next-token prediction on massive text corpora) followed by post-training via instruction tuning and preference tuning. It then outlines three families of techniques that enhance reasoning on top of this baseline: inference-time compute scaling (e.g., prompting and sampling strategies that “think” longer without changing weights), reinforcement learning (updating weights using rewards from verifiable tasks or environments, distinct from human-preference RLHF), and distillation (supervised fine-tuning on high-quality, step-by-step data generated by stronger models). The chapter contrasts statistical pattern matching with logical reasoning, showing how LLMs can simulate reasoning through learned correlations yet remain vulnerable in novel or highly complex multi-step scenarios.

Finally, the chapter motivates building reasoning models from scratch by highlighting recent momentum in the field and the practical trade-offs involved. Reasoning can improve performance on challenging tasks in math, coding, and puzzles, but it is not always desirable: it tends to increase verbosity, incur higher compute due to longer outputs and multiple inference calls, and can overthink simple requests. The roadmap for the book is to start with a capable base model, establish evaluation, apply inference-time techniques, and then implement training-based methods, equipping readers to design, prototype, and assess reasoning-enhanced LLMs with a clear understanding of costs and benefits.

A simplified illustration of how an LLM might tackle a multi-step reasoning task. Rather than just recalling a fact, the model needs to combine several intermediate reasoning steps to arrive at the correct conclusion. The intermediate reasoning steps may or may not be shown to the user, depending on the implementation.

Overview of a typical LLM training pipeline. The process begins with an initial model initialized with random weights, followed by pre-training on large-scale text data to learn language patterns by predicting the next token. Post-training then refines the model through instruction fine-tuning and preference fine-tuning, which enables the LLM to follow human instructions better and align with human preferences.

Example responses from a language model at different training stages. The prompt asks for a summary of the relationship between sleep and health. The pre-trained LLM produces a relevant but unfocused answer without directly following the instructions. The instruction-tuned LLM generates a concise and accurate summary aligned with the prompt. The preference-tuned LLM further improves the response by using a friendly tone and engaging language, which makes the answer more relatable and user-centered.

Three approaches commonly used to improve reasoning capabilities in LLMs. These methods (inference-compute scaling, reinforcement learning, and distillation) are typically applied after the conventional training stages (initial model training, pre-training, and post-training with instruction and preference tuning).

Illustration of how contradictory premises lead to a logical inconsistency. From "All birds can fly" and "A penguin is a bird," we infer "Penguin can fly." This conclusion conflicts with the established fact "Penguin cannot fly," which results in a contradiction.

An illustrative example of how a language model (GPT-4o in ChatGPT) appears to "reason" about a contradictory premise.

Token-by-token generation in an LLM. At each step, the LLM takes the full sequence generated so far and predicts the next token, which may represent a word, subword, or punctuation mark depending on the tokenizer. The newly generated token is appended to the sequence and used as input for the next step. This iterative decoding process is used in both standard language models and reasoning-focused models.

A mental model of the main reasoning model development stages covered in this book. We start with a conventional LLM as base model (stage 1). In stage 2, we cover evaluation strategies to track the reasoning improvements introduced via the reasoning methods in stages 3 and 4.

Summary

Conventional LLM training occurs in several stages:

Pre-training, where the model learns language patterns from vast amounts of text.
Instruction fine-tuning, which improves the model's responses to user prompts.
Preference tuning, which aligns model outputs with human preferences.

Reasoning methods are applied on top of a conventional LLM.
Reasoning in LLMs involves systematically solving multi-step tasks using intermediate steps (chain-of-thought).
Reasoning in LLMs is different from rule-based reasoning and it also likely works differently than human reasoning; currently, the common consensus is that reasoning in LLMs relies on statistical pattern matching.
Pattern matching in LLMs relies purely on statistical associations learned from data, which enables fluent text generation but lacks explicit logical inference.
Improving reasoning in LLMs can be achieved through:

Inference-time compute scaling, enhancing reasoning without retraining (e.g., chain-of-thought prompting).
Reinforcement learning, training models explicitly with reward signals.
Supervised fine-tuning and distillation, using examples from stronger reasoning models.

Building reasoning models from scratch provides practical insights into LLM capabilities, limitations, and computational trade-offs.

FAQ

What does “reasoning” mean in the context of LLMs?

In this chapter, reasoning means an LLM shows the intermediate steps it used to reach an answer before giving the final response. Making these steps explicit often improves accuracy on complex tasks such as coding, logic puzzles, and multi-step math.

What is Chain-of-Thought (CoT)?

Chain-of-Thought is a prompting and generation style where the model produces step-by-step intermediate explanations on the way to an answer. It makes the model’s process more explicit and can improve performance, but it does not imply the model “thinks” like a human.

How is LLM “reasoning” different from human or symbolic logical reasoning?

LLM reasoning is probabilistic and autoregressive—predicting one token at a time from learned patterns—so its steps can be convincing yet not guaranteed logically sound. Symbolic or human reasoning can apply explicit rules and track contradictions deterministically. LLMs generally do not.

What are the standard stages of the LLM training pipeline?

Two main stages: (1) Pre-training on massive unlabeled text via next-token prediction to learn general language patterns; (2) Post-training, which includes supervised fine-tuning (instruction tuning) to follow tasks/instructions, and preference tuning (often via RLHF) to align outputs with human preferences. A chatbot experience typically adds an orchestration layer (system prompt, conversation history), beyond the model itself.

What’s the difference between instruction tuning and preference tuning (RLHF)?

- Instruction tuning (SFT) teaches models to follow task instructions using labeled examples.
- Preference tuning (often RLHF) optimizes the model to produce outputs humans prefer, using human feedback (rankings/ratings) as a reward signal to refine style, helpfulness, and safety.

What are the main approaches to improving LLM reasoning after base training?

Three broad categories:
- Inference-time compute scaling: improve performance at inference without changing weights (e.g., CoT prompting, multiple-sample generation, verifier-guided selection).
- Reinforcement learning (RL): update weights using rewards tied to task success or verifiable correctness (e.g., math/coding).
- Distillation: transfer reasoning behaviors from a stronger model to a smaller one via supervised fine-tuning on high-quality, teacher-generated data.

How does RL for reasoning differ from RLHF used in preference tuning?

Both use RL, but they differ in rewards. RLHF uses human judgments to align with human preferences. RL for reasoning typically uses automated, objective signals (e.g., correctness checks, verifiers, environment rewards) to directly optimize task success and reasoning quality.

How do pattern matching and logical reasoning differ in LLMs?

LLMs mostly perform pattern matching: they continue text based on statistical associations from training. They don’t explicitly track contradictions or apply formal rules. For example, the “penguin” scenario may be answered correctly if the model has seen many examples linking penguins to “cannot fly,” but this is still pattern-based rather than explicit rule checking.

When should I use a reasoning model, and what are the trade-offs?

Use reasoning models for complex, multi-step tasks (advanced math, coding, puzzles). For simpler tasks (summarization, translation, factual Q&A), conventional models may be faster and cheaper. Trade-offs include higher cost and latency due to: (1) longer outputs with intermediate steps (more tokens, more forward passes), and (2) multi-call workflows (sampling, tools, verifiers) that multiply token usage. Reasoning models can also “overthink” easy tasks.

What roadmap does the book follow to build reasoning models from scratch?

Four stages: (1) Start from a conventional instruction-tuned LLM; (2) Establish evaluation methods for reasoning; (3) Apply inference-time techniques to boost reasoning behavior; (4) Train with methods like RL and distillation to develop dedicated reasoning models. The book takes a hands-on, code-first approach to implement these steps from scratch.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $28.79

you save $19.20 (40%)

include audio $24.99 $14.99

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $28.79

you save $19.20 (40%)

include audio $24.99 $14.99

eBook

pdf, ePub, online

$47.99 $28.79

you save $19.20 (40%)

include audio $24.99 $14.99

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more