Overview

1 Understanding reasoning models

This chapter introduces reasoning as it is used in large language models: making intermediate steps explicit before giving an answer, often called chain-of-thought. It clarifies that, unlike deterministic symbolic systems, LLM “reasoning” is probabilistic next-token prediction that can look convincing without guaranteeing logical soundness. Framing the book’s goal, the author emphasizes a hands-on, code-first approach that starts from a pre-trained LLM and adds reasoning capabilities from scratch so practitioners can understand how these methods work in practice and where they help or fall short.

To ground the discussion, the chapter reviews the conventional LLM pipeline: massive pre-training for next-token prediction, followed by post-training via supervised instruction tuning and preference tuning (e.g., RLHF), with chat behavior layered on top. It then outlines three major routes to stronger reasoning: inference-time compute scaling (e.g., step-by-step prompting and sampling strategies that spend more compute at inference), reinforcement learning that updates weights using verifiable rewards for task success, and distillation that transfers reasoning behaviors from stronger models into smaller ones through high-quality supervision. These techniques are positioned as extensions to the standard pipeline that specifically target complex, multi-step tasks in coding, math, and logic.

The chapter contrasts pattern matching with logical reasoning using examples: LLMs often answer correctly by drawing on statistical associations (e.g., “Berlin” for a capital, or common corrections of “penguins can’t fly”) rather than explicit rule application, which works well in familiar contexts but can fail on novel or intricate problems. It argues for building reasoning systems from scratch to understand trade-offs, especially since reasoning models can be more verbose, costlier, and sometimes prone to overthinking—so they should be applied selectively where complexity warrants it. The roadmap ahead loads a capable base model, establishes evaluation to track gains, and then incrementally adds reasoning via inference techniques and targeted training, culminating in practical, testable improvements.

A simplified illustration of how a conventional, non-reasoning LLM might respond to a question with a short answer.
A simplified illustration of how a reasoning LLM might tackle a multi-step reasoning task using a chain-of-thought. Rather than just recalling a fact, the model combines several intermediate reasoning steps to arrive at the correct conclusion. The intermediate reasoning steps may or may not be shown to the user, depending on the implementation.
Overview of a typical LLM training pipeline. The process begins with an initial model initialized with random weights, followed by pre-training on large-scale text data to learn language patterns by predicting the next token. Post-training then refines the model through instruction fine-tuning and preference fine-tuning, which enables the LLM to follow human instructions better and align with human preferences.
Example responses from a language model at different training stages. The prompt asks for a summary of the relationship between sleep and health. The pre-trained LLM produces a relevant but unfocused answer without directly following the instructions. The instruction-tuned LLM generates a concise and accurate summary aligned with the prompt. The preference-tuned LLM further improves the response by using a friendly tone and engaging language, which makes the answer more relatable and user-centered.
Three approaches commonly used to improve reasoning capabilities in LLMs. These methods (inference-compute scaling, reinforcement learning, and distillation) are typically applied after the conventional training stages (initial model training, pre-training, and post-training with instruction and preference tuning), but reasoning techniques can also be applied to the pre-trained base model.
Illustration of how contradictory premises lead to a logical inconsistency. From "All birds can fly" and "A penguin is a bird," we infer "Penguin can fly." This conclusion conflicts with the established fact "Penguin cannot fly," which results in a contradiction.
An illustrative example of how a language model (GPT-4o in ChatGPT) appears to "reason" about a contradictory premise.
Token-by-token generation in an LLM. At each step, the LLM takes the full sequence generated so far and predicts the next token, which may represent a word, subword, or punctuation mark depending on the tokenizer. The newly generated token is appended to the sequence and used as input for the next step. This iterative decoding process is used in both standard language models and reasoning-focused models.
A mental model of the main reasoning model development stages covered in this book. We start with a conventional LLM as base model (stage 1). In stage 2, we cover evaluation strategies to track the reasoning improvements introduced via the reasoning methods in stages 3 and 4.

Summary

  • Conventional LLM training occurs in several stages:
    • Pre-training, where the model learns language patterns from vast amounts of text.
    • Instruction fine-tuning, which improves the model's responses to user prompts.
    • Preference tuning, which aligns model outputs with human preferences.
  • Reasoning methods are applied on top of a conventional LLM.
  • Reasoning in LLMs refers to improving a model so that it explicitly generates intermediate steps (chain-of-thought) before producing a final answer, which often increases accuracy on multi-step tasks.
  • Reasoning in LLMs is different from rule-based reasoning and it also likely works differently than human reasoning; currently, the common consensus is that reasoning in LLMs relies on statistical pattern matching.
  • Pattern matching in LLMs relies purely on statistical associations learned from data, which enables fluent text generation but lacks explicit logical inference.
  • Improving reasoning in LLMs can be achieved through:
    • Inference-time compute scaling, enhancing reasoning without retraining (e.g., chain-of-thought prompting).
    • Reinforcement learning, training models explicitly with reward signals.
    • Supervised fine-tuning and distillation, using examples from stronger reasoning models.
  • Building reasoning models from scratch provides practical insights into LLM capabilities, limitations, and computational trade-offs.

FAQ

What does “reasoning” mean in the context of LLMs?In this book, reasoning means an LLM explains the intermediate steps it takes before giving a final answer. Making these steps explicit (often called chain-of-thought) often improves accuracy on complex tasks like coding, logic puzzles, and multi-step math.
Do LLMs actually think like humans?No. The terms “reasoning” and “thinking” are used in the LLM community, but current models generate tokens probabilistically from patterns in data. Their outputs can look human-like, yet the underlying mechanism differs substantially from human or rule-based reasoning.
What is Chain-of-Thought (CoT) and why does it help?CoT is the model’s step-by-step explanation of how it arrives at an answer. By externalizing intermediate steps, the model is guided to “work through” multi-step problems, which often boosts correctness and makes the process more transparent (these steps may or may not be shown to users).
How does LLM “reasoning” differ from symbolic or deterministic logic?Symbolic systems follow fixed, rule-based procedures that guarantee consistency. LLMs generate text autoregressively, so their “reasoning steps” are probabilistic and not guaranteed to be logically sound, even if they read as convincing explanations.
What are the standard stages of LLM training before adding reasoning methods?Two main stages: pre-training (next-token prediction on massive text to learn language) and post-training, which includes instruction fine-tuning (SFT) to follow tasks and preference tuning (often RLHF) to align style and behavior. A separate chat layer typically handles multi-turn interaction.
Which approaches can improve LLM reasoning after conventional training?Three broad categories: (1) inference-time compute scaling (improves performance at inference without changing weights), (2) reinforcement learning for reasoning (updates weights using rewards tied to task success), and (3) distillation (transferring reasoning patterns from stronger models into smaller ones).
What is inference-time compute scaling, and when should I use it?It trades more inference compute for better performance without retraining the model. Techniques include CoT prompting, sampling multiple candidates, and verifier-guided selection—useful for boosting fixed models quickly, at the cost of higher latency and token usage.
How does reinforcement learning for reasoning differ from RLHF?Both use RL, but they differ in rewards. RL for reasoning typically uses automated, verifiable signals (e.g., correctness in math/coding), while RLHF relies on human judgments to shape preferred behavior. The former targets task success; the latter targets alignment with human preferences.
What is distillation in the context of reasoning models?Distillation fine-tunes a smaller “student” on high-quality data (often generated by a larger “teacher”) to transfer reasoning behaviors. In LLM practice this is commonly SFT on teacher-produced outputs, which differs from classic KD that also matches teacher logits.
Why build reasoning models from scratch, and when should I avoid using them?Building from scratch clarifies how methods work and the cost–quality trade-offs. Reasoning models can be slower and pricier due to longer outputs and multi-call workflows, and they may “overthink” simple tasks; use them for complex reasoning (math, coding, puzzles) and prefer lighter models for summarization, translation, or straightforward Q&A.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Build a Reasoning Model (From Scratch) ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Build a Reasoning Model (From Scratch) ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Build a Reasoning Model (From Scratch) ebook for free