Overview

1 Understanding reasoning models

Reasoning in large language models is framed as the ability to show intermediate steps on the way to an answer, often called chain-of-thought. This step-by-step articulation tends to boost accuracy on complex tasks such as coding, logic puzzles, and multi-step math, and is central to making more agentic AI practical. Unlike deterministic, rule-based systems, LLMs generate these steps probabilistically via next-token prediction, so their “reasoning” can be persuasive without guaranteed logical soundness. The chapter sets this practical, hands-on focus, clarifying that terms like reasoning and thinking are used in an engineering sense rather than to imply human-like cognition.

The chapter revisits the standard LLM pipeline—massive pre-training for next-token prediction followed by post-training through instruction tuning and preference tuning—and positions reasoning enhancements as additional layers on top. It groups the main approaches into three categories: inference-time compute scaling (spending more thinking time or samples at inference without changing weights), reinforcement learning with verifiable rewards to update weights (distinct from RLHF used for preference alignment), and distillation that transfers behaviors from stronger teachers to smaller students. It contrasts pattern matching with logical reasoning, noting that LLMs often simulate reasoning by leveraging statistical associations rather than explicit rule-checking, which works well in familiar contexts but can falter on novel or deeply compositional problems.

The motivation for building reasoning models from scratch is both strategic and practical: the field is rapidly shifting toward models that know when and how long to think, yet reasoning is not universally beneficial. These systems can be costlier due to longer outputs and multi-sample workflows, and may overthink simple tasks, so choosing the right tool matters. Implementing methods end to end helps practitioners understand these trade-offs. The roadmap starts from a conventional, post-trained base model, establishes evaluation to track gains, and then adds inference-time techniques followed by training-based methods, equipping readers to design, prototype, and assess modern reasoning approaches.

A simplified illustration of how a conventional, non-reasoning LLM might respond to a question with a short answer.
A simplified illustration of how a reasoning LLM might tackle a multi-step reasoning task using a chain-of-thought. Rather than just recalling a fact, the model combines several intermediate reasoning steps to arrive at the correct conclusion. The intermediate reasoning steps may or may not be shown to the user, depending on the implementation.
Overview of a typical LLM training pipeline. The process begins with an initial model initialized with random weights, followed by pre-training on large-scale text data to learn language patterns by predicting the next token. Post-training then refines the model through instruction fine-tuning and preference fine-tuning, which enables the LLM to follow human instructions better and align with human preferences.
Example responses from a language model at different training stages. The prompt asks for a summary of the relationship between sleep and health. The pre-trained LLM produces a relevant but unfocused answer without directly following the instructions. The instruction-tuned LLM generates a concise and accurate summary aligned with the prompt. The preference-tuned LLM further improves the response by using a friendly tone and engaging language, which makes the answer more relatable and user-centered.
Three approaches commonly used to improve reasoning capabilities in LLMs. These methods (inference-compute scaling, reinforcement learning, and distillation) are typically applied after the conventional training stages (initial model training, pre-training, and post-training with instruction and preference tuning), but reasoning techniques can also be applied to the pre-trained base model.
Illustration of how contradictory premises lead to a logical inconsistency. From "All birds can fly" and "A penguin is a bird," we infer "Penguin can fly." This conclusion conflicts with the established fact "Penguin cannot fly," which results in a contradiction.
An illustrative example of how a language model (GPT-4o in ChatGPT) appears to "reason" about a contradictory premise.
Token-by-token generation in an LLM. At each step, the LLM takes the full sequence generated so far and predicts the next token, which may represent a word, subword, or punctuation mark depending on the tokenizer. The newly generated token is appended to the sequence and used as input for the next step. This iterative decoding process is used in both standard language models and reasoning-focused models.
A mental model of the main reasoning model development stages covered in this book. We start with a conventional LLM as base model (stage 1). In stage 2, we cover evaluation strategies to track the reasoning improvements introduced via the reasoning methods in stages 3 and 4.

Summary

  • Conventional LLM training occurs in several stages:
    • Pre-training, where the model learns language patterns from vast amounts of text.
    • Instruction fine-tuning, which improves the model's responses to user prompts.
    • Preference tuning, which aligns model outputs with human preferences.
  • Reasoning methods are applied on top of a conventional LLM.
  • Reasoning in LLMs refers to improving a model so that it explicitly generates intermediate steps (chain-of-thought) before producing a final answer, which often increases accuracy on multi-step tasks.
  • Reasoning in LLMs is different from rule-based reasoning and it also likely works differently than human reasoning; currently, the common consensus is that reasoning in LLMs relies on statistical pattern matching.
  • Pattern matching in LLMs relies purely on statistical associations learned from data, which enables fluent text generation but lacks explicit logical inference.
  • Improving reasoning in LLMs can be achieved through:
    • Inference-time compute scaling, enhancing reasoning without retraining (e.g., chain-of-thought prompting).
    • Reinforcement learning, training models explicitly with reward signals.
    • Supervised fine-tuning and distillation, using examples from stronger reasoning models.
  • Building reasoning models from scratch provides practical insights into LLM capabilities, limitations, and computational trade-offs.

FAQ

What does “reasoning” mean for LLMs in this book?Reasoning means the model shows how it arrived at an answer by producing intermediate steps before the final response. Making these steps explicit (often called chain-of-thought) frequently improves accuracy on complex tasks like math, coding, and logic puzzles.
Does chain-of-thought mean LLMs truly “think” like humans?No. LLMs generate tokens probabilistically based on patterns learned from data. Their “reasoning” can look human-like, but it is not the same as human, rule-based, or world-model-driven reasoning.
How does LLM reasoning differ from traditional logic or theorem provers?Rule-based systems follow deterministic, verifiable steps that guarantee consistency. LLMs produce reasoning autoregressively and probabilistically, so intermediate steps may be plausible but are not guaranteed to be logically sound.
What are the standard stages of LLM training before adding reasoning methods?Two main stages: (1) Pre-training on massive text for next-token prediction (language capability). (2) Post-training with instruction fine-tuning (SFT) to follow prompts and preference tuning (e.g., RLHF) to align style and tone. A chat interface is an additional orchestration layer, not a training stage.
What are the main approaches to improve LLM reasoning after standard training?Three families: (1) Inference-time compute scaling (e.g., CoT prompting, multiple sampling) without changing weights; (2) Reinforcement learning using verifiable rewards to update weights; (3) Distillation, where smaller models learn from strong model-generated reasoning data. Note: RL for reasoning differs from RLHF used for preference tuning.
How is pattern matching different from logical reasoning for LLMs?Pattern matching predicts continuations based on statistical associations (e.g., “The capital of Germany is … Berlin”). Logical reasoning involves applying rules and tracking contradictions. LLMs often answer correctly via learned associations, but can fail on novel or intricate multi-step problems.
If non-reasoning LLMs can handle some logic questions, why build dedicated reasoning models?Conventional LLMs can simulate reasoning when training data covers similar patterns. They struggle with novel scenarios and deeper multi-step dependencies. Reasoning-focused methods make performance more reliable and robust across complex tasks.
When should I use a reasoning model versus a conventional model?Use reasoning models for complex, multi-step tasks (advanced math, coding, puzzles, agentic workflows). Prefer conventional models for simpler tasks (summarization, translation, factual Q&A) to avoid extra cost, verbosity, and potential “overthinking.”
Why are reasoning models more expensive to use?They typically produce longer outputs (more tokens, more forward passes) and often require multiple runs per query (e.g., sampling several solutions, tool calls, verification), multiplying total compute and cost.
What is the roadmap in this book for building reasoning models from scratch?Start with a conventional LLM (pre-trained + instruction-tuned), establish evaluation methods, apply inference-time techniques to boost reasoning, then add training-based methods (e.g., RL, distillation). The book implements these components step by step, hands-on.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Build a Reasoning Model (From Scratch) ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Build a Reasoning Model (From Scratch) ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Build a Reasoning Model (From Scratch) ebook for free