Overview

1 Understanding reasoning models

This chapter introduces reasoning models as the next stage in large language model development, where the goal is to move beyond short pattern-based answers toward responses that include intermediate problem-solving steps. In this practical engineering sense, “reasoning” means that an LLM spends tokens on step-by-step work before producing a final answer, whether those steps are visible to users or hidden internally. These reasoning behaviors are especially useful for complex tasks such as math, coding, logical puzzles, planning, tool use, and agent workflows, but they do not mean that LLMs reason like humans or follow deterministic logical rules.

The chapter reviews the standard LLM training pipeline to explain where reasoning methods fit. Conventional LLMs are first pre-trained on massive text datasets through next-token prediction, which gives them fluency and broad capabilities through learned statistical patterns. They are then post-trained with instruction tuning and preference tuning so they can better follow user requests and produce responses aligned with human preferences. Reasoning improvements are usually added after these stages through three broad approaches: inference-time compute scaling, which improves performance without changing model weights; reinforcement learning, which trains models using reward signals such as verified correctness; and distillation, which transfers reasoning behavior from stronger models into smaller or more efficient ones.

A central distinction in the chapter is the difference between pattern matching and logical reasoning. LLMs can appear to reason because they have learned many reasoning-like patterns from data, but they do not explicitly apply formal rules or guarantee logical consistency like symbolic systems. This means they may handle familiar contradictions well while struggling with novel or highly complex reasoning tasks. The chapter argues that building reasoning methods from scratch is valuable because it reveals how these systems work, how to evaluate them, and when their added cost and latency are justified. It also lays out the book’s roadmap: start with a conventional LLM, build evaluation tools, improve reasoning at inference time, and then use training methods to develop a more dedicated reasoning model.

A simplified illustration of how a conventional, non-reasoning LLM might respond to a question with a short answer.
A simplified illustration of how a reasoning LLM might tackle a multi-step reasoning task using a chain-of-thought. Rather than just recalling a fact, the model combines several intermediate reasoning steps to arrive at the correct conclusion. The intermediate reasoning steps may or may not be shown to the user, depending on the implementation.
Overview of a typical LLM training pipeline. The process begins with an initial model initialized with random weights, followed by pre-training on large-scale text data to learn language patterns by predicting the next token. Post-training then refines the model through instruction fine-tuning and preference fine-tuning, which enables the LLM to follow human instructions better and align with human preferences.
Example responses from a language model at different training stages. The prompt asks for a summary of the relationship between sleep and health. The pre-trained LLM produces a relevant but unfocused answer without directly following the instructions. The instruction-tuned LLM generates a concise and accurate summary aligned with the prompt. The preference-tuned LLM further improves the response by using a friendly tone and engaging language, which makes the answer more relatable and user-centered.
Three approaches commonly used to improve reasoning capabilities in LLMs. These methods (inference-compute scaling, reinforcement learning, and distillation) are typically applied after the conventional training stages (initial model training, pre-training, and post-training with instruction and preference tuning), but reasoning techniques can also be applied to the pre-trained base model.
Contradictory premises lead to a logical inconsistency. From "All birds can fly" and "A penguin is a bird," we infer "Penguin can fly." This conclusion conflicts with the established fact "Penguin cannot fly," which results in a contradiction.
An illustrative example of how a language model (GPT-4o in ChatGPT) appears to "reason" about a contradictory premise.
Token-by-token generation in an LLM. At each step, the LLM takes the full sequence generated so far and predicts the next token, which may represent a word, subword, or punctuation mark depending on the tokenizer. The newly generated token is appended to the sequence and used as input for the next step. This iterative decoding process is used in both standard language models and reasoning-focused models.
A high-level roadmap of what we build in this book. We start with a conventional LLM, add evaluation methods so that we can measure progress, and then explore two broad families of reasoning improvements, namely, inference techniques and training techniques.
A detailed roadmap of the chapter-level substeps. After loading the base model, we cover benchmark-based and judgment-based evaluation, then inference-time methods such as advanced text generation and voting plus self-refinement, and finally training-time methods based on reinforcement learning and distillation.

Summary

  • Conventional LLM training occurs in several stages:
    • Pre-training, where the model learns language patterns from vast amounts of text.
    • Instruction fine-tuning, which improves the model's responses to user prompts.
    • Preference tuning, which aligns model outputs with human preferences.
  • Reasoning methods are applied on top of a conventional LLM.
  • Reasoning in LLMs refers to improving a model so that it explicitly generates intermediate steps (chain-of-thought) before producing a final answer, which often increases accuracy on multi-step tasks.
  • Reasoning in LLMs is different from rule-based reasoning and it also likely works differently than human reasoning; currently, the common consensus is that reasoning in LLMs relies on statistical pattern matching.
  • Pattern matching in LLMs relies purely on statistical associations learned from data, which enables fluent text generation but lacks explicit logical inference.
  • Improving reasoning in LLMs can be achieved through:
    • Inference-time compute scaling, enhancing reasoning without retraining (e.g., chain-of-thought prompting).
    • Reinforcement learning, training models explicitly with reward signals.
    • Supervised fine-tuning and distillation, using examples from stronger reasoning models.
  • Building reasoning models from scratch provides practical insights into LLM capabilities, limitations, and computational trade-offs.

FAQ

What does “reasoning” mean for an LLM in this chapter?In this chapter, reasoning is defined in a practical engineering sense: an LLM generates intermediate steps before producing a final answer. These steps may be shown to the user as a step-by-step explanation or hidden inside special tags such as <think>...</think>. The goal is to encourage the model to spend tokens on problem solving instead of jumping directly to a conclusion.
What is a reasoning model?A reasoning model is an LLM that has been improved, through training or prompting techniques, to produce intermediate reasoning steps. This often improves accuracy on complex tasks such as coding problems, logical puzzles, multi-step math problems, and agent workflows that require planning, tool use, or recovery from mistakes.
What is chain-of-thought reasoning?Chain-of-thought, or CoT, refers to the generation of intermediate reasoning steps before the final answer. Engineers often describe this as the model “thinking” through a problem step by step, but the chapter emphasizes that this does not mean LLMs think or reason in the same way humans do.
How does LLM reasoning differ from deterministic logical reasoning?Traditional symbolic logic engines or theorem provers follow strict rule-based steps that can guarantee consistency and correctness when the rules and inputs are valid. LLMs, by contrast, generate text autoregressively, predicting one token at a time based on statistical patterns in training data. As a result, an LLM’s reasoning steps may look convincing but are not guaranteed to be logically sound.
What are the main stages in the standard LLM training pipeline?Conventional LLM training is usually divided into pre-training and post-training. During pre-training, the model learns to predict the next token from massive text datasets. During post-training, techniques such as supervised fine-tuning, also called instruction tuning, and preference tuning help the model follow user instructions and produce responses that better match human preferences.
What is the difference between pre-training and post-training?Pre-training gives the model general language capabilities by training it on huge amounts of text using next-token prediction. Post-training refines the base model so it becomes more useful in practice. Instruction tuning improves the model’s ability to follow tasks such as summarization, translation, and question answering, while preference tuning shapes tone, style, and user-centered behavior.
What are the three broad approaches to improving LLM reasoning?The chapter introduces three major approaches: inference-time compute scaling, reinforcement learning, and distillation. Inference-time compute scaling improves reasoning without changing model weights, often by using chain-of-thought, sampling, voting, or self-refinement. Reinforcement learning updates model weights using reward signals such as correct math or coding answers. Distillation transfers reasoning behavior from a larger, stronger model into a smaller or more efficient model.
How is reinforcement learning for reasoning different from RLHF?Both use reinforcement learning, but they differ in how rewards are obtained. RLHF, used in preference tuning, relies on human judgments or rankings to guide models toward human-preferred behavior. Reinforcement learning for reasoning typically uses automated verifiers or environment-based rewards, such as whether a math answer or code solution is correct.
How does pattern matching differ from logical reasoning in LLMs?Pattern matching means the model produces text based on statistical associations learned during training. For example, answering “Berlin” to “The capital of Germany is…” reflects a strong learned association rather than formal deduction. Logical reasoning, in contrast, involves systematically deriving conclusions from premises using rules, tracking implications, and detecting contradictions.
Why build reasoning models from scratch?Building reasoning models and methods from scratch helps reveal how the techniques actually work and what trade-offs they introduce. Reasoning models can be powerful for complex tasks, but they are often more expensive, slower, more verbose, and sometimes prone to overthinking. Implementing the methods directly helps engineers understand when reasoning machinery is useful and when a standard LLM is sufficient.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Build a Reasoning Model (From Scratch) ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Build a Reasoning Model (From Scratch) ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Build a Reasoning Model (From Scratch) ebook for free