Overview

1 Understanding reasoning models

This chapter introduces reasoning as the next stage in large language models and defines it pragmatically: an LLM makes its intermediate steps explicit before producing an answer, often referred to as chain-of-thought. While such traces can look human-like, the model is still generating tokens probabilistically and is not performing guaranteed, rule-based logic. The book takes a hands-on, code-first approach aimed at practitioners, starting from an existing pre-trained model and incrementally adding reasoning capabilities to understand how the methods work in practice.

To ground the discussion, the chapter reviews the standard LLM pipeline of pre-training (next-token prediction on massive text corpora) followed by post-training via instruction tuning and preference tuning. It then outlines three families of techniques that enhance reasoning on top of this baseline: inference-time compute scaling (e.g., prompting and sampling strategies that “think” longer without changing weights), reinforcement learning (updating weights using rewards from verifiable tasks or environments, distinct from human-preference RLHF), and distillation (supervised fine-tuning on high-quality, step-by-step data generated by stronger models). The chapter contrasts statistical pattern matching with logical reasoning, showing how LLMs can simulate reasoning through learned correlations yet remain vulnerable in novel or highly complex multi-step scenarios.

Finally, the chapter motivates building reasoning models from scratch by highlighting recent momentum in the field and the practical trade-offs involved. Reasoning can improve performance on challenging tasks in math, coding, and puzzles, but it is not always desirable: it tends to increase verbosity, incur higher compute due to longer outputs and multiple inference calls, and can overthink simple requests. The roadmap for the book is to start with a capable base model, establish evaluation, apply inference-time techniques, and then implement training-based methods, equipping readers to design, prototype, and assess reasoning-enhanced LLMs with a clear understanding of costs and benefits.

A simplified illustration of how an LLM might tackle a multi-step reasoning task. Rather than just recalling a fact, the model needs to combine several intermediate reasoning steps to arrive at the correct conclusion. The intermediate reasoning steps may or may not be shown to the user, depending on the implementation.
Overview of a typical LLM training pipeline. The process begins with an initial model initialized with random weights, followed by pre-training on large-scale text data to learn language patterns by predicting the next token. Post-training then refines the model through instruction fine-tuning and preference fine-tuning, which enables the LLM to follow human instructions better and align with human preferences.
Example responses from a language model at different training stages. The prompt asks for a summary of the relationship between sleep and health. The pre-trained LLM produces a relevant but unfocused answer without directly following the instructions. The instruction-tuned LLM generates a concise and accurate summary aligned with the prompt. The preference-tuned LLM further improves the response by using a friendly tone and engaging language, which makes the answer more relatable and user-centered.
Three approaches commonly used to improve reasoning capabilities in LLMs. These methods (inference-compute scaling, reinforcement learning, and distillation) are typically applied after the conventional training stages (initial model training, pre-training, and post-training with instruction and preference tuning).
Illustration of how contradictory premises lead to a logical inconsistency. From "All birds can fly" and "A penguin is a bird," we infer "Penguin can fly." This conclusion conflicts with the established fact "Penguin cannot fly," which results in a contradiction.
An illustrative example of how a language model (GPT-4o in ChatGPT) appears to "reason" about a contradictory premise.
Token-by-token generation in an LLM. At each step, the LLM takes the full sequence generated so far and predicts the next token, which may represent a word, subword, or punctuation mark depending on the tokenizer. The newly generated token is appended to the sequence and used as input for the next step. This iterative decoding process is used in both standard language models and reasoning-focused models.
A mental model of the main reasoning model development stages covered in this book. We start with a conventional LLM as base model (stage 1). In stage 2, we cover evaluation strategies to track the reasoning improvements introduced via the reasoning methods in stages 3 and 4.

Summary

  • Conventional LLM training occurs in several stages:
    • Pre-training, where the model learns language patterns from vast amounts of text.
    • Instruction fine-tuning, which improves the model's responses to user prompts.
    • Preference tuning, which aligns model outputs with human preferences.
  • Reasoning methods are applied on top of a conventional LLM.
  • Reasoning in LLMs involves systematically solving multi-step tasks using intermediate steps (chain-of-thought).
  • Reasoning in LLMs is different from rule-based reasoning and it also likely works differently than human reasoning; currently, the common consensus is that reasoning in LLMs relies on statistical pattern matching.
  • Pattern matching in LLMs relies purely on statistical associations learned from data, which enables fluent text generation but lacks explicit logical inference.
  • Improving reasoning in LLMs can be achieved through:
    • Inference-time compute scaling, enhancing reasoning without retraining (e.g., chain-of-thought prompting).
    • Reinforcement learning, training models explicitly with reward signals.
    • Supervised fine-tuning and distillation, using examples from stronger reasoning models.
  • Building reasoning models from scratch provides practical insights into LLM capabilities, limitations, and computational trade-offs.

FAQ

What does “reasoning” mean in the context of LLMs?In this chapter, reasoning means an LLM shows the intermediate steps it used to reach an answer before giving the final response. Making these steps explicit often improves accuracy on complex tasks such as coding, logic puzzles, and multi-step math.
What is Chain-of-Thought (CoT)?Chain-of-Thought is a prompting and generation style where the model produces step-by-step intermediate explanations on the way to an answer. It makes the model’s process more explicit and can improve performance, but it does not imply the model “thinks” like a human.
How is LLM “reasoning” different from human or symbolic logical reasoning?LLM reasoning is probabilistic and autoregressive—predicting one token at a time from learned patterns—so its steps can be convincing yet not guaranteed logically sound. Symbolic or human reasoning can apply explicit rules and track contradictions deterministically. LLMs generally do not.
What are the standard stages of the LLM training pipeline?Two main stages: (1) Pre-training on massive unlabeled text via next-token prediction to learn general language patterns; (2) Post-training, which includes supervised fine-tuning (instruction tuning) to follow tasks/instructions, and preference tuning (often via RLHF) to align outputs with human preferences. A chatbot experience typically adds an orchestration layer (system prompt, conversation history), beyond the model itself.
What’s the difference between instruction tuning and preference tuning (RLHF)?- Instruction tuning (SFT) teaches models to follow task instructions using labeled examples.
- Preference tuning (often RLHF) optimizes the model to produce outputs humans prefer, using human feedback (rankings/ratings) as a reward signal to refine style, helpfulness, and safety.
What are the main approaches to improving LLM reasoning after base training?Three broad categories:
- Inference-time compute scaling: improve performance at inference without changing weights (e.g., CoT prompting, multiple-sample generation, verifier-guided selection).
- Reinforcement learning (RL): update weights using rewards tied to task success or verifiable correctness (e.g., math/coding).
- Distillation: transfer reasoning behaviors from a stronger model to a smaller one via supervised fine-tuning on high-quality, teacher-generated data.
How does RL for reasoning differ from RLHF used in preference tuning?Both use RL, but they differ in rewards. RLHF uses human judgments to align with human preferences. RL for reasoning typically uses automated, objective signals (e.g., correctness checks, verifiers, environment rewards) to directly optimize task success and reasoning quality.
How do pattern matching and logical reasoning differ in LLMs?LLMs mostly perform pattern matching: they continue text based on statistical associations from training. They don’t explicitly track contradictions or apply formal rules. For example, the “penguin” scenario may be answered correctly if the model has seen many examples linking penguins to “cannot fly,” but this is still pattern-based rather than explicit rule checking.
When should I use a reasoning model, and what are the trade-offs?Use reasoning models for complex, multi-step tasks (advanced math, coding, puzzles). For simpler tasks (summarization, translation, factual Q&A), conventional models may be faster and cheaper. Trade-offs include higher cost and latency due to: (1) longer outputs with intermediate steps (more tokens, more forward passes), and (2) multi-call workflows (sampling, tools, verifiers) that multiply token usage. Reasoning models can also “overthink” easy tasks.
What roadmap does the book follow to build reasoning models from scratch?Four stages: (1) Start from a conventional instruction-tuned LLM; (2) Establish evaluation methods for reasoning; (3) Apply inference-time techniques to boost reasoning behavior; (4) Train with methods like RL and distillation to develop dedicated reasoning models. The book takes a hands-on, code-first approach to implement these steps from scratch.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Build a Reasoning Model (From Scratch) ebook for free
choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Build a Reasoning Model (From Scratch) ebook for free
choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Build a Reasoning Model (From Scratch) ebook for free