Overview

1 Understanding reasoning models

This chapter introduces reasoning as the next stage in large language models, framing it as the production of explicit intermediate steps—often called chain-of-thought—to solve complex tasks in math, logic, and programming. It contrasts probabilistic, token-by-token generation in LLMs with deterministic, rule-based systems, emphasizing that today’s models often simulate reasoning through learned statistical patterns rather than formal inference. The chapter sets expectations for a practical, code-first journey: starting from a capable, pre-trained-and-tuned base model, then adding, analyzing, and evaluating reasoning behaviors in a way that is useful to engineers, researchers, and developers.

To ground the discussion, the chapter reviews the standard LLM pipeline—pre-training for next-token prediction at scale, followed by post-training via instruction tuning and preference tuning—and explains why such models excel at fluent generation and pattern completion yet can falter on tasks requiring explicit logical consistency. It differentiates pattern matching from logical reasoning with examples, notes how LLMs may answer correctly by leveraging frequent training patterns, and clarifies that human-like explanations do not imply human-like mechanisms. Dedicated reasoning models extend these capabilities, but even conventional LLMs exhibit a spectrum of reasoning-like behavior that can be amplified by the right methods.

Finally, the chapter outlines three complementary approaches to improving reasoning: inference-time compute scaling (e.g., chain-of-thought prompting and diverse sampling without changing weights), reinforcement learning with verifiable rewards for problem solving (distinct from RLHF used for preference alignment), and supervised fine-tuning/distillation from stronger teachers. It highlights practical trade-offs—longer outputs, multiple inference calls, higher costs, and the need to decide when to “think” more—and argues that building reasoning components from scratch reveals these strengths and limitations clearly. The chapter closes with a roadmap for evaluating, enhancing, and training reasoning behaviors across subsequent stages of the book.

A simplified illustration of how an LLM might tackle a multi-step reasoning task. Rather than just recalling a fact, the model needs to combine several intermediate reasoning steps to arrive at the correct conclusion. The intermediate reasoning steps may or may not be shown to the user, depending on the implementation.
Overview of a typical LLM training pipeline. The process begins with an initial model initialized with random weights, followed by pre-training on large-scale text data to learn language patterns by predicting the next token. Post-training then refines the model through instruction fine-tuning and preference fine-tuning, which enables the LLM to follow human instructions better and align with human preferences.In the pre-training stage, LLMs are trained on massive amounts (many terabytes) of unlabeled text, which includes books, websites, research articles, and many other sources. The pre-training objective for the LLM is to learn to predict the next word (or token) in these texts.
Example responses from a language model at different training stages. The prompt asks for a summary of the relationship between sleep and health. The pre-trained LLM produces a relevant but unfocused answer without directly following the instructions. The instruction-tuned LLM generates a concise and accurate summary aligned with the prompt. The preference-tuned LLM further improves the response by using a friendly tone and engaging language, which makes the answer more relatable and user-centered.
Illustration of how contradictory premises lead to a logical inconsistency. From "All birds can fly" and "A penguin is a bird," we infer "Penguin can fly." This conclusion conflicts with the established fact "Penguin cannot fly," which results in a contradiction.
An illustrative example of how a language model (GPT-4o in ChatGPT) appears to "reason" about a contradictory premise.
Three approaches commonly used to improve reasoning capabilities in LLM). These methods (inference-compute scaling, reinforcement learning, and distillation) are typically applied after the conventional training stages (initial model training, pre-training, and post-training with instruction and preference tuning).
Token-by-token generation in an LLM. At each step, the LLM takes the full sequence generated so far and predicts the next token, which may represent a word, subword, or punctuation mark depending on the tokenizer. The newly generated token is appended to the sequence and used as input for the next step. This iterative decoding process is used in both standard language models and reasoning-focused models.
A mental model of the main reasoning model development stages covered in this book. We start with a conventional LLM as base model (stage 1). In stage 2, we cover evaluation strategies to track the reasoning improvements introduced via the reasoning methods in stages 3 and 4.

Summary

  • Conventional LLM training occurs in several stages:
    • Pre-training, where the model learns language patterns from vast amounts of text.
    • Instruction fine-tuning, which improves the model's responses to user prompts.
    • Preference tuning, which aligns model outputs with human preferences.
  • Reasoning methods are applied on top of a conventional LLM.
  • Reasoning in LLMs involves systematically solving multi-step tasks using intermediate steps (chain-of-thought).
  • Reasoning in LLMs is different from rule-based reasoning and it also likely works differently than human reasoning; currently, the common consensus is that reasoning in LLMs relies on statistical pattern matching.
  • Pattern matching in LLMs relies purely on statistical associations learned from data, which enables fluent text generation but lacks explicit logical inference.
  • Improving reasoning in LLMs can be achieved through:
    • Inference-time compute scaling, enhancing reasoning without retraining (e.g., chain-of-thought prompting).
    • Reinforcement learning, training models explicitly with reward signals.
    • Supervised fine-tuning and distillation, using examples from stronger reasoning models.
  • Building reasoning models from scratch provides practical insights into LLM capabilities, limitations, and computational trade-offs.

FAQ

What does “reasoning” mean for LLMs in this book?Reasoning is defined as a model’s ability to generate intermediate steps before giving a final answer, often called chain-of-thought (CoT). These steps are produced autoregressively and can look convincing, but they are not guaranteed to be logically sound like rule-based systems.
How does reasoning differ from pattern matching?Pattern matching is next-token prediction based on statistical associations learned during pre-training. Reasoning involves deriving conclusions through intermediate steps and structured inference. Conventional LLMs often simulate reasoning by leveraging familiar patterns rather than explicitly applying formal logic.
What are the standard stages of training an LLM?Two main stages: pre-training (learn general language via next-token prediction on massive text corpora) and post-training (instruction tuning to follow prompts; preference tuning—often via RLHF—to align outputs with human preferences). A chat interface is an additional orchestration layer, not part of training.
How do LLM reasoning processes compare to human reasoning?LLMs generate tokens probabilistically from learned patterns; humans can apply explicit rules, manipulate concepts, and reason over an internal world model. LLM outputs can appear human-like, but the underlying mechanisms differ substantially.
What’s the difference between closed‑world and open‑world reasoning?Closed‑world uses only the premises in the prompt to derive conclusions. Open‑world also considers external knowledge, which can reveal contradictions and require revising or clarifying premises. Conventional LLMs don’t explicitly track contradictions; they respond based on learned text distributions.
If LLMs can “appear” to reason, why build dedicated reasoning models?Pattern-based simulation works in familiar contexts but often fails on novel, complex, or multi-step problems requiring structured reasoning. Dedicated reasoning methods improve robustness, consistency, and performance on such tasks.
What are the main approaches to improve LLM reasoning after conventional training?Three broad approaches: 1) Inference-time compute scaling (spend more compute at inference—e.g., CoT and diverse sampling—to boost quality without changing weights), 2) Reinforcement learning (update weights using reward signals tied to task success or verifiable correctness), 3) Supervised fine-tuning and distillation (train on high-quality, reasoning-rich data, often generated by stronger models).
How does reinforcement learning for reasoning differ from RLHF?Both use RL, but the reward sources differ. RL for reasoning typically uses objective, automated signals (e.g., correctness in math or code), while RLHF uses human judgments to align outputs with preferences. RL changes model weights; inference-time scaling does not.
Why are reasoning models more expensive to use?Two main reasons: 1) Longer outputs due to intermediate steps mean more tokens and more forward passes, 2) Many reasoning workflows use multiple inference calls (sampling candidates, tools, verifiers), multiplying total token processing and cost. They can also be more verbose and sometimes “overthink.”
What is the roadmap in this book for building reasoning models from scratch?You start with a conventional instruction-tuned base model, learn evaluation methods to measure reasoning, apply inference-time techniques to improve behavior, and then implement training methods (e.g., RL and distillation) to develop full reasoning models—all via a hands-on, code-first approach.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Build a Reasoning Model (From Scratch) ebook for free
choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Build a Reasoning Model (From Scratch) ebook for free
choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Build a Reasoning Model (From Scratch) ebook for free