Overview

4 Improving reasoning with inference-time scaling

This chapter shows how to boost a model’s reasoning at inference time—without retraining—by spending more compute when generating answers. It introduces two practical techniques: prompting the model to write out its reasoning (chain-of-thought) and sampling multiple responses to select a consensus answer (self-consistency). The author extends a flexible text-generation wrapper so different decoding strategies can be swapped in, then demonstrates that even a simple base model that initially answers a math problem incorrectly can be steered toward the correct result by eliciting step-by-step reasoning. The core trade-off is emphasized throughout: higher accuracy usually comes from generating more tokens and/or more samples, increasing latency and cost, and not every task or model benefits uniformly (overthinking and model-specific behavior can appear).

The chapter builds the sampling tools needed for diverse but coherent candidates. It first reviews next-token selection (logits → probabilities → token) and adds temperature scaling to control diversity: lower temperature sharpens the distribution (more deterministic), higher temperature flattens it (more exploratory). To keep exploration sensible, it implements top-p (nucleus) sampling, which filters out low-probability tokens by keeping only the smallest set whose cumulative probability mass reaches a threshold p, then renormalizes before sampling. These mechanisms are integrated into the streaming generation function, along with guidance on practical settings (e.g., moderate temperatures and typical top-p cutoffs) to balance variety with relevance.

With these pieces in place, the chapter implements self-consistency: generate several answers with temperature and top-p, extract the final boxed result from each, then take a majority vote. On MATH-500, chain-of-thought alone can lift the base model dramatically (from around 15% to about 41% accuracy), whereas temperature+top-p by itself yields modest gains; adding self-consistency raises the base model further into the low 30% range, and combining chain-of-thought with self-consistency can surpass 50%—at a substantial runtime cost that scales with the number and length of samples. A reasoning-tuned model also benefits (roughly 48% to 55% with voting). The chapter closes by noting that voting works best when short, extractable answers are available, and previews the next chapter’s more general inference-time method: iterative self-refinement.

A mental model of the topics covered in this book. This chapter focuses on techniques that improve reasoning without additional training (stage 3). In particular, it extends the text-generation function and implements a voting-based method to improve answer accuracy. The next chapter then introduces an inference-time scaling approach where the model iteratively refines its own answers.
Comparing inference-time scaling (this chapter) and training-time scaling (after chapter 5). Both improve accuracy by using more compute, but inference-time scaling does this on the fly, without changing the model's weight parameters. The plots are inspired by OpenAI's article introducing their first reasoning model (https://openai.com/index/learning-to-reason-with-llms/).
Overview of three inference-time methods to improve reasoning covered in this book. The first modifies the prompt to encourage step-by-step reasoning, and the second samples multiple answers and selects the most frequent one. Both are discussed in this chapter. The third method, in which the model iteratively refines its own response, is introduced in the next chapter.
The first inference-time method, chain-of-thought prompting, modifies the prompt to encourage the model to explain its reasoning step by step before producing a final answer.
The second inference-time method, self-consistency sampling, generates multiple answers and selects the most frequent one. This method relies on temperature scaling, covered in this section, which influences how the model samples its next token.
How an LLM generates the next token. As in the other process diagrams in this book, the flow runs from bottom to top. The model converts the input into token IDs, computes scores for all possible next tokens, and selects the one with the highest score as the next output.
Example of next-token logits for a 100-token slice of a language model's much larger vocabulary. Each bar represents one possible token's score within this slice, with "Berlin" having the highest logit value and being selected as the next token.
In this section, we implement the core part of temperature scaling (step 3.2), which adjusts the next-token scores. This allows us to control how confidently the model selects its next token in later steps.
The effect of temperature scaling on logits. Lower temperatures make the distribution sharper, while higher temperatures flatten it. (Please note that this visualization is shown as a line plot for readability, though a bar plot would more accurately represent the discrete vocabulary scores.)
Overview of the sampling process for generating tokens. In this section, we focus on steps 3.3 and 3.4, where the next-token scores are converted into a probability distribution, and the next token is sampled based on that distribution.
Token probabilities obtained by applying the softmax function to the rescaled logits. The token of the highest probability (corresponding to " Berlin", but with the label omitted for code simplicity) is selected as the next output.
Overview of the top-p filtering process. The filter keeps only the highest-probability tokens by sorting them, applying a cumulative cutoff, selecting the top-p subset, and renormalizing the result.
Example of token probabilities before top-p filtering. The distribution includes many low-probability tokens, which will later be truncated by applying a cumulative probability threshold.
Visualization of sorted token probabilities and their cumulative sum. This step prepares for top-p filtering by showing how probabilities accumulate when ordered from highest to lowest, which helps determine where to set the cutoff threshold.
Top-p (nucleus) filtering. Tokens are sorted by probability, and the smallest subset whose cumulative probability exceeds the threshold (p = 0.8) is kept for sampling.
Integrating top-p filtering with temperature scaling. After rescaling the next-token scores, top-p filtering is applied between steps 3.3 and 3.4 to limit sampling to the most probable tokens.
The self-consistency sampling method generates multiple responses from the LLM and selects the most frequent answer, which improves answer accuracy through majority voting across these sampled responses.
The three main steps for implementing self-consistency sampling. First, we generate multiple answers for the same prompt using a temperature greater than zero and top-p filtering to generate different answers. Second, we extract the final boxed answer from each generated solution. Third, we select the most frequently extracted answer as the final prediction.
Summary of this chapter's focus on inference-time techniques. Here, the text generation function was extended with a voting-based method to improve answer accuracy. The next chapter introduces self-refinement, in which the model iteratively improves its responses.

Summary

  • Reasoning abilities and answer accuracy can be improved without retraining the model by increasing compute at inference time (inference-time scaling).
  • This chapter focuses on two such techniques: chain-of-thought prompting and self-consistency; a third method, self-refinement, which was briefly described, will be covered in for the next chapter.
  • A flexible text generation wrapper (generate_text_stream_concat_flex) that uses different sampling strategies that can be plugged in without changing the surrounding code.
  • Next tokens are produced from logits via softmax
  • Temperature scaling changes logits to control the diversity of the generated text.
  • Top-p (nucleus) sampling filters out low-probability tokens to reduce the chance of generating nonsensical answers
  • Chain-of-thought prompting (like "Explain step by step." or similar) often yields more accurate answers by encouraging the model to write out intermediate reasoning, though it increases the number of generated tokens and thus increases the runtime cost.
  • Self-consistency sampling generates multiple answers, extracts the final boxed result from each, and selects the most frequent answer via majority vote to improve the answer accuracy.
  • Experiments on the MATH-500 dataset show that combining chain-of-thought prompting with self-consistency can substantially boost accuracy compared to the baseline without sampling, at the cost of much longer runtimes.
  • The central trade-off of inference-time scaling: higher accuracy in exchange for more compute.

FAQ

What is inference-time scaling, and how does it differ from training-time scaling?Inference-time scaling (also called test-time compute scaling) improves answer quality by spending more computation while the model is generating a response, without changing its weights. Examples include prompting the model to “think” longer, sampling multiple answers, and voting among them. Training-time scaling, in contrast, expends more compute during training to improve the model’s parameters. In practice, strong systems combine both: heavy training-time compute and targeted inference-time compute.
Which inference-time techniques does this chapter cover?This chapter focuses on two practical methods you can implement from scratch: - Chain-of-thought prompting: modify the prompt to elicit step-by-step explanations before the final answer. - Self-consistency sampling: generate multiple responses (using temperature and top-p to create diversity) and choose the most frequent final answer by majority vote. A third method, iterative self-refinement (the model revises its own answer in multiple steps), is introduced in the next chapter.
How does chain-of-thought prompting help, and when can it hurt?Asking the model to “Explain step by step” often boosts accuracy because: - Writing intermediate steps gives the model opportunities to self-correct. - Many training examples include worked solutions, so this matches learned patterns. Trade-offs and caveats: - It increases latency and cost by generating more tokens. - On simple questions, it can occasionally degrade performance (“overthinking”). - Reasoning-tuned models may already produce explanations and gain less from an explicit chain-of-thought cue.
What is temperature in text generation, and how should I set it?Temperature rescales next-token logits before sampling: - Lower than 1.0: sharper distribution, more deterministic, less diverse. - Around 1.0: baseline behavior. - Higher than 1.0: flatter distribution, more diverse but potentially less reliable. Typical guidance: - 0.0 is greedy decoding (no sampling). - 0.3–0.8 adds modest diversity for reasoning tasks. - Very high values are best for creative tasks and broad exploration, not precise reasoning.
What is top-p (nucleus) sampling, and how is it different from top-k?Top-p keeps the smallest set of highest-probability tokens whose cumulative probability mass reaches a threshold p (for example, 0.8 or 0.9), renormalizes them, and samples from that subset. This filters out low-confidence tokens that can derail coherence. Top-k instead keeps a fixed number k of the most likely tokens. In short: - Top-p: variable subset size based on cumulative mass. - Top-k: fixed subset size by rank. Top-p has become popular because it adapts to the shape of the distribution.
How do I integrate temperature scaling and top-p into the generation loop?The pipeline is: - Get next-token logits from the model. - Scale logits by temperature. - Convert to probabilities via softmax. - Apply a top-p filter to drop low-mass tokens and renormalize. - Sample the next token with multinomial sampling. This is a drop-in replacement for greedy decoding inside your generation loop. The chapter provides flexible wrappers so you can swap generation functions and settings without changing surrounding code.
What is self-consistency sampling, and how is it implemented?Self-consistency is majority voting across multiple sampled solutions: 1) Generate multiple answers for the same prompt with temperature > 0 and top-p enabled to induce diversity. 2) Extract each answer’s final result (for example, the boxed number). 3) Choose the most frequent final answer as the prediction. Notes: - You can parallelize sampling across devices to reduce wall-clock time. - Implement tie-breaking and consider early stopping once a majority emerges.
How should I choose num_samples, temperature, and top-p for self-consistency?Practical starting points: - Temperature: about 0.5–0.9. - Top-p: about 0.7–0.9. - num_samples: 3–10 (diminishing returns beyond a small handful, with rising cost). Tips: - If all long answers look nearly identical, slightly increase temperature or top-p. - If answers get nonsensical, reduce temperature first. - Consider early stopping once >50% of samples agree. - Run samples in parallel to shorten latency (total compute still increases).
Why does the chapter sometimes recommend running on CPU instead of GPU/MPS?The sampling routines (for example, multinomial draws) and some low-level differences can yield slightly different outputs across devices, and certain PyTorch versions may show instability when drawing many samples on GPU/MPS. Running on CPU improves reproducibility for the examples, aligns with the book’s reported results, and avoids device-specific quirks.
What accuracy gains did these methods show on MATH-500, and what are the trade-offs?Representative results from the chapter’s experiments: - Baseline (base model, greedy): ~15.2%. - Chain-of-thought on base model: ~40.6%. - Temperature + top-p alone on base model: ~17.8% (diversity control, not a standalone booster). - Self-consistency on base (n=10, with temp+top-p): ~31.6%. - CoT + temp/top-p + self-consistency on base (n=10): ~52.0%. - Reasoning model (greedy): ~48.2%; with self-consistency (n=3): ~55.2%. Key trade-off: Higher accuracy comes from generating longer and/or multiple responses, which increases inference compute, latency, and cost.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Build a Reasoning Model (From Scratch) ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Build a Reasoning Model (From Scratch) ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Build a Reasoning Model (From Scratch) ebook for free