Overview

4 How does GPT Learn?

This chapter demystifies how large language models, like GPT, are trained and clarifies why calling this process “learning” can be misleading. Rather than acquiring knowledge like humans, these models are optimized mechanically through mathematics: a loss function measures how poorly the model performs, and gradient descent repeatedly tweaks billions of parameters to reduce that loss. Training proceeds in tiny steps, often using stochastic gradient descent and optimizers like Adam, which trade extra memory for faster, more reliable progress. The emphasis is on computability and efficiency, not understanding or reasoning.

A good loss must be specific, computable, and smooth; when direct objectives (such as accuracy) are not smooth, proxy losses like cross-entropy are used. GPT is trained to mimic human text via next-token prediction, receiving reward when its outputs resemble its training data. This creates an incentive mismatch: the model becomes better at producing plausible text, not necessarily truthful or logically consistent text, and can absorb biases or errors present in internet-scale data. Despite this, contextual pattern-matching across vast corpora enables strikingly coherent outputs that can appear like reasoning, even though the process is fundamentally predictive rather than deliberative.

These training dynamics explain common failure modes and practical guardrails. GPT struggles with novel or sparsely represented tasks, can misidentify what task is being asked, and cannot plan or pre-commit to hidden states, leading to brittle behavior in puzzles or multi-step games. Effective use focuses on well-scoped, familiar tasks; adding retrieval or citations, constraining inputs and prompts, and supplying structured exemplars can improve reliability. While scaling transformers generally improves performance (“bigger is better”), real-world deployment must balance accuracy against cost, latency, memory, and device constraints, guiding careful product design around where LLMs succeed and where they predictably fail.

Investment returns are not easy to predict partly because they are not smooth. Image modified under Creative Commons license from Forsyth, J. A., & Mongrut, S.. (2022). Does duration of competitive advantage drive long-term returns in the stock market?. Revista Contabilidade & Finanças, 33(89), 329–342. https://doi.org/10.1590/1808-057x202113660.
figure
Examples of a smooth function on the left and two non-smooth functions on the right. The center example is mostly smooth, but there is a region where it is not smooth because the function has no value. On the right, the function is not smooth anywhere due to the hard change in value.
figure
Inputs and labels (the known correct answer for each input) are used to tweak the neural network during gradient descent. A network is made of parameters that are altered a small amount each time gradient descent is applied. We eventually transform the network into something useful by applying gradient descent millions or billions of times.
figure
This shows the global big-picture of gradient descent applied to a single parameter problem. The curve illustrates the value of the loss function for a given parameter value. The ball’s location shows the loss for the current parameter value. The goal is to find the parameter values corresponding to a global minimum representing the ideal solution with the least loss.
figure
This figure shows the gradient descent algorithm taking steps to adjust parameters to find the optimal outcome with the least loss. Unfortunately, the algorithm gets stuck in a local minimum, an area of the graph that is not optimal because other parameter values correspond to areas with a lower loss.
figure
GPT sees this sentence nine times, each time learning from the prediction of a single word at the end of each of the nine sequences.
figure
Context can help you make decent predictions about the next word. As you move from left to right, additional text that might occur in a sentence is added. The images in the thought bubble for each sentence show how different the added context eliminates predictions.
figure
While predicting the next token is powerful, it doesn’t imbue the network with reasoning or logic abilities. If we ask Chat-GPT something absurd and untrue, it happily explains how it happens.
figure
GPT fails to solve two modified versions of a classic logic puzzle. This is due to how LLMs are trained. Content frequently occurring in the same general form (e.g., a famous logic puzzle) leads the model to regurgitate the frequent answer. This can happen even when the content is modified in important ways that are obvious to a person.
figure
The dialogue agent doesn’t commit to a specific object at the start of the game.
figure

Summary

  • Deep learning needs a loss/reward function that specifically quantifies how “badly” an algorithm is at making predictions
  • This loss/reward function should be designed to correlate with the overarching goal of what we want the algorithm to achieve in real life.
  • Gradient Descent involves incrementally using a loss/reward function to alter the network’s parameters.
  • GPT-like models are trained to mimic human text by predicting the next token. This task is sufficiently specific to train a model to perform it, but it does not perfectly correlate with high-level objectives like “reasoning”.
  • GPT will perform best on tasks similar to common and repetitive tasks observed in its training data but will fail when the task is sufficiently novel.

FAQ

Does GPT “learn” like a human?Not really. In machine learning, “learning” means repeatedly adjusting numeric parameters to better match data, guided by math. It’s a mechanical, highly formulaic optimization process, not human-style understanding, reasoning, or education. Models improve at a defined objective but do not acquire human-like cognition.
What is a loss function, and what makes a good one?A loss function is a single numeric score that quantifies how poorly a model performed on an example. A good loss is: - Specific: tightly correlated with the behavior you want. - Computable: practical to evaluate with available data and resources. - Smooth: changes continuously with small input/parameter changes so gradients make sense and optimization can progress.
Why don’t we train directly on “accuracy”?Accuracy is not smooth; it jumps in integer steps (correct/incorrect), so tiny parameter tweaks don’t yield meaningful gradient signals. Instead, training uses a smooth proxy objective such as cross-entropy, which correlates with accuracy but provides workable gradients for optimization.
How does gradient descent train a neural network?Given a loss, gradients indicate how to nudge each parameter to reduce that loss. Gradient descent applies many tiny updates, billions of times for large models, to gradually lower error. It’s greedy and local—so it can get stuck in suboptimal “valleys” (local minima) and has no guarantee of finding the best solution, yet works remarkably well in practice.
What are SGD and Adam, and why are they used? - SGD (stochastic gradient descent) uses small random batches instead of the full dataset, making training feasible and fast while still moving in good directions on average. - Adam augments SGD with momentum-like estimators to speed progress and sometimes skip shallow minima. Trade-off: Adam increases memory use (roughly 3× vs plain SGD during training), which matters for LLM scale.
What is GPT actually optimizing during pretraining?Next-token prediction. The model is shown a sequence of tokens and trained to predict the next one. Minimizing cross-entropy here encourages outputs that look like human-written text. Crucially, this objective rewards plausibility, not truth, logic, or fidelity to external facts—creating potential misalignment with user goals.
Why can GPT reproduce biases or falsehoods?GPT is trained on large internet corpora that contain myths, errors, stereotypes, and social biases. Because the training objective rewards matching the data distribution, the model can preferentially reproduce frequent but incorrect or harmful patterns. It may even “spiral,” compounding earlier mistakes (e.g., buggy code leading to more bugs) when those patterns are common in the data.
Why does GPT struggle with novel or obscure tasks?LLMs excel when a task resembles patterns seen in training data. As tasks become rarer, more specialized, or structurally unusual (e.g., obscure languages/APIs or subtly modified puzzles), the model tends to extrapolate from nearby patterns and fill gaps with plausible but wrong details. Instruction tuning helps, but unusual tasks still trip models up.
Can GPT plan or commit to hidden information across turns?Not by itself. GPT generates each continuation from the current context; it doesn’t natively set and maintain hidden state or commitments (e.g., pre-choosing an object for “20 Questions”). Without external tools or memory, it answers each step independently and only later aligns its output to prior text.
Is “bigger is better” for LLMs, and what are the trade-offs?Larger transformer models trained on more data generally perform better and scale efficiently compared to older architectures. However, deployment costs rise: more memory, compute, latency, and energy; potential need for server offloading and always-on connectivity; and hardware constraints for edge or embedded use. Practical design often balances accuracy gains with speed, cost, and user experience limits.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • How Large Language Models Work ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • How Large Language Models Work ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • How Large Language Models Work ebook for free