How Large Language Models Work you own this product

Edward Raff, Drew Farris and Stella Biderman for Booz Allen Hamilton

June 2025
ISBN 9781633437081
200 pages

Included with a Manning Online subscription

printed in black & white

available in Simplified Chinese

catalog / Data Science / AI

resources: Book forum Register your pBook for a free eBook

table of content

1 Big picture: What are LLMs?

1.1 Generative AI in context

1.2 What you will learn

1.3 Introducing how LLMs work

1.4 What is intelligence, anyway?

1.5 How humans and machines represent language differently

1.6 Generative Pretrained Transformers and friends

1.7 Why LLMs perform so well

1.8 LLMs in action: The good, bad, and scary

2 Tokenizers: How large language models see the world

2.1 Tokens as numeric representations

2.2 Language models see only tokens

2.2.1 The tokenization process

2.2.2 Controlling vocabulary size in tokenization

2.2.3 Tokenization in detail

2.2.4 The risks of tokenization

2.3 Tokenization and LLM capabilities

2.3.1 LLMs are bad at word games

2.3.2 LLMs are challenged by mathematics

2.3.3 LLMs and language equity

2.4 Check your understanding

2.5 Tokenization in context

3 Transformers: How inputs become outputs

3.1 The transformer model

3.1.1 Layers of the transformer model

3.2 Exploring the transformer architecture in detail

3.2.1 Embedding layers

3.2.2 Transformer layers

3.2.3 Unembedding layers

3.3 The tradeoff between creativity and topical responses

3.4 Transformers in context

4 How LLMs learn

4.1 Gradient descent

4.1.1 What is a loss function?

4.1.2 What is gradient descent?

4.2 LLMs learn to mimic human text

4.2.1 LLM reward functions

4.3 LLMs and novel tasks

4.3.1 Failing to identify the correct task

4.3.2 LLMs cannot plan

4.4 If LLMs cannot extrapolate well, can I use them?

4.5 Is bigger better?

5 How do we constrain the behavior of LLMs?

5.1 Why do we want to constrain behavior?

5.1.1 Base models are not very usable

5.1.2 Not all model outputs are desirable

5.1.3 Some cases require specific formatting

5.2 Fine-tuning: The primary method of changing behavior

5.2.1 Supervised fine-tuning

5.2.2 Reinforcement learning from human feedback

5.2.3 Fine-tuning: The big picture

5.3 The mechanics of RLHF

5.3.1 Beginning with a naive RLHF

5.3.2 The quality reward model

5.3.3 The similar-but-different RLHF objective

5.4 Other factors in customizing LLM behavior

5.4.1 Altering training data

5.4.2 Altering base model training

5.4.3 Altering the outputs

5.5 Integrating LLMs into larger workflows

5.5.1 Customizing LLMs with retrieval augmented generation

5.5.2 General-purpose LLM programming

6 Beyond natural language processing

6.1 LLMs for software development

6.1.1 Improving LLMs to work with code

6.1.2 Validating code generated by LLMs

6.1.3 Improving code via formatting

6.2 LLMs for formal mathematics

6.2.1 Sanitized input

6.2.2 Helping LLMs understand numbers

6.2.3 Math LLMs also use tools

6.3 Transformers and computer vision

6.3.1 Converting images to patches and back

6.3.2 Multimodal models using images and text

6.3.3 Applicability of prior lessons

7 Misconceptions, limits, and eminent abilities of LLMs

7.1 Human rate of learning vs. LLMs

7.1.1 The limitations on self-improvement

7.1.2 Few-shot learning

7.2 Efficiency of work: A 10-watt human brain vs. a 2000-watt computer

7.2.1 Power

7.2.2 Latency, scalability, and availability

7.2.3 Refinement

7.3 Language models are not models of the world

7.4 Computational limits: Hard problems are still hard

7.4.1 Using fuzzy algorithms for fuzzy problems

7.4.2 When close enough is good enough for hard problems

8 Designing solutions with large language models

8.1 Just make a chatbot?

8.2 Automation bias

8.2.1 Changing the process

8.2.2 When things are too risky for autonomous LLMs

8.3 Using more than LLMs to reduce risk

8.3.1 Combining LLM embeddings with other tools

8.3.2 Designing a solution that uses embeddings

8.4 Technology presentation matters

8.4.1 How can you be transparent?

8.4.2 Aligning incentives with users

8.4.3 Incorporating feedback cycles

9 Ethics of building and using LLMs

9.1 Why did we build LLMs at all?

9.1.1 The pros and cons of LLMs doing everything

9.1.2 Do we want to automate all human work?

9.2 Do LLMs pose an existential risk?

9.2.1 Self-improvement and the iterative S-curve

9.2.2 The alignment problem

9.3 The ethics of data sourcing and reuse

9.3.1 What is fair use?

9.3.2 The challenges associated with compensating content creators

9.3.3 The limitations of public domain data

9.4 Ethical concerns with LLM outputs

9.4.1 Licensing implications for LLM output

9.4.2 Do LLM outputs poison the well?

9.5 Other explorations in LLM ethics

References

Overview

4 How does GPT Learn?

This chapter demystifies how large language models, like GPT, are trained and clarifies why calling this process “learning” can be misleading. Rather than acquiring knowledge like humans, these models are optimized mechanically through mathematics: a loss function measures how poorly the model performs, and gradient descent repeatedly tweaks billions of parameters to reduce that loss. Training proceeds in tiny steps, often using stochastic gradient descent and optimizers like Adam, which trade extra memory for faster, more reliable progress. The emphasis is on computability and efficiency, not understanding or reasoning.

A good loss must be specific, computable, and smooth; when direct objectives (such as accuracy) are not smooth, proxy losses like cross-entropy are used. GPT is trained to mimic human text via next-token prediction, receiving reward when its outputs resemble its training data. This creates an incentive mismatch: the model becomes better at producing plausible text, not necessarily truthful or logically consistent text, and can absorb biases or errors present in internet-scale data. Despite this, contextual pattern-matching across vast corpora enables strikingly coherent outputs that can appear like reasoning, even though the process is fundamentally predictive rather than deliberative.

These training dynamics explain common failure modes and practical guardrails. GPT struggles with novel or sparsely represented tasks, can misidentify what task is being asked, and cannot plan or pre-commit to hidden states, leading to brittle behavior in puzzles or multi-step games. Effective use focuses on well-scoped, familiar tasks; adding retrieval or citations, constraining inputs and prompts, and supplying structured exemplars can improve reliability. While scaling transformers generally improves performance (“bigger is better”), real-world deployment must balance accuracy against cost, latency, memory, and device constraints, guiding careful product design around where LLMs succeed and where they predictably fail.

Investment returns are not easy to predict partly because they are not smooth. Image modified under Creative Commons license from Forsyth, J. A., & Mongrut, S.. (2022). Does duration of competitive advantage drive long-term returns in the stock market?. Revista Contabilidade & Finanças, 33(89), 329–342. https://doi.org/10.1590/1808-057x202113660.

Examples of a smooth function on the left and two non-smooth functions on the right. The center example is mostly smooth, but there is a region where it is not smooth because the function has no value. On the right, the function is not smooth anywhere due to the hard change in value.

Inputs and labels (the known correct answer for each input) are used to tweak the neural network during gradient descent. A network is made of parameters that are altered a small amount each time gradient descent is applied. We eventually transform the network into something useful by applying gradient descent millions or billions of times.

This shows the global big-picture of gradient descent applied to a single parameter problem. The curve illustrates the value of the loss function for a given parameter value. The ball’s location shows the loss for the current parameter value. The goal is to find the parameter values corresponding to a global minimum representing the ideal solution with the least loss.

This figure shows the gradient descent algorithm taking steps to adjust parameters to find the optimal outcome with the least loss. Unfortunately, the algorithm gets stuck in a local minimum, an area of the graph that is not optimal because other parameter values correspond to areas with a lower loss.

GPT sees this sentence nine times, each time learning from the prediction of a single word at the end of each of the nine sequences.

Context can help you make decent predictions about the next word. As you move from left to right, additional text that might occur in a sentence is added. The images in the thought bubble for each sentence show how different the added context eliminates predictions.

While predicting the next token is powerful, it doesn’t imbue the network with reasoning or logic abilities. If we ask Chat-GPT something absurd and untrue, it happily explains how it happens.

GPT fails to solve two modified versions of a classic logic puzzle. This is due to how LLMs are trained. Content frequently occurring in the same general form (e.g., a famous logic puzzle) leads the model to regurgitate the frequent answer. This can happen even when the content is modified in important ways that are obvious to a person.

The dialogue agent doesn’t commit to a specific object at the start of the game.

Summary

Deep learning needs a loss/reward function that specifically quantifies how “badly” an algorithm is at making predictions
This loss/reward function should be designed to correlate with the overarching goal of what we want the algorithm to achieve in real life.
Gradient Descent involves incrementally using a loss/reward function to alter the network’s parameters.
GPT-like models are trained to mimic human text by predicting the next token. This task is sufficiently specific to train a model to perform it, but it does not perfectly correlate with high-level objectives like “reasoning”.
GPT will perform best on tasks similar to common and repetitive tasks observed in its training data but will fail when the task is sufficiently novel.

FAQ

Does GPT “learn” like a human?

Not really. In machine learning, “learning” means repeatedly adjusting numeric parameters to better match data, guided by math. It’s a mechanical, highly formulaic optimization process, not human-style understanding, reasoning, or education. Models improve at a defined objective but do not acquire human-like cognition.

What is a loss function, and what makes a good one?

A loss function is a single numeric score that quantifies how poorly a model performed on an example. A good loss is: - Specific: tightly correlated with the behavior you want. - Computable: practical to evaluate with available data and resources. - Smooth: changes continuously with small input/parameter changes so gradients make sense and optimization can progress.

Why don’t we train directly on “accuracy”?

Accuracy is not smooth; it jumps in integer steps (correct/incorrect), so tiny parameter tweaks don’t yield meaningful gradient signals. Instead, training uses a smooth proxy objective such as cross-entropy, which correlates with accuracy but provides workable gradients for optimization.

How does gradient descent train a neural network?

Given a loss, gradients indicate how to nudge each parameter to reduce that loss. Gradient descent applies many tiny updates, billions of times for large models, to gradually lower error. It’s greedy and local—so it can get stuck in suboptimal “valleys” (local minima) and has no guarantee of finding the best solution, yet works remarkably well in practice.

What are SGD and Adam, and why are they used?

- SGD (stochastic gradient descent) uses small random batches instead of the full dataset, making training feasible and fast while still moving in good directions on average. - Adam augments SGD with momentum-like estimators to speed progress and sometimes skip shallow minima. Trade-off: Adam increases memory use (roughly 3× vs plain SGD during training), which matters for LLM scale.

What is GPT actually optimizing during pretraining?

Next-token prediction. The model is shown a sequence of tokens and trained to predict the next one. Minimizing cross-entropy here encourages outputs that look like human-written text. Crucially, this objective rewards plausibility, not truth, logic, or fidelity to external facts—creating potential misalignment with user goals.

Why can GPT reproduce biases or falsehoods?

GPT is trained on large internet corpora that contain myths, errors, stereotypes, and social biases. Because the training objective rewards matching the data distribution, the model can preferentially reproduce frequent but incorrect or harmful patterns. It may even “spiral,” compounding earlier mistakes (e.g., buggy code leading to more bugs) when those patterns are common in the data.

Why does GPT struggle with novel or obscure tasks?

LLMs excel when a task resembles patterns seen in training data. As tasks become rarer, more specialized, or structurally unusual (e.g., obscure languages/APIs or subtly modified puzzles), the model tends to extrapolate from nearby patterns and fill gaps with plausible but wrong details. Instruction tuning helps, but unusual tasks still trip models up.

Can GPT plan or commit to hidden information across turns?

Not by itself. GPT generates each continuation from the current context; it doesn’t natively set and maintain hidden state or commitments (e.g., pre-choosing an object for “20 Questions”). Without external tools or memory, it answers each step independently and only later aligns its output to prior text.

Is “bigger is better” for LLMs, and what are the trade-offs?

Larger transformer models trained on more data generally perform better and scale efficiently compared to older architectures. However, deployment costs rise: more memory, compute, latency, and energy; potential need for server offloading and always-on connectivity; and hardware constraints for edge or embedded use. Practical design often balances accuracy gains with speed, cost, and user experience limits.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$54.99 $32.99

you save $22.00 (40%)

include audio $19.99 $13.99

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$54.99 $32.99

you save $22.00 (40%)

include audio $19.99 $13.99

eBook

pdf, ePub, online

$54.99 $32.99

you save $22.00 (40%)

include audio $19.99 $13.99

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more