How Large Language Models Work you own this product

Edward Raff, Drew Farris and Stella Biderman for Booz Allen Hamilton

June 2025
ISBN 9781633437081
200 pages

Included with a Manning Online subscription

printed in black & white

available in Simplified Chinese

catalog / Data Science / AI

resources: Book forum Register your pBook for a free eBook

table of content

1 Big picture: What are LLMs?

1.1 Generative AI in context

1.2 What you will learn

1.3 Introducing how LLMs work

1.4 What is intelligence, anyway?

1.5 How humans and machines represent language differently

1.6 Generative Pretrained Transformers and friends

1.7 Why LLMs perform so well

1.8 LLMs in action: The good, bad, and scary

2 Tokenizers: How large language models see the world

2.1 Tokens as numeric representations

2.2 Language models see only tokens

2.2.1 The tokenization process

2.2.2 Controlling vocabulary size in tokenization

2.2.3 Tokenization in detail

2.2.4 The risks of tokenization

2.3 Tokenization and LLM capabilities

2.3.1 LLMs are bad at word games

2.3.2 LLMs are challenged by mathematics

2.3.3 LLMs and language equity

2.4 Check your understanding

2.5 Tokenization in context

3 Transformers: How inputs become outputs

3.1 The transformer model

3.1.1 Layers of the transformer model

3.2 Exploring the transformer architecture in detail

3.2.1 Embedding layers

3.2.2 Transformer layers

3.2.3 Unembedding layers

3.3 The tradeoff between creativity and topical responses

3.4 Transformers in context

4 How LLMs learn

4.1 Gradient descent

4.1.1 What is a loss function?

4.1.2 What is gradient descent?

4.2 LLMs learn to mimic human text

4.2.1 LLM reward functions

4.3 LLMs and novel tasks

4.3.1 Failing to identify the correct task

4.3.2 LLMs cannot plan

4.4 If LLMs cannot extrapolate well, can I use them?

4.5 Is bigger better?

5 How do we constrain the behavior of LLMs?

5.1 Why do we want to constrain behavior?

5.1.1 Base models are not very usable

5.1.2 Not all model outputs are desirable

5.1.3 Some cases require specific formatting

5.2 Fine-tuning: The primary method of changing behavior

5.2.1 Supervised fine-tuning

5.2.2 Reinforcement learning from human feedback

5.2.3 Fine-tuning: The big picture

5.3 The mechanics of RLHF

5.3.1 Beginning with a naive RLHF

5.3.2 The quality reward model

5.3.3 The similar-but-different RLHF objective

5.4 Other factors in customizing LLM behavior

5.4.1 Altering training data

5.4.2 Altering base model training

5.4.3 Altering the outputs

5.5 Integrating LLMs into larger workflows

5.5.1 Customizing LLMs with retrieval augmented generation

5.5.2 General-purpose LLM programming

6 Beyond natural language processing

6.1 LLMs for software development

6.1.1 Improving LLMs to work with code

6.1.2 Validating code generated by LLMs

6.1.3 Improving code via formatting

6.2 LLMs for formal mathematics

6.2.1 Sanitized input

6.2.2 Helping LLMs understand numbers

6.2.3 Math LLMs also use tools

6.3 Transformers and computer vision

6.3.1 Converting images to patches and back

6.3.2 Multimodal models using images and text

6.3.3 Applicability of prior lessons

7 Misconceptions, limits, and eminent abilities of LLMs

7.1 Human rate of learning vs. LLMs

7.1.1 The limitations on self-improvement

7.1.2 Few-shot learning

7.2 Efficiency of work: A 10-watt human brain vs. a 2000-watt computer

7.2.1 Power

7.2.2 Latency, scalability, and availability

7.2.3 Refinement

7.3 Language models are not models of the world

7.4 Computational limits: Hard problems are still hard

7.4.1 Using fuzzy algorithms for fuzzy problems

7.4.2 When close enough is good enough for hard problems

8 Designing solutions with large language models

8.1 Just make a chatbot?

8.2 Automation bias

8.2.1 Changing the process

8.2.2 When things are too risky for autonomous LLMs

8.3 Using more than LLMs to reduce risk

8.3.1 Combining LLM embeddings with other tools

8.3.2 Designing a solution that uses embeddings

8.4 Technology presentation matters

8.4.1 How can you be transparent?

8.4.2 Aligning incentives with users

8.4.3 Incorporating feedback cycles

9 Ethics of building and using LLMs

9.1 Why did we build LLMs at all?

9.1.1 The pros and cons of LLMs doing everything

9.1.2 Do we want to automate all human work?

9.2 Do LLMs pose an existential risk?

9.2.1 Self-improvement and the iterative S-curve

9.2.2 The alignment problem

9.3 The ethics of data sourcing and reuse

9.3.1 What is fair use?

9.3.2 The challenges associated with compensating content creators

9.3.3 The limitations of public domain data

9.4 Ethical concerns with LLM outputs

9.4.1 Licensing implications for LLM output

9.4.2 Do LLM outputs poison the well?

9.5 Other explorations in LLM ethics

References

Overview

3 Transformers: How Inputs Become Outputs

Large language models treat text as tokens and generate outputs by repeatedly converting those tokens into numbers, transforming them, and mapping the results back to new tokens. This cyclical, statistical process is unlike human language use but is highly effective for predicting what comes next in a sequence. Modern models implement this with Transformers, architectures built to predict tokens rather than “understand” language in a human sense, and the chapter constructs a practical mental model for how inputs flow through the system to become coherent text.

The pipeline begins with an embedding layer that turns tokens into high-dimensional vectors capturing meaning, augmented with positional information so order matters. Stacked transformer layers then apply attention, using queries, keys, and values to highlight the most relevant context and refine representations across many steps. Finally, an output (unembedding) layer converts internal vectors into a probability distribution over tokens; sampling selects the next token, making generation autoregressive—each new choice depends on the prior ones—until an end-of-sequence signal stops the loop. This probabilistic decoding explains why the same prompt can yield different valid continuations.

Transformers come in three main styles: encoder-only models that produce rich representations for tasks, decoder-only models that continue text by next-token prediction, and encoder-decoder models that map one passage to another, excelling at tasks like translation and summarization. Embedding spaces enable useful semantic relationships but bring trade-offs: more dimensions increase expressivity and cost, and patterns in data can imprint unwanted biases. During decoding, controls like temperature adjust the balance between conservative, on-topic responses and creative, surprising ones. Together, these components and choices explain how LLMs convert tokenized inputs into fluent outputs.

The basic components of the transformer model, consisting of the embedding layer, multiple transformer layers, and the output layer

The full process going from a document to a transformer output

If you use just one number to represent a token, you quickly encounter problems where similar/dissimilar words can not be made to “fit” each other. Here we see how trying to represent simple synonym/antonym relationships quickly becomes nonsensical even with just a handful of words.

Adding another dimension to our token representation allows us to represent a more diverse arrangement of semantic relationships. Here, we see how different dimensions can capture relationships for multiple meanings of the same word

Demonstration of how the relationships are sufficiently useful makes the embeddings a “semantic space”. Words with similar meanings are near each other, but a single transformation of a word can be applied to multiple words to yield a similar result. In this instance, a transformation for finding the female version of a word.

Because Transformers do not understand that their inputs have a specific order, all possible re-organizations of the tokens “look” identical to the algorithm. This is problematic because word order can change the word’s context or, if done randomly, become gibberish.

Transformer layers do not understand that the inputs have a specific order. This information about the order of tokens is endowed via a second “positional” encoding. The position embeddings work the same way as token embeddings and are added together. This provides the information the model needs on the order of tokens.

An example of how queries, keys, and values work inside a transformer compared to a Python dictionary. When a Python dictionary matches queries to keys, it needs an exact match to find the value, or it will return nothing. A transformer always returns something based on the most similar matches between queries and keys.

The next token in a sentence is predicted by using the current token as the “query” and calculating matches with the preceding words as the “keys”. The individual values themselves do not need to exist in the semantic space; the output of the attention mechanism produces something similar to one of the tokens in the vocabulary.

Producing output from LLMs involves converting from documents to tokens and then using the model to produce output. We loop through this process to both consume text and generate human-readable output

We demonstrate text generation by starting with the phrase "I love to eat" and then showing some possible completions that are foods, such as barbeque and sushi, have high probabilities, while a car and the number ’42’ have low probabilities. Weighted random selection chooses the word ’tacos’. The generation loop is stopped when the EoS token appears.

Summary

While GPTs use tokens as their "basic unit of semantic meaning," they’re mathematically represented within the model as embedding vectors rather than as strings. These embedding vectors can capture relationships about nearness, dissimilarity, antonyms, and other linguistic descriptive properties.
Position and word order do not come naturally to transformers and are obtained via another vector representing the relative position. The model can represent word order by adding the position and token embedding vectors.
Transformer layers act as a kind of fuzzy dictionary, returning approximate answers to approximate matches. This fuzzy process is called attention and uses the terms Query, Key, and Value as analogous to the key and value in a Python dictionary.
GPTs are examples of decoder-only transformers, but encoder-only transformers and encoder-decoder transformers also exist. GPTs are best at generating text, but other types can be better at other tasks.
GPTs are autoregressive, meaning they work recursively. All previously generated tokens are fed into the model at each step to get the next token. Simply put, autoregressive models “predict the next thing using the previous things”.
The output of any transformer isn’t tokens; instead, the output is a probability for how likely every token is. Selecting a specific token is called Unembedding or sampling and includes some randomness.
The strength of this randomness can be controlled, resulting in more or less realistic output, more creative or unique output, or more consistent output. Most LLMs have a default threshold for randomness that is “reasonable looking,” but you may want to change it for different uses.

FAQ

What are the main stages in an LLM’s text generation pipeline?

The process typically follows seven steps: (1) tokenize the input text; (2) map tokens to embedding vectors; (3) add positional embeddings to encode order; (4) pass the sequence through many transformer layers; (5) project the result back to token scores (unembedding/output layer); (6) sample one token from the scored vocabulary; (7) detokenize to text and repeat autoregressively until an end-of-sequence token or a length limit is reached.

Why do LLMs convert tokens into vectors instead of using token IDs directly?

Neural networks operate on continuous numbers they can adjust during learning. A token ID is a fixed identifier that carries no graded meaning. Embedding layers map each token to a high-dimensional vector that can capture nuanced relationships (similarity, antonymy, multiple senses) that the model can manipulate.

Why isn’t a single number enough to represent a word?

One dimension cannot simultaneously encode many relationships (synonyms, antonyms, multiple meanings). Using one number leads to contradictions (e.g., making antonyms mere negations wrongly ties unrelated words together). Multiple dimensions allow the model to place words in positions that reflect many relationships at once.

What is an embedding or “semantic” space, and what useful properties does it have?

It’s a high-dimensional space where each token is represented by a vector. Useful properties include: similar words being close together; consistent vector offsets that capture relations (e.g., male→female), and the ability to compose many relations simultaneously. Note that these spaces reflect training data and can encode societal biases.

Why do Transformers need positional information and how is it added?

Transformer layers don’t inherently know the order of tokens; without position, a sentence and its jumbled version look identical. Models add a positional vector (based on the token’s index) to the token’s meaning vector; the sum encodes both identity and position so the model can use order-sensitive patterns.

What happens inside a transformer layer (Query, Key, Value and attention)?

Each token is mapped to a Query (what it’s looking for), a set of Keys (what others offer), and corresponding Values (the content to retrieve). Attention scores how well the current Query matches each Key, weights the Values accordingly, and aggregates them. This “fuzzy lookup” lets the model focus on the most relevant context when forming the next representation.

What are encoder-only, decoder-only, and encoder–decoder models used for?

- Encoder-only: build rich representations of input text for tasks like classification and retrieval (e.g., BERT, RoBERTa). - Decoder-only: generate continuations by predicting the next token (e.g., GPT, Gemini). - Encoder–decoder: transform one sequence into a corresponding output sequence, excelling at tasks like translation and summarization (e.g., T5, Google Translate’s core models).

How are vectors converted back into tokens, and how does generation stop?

The output/unembedding layer scores every vocabulary token against the current vector. The model samples one token, appends it to the output, and feeds it back in to predict the next token (an autoregressive loop). Generation ends when a special end-of-sequence token appears or a maximum length is reached.

How does the model choose the next token, and why is randomness involved?

The model converts token scores into probabilities and samples one token according to those probabilities. Randomness is essential because many continuations can be valid; always picking the single highest-probability token would produce repetitive, unnatural text. Sampling balances plausibility with variability and can be tuned (e.g., via nucleus sampling).

What does “temperature” control, and when should you raise or lower it?

Temperature rescales token probabilities before sampling. Lower temperature emphasizes high-probability tokens for focused, reliable outputs; higher temperature boosts lower-probability tokens for more diverse, creative outputs. Typical defaults are around 0.7–0.8; lower for factual or precise tasks, higher for brainstorming or creative writing.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$54.99 $34.64

you save $20.35 (37%)

include audio $19.99 $12.59

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$54.99 $34.64

you save $20.35 (37%)

include audio $19.99 $12.59

eBook

pdf, ePub, online

$54.99 $34.64

you save $20.35 (37%)

include audio $19.99 $12.59

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more