How Large Language Models Work you own this product

Edward Raff, Drew Farris and Stella Biderman for Booz Allen Hamilton

June 2025
ISBN 9781633437081
200 pages

Included with a Manning Online subscription

printed in black & white

available in Simplified Chinese

catalog / Data Science / AI

resources: Book forum Register your pBook for a free eBook

table of content

1 Big picture: What are LLMs?

1.1 Generative AI in context

1.2 What you will learn

1.3 Introducing how LLMs work

1.4 What is intelligence, anyway?

1.5 How humans and machines represent language differently

1.6 Generative Pretrained Transformers and friends

1.7 Why LLMs perform so well

1.8 LLMs in action: The good, bad, and scary

2 Tokenizers: How large language models see the world

2.1 Tokens as numeric representations

2.2 Language models see only tokens

2.2.1 The tokenization process

2.2.2 Controlling vocabulary size in tokenization

2.2.3 Tokenization in detail

2.2.4 The risks of tokenization

2.3 Tokenization and LLM capabilities

2.3.1 LLMs are bad at word games

2.3.2 LLMs are challenged by mathematics

2.3.3 LLMs and language equity

2.4 Check your understanding

2.5 Tokenization in context

3 Transformers: How inputs become outputs

3.1 The transformer model

3.1.1 Layers of the transformer model

3.2 Exploring the transformer architecture in detail

3.2.1 Embedding layers

3.2.2 Transformer layers

3.2.3 Unembedding layers

3.3 The tradeoff between creativity and topical responses

3.4 Transformers in context

4 How LLMs learn

4.1 Gradient descent

4.1.1 What is a loss function?

4.1.2 What is gradient descent?

4.2 LLMs learn to mimic human text

4.2.1 LLM reward functions

4.3 LLMs and novel tasks

4.3.1 Failing to identify the correct task

4.3.2 LLMs cannot plan

4.4 If LLMs cannot extrapolate well, can I use them?

4.5 Is bigger better?

5 How do we constrain the behavior of LLMs?

5.1 Why do we want to constrain behavior?

5.1.1 Base models are not very usable

5.1.2 Not all model outputs are desirable

5.1.3 Some cases require specific formatting

5.2 Fine-tuning: The primary method of changing behavior

5.2.1 Supervised fine-tuning

5.2.2 Reinforcement learning from human feedback

5.2.3 Fine-tuning: The big picture

5.3 The mechanics of RLHF

5.3.1 Beginning with a naive RLHF

5.3.2 The quality reward model

5.3.3 The similar-but-different RLHF objective

5.4 Other factors in customizing LLM behavior

5.4.1 Altering training data

5.4.2 Altering base model training

5.4.3 Altering the outputs

5.5 Integrating LLMs into larger workflows

5.5.1 Customizing LLMs with retrieval augmented generation

5.5.2 General-purpose LLM programming

6 Beyond natural language processing

6.1 LLMs for software development

6.1.1 Improving LLMs to work with code

6.1.2 Validating code generated by LLMs

6.1.3 Improving code via formatting

6.2 LLMs for formal mathematics

6.2.1 Sanitized input

6.2.2 Helping LLMs understand numbers

6.2.3 Math LLMs also use tools

6.3 Transformers and computer vision

6.3.1 Converting images to patches and back

6.3.2 Multimodal models using images and text

6.3.3 Applicability of prior lessons

7 Misconceptions, limits, and eminent abilities of LLMs

7.1 Human rate of learning vs. LLMs

7.1.1 The limitations on self-improvement

7.1.2 Few-shot learning

7.2 Efficiency of work: A 10-watt human brain vs. a 2000-watt computer

7.2.1 Power

7.2.2 Latency, scalability, and availability

7.2.3 Refinement

7.3 Language models are not models of the world

7.4 Computational limits: Hard problems are still hard

7.4.1 Using fuzzy algorithms for fuzzy problems

7.4.2 When close enough is good enough for hard problems

8 Designing solutions with large language models

8.1 Just make a chatbot?

8.2 Automation bias

8.2.1 Changing the process

8.2.2 When things are too risky for autonomous LLMs

8.3 Using more than LLMs to reduce risk

8.3.1 Combining LLM embeddings with other tools

8.3.2 Designing a solution that uses embeddings

8.4 Technology presentation matters

8.4.1 How can you be transparent?

8.4.2 Aligning incentives with users

8.4.3 Incorporating feedback cycles

9 Ethics of building and using LLMs

9.1 Why did we build LLMs at all?

9.1.1 The pros and cons of LLMs doing everything

9.1.2 Do we want to automate all human work?

9.2 Do LLMs pose an existential risk?

9.2.1 Self-improvement and the iterative S-curve

9.2.2 The alignment problem

9.3 The ethics of data sourcing and reuse

9.3.1 What is fair use?

9.3.2 The challenges associated with compensating content creators

9.3.3 The limitations of public domain data

9.4 Ethical concerns with LLM outputs

9.4.1 Licensing implications for LLM output

9.4.2 Do LLM outputs poison the well?

9.5 Other explorations in LLM ethics

References

Overview

2 Tokenizers: How Large Language Models See The World

Large language models cannot work directly with raw text; they rely on tokenization to convert sentences into numeric sequences the model can process. Tokens are typically sub-words that balance the granularity of letters with the semantic coherence of words, allowing models to generalize to unfamiliar terms while keeping vocabularies manageable. Because models “see” only token IDs—without inherent relationships like capitalization or morphology—tokenization becomes the core feature engineering step for LLMs, shaping both model size and the kinds of patterns it can learn.

The tokenization pipeline involves four steps: receiving text, normalizing it, segmenting it into tokens, and mapping each token to a unique identifier while building the vocabulary. Normalization (e.g., lowercasing, punctuation handling) can shrink vocabularies and reduce ambiguity from typos, but it may also erase meaning (such as proper nouns) and reduce a model’s ability to correct errors. Modern systems therefore use minimal normalization and data-driven segmentation, most commonly via Byte-Pair Encoding, which learns frequent sub-word units to optimize coverage and efficiency. Vocabulary design includes tradeoffs in accuracy, speed, and memory, handling special tokens, and guarding against pitfalls like inconsistent tokenization, homoglyphs, and other encoding quirks that can create security and reliability issues.

Tokenization choices have direct consequences for model capabilities and equity. Tasks that depend on exact character sequences—word games, rare or misspelled drug names, poetry constraints—are challenging when models lack letter-level visibility. Math improves when digits are tokenized individually, illustrating how tailored token design can boost reasoning performance. Across languages, tokenization efficiency varies, affecting latency and cost: languages underrepresented in tokenizer training often require more tokens for the same content, increasing usage and potentially amplifying economic inequities. In practice, tokenization determines what an LLM can represent, how well it learns, and how fairly and efficiently it operates.

To understand text, LLMs must break text into tokens. Each unique token has a numeric identifier associated with it.

Generically, tokenization involves processing input to produce numeric identifiers for tokens.

The normalization process commonly involves changing text to remove upper-case characters and punctuation.

The segmentation process breaks normalized text into words or tokens so that each can be processed independently

A simplified Byte-Pair Encoding algorithm for creating tokens: First, find the most frequent pair of characters “ng”. Next, replace all instances of “ng” with a placeholder token “T”, and “ng” to the vocabulary. Repeat the process until no common byte pairs remain.

Tokenizing two different sentences.

The tokenization approach means that GPT can not really “see” single characters or word lengths. If you ask questions that require sub-character identification and change them in a unique and unusual way, GPT starts to fail. The correct middle character is “a”, but GPT insists that the letter “e”. What GPT sees is three tokens, representing P, ine, and apple, respectively.

A comparison of how two LLMs learn to perform arithmetic computations over time. Time is shown on the x-axis. The upper curve is a typical BPE tokenizer, while the lower curve is the same tokenizer modified to use tokens that represent individual digits. The y-axis describes the ability of the LLM to perform accurately, where a smaller number means fewer errors. The bottom line is that LLMs that use digit-level tokenization can learn how to do math better and faster.

$figure$

Summary

Tokenization is the fundamental process that Large Language Models use to understand text by converting sentences into tokens.
Tokens are the smallest units of information in text that represent content. Sometimes, they correspond to full words, but often, they represent pieces of words or sub-words.
Tokenization involves normalizing text into a standard representation, which may involve converting characters to lowercase or translating the byte encoding of Unicode characters so that visibly identical characters employ the same encoding.
Tokenization also involves segmentation, which is breaking up text into words or sub-words. Algorithms like Byte-Pair Encoding (BPE) provide a mechanism to automatically learn how to efficiently segment text based on the statistical occurrence of combinations of letters in a training data set.
The result of building a tokenizer is known as a vocabulary, which is the unique collection of word and sub-word tokens that a tokenizer can use to represent text that it has processed.
The size of a tokenizer’s vocabulary affects the LLM’s ability to accurately represent data and the storage and computational resources required to understand and predict text.
Internally to the LLM, tokens are represented using numbers. As a result, there is no understanding of relationships between tokens, such as prefixes and suffixes, or the fact that two tokens share a similar set of letters.
To support specific domains of knowledge, tokenizers trained automatically may be augmented to provide tokens that are important to their application.
Tokenizers that do not understand individual letters or digits will have issues with arithmetic operations or simple word games.

FAQ

Why do large language models tokenize text into numbers?

Neural networks operate on numbers, not raw text. Tokenization converts text into a sequence of discrete tokens, each mapped to a unique numeric identifier, so models can learn statistical relationships between them. Tokens are the basic units LLMs “see” and manipulate during training and inference.

What are sub-word tokens, and why not use only letters or whole words?

Sub-word tokens sit between letters and words, capturing meaningful chunks like prefixes, roots, and suffixes (for example, “schoolhouse” → “school”, “house”; “thoughtful” → “thought”, “ful”). This balances efficiency and flexibility: frequent words can be single tokens, while rare or new words can be composed from known parts. It mirrors how humans often parse unfamiliar terms by breaking them into recognizable components.

What are the main steps of the tokenization process?

The pipeline typically has four steps: (1) receive a text string, (2) normalize it (optional transformations like lowercasing or removing punctuation), (3) segment it into tokens, and (4) map tokens to unique numeric IDs. The first and last steps are straightforward necessities; the key design choices live in normalization and segmentation. Those choices shape vocabulary size, model capability, and robustness.

How does vocabulary size affect performance, capability, and deployability?

A larger vocabulary captures more nuance and reduces ambiguity, often improving capability. However, it increases storage, memory use, and compute cost, potentially slowing the model and complicating deployment. Building vocabularies is a trade-off between expressiveness and resource constraints; even publicly available models can require many gigabytes just to store token vocabularies.

What is normalization, and when does lowercasing or stripping punctuation help or hurt?

Normalization reduces variability by transforming text (for example, lowercasing, removing punctuation), shrinking the vocabulary and simplifying learning. This can unify typos and casing variants, but it can also erase meaningful distinctions (for example, “Bill” the name versus “bill” the invoice). Modern LLMs often keep case and punctuation to preserve nuance and enable capabilities like typo correction, accepting a larger vocabulary as a trade-off.

Why isn’t whitespace-based segmentation enough?

Splitting on spaces is simple but leads to issues: punctuation-bound tokens like “hello,” differ from “hello”, inflating the vocabulary and ambiguity. Handwritten rules struggle with edge cases and languages that lack spaces (for example, Chinese). Data-driven sub-word methods better capture consistent, reusable token pieces across languages and contexts.

How does Byte-Pair Encoding (BPE) build token vocabularies?

BPE starts from characters and repeatedly merges the most frequent adjacent pairs into new tokens, creating common sub-words efficiently. It typically uses minimal normalization and a learned segmenter, then authors often add domain-specific and special tokens (for unknowns, system prompts, or multimodal boundaries). Alternatives like WordPiece and SentencePiece exist with different counting and whitespace handling, but BPE is widely used.

Can a longer string use fewer tokens? Why does tokenizer choice matter?

Yes. If a longer string matches a frequent sub-word (for example, “running”), it may map to a single token, whereas a shorter variant (“runnin”) might split into multiple tokens. Different tokenizer versions can segment the same text differently, affecting costs, model behavior, and evaluation consistency—important considerations for testing and production.

What are homoglyphs, and why are they risky?

Homoglyphs are visually identical characters with different byte encodings (for example, Latin “H” vs. Cyrillic “Н”, or zero-width space). Tokenizers treat them as different symbols, which can inflate token counts, alter meaning, or enable adversarial inputs. Many systems mitigate this with normalization that removes or canonicalizes suspicious characters before tokenization.

How does tokenization shape strengths and weaknesses in word games, rare terms, and math?

Because models operate on sub-words, they don’t directly track letters or exact word lengths, making letter-by-letter puzzles and rhyme schemes challenging. Rare or misspelled domain terms can tokenize inconsistently (for example, drug names), increasing confusion risk; adding domain tokens or improving input hygiene can help. For math, digit-level tokens often improve arithmetic by letting models reason over individual digits; some systems normalize numbers (for example, spacing digits) to encourage this behavior.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$54.99 $34.64

you save $20.35 (37%)

include audio $19.99 $12.59

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$54.99 $34.64

you save $20.35 (37%)

include audio $19.99 $12.59

eBook

pdf, ePub, online

$54.99 $34.64

you save $20.35 (37%)

include audio $19.99 $12.59

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more