How Large Language Models Work you own this product

Edward Raff, Drew Farris and Stella Biderman for Booz Allen Hamilton

June 2025
ISBN 9781633437081
200 pages

Included with a Manning Online subscription

printed in black & white

available in Simplified Chinese

catalog / Data Science / AI

resources: Book forum Register your pBook for a free eBook

table of content

1 Big picture: What are LLMs?

1.1 Generative AI in context

1.2 What you will learn

1.3 Introducing how LLMs work

1.4 What is intelligence, anyway?

1.5 How humans and machines represent language differently

1.6 Generative Pretrained Transformers and friends

1.7 Why LLMs perform so well

1.8 LLMs in action: The good, bad, and scary

2 Tokenizers: How large language models see the world

2.1 Tokens as numeric representations

2.2 Language models see only tokens

2.2.1 The tokenization process

2.2.2 Controlling vocabulary size in tokenization

2.2.3 Tokenization in detail

2.2.4 The risks of tokenization

2.3 Tokenization and LLM capabilities

2.3.1 LLMs are bad at word games

2.3.2 LLMs are challenged by mathematics

2.3.3 LLMs and language equity

2.4 Check your understanding

2.5 Tokenization in context

3 Transformers: How inputs become outputs

3.1 The transformer model

3.1.1 Layers of the transformer model

3.2 Exploring the transformer architecture in detail

3.2.1 Embedding layers

3.2.2 Transformer layers

3.2.3 Unembedding layers

3.3 The tradeoff between creativity and topical responses

3.4 Transformers in context

4 How LLMs learn

4.1 Gradient descent

4.1.1 What is a loss function?

4.1.2 What is gradient descent?

4.2 LLMs learn to mimic human text

4.2.1 LLM reward functions

4.3 LLMs and novel tasks

4.3.1 Failing to identify the correct task

4.3.2 LLMs cannot plan

4.4 If LLMs cannot extrapolate well, can I use them?

4.5 Is bigger better?

5 How do we constrain the behavior of LLMs?

5.1 Why do we want to constrain behavior?

5.1.1 Base models are not very usable

5.1.2 Not all model outputs are desirable

5.1.3 Some cases require specific formatting

5.2 Fine-tuning: The primary method of changing behavior

5.2.1 Supervised fine-tuning

5.2.2 Reinforcement learning from human feedback

5.2.3 Fine-tuning: The big picture

5.3 The mechanics of RLHF

5.3.1 Beginning with a naive RLHF

5.3.2 The quality reward model

5.3.3 The similar-but-different RLHF objective

5.4 Other factors in customizing LLM behavior

5.4.1 Altering training data

5.4.2 Altering base model training

5.4.3 Altering the outputs

5.5 Integrating LLMs into larger workflows

5.5.1 Customizing LLMs with retrieval augmented generation

5.5.2 General-purpose LLM programming

6 Beyond natural language processing

6.1 LLMs for software development

6.1.1 Improving LLMs to work with code

6.1.2 Validating code generated by LLMs

6.1.3 Improving code via formatting

6.2 LLMs for formal mathematics

6.2.1 Sanitized input

6.2.2 Helping LLMs understand numbers

6.2.3 Math LLMs also use tools

6.3 Transformers and computer vision

6.3.1 Converting images to patches and back

6.3.2 Multimodal models using images and text

6.3.3 Applicability of prior lessons

7 Misconceptions, limits, and eminent abilities of LLMs

7.1 Human rate of learning vs. LLMs

7.1.1 The limitations on self-improvement

7.1.2 Few-shot learning

7.2 Efficiency of work: A 10-watt human brain vs. a 2000-watt computer

7.2.1 Power

7.2.2 Latency, scalability, and availability

7.2.3 Refinement

7.3 Language models are not models of the world

7.4 Computational limits: Hard problems are still hard

7.4.1 Using fuzzy algorithms for fuzzy problems

7.4.2 When close enough is good enough for hard problems

8 Designing solutions with large language models

8.1 Just make a chatbot?

8.2 Automation bias

8.2.1 Changing the process

8.2.2 When things are too risky for autonomous LLMs

8.3 Using more than LLMs to reduce risk

8.3.1 Combining LLM embeddings with other tools

8.3.2 Designing a solution that uses embeddings

8.4 Technology presentation matters

8.4.1 How can you be transparent?

8.4.2 Aligning incentives with users

8.4.3 Incorporating feedback cycles

9 Ethics of building and using LLMs

9.1 Why did we build LLMs at all?

9.1.1 The pros and cons of LLMs doing everything

9.1.2 Do we want to automate all human work?

9.2 Do LLMs pose an existential risk?

9.2.1 Self-improvement and the iterative S-curve

9.2.2 The alignment problem

9.3 The ethics of data sourcing and reuse

9.3.1 What is fair use?

9.3.2 The challenges associated with compensating content creators

9.3.3 The limitations of public domain data

9.4 Ethical concerns with LLM outputs

9.4.1 Licensing implications for LLM output

9.4.2 Do LLM outputs poison the well?

9.5 Other explorations in LLM ethics

References

Overview

7 Misconceptions, Limits, and Eminent Abilities of LLMs

This chapter separates hype from reality about large language models. It counters misconceptions that LLMs continually self-improve, possess humanlike intelligence, or will soon solve every problem, and instead frames them as static predictors whose strengths come from scale, speed, and availability. The narrative centers on three themes: how LLMs and humans learn in fundamentally different ways; why “thinking” is a misleading metaphor and why producing intermediate reasoning often improves results; and how computational complexity places real limits on what LLMs can do, guiding practitioners on when to use or avoid them.

Human learning is interactive, sample-efficient, and staged, whereas LLMs learn by next-token prediction over massive corpora, absorbing vocabulary and patterns all at once. This yields clear tradeoffs: breadth, low marginal cost, and rapid deployment versus brittleness to novelty, vulnerability in adversarial settings, and costly, uncertain improvement cycles. In-context (“few-shot”) prompting can steer behavior without altering weights, but it is not true learning and shows diminishing returns; material gains typically require better prompts, fine-tuning, and fresh external information. Closed self-improvement loops with model-generated data degrade performance absent new signal, and even tool-augmented approaches plateau and incur economic constraints. Operationally, LLMs bring latency, scalability, and availability advantages, but power costs and data drift demand monitoring, logging, and human-in-the-loop refinement.

On cognition, the chapter argues LLMs compute rather than think: they cannot silently plan and must emit intermediate tokens to “reason,” which helps mainly by increasing computation or aligning with pedagogical patterns in their training. Formal limits back this up: transformer inference grows roughly quadratically with context, and even with long intermediate steps, LLMs align with polynomial-time capabilities, not the NP class and beyond. As a result, LLMs are best for fuzzy, high-volume language tasks—summarization, drafting, translation, retrieval-augmented answers—where approximate results suffice, and are a poor fit for exact, adversarial, or safety-critical problems without robust guardrails, complementary algorithms, and human oversight.

A summary of the strengths and weaknesses of LLMs relative to humans performing the same task. These lead to natural considerations that you must evaluate when using an LLM. From these, we can draw broad recommendations for successful LLM use.

Concerns that LLMs will self-improve require the belief that LLMs won’t follow the normal “sigmoid” or “S-curve” of diminishing returns that describes the development of almost all other technologies. For infinite self-improvement to happen, we must believe that constraints such as power, data, or computational capacity are always solvable and that, somehow, humans would not otherwise solve them for areas outside of LLMs. Constraints such as these are why we can describe most technology development using S-curves, where progress slows as more constraints take effect. In other words, we’ll eventually reach a state where we can’t just build a bigger computer.

Moores’s law is a common example of boundless growth, but it is misleading. Transistors keep doubling, but frequency, power, single-threaded performance, and total computing do not. So, the total system performance has not continued to double approximately every two years. Other similar factors will constrain LLM performance and impact capability over time. Used under CC4.0 license from https://github.com/karlrupp/microprocessor-trend-data.

Prompts with examples of how you want the LLM to produce output are called “few-shot” prompts because it has not seen any examples of this specific behavior in its training data. In your prompt, you can include examples of input and output similar to RLHF/SFT. This prompting style encourages the model to produce the desired output by providing examples of what the desired output should look like. Because LLMs train on such a large amount of unlabeled data, k-shot examples are an effective way to get better results with minimal effort.

The expensive hardware that makes LLMs work leads to several tradeoffs. For example, the “startup” cost of using LLMs is often high, and they do not “adapt” independently. This lack of independent adaption leads to many natural weaknesses where a human would outperform an LLM. Some weaknesses, such as the fact that a model doesn’t change without training, can be considered strengths. You don’t get repeatable processes that are easy to scale if each new LLM running behaves differently and unpredictably.

The context and reason why someone is wearing or doing something unusual may be in the realm of something that an LLM properly recognizes and for which it produces an appropriate response. However, it might not be possible for an LLM to reach that appropriate response without producing some intermediate text. For a math problem, this intermediate text could be useful, but the intermediate text may not always be appropriate or desirable for a user to see.

A ven-diagram of computational complexities (assuming $P \neq NP$, a minor point for the nerds) relate to each other. The top arrows give examples of the kind of problem that a new complexity class lets you solve. The bottom arrows show where LLMs land in terms of their complexity.

Summary

The biggest advantage LLMs have over humans is the scale they achieve. LLMS can run at low cost, 24/7, and be re-sized to meet demand with far less effort than training up or reducing a human workforce.
Humans are better at handling highly novel situations, which is important if the people interacting with the LLM might be adversaries (e.g., trying to commit fraud).
We know LLMs work well at problems similar to what they have seen before in their training data, making them useful for repetitive work.
Propmpt engineering is likely the most effective starting point to “teach” LLMs something new unless you can dedicate large amounts of effort and money to data collection and fine-tuning.
LLMs can not self-improve and are inefficient for solving algorithmic problems requiring a specific correct answer. They work best on “fuzzy” problems where there is some range of satisfying outputs, and some amount of error is acceptable.

FAQ

Are LLMs continually learning from every conversation?

No. Once trained, an LLM’s parameters are static. Interactions do not update the model unless developers retrain or fine‑tune it with new data. Prompting can change outputs in the moment (in‑context learning), but it does not change what the model knows.

How do LLMs learn differently from humans?

Humans learn efficiently through interactive, incremental experience and can generalize from relatively little data. LLMs learn by predicting the next token over massive corpora, ingesting vast vocabularies at once. They gain breadth and scale, but lack humans’ sample‑efficient, adaptive learning.

Why is calling LLM behavior “thinking” misleading?

LLMs compute and emit tokens; they don’t separate internal thought from output. Producing more intermediate text gives them more computation, which can help on hard problems, but this is not the same as human planning or having a world model.

Can LLMs self‑improve by training on their own outputs?

Not reliably. By information theory, model‑generated data contains no new information beyond the original training distribution, so iterative self‑training tends to degrade quality. Real improvement requires new external information, tools, or human‑curated data—and still faces diminishing returns and cost constraints.

When is few‑shot (in‑context) learning useful, and when should I fine‑tune?

Few‑shot prompting adds a handful of examples to the prompt to steer behavior without changing weights. It’s fast and often effective when you have little labeled data, but exhibits diminishing returns. If performance is still lacking, consider supervised fine‑tuning or RLHF.

Why do intermediate steps (chain‑of‑thought) often help?

Asking for step‑by‑step reasoning increases the amount of computation the model performs, which can improve accuracy. However, it can still miss steps or reason incorrectly, and the verbose reasoning may be undesirable to show users. Hidden or tool‑aided reasoning can mitigate this.

What advantages make LLMs attractive for latency and scale‑sensitive applications?

LLMs offer rapid response, 24/7 availability, and easy horizontal scaling to large numbers of concurrent tasks. They provide broad competence across domains at low marginal cost per use, making them well‑suited to high‑volume, time‑sensitive workloads.

What are key limitations and risks of LLMs in practice?

They are brittle to novel or adversarial inputs, cannot autonomously adapt, and may repeatedly fail without guidance. Training and refinement are costly, and power demands can be significant. Guardrails, monitoring, and human‑in‑the‑loop review are often necessary.

What do computational limits imply about what LLMs can solve?

Transformer inference scales roughly quadratically with input length, and expressivity depends on how many intermediate tokens the model emits. LLMs can approximate problems in class P with enough steps, but cannot efficiently solve NP‑hard problems exactly. They shine on “fuzzy” tasks where exact correctness isn’t required.

When are LLMs the right fit versus the wrong fit?

Use LLMs for repetitive, mildly varying, and fuzzy tasks—summarization, drafting, translation, style edits—where “close enough” is acceptable and humans or tools can refine outputs. Avoid them for zero‑tolerance, adversarial, or highly novel settings where exact, verifiable solutions are required.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$54.99 $32.99

you save $22.00 (40%)

include audio $19.99 $13.99

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$54.99 $32.99

you save $22.00 (40%)

include audio $19.99 $13.99

eBook

pdf, ePub, online

$54.99 $32.99

you save $22.00 (40%)

include audio $19.99 $13.99

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

7 Misconceptions, Limits, and Eminent Abilities of LLMs

A summary of the strengths and weaknesses of LLMs relative to humans performing the same task. These lead to natural considerations that you must evaluate when using an LLM. From these, we can draw broad recommendations for successful LLM use.

A ven-diagram of computational complexities (assuming \(P \neq NP\), a minor point for the nerds) relate to each other. The top arrows give examples of the kind of problem that a new complexity class lets you solve. The bottom arrows show where LLMs land in terms of their complexity.

Summary

FAQ

pro $24.99 per month

lite $19.99 per month

team

pro $24.99 per month

lite $19.99 per month

team

pro

team

pro

team

pro

team