Overview

3 Transformers: How Inputs Become Outputs

Large language models treat text as tokens and generate outputs by repeatedly converting those tokens into numbers, transforming them, and mapping the results back to new tokens. This cyclical, statistical process is unlike human language use but is highly effective for predicting what comes next in a sequence. Modern models implement this with Transformers, architectures built to predict tokens rather than “understand” language in a human sense, and the chapter constructs a practical mental model for how inputs flow through the system to become coherent text.

The pipeline begins with an embedding layer that turns tokens into high-dimensional vectors capturing meaning, augmented with positional information so order matters. Stacked transformer layers then apply attention, using queries, keys, and values to highlight the most relevant context and refine representations across many steps. Finally, an output (unembedding) layer converts internal vectors into a probability distribution over tokens; sampling selects the next token, making generation autoregressive—each new choice depends on the prior ones—until an end-of-sequence signal stops the loop. This probabilistic decoding explains why the same prompt can yield different valid continuations.

Transformers come in three main styles: encoder-only models that produce rich representations for tasks, decoder-only models that continue text by next-token prediction, and encoder-decoder models that map one passage to another, excelling at tasks like translation and summarization. Embedding spaces enable useful semantic relationships but bring trade-offs: more dimensions increase expressivity and cost, and patterns in data can imprint unwanted biases. During decoding, controls like temperature adjust the balance between conservative, on-topic responses and creative, surprising ones. Together, these components and choices explain how LLMs convert tokenized inputs into fluent outputs.

The basic components of the transformer model, consisting of the embedding layer, multiple transformer layers, and the output layer
figure
The full process going from a document to a transformer output
figure
If you use just one number to represent a token, you quickly encounter problems where similar/dissimilar words can not be made to “fit” each other. Here we see how trying to represent simple synonym/antonym relationships quickly becomes nonsensical even with just a handful of words.
figure
Adding another dimension to our token representation allows us to represent a more diverse arrangement of semantic relationships. Here, we see how different dimensions can capture relationships for multiple meanings of the same word
figure
Demonstration of how the relationships are sufficiently useful makes the embeddings a “semantic space”. Words with similar meanings are near each other, but a single transformation of a word can be applied to multiple words to yield a similar result. In this instance, a transformation for finding the female version of a word.
figure
Because Transformers do not understand that their inputs have a specific order, all possible re-organizations of the tokens “look” identical to the algorithm. This is problematic because word order can change the word’s context or, if done randomly, become gibberish.
figure
Transformer layers do not understand that the inputs have a specific order. This information about the order of tokens is endowed via a second “positional” encoding. The position embeddings work the same way as token embeddings and are added together. This provides the information the model needs on the order of tokens.
figure
An example of how queries, keys, and values work inside a transformer compared to a Python dictionary. When a Python dictionary matches queries to keys, it needs an exact match to find the value, or it will return nothing. A transformer always returns something based on the most similar matches between queries and keys.
figure
The next token in a sentence is predicted by using the current token as the “query” and calculating matches with the preceding words as the “keys”. The individual values themselves do not need to exist in the semantic space; the output of the attention mechanism produces something similar to one of the tokens in the vocabulary.
figure
Producing output from LLMs involves converting from documents to tokens and then using the model to produce output. We loop through this process to both consume text and generate human-readable output
figure
We demonstrate text generation by starting with the phrase "I love to eat" and then showing some possible completions that are foods, such as barbeque and sushi, have high probabilities, while a car and the number ’42’ have low probabilities. Weighted random selection chooses the word ’tacos’. The generation loop is stopped when the EoS token appears.
figure

Summary

  • While GPTs use tokens as their "basic unit of semantic meaning," they’re mathematically represented within the model as embedding vectors rather than as strings. These embedding vectors can capture relationships about nearness, dissimilarity, antonyms, and other linguistic descriptive properties.
  • Position and word order do not come naturally to transformers and are obtained via another vector representing the relative position. The model can represent word order by adding the position and token embedding vectors.
  • Transformer layers act as a kind of fuzzy dictionary, returning approximate answers to approximate matches. This fuzzy process is called attention and uses the terms Query, Key, and Value as analogous to the key and value in a Python dictionary.
  • GPTs are examples of decoder-only transformers, but encoder-only transformers and encoder-decoder transformers also exist. GPTs are best at generating text, but other types can be better at other tasks.
  • GPTs are autoregressive, meaning they work recursively. All previously generated tokens are fed into the model at each step to get the next token. Simply put, autoregressive models “predict the next thing using the previous things”.
  • The output of any transformer isn’t tokens; instead, the output is a probability for how likely every token is. Selecting a specific token is called Unembedding or sampling and includes some randomness.
  • The strength of this randomness can be controlled, resulting in more or less realistic output, more creative or unique output, or more consistent output. Most LLMs have a default threshold for randomness that is “reasonable looking,” but you may want to change it for different uses.

FAQ

What are the main stages in an LLM’s text generation pipeline?The process typically follows seven steps: (1) tokenize the input text; (2) map tokens to embedding vectors; (3) add positional embeddings to encode order; (4) pass the sequence through many transformer layers; (5) project the result back to token scores (unembedding/output layer); (6) sample one token from the scored vocabulary; (7) detokenize to text and repeat autoregressively until an end-of-sequence token or a length limit is reached.
Why do LLMs convert tokens into vectors instead of using token IDs directly?Neural networks operate on continuous numbers they can adjust during learning. A token ID is a fixed identifier that carries no graded meaning. Embedding layers map each token to a high-dimensional vector that can capture nuanced relationships (similarity, antonymy, multiple senses) that the model can manipulate.
Why isn’t a single number enough to represent a word?One dimension cannot simultaneously encode many relationships (synonyms, antonyms, multiple meanings). Using one number leads to contradictions (e.g., making antonyms mere negations wrongly ties unrelated words together). Multiple dimensions allow the model to place words in positions that reflect many relationships at once.
What is an embedding or “semantic” space, and what useful properties does it have?It’s a high-dimensional space where each token is represented by a vector. Useful properties include: similar words being close together; consistent vector offsets that capture relations (e.g., male→female), and the ability to compose many relations simultaneously. Note that these spaces reflect training data and can encode societal biases.
Why do Transformers need positional information and how is it added?Transformer layers don’t inherently know the order of tokens; without position, a sentence and its jumbled version look identical. Models add a positional vector (based on the token’s index) to the token’s meaning vector; the sum encodes both identity and position so the model can use order-sensitive patterns.
What happens inside a transformer layer (Query, Key, Value and attention)?Each token is mapped to a Query (what it’s looking for), a set of Keys (what others offer), and corresponding Values (the content to retrieve). Attention scores how well the current Query matches each Key, weights the Values accordingly, and aggregates them. This “fuzzy lookup” lets the model focus on the most relevant context when forming the next representation.
What are encoder-only, decoder-only, and encoder–decoder models used for?- Encoder-only: build rich representations of input text for tasks like classification and retrieval (e.g., BERT, RoBERTa). - Decoder-only: generate continuations by predicting the next token (e.g., GPT, Gemini). - Encoder–decoder: transform one sequence into a corresponding output sequence, excelling at tasks like translation and summarization (e.g., T5, Google Translate’s core models).
How are vectors converted back into tokens, and how does generation stop?The output/unembedding layer scores every vocabulary token against the current vector. The model samples one token, appends it to the output, and feeds it back in to predict the next token (an autoregressive loop). Generation ends when a special end-of-sequence token appears or a maximum length is reached.
How does the model choose the next token, and why is randomness involved?The model converts token scores into probabilities and samples one token according to those probabilities. Randomness is essential because many continuations can be valid; always picking the single highest-probability token would produce repetitive, unnatural text. Sampling balances plausibility with variability and can be tuned (e.g., via nucleus sampling).
What does “temperature” control, and when should you raise or lower it?Temperature rescales token probabilities before sampling. Lower temperature emphasizes high-probability tokens for focused, reliable outputs; higher temperature boosts lower-probability tokens for more diverse, creative outputs. Typical defaults are around 0.7–0.8; lower for factual or precise tasks, higher for brainstorming or creative writing.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • How Large Language Models Work ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • How Large Language Models Work ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • How Large Language Models Work ebook for free