Overview

15 Language models and the Transformer

This chapter moves from basic text preprocessing to models that generate and transform language. It begins with the language modeling paradigm—predicting the next token given prior tokens—and shows how an autoregressive loop turns next-token predictions into open-ended text. Building on this, it frames machine translation as sequence-to-sequence learning with an encoder-decoder design and a decoding loop that feeds previously generated tokens back into the model. Along the way, it highlights the limitations of RNN-based approaches for long-range dependencies and fixed-size state bottlenecks, motivating a shift toward attention-based architectures.

The core of the chapter is the Transformer, which replaces recurrence with attention. It introduces dot-product attention with the query–key–value formulation, softmax weighting, scaling, and multi-head parallelism to capture diverse relationships. Self-attention enables tokens to contextualize one another; residual connections, layer normalization, and two-layer feed-forward blocks provide depth and nonlinearity. A practical caveat is that attention is order-agnostic, so positional embeddings are added to token embeddings to encode sequence order. Implemented as stacked encoder and decoder blocks—with causal masking in the decoder and cross-attention to the encoder—the Transformer achieves better translation quality than a GRU baseline while training faster on accelerators due to parallelism.

The chapter then demonstrates the modern workflow of leveraging large pretrained Transformers (e.g., BERT/RoBERTa) trained with masked language modeling and subword tokenization, and fine-tuning them for downstream tasks such as IMDb sentiment classification, achieving higher accuracy with minimal task-specific training. It closes by explaining why Transformers work so well: attention iteratively shapes semantically continuous and interpolative embedding spaces, echoing word2vec’s principles but at far greater scale and expressivity—storing not just facts but “vector programs” that can be recombined at inference time. This power comes with trade-offs (data hunger, potential hallucinations), and the field continues to evolve with improvements to attention, normalization, and positional encoding, as well as alternatives for very long sequences.

Sequence-to-sequence learning: the source sequence is processed by the encoder and is then sent to the decoder. The decoder looks at the target sequence so far and predicts the target sequence offset by one step in the future. During inference, we generate one target token at a time and feed it back into the decoder.
seq2seq learning
A sequence-tosequence RNN: an RNN encoder is used to produce a vector that encodes the entire source sequence, which is used as the initial state for an RNN decoder.
seq2seq rnn
The general concept of “attention” in deep learning: input features get assigned “attention scores,” which can be used to inform the next representation of the input.
attention concept
Attention assigns a relevance score to each vector in a source for each vector in a target sequence.
attention
When both target and source are sequences, attention scores are a 2d matrix. Each row shows the attention scores for the word we are trying to predict, in green.
attention scores
Retrieving images from a database: the query is compared to a set of keys, and the match scores are used to rank values (images).
query key value
Multi-headed attention allows each target word to attend to different parts of the source sequence in separate partitions of the eventual output vector.
multi head attention
A visual representation of the computations for both TransformerEncoder and TransformerDecoder blocks.
encoder decoder

Chapter summary

  • A language model is a model that learns a specific probability distribution – p(token|past tokens).
    • Language models have broad applications, but the most important is that you can generate text by calling them in a loop – where the output token at one time step becomes the input token in the next.
    • A masked language model learns a related probability distribution p(tokens|surrounding tokens), and can be helpful for classifying text and individual tokens.
    • A sequence-to-sequence language model learns to predict the next token given both past tokens in a target sequence and an entirely separate, fixed, source sequence. Sequence-to-sequence models are useful for problems like translation and question answering.
    • A sequence-to-sequence model usually has two separate components. An encoder computes a representation of the source sequence, and a decoder takes this representation as input and predicts the next token in a target sequence based on past tokens.
  • Attention is a mechanism that allows a model to pull information from anywhere in a sequence selectively based on the context of the token currently being processed.
    • Attention avoids the problems RNNs have with long-range dependencies in text.
    • Attention works by taking the dot-product of two vectors to compute an attention score. Vectors near each other in an embedding space will be summed together in the attention mechanism.
  • The Transformer is a sequence modeling architecture that uses attention as the only mechanism to pass information across a sequence.
    • The Transformer works by stacking blocks of alternating attention and two-layer feed-forward networks.
    • The Transformer can scale to many parameters and lots of training data while still improving accuracy in the language modeling problem.
    • Unlike RNNs, the Transformer involves no sequence-length loops at training time, making the model much easier to train in parallel across many machines.
    • A Transformer encoder uses bidirectional attention to build a rich representation of a sequence.
    • A Transformer decoder uses causal attention to predict the next word in a language model setup.

FAQ

What is a language model, and why predict one token at a time?A language model estimates p(next token | past tokens). Predicting one token at a time keeps the output space tractable: with a 20,000-word vocabulary the model outputs 20,000 probabilities per step, and by repeating this step we can generate long sequences. Directly classifying whole sequences is intractable because the number of possible sequences grows exponentially with length.
How do you generate text from a trained model, and why can’t a bidirectional RNN be used for it?Generation is autoregressive: feed a prompt, get a distribution over the next token, select one (e.g., argmax), append it to the input, and repeat while carrying the RNN/Transformer state. A bidirectional RNN “cheats” during training by using future tokens to predict the present one, which breaks causal generation where the future is unknown.
How was the Shakespeare character-level model built and trained?The chapter uses a character-level tokenizer (around 67 chars), splits text into fixed-length sequences, and trains a GRU-based model with an Embedding, GRU(return_sequences=True), Dropout, and a Dense softmax over the vocabulary. It uses sparse categorical crossentropy across all time steps, achieving about 70% next-character accuracy after ~20 epochs and can generate Shakespeare-like text with a prompt.
What is sequence-to-sequence (seq2seq) learning for translation, and how do training and inference differ?Seq2seq uses an encoder to summarize the source sentence and a decoder (trained as a language model) to generate the target, conditioning on previous target tokens and the encoder output. During training (“teacher forcing”), the decoder sees the true previous tokens; during inference, it must generate tokens one by one from scratch. Padding positions are masked (via sample weights) so they don’t affect loss/metrics.
What is attention, especially dot-product attention, and what are queries, keys, and values?Attention scores how relevant each source element is to a target element and takes a softmax-weighted sum of source representations. In dot-product attention, scores are computed as q·k between “queries” (from the target) and “keys” (from the source), and the weighted sum is taken over “values” (often the same as keys). This lets the model pull information from any position in the sequence based on context.
Why use multi-head attention and why scale the dot product?Scaling by 1/√d stabilizes gradients because raw dot products grow with vector dimension. Multiple heads learn different alignment patterns (e.g., syntax vs. entities) in parallel, avoiding “washing out” when combining many tokens with a single softmax; head outputs are concatenated and projected to form a richer representation.
How do Transformer encoder and decoder blocks work, and what masks are needed?The encoder stacks self-attention and a position-wise feed-forward network, each with residual connections and LayerNormalization. The decoder adds two attentions: self-attention (with a causal mask so each position can only attend to earlier positions) and cross-attention over the encoder outputs (with an attention mask to ignore source padding). LayerNormalization is used instead of BatchNormalization for sequence data.
Why do Transformers need positional embeddings, and which kind were used?Attention alone is permutation-invariant; without positions, the model is blind to word order. The chapter uses learned positional embeddings added to token embeddings (one vector per position up to a max length). Adding them boosts translation accuracy and makes the Transformer truly sequence-aware.
How do Transformers compare to RNNs for sequence modeling?Transformers handle long-range dependencies better and train faster on accelerators because they avoid recurrent loops and compute attention in parallel. They also scale well with data and model size. RNN seq2seq models compress the source into a single state and struggle with very long sequences; attention-based Transformers overcome these limits.
How do you use a pretrained Transformer (e.g., BERT/RoBERTa) for classification?Load a matching tokenizer and backbone (e.g., via KerasHub from_preset). Tokenize and pack sequences with the expected special tokens and padding mask. Feed token IDs and padding mask to the backbone to get contextual embeddings, pool (often using the first token’s representation), attach a small classification head (Dense layers), and fine-tune with a low learning rate. RoBERTa is pretrained with masked language modeling and uses a subword tokenizer and position embeddings.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Deep Learning with Python, Third Edition ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Deep Learning with Python, Third Edition ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Deep Learning with Python, Third Edition ebook for free