table of content

1 What is deep learning?

1.1 Artificial intelligence, machine learning, and deep learning

1.2 Artificial intelligence

1.3 Machine learning

1.4 Learning rules and representations from data

1.5 The “deep” in “deep learning”

1.6 Understanding how deep learning works, in three figures

1.7 What makes deep learning different

1.8 The age of generative AI

1.9 What deep learning has achieved so far

1.10 Beware of the short-term hype

1.11 Summer can turn to winter

1.12 The promise of AI

2 The mathematical building blocks of neural networks

2.1 A first look at a neural network

2.2 Data representations for neural networks

2.2.1 Scalars (rank-0 tensors)

2.2.2 Vectors (rank-1 tensors)

2.2.3 Matrices (rank-2 tensors)

2.2.4 Rank-3 tensors and higher-rank tensors

2.2.5 Key attributes

2.2.6 Manipulating tensors in NumPy

2.2.7 The notion of data batches

2.2.8 Real-world examples of data tensors

2.3 The gears of neural networks: Tensor operations

2.3.1 Element-wise operations

2.3.2 Broadcasting

2.3.3 Tensor product

2.3.4 Tensor reshaping

2.3.5 Geometric interpretation of tensor operations

2.3.6 A geometric interpretation of deep learning

2.4 The engine of neural networks: Gradient-based optimization

2.4.1 What’s a derivative?

2.4.2 Derivative of a tensor operation: The gradient

2.4.3 Stochastic gradient descent

2.4.4 Chaining derivatives: The Backpropagation algorithm

2.5 Looking back at our first example

2.5.1 Reimplementing our first example from scratch

2.5.2 Running one training step

2.5.3 The full training loop

2.5.4 Evaluating the model

2.6 Summary

3 Introduction to TensorFlow, PyTorch, JAX, and Keras

3.1 A brief history of deep learning frameworks

3.2 How these frameworks relate to each other

3.3 Introduction to TensorFlow

3.3.1 First steps with TensorFlow

3.3.2 An end-to-end example: A linear classifier in pure TensorFlow

3.3.3 What makes the TensorFlow approach unique

3.4 Introduction to PyTorch

3.4.1 First steps with PyTorch

3.4.2 An end-to-end example: A linear classifier in pure PyTorch

3.4.3 What makes the PyTorch approach unique

3.5 Introduction to JAX

3.5.1 First steps with JAX

3.5.2 Tensors in JAX

3.5.3 Random number generation in JAX

3.5.4 An end-to-end example: A linear classifier in pure JAX

3.5.5 What makes the JAX approach unique

3.6 Introduction to Keras

3.6.1 First steps with Keras

3.6.2 Layers: The building blocks of deep learning

3.6.3 From layers to models

3.6.4 The “compile” step: Configuring the learning process

3.6.5 Picking a loss function

3.6.6 Understanding the fit method

3.6.7 Monitoring loss and metrics on validation data

3.6.8 Inference: Using a model after training

4 Classification and regression

4.1 Classifying movie reviews: A binary classification example

4.1.1 The IMDb dataset

4.1.2 Preparing the data

4.1.3 Building your model

4.1.4 Validating your approach

4.1.5 Using a trained model to generate predictions on new data

4.1.6 Further experiments

4.1.7 Wrapping up

4.2 Classifying newswires: A multiclass classification example

4.2.1 The Reuters dataset

4.2.2 Preparing the data

4.2.3 Building your model

4.2.4 Validating your approach

4.2.5 Generating predictions on new data

4.2.6 A different way to handle the labels and the loss

4.2.7 The importance of having sufficiently large intermediate layers

4.2.8 Further experiments

4.2.9 Wrapping up

4.3 Predicting house prices: A regression example

4.3.1 The California Housing Price dataset

4.3.2 Preparing the data

4.3.3 Building your model

4.3.4 Validating your approach using K-fold validation

4.3.5 Generating predictions on new data

4.3.6 Wrapping up

5 Fundamentals of machine learning

5.1 Generalization: The goal of machine learning

5.1.1 Underfitting and overfitting

5.1.2 The nature of generalization in deep learning

5.2 Evaluating machine-learning models

5.2.1 Training, validation, and test sets

5.2.2 Beating a common-sense baseline

5.2.3 Things to keep in mind about model evaluation

5.3 Improving model fit

5.3.1 Tuning key gradient descent parameters

5.3.2 Using better architecture priors

5.3.3 Increasing model capacity

5.4 Improving generalization

5.4.1 Dataset curation

5.4.2 Feature engineering

5.4.3 Using early stopping

5.4.4 Regularizing your model

6 The universal workflow of machine learning

6.1 Defining the task

6.1.1 Framing the problem

6.1.2 Collecting a dataset

6.1.3 Understanding your data

6.1.4 Choosing a measure of success

6.2 Developing a model

6.2.1 Preparing the data

6.2.2 Choosing an evaluation protocol

6.2.3 Beating a baseline

6.2.4 Scaling up: Developing a model that overfits

6.2.5 Regularizing and tuning your model

6.3 Deploying your model

6.3.1 Explaining your work to stakeholders and setting expectations

6.3.2 Shipping an inference model

6.3.3 Monitoring your model in the wild

6.3.4 Maintaining your model

7 A deep dive on Keras

7.1 A spectrum of workflows

7.2 Different ways to build Keras models

7.2.1 The Sequential model

7.2.2 The Functional API

7.2.3 Subclassing the Model class

7.2.4 Mixing and matching different components

7.2.5 Remember: Use the right tool for the job

7.3 Using built-in training and evaluation loops

7.3.1 Writing your own metrics

7.3.2 Using callbacks

7.3.3 Writing your own callbacks

7.3.4 Monitoring and visualization with TensorBoard

7.4 Writing your own training and evaluation loops

7.4.1 Training vs. inference

7.4.2 Writing custom training step functions

7.4.3 Low-level usage of metrics

7.4.4 Using fit() with a custom training loop

7.4.5 Handling metrics in a custom train_step()

8 Image classification

8.1 Introduction to ConvNets

8.1.1 The convolution operation

8.1.2 The max-pooling operation

8.2 Training a ConvNet from scratch on a small dataset

8.2.1 The relevance of deep learning for small-data problems

8.2.2 Downloading the data

8.2.3 Building your model

8.2.4 Data preprocessing

8.2.5 Using data augmentation

8.3 Using a pretrained model

8.3.1 Feature extraction with a pretrained model

8.3.2 Fine-tuning a pretrained model

9 ConvNet architecture patterns

9.1 Modularity, hierarchy, and reuse

9.2 Residual connections

9.3 Batch normalization

9.4 Depthwise separable convolutions

9.5 Putting it together: A mini Xception-like model

9.6 Beyond convolution: Vision Transformers

10 Interpreting what ConvNets learn

10.1 Visualizing intermediate activations

10.2 Visualizing ConvNet filters

10.2.1 Gradient ascent in TensorFlow

10.2.2 Gradient ascent in PyTorch

10.2.3 Gradient ascent in JAX

10.2.4 The filter visualization loop

10.3 Visualizing heatmaps of class activation

10.3.1 Getting the gradient of the top class: TensorFlow version

10.3.2 Getting the gradient of the top class: PyTorch version

10.3.3 Getting the gradient of the top class: JAX version

10.3.4 Displaying the class activation heatmap

10.4 Visualizing the latent space of a ConvNet

11 Image segmentation

11.1 Computer vision tasks

11.1.1 Types of image segmentation

11.2 Training a segmentation model from scratch

11.2.1 Downloading a segmentation dataset

11.2.2 Building and training the segmentation model

11.3 Using a pretrained segmentation model

11.3.1 Downloading the Segment Anything Model

11.3.2 How Segment Anything works

11.3.3 Preparing a test image

11.3.4 Prompting the model with a target point

11.3.5 Prompting the model with a target box

12 Object detection

12.1 Single-stage vs. two-stage object detectors

12.1.1 Two-stage R-CNN detectors

12.1.2 Single-stage detectors

12.2 Training a YOLO model from scratch

12.2.1 Downloading the COCO dataset

12.2.2 Creating a YOLO model

12.2.3 Readying the COCO data for the YOLO model

12.2.4 Training the YOLO model

12.3 Using a pretrained RetinaNet detector

13 Timeseries forecasting

13.1 Different kinds of timeseries tasks

13.2 A temperature forecasting example

13.2.1 Preparing the data

13.2.2 A commonsense, non-machine-learning baseline

13.2.3 Let’s try a basic machine learning model

13.2.4 Let’s try a 1D convolutional model

13.3 Recurrent neural networks

13.3.1 Understanding recurrent neural networks

13.3.2 A recurrent layer in Keras

13.3.3 Getting the most out of recurrent neural networks

13.3.4 Using recurrent dropout to fight overfitting

13.3.5 Stacking recurrent layers

13.3.6 Using bidirectional RNNs

13.4 Going even further

14 Text classification

14.1 A brief history of natural language processing

14.2 Preparing text data

14.2.1 Character and word tokenization

14.2.2 Subword tokenization

14.3 Sets vs. sequences

14.3.1 Loading the IMDb classification dataset

14.4 Set models

14.4.1 Training a bag-of-words model

14.4.2 Training a bigram model

14.5 Sequence models

14.5.1 Training a recurrent model

14.5.2 Understanding word embeddings

14.5.3 Using a word embedding

14.5.4 Pretraining a word embedding

14.5.5 Using the pretrained embedding for classification

15 Language models and the Transformer

15.1 The language model

15.1.1 Training a Shakespeare language model

15.1.2 Generating Shakespeare

15.2 Sequence-to-sequence learning

15.2.1 English-to-Spanish translation

15.2.2 Sequence-to-sequence learning with RNNs

15.3 The Transformer architecture

15.3.1 Dot-product attention

15.3.2 Transformer encoder block

15.3.3 Transformer decoder block

15.3.4 Sequence-to-sequence learning with a Transformer

15.3.5 Embedding positional information

15.4 Classification with a pretrained Transformer

15.4.1 Pretraining a Transformer encoder

15.4.2 Loading a pretrained Transformer

15.4.3 Preprocessing IMDb movie reviews

15.4.4 Fine-tuning a pretrained Transformer

15.5 What makes the Transformer effective?

16 Text generation

16.1 A brief history of sequence generation

16.2 Training a mini-GPT

16.2.1 Building the model

16.2.2 Pretraining the model

16.2.3 Generative decoding

16.2.4 Sampling strategies

16.3 Using a pretrained LLM

16.3.1 Text generation with the Gemma model

16.3.2 Instruction fine-tuning

16.3.3 Low-Rank Adaptation (LoRA)

16.4 Going further with LLMs

16.4.1 Reinforcement Learning with Human Feedback (RLHF)

16.4.2 Multimodal LLMs

16.4.3 Retrieval Augmented Generation (RAG)

16.4.4 “Reasoning” models

16.5 Where are LLMs heading next?

17 Image generation

17.1 Deep learning for image generation

17.1.1 Sampling from latent spaces of images

17.1.2 Variational autoencoders

17.1.3 Implementing a VAE with Keras

17.2 Diffusion models

17.2.1 The Oxford Flowers dataset

17.2.2 A U-Net denoising autoencoder

17.2.3 The concepts of diffusion time and diffusion schedule

17.2.4 The training process

17.2.5 The generation process

17.2.6 Visualizing results with a custom callback

17.2.7 It’s go time!

17.3 Text-to-image models

17.3.1 Exploring the latent space of a text-to-image model

18 Best practices for the real world

18.1 Getting the most out of your models

18.1.1 Hyperparameter optimization

18.1.2 Model ensembling

18.2 Scaling up model training with multiple devices

18.2.1 Multi-GPU training

18.2.2 Distributed training in practice

18.2.3 TPU training

18.3 Speeding up training and inference with lower-precision computation

18.3.1 Understanding floating-point precision

18.3.2 Float16 inference

18.3.3 Mixed-precision training

18.3.4 Using loss scaling with mixed precision

18.3.5 Beyond mixed precision: float8 training

18.3.6 Faster inference with quantization

19 The future of AI

19.1 The limitations of deep learning

19.1.1 Deep learning models struggle to adapt to novelty

19.1.2 Deep learning models are highly sensitive to phrasing and other distractors

19.1.3 Deep learning models struggle to learn generalizable programs

19.1.4 The risk of anthropomorphizing machine-learning models

19.2 Scale isn’t all you need

19.2.1 Automatons vs. intelligent agents

19.2.2 Local generalization vs. extreme generalization

19.2.3 The purpose of intelligence

19.2.4 Climbing the spectrum of generalization

19.3 How to build intelligence

19.3.1 The kaleidoscope hypothesis

19.3.2 The essence of intelligence: Abstraction acquisition and recombination

19.3.3 The importance of setting the right target

19.3.4 A new target: On-the-fly adaptation

19.3.5 ARC Prize

19.3.6 The test-time adaptation era

19.3.7 ARC-AGI 2

19.4 The missing ingredients: Search and symbols

19.4.1 The two poles of abstraction

19.4.2 Cognition as a combination of both kinds of abstraction

19.4.3 Why deep learning isn’t a complete answer to abstraction generation

19.4.4 An alternative approach to AI: Program synthesis

19.4.5 Blending deep learning and program synthesis

19.4.6 Modular component recombination and lifelong learning

19.4.7 The long-term vision

20 Conclusions

20.1 Key concepts in review

20.1.1 Various approaches to artificial intelligence

20.1.2 What makes deep learning special within the field of machine learning

20.1.3 How to think about deep learning

20.1.4 Key enabling technologies

20.1.5 The universal machine learning workflow

20.1.6 Key network architectures

20.2 Limitations of deep learning

20.3 What might lie ahead

20.4 Staying up to date in a fast-moving field

20.4.1 Practice on real-world problems using Kaggle

20.4.2 Read about the latest developments on arXiv

20.4.3 Explore the Keras ecosystem

20.5 Final words

Overview

15 Language models and the Transformer

This chapter moves from basic text preprocessing to models that generate and transform language. It begins with the language modeling paradigm—predicting the next token given prior tokens—and shows how an autoregressive loop turns next-token predictions into open-ended text. Building on this, it frames machine translation as sequence-to-sequence learning with an encoder-decoder design and a decoding loop that feeds previously generated tokens back into the model. Along the way, it highlights the limitations of RNN-based approaches for long-range dependencies and fixed-size state bottlenecks, motivating a shift toward attention-based architectures.

The core of the chapter is the Transformer, which replaces recurrence with attention. It introduces dot-product attention with the query–key–value formulation, softmax weighting, scaling, and multi-head parallelism to capture diverse relationships. Self-attention enables tokens to contextualize one another; residual connections, layer normalization, and two-layer feed-forward blocks provide depth and nonlinearity. A practical caveat is that attention is order-agnostic, so positional embeddings are added to token embeddings to encode sequence order. Implemented as stacked encoder and decoder blocks—with causal masking in the decoder and cross-attention to the encoder—the Transformer achieves better translation quality than a GRU baseline while training faster on accelerators due to parallelism.

The chapter then demonstrates the modern workflow of leveraging large pretrained Transformers (e.g., BERT/RoBERTa) trained with masked language modeling and subword tokenization, and fine-tuning them for downstream tasks such as IMDb sentiment classification, achieving higher accuracy with minimal task-specific training. It closes by explaining why Transformers work so well: attention iteratively shapes semantically continuous and interpolative embedding spaces, echoing word2vec’s principles but at far greater scale and expressivity—storing not just facts but “vector programs” that can be recombined at inference time. This power comes with trade-offs (data hunger, potential hallucinations), and the field continues to evolve with improvements to attention, normalization, and positional encoding, as well as alternatives for very long sequences.

Sequence-to-sequence learning: the source sequence is processed by the encoder and is then sent to the decoder. The decoder looks at the target sequence so far and predicts the target sequence offset by one step in the future. During inference, we generate one target token at a time and feed it back into the decoder.

A sequence-tosequence RNN: an RNN encoder is used to produce a vector that encodes the entire source sequence, which is used as the initial state for an RNN decoder.

The general concept of “attention” in deep learning: input features get assigned “attention scores,” which can be used to inform the next representation of the input.

Attention assigns a relevance score to each vector in a source for each vector in a target sequence.

When both target and source are sequences, attention scores are a 2d matrix. Each row shows the attention scores for the word we are trying to predict, in green.

Retrieving images from a database: the query is compared to a set of keys, and the match scores are used to rank values (images).

Multi-headed attention allows each target word to attend to different parts of the source sequence in separate partitions of the eventual output vector.

A visual representation of the computations for both TransformerEncoder and TransformerDecoder blocks.

Chapter summary

A language model is a model that learns a specific probability distribution – p(token|past tokens).
- Language models have broad applications, but the most important is that you can generate text by calling them in a loop – where the output token at one time step becomes the input token in the next.
- A masked language model learns a related probability distribution p(tokens|surrounding tokens), and can be helpful for classifying text and individual tokens.
- A sequence-to-sequence language model learns to predict the next token given both past tokens in a target sequence and an entirely separate, fixed, source sequence. Sequence-to-sequence models are useful for problems like translation and question answering.
- A sequence-to-sequence model usually has two separate components. An encoder computes a representation of the source sequence, and a decoder takes this representation as input and predicts the next token in a target sequence based on past tokens.
Attention is a mechanism that allows a model to pull information from anywhere in a sequence selectively based on the context of the token currently being processed.
- Attention avoids the problems RNNs have with long-range dependencies in text.
- Attention works by taking the dot-product of two vectors to compute an attention score. Vectors near each other in an embedding space will be summed together in the attention mechanism.
The Transformer is a sequence modeling architecture that uses attention as the only mechanism to pass information across a sequence.
- The Transformer works by stacking blocks of alternating attention and two-layer feed-forward networks.
- The Transformer can scale to many parameters and lots of training data while still improving accuracy in the language modeling problem.
- Unlike RNNs, the Transformer involves no sequence-length loops at training time, making the model much easier to train in parallel across many machines.
- A Transformer encoder uses bidirectional attention to build a rich representation of a sequence.
- A Transformer decoder uses causal attention to predict the next word in a language model setup.

FAQ

What is a language model, and why predict one token at a time?

A language model estimates p(next token | past tokens). Predicting one token at a time keeps the output space tractable: with a 20,000-word vocabulary the model outputs 20,000 probabilities per step, and by repeating this step we can generate long sequences. Directly classifying whole sequences is intractable because the number of possible sequences grows exponentially with length.

How do you generate text from a trained model, and why can’t a bidirectional RNN be used for it?

Generation is autoregressive: feed a prompt, get a distribution over the next token, select one (e.g., argmax), append it to the input, and repeat while carrying the RNN/Transformer state. A bidirectional RNN “cheats” during training by using future tokens to predict the present one, which breaks causal generation where the future is unknown.

How was the Shakespeare character-level model built and trained?

The chapter uses a character-level tokenizer (around 67 chars), splits text into fixed-length sequences, and trains a GRU-based model with an Embedding, GRU(return_sequences=True), Dropout, and a Dense softmax over the vocabulary. It uses sparse categorical crossentropy across all time steps, achieving about 70% next-character accuracy after ~20 epochs and can generate Shakespeare-like text with a prompt.

What is sequence-to-sequence (seq2seq) learning for translation, and how do training and inference differ?

Seq2seq uses an encoder to summarize the source sentence and a decoder (trained as a language model) to generate the target, conditioning on previous target tokens and the encoder output. During training (“teacher forcing”), the decoder sees the true previous tokens; during inference, it must generate tokens one by one from scratch. Padding positions are masked (via sample weights) so they don’t affect loss/metrics.

What is attention, especially dot-product attention, and what are queries, keys, and values?

Attention scores how relevant each source element is to a target element and takes a softmax-weighted sum of source representations. In dot-product attention, scores are computed as q·k between “queries” (from the target) and “keys” (from the source), and the weighted sum is taken over “values” (often the same as keys). This lets the model pull information from any position in the sequence based on context.

Why use multi-head attention and why scale the dot product?

Scaling by 1/√d stabilizes gradients because raw dot products grow with vector dimension. Multiple heads learn different alignment patterns (e.g., syntax vs. entities) in parallel, avoiding “washing out” when combining many tokens with a single softmax; head outputs are concatenated and projected to form a richer representation.

How do Transformer encoder and decoder blocks work, and what masks are needed?

The encoder stacks self-attention and a position-wise feed-forward network, each with residual connections and LayerNormalization. The decoder adds two attentions: self-attention (with a causal mask so each position can only attend to earlier positions) and cross-attention over the encoder outputs (with an attention mask to ignore source padding). LayerNormalization is used instead of BatchNormalization for sequence data.

Why do Transformers need positional embeddings, and which kind were used?

Attention alone is permutation-invariant; without positions, the model is blind to word order. The chapter uses learned positional embeddings added to token embeddings (one vector per position up to a max length). Adding them boosts translation accuracy and makes the Transformer truly sequence-aware.

How do Transformers compare to RNNs for sequence modeling?

Transformers handle long-range dependencies better and train faster on accelerators because they avoid recurrent loops and compute attention in parallel. They also scale well with data and model size. RNN seq2seq models compress the source into a single state and struggle with very long sequences; attention-based Transformers overcome these limits.

How do you use a pretrained Transformer (e.g., BERT/RoBERTa) for classification?

Load a matching tokenizer and backbone (e.g., via KerasHub from_preset). Tokenize and pack sequences with the expected special tokens and padding mask. Feed token IDs and padding mask to the backbone to get contextual embeddings, pool (often using the first token’s representation), attach a small classification head (Dense layers), and fine-tune with a low learning rate. RoBERTa is pretrained with masked language modeling and uses a subword tokenizer and position embeddings.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$74.99 $37.49

you save $37.50 (50%)

include audio $24.99 $12.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$74.99 $37.49

you save $37.50 (50%)

include audio $24.99 $12.49

eBook

pdf, ePub, online

$74.99 $37.49

you save $37.50 (50%)

include audio $24.99 $12.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more