Deep Learning with Python, Third Edition you own this product

François Chollet and Matthew Watson

September 2025
ISBN 9781633436589
648 pages

Included with a Manning Online subscription

printed in color

available in Korean, Russian, Simplified Chinese

catalog / Data Science / Deep Learning

table of content

1 What is deep learning?

1.1 Artificial intelligence, machine learning, and deep learning

1.2 Artificial intelligence

1.3 Machine learning

1.4 Learning rules and representations from data

1.5 The “deep” in “deep learning”

1.6 Understanding how deep learning works, in three figures

1.7 What makes deep learning different

1.8 The age of generative AI

1.9 What deep learning has achieved so far

1.10 Beware of the short-term hype

1.11 Summer can turn to winter

1.12 The promise of AI

2 The mathematical building blocks of neural networks

2.1 A first look at a neural network

2.2 Data representations for neural networks

2.2.1 Scalars (rank-0 tensors)

2.2.2 Vectors (rank-1 tensors)

2.2.3 Matrices (rank-2 tensors)

2.2.4 Rank-3 tensors and higher-rank tensors

2.2.5 Key attributes

2.2.6 Manipulating tensors in NumPy

2.2.7 The notion of data batches

2.2.8 Real-world examples of data tensors

2.3 The gears of neural networks: Tensor operations

2.3.1 Element-wise operations

2.3.2 Broadcasting

2.3.3 Tensor product

2.3.4 Tensor reshaping

2.3.5 Geometric interpretation of tensor operations

2.3.6 A geometric interpretation of deep learning

2.4 The engine of neural networks: Gradient-based optimization

2.4.1 What’s a derivative?

2.4.2 Derivative of a tensor operation: The gradient

2.4.3 Stochastic gradient descent

2.4.4 Chaining derivatives: The Backpropagation algorithm

2.5 Looking back at our first example

2.5.1 Reimplementing our first example from scratch

2.5.2 Running one training step

2.5.3 The full training loop

2.5.4 Evaluating the model

2.6 Summary

3 Introduction to TensorFlow, PyTorch, JAX, and Keras

3.1 A brief history of deep learning frameworks

3.2 How these frameworks relate to each other

3.3 Introduction to TensorFlow

3.3.1 First steps with TensorFlow

3.3.2 An end-to-end example: A linear classifier in pure TensorFlow

3.3.3 What makes the TensorFlow approach unique

3.4 Introduction to PyTorch

3.4.1 First steps with PyTorch

3.4.2 An end-to-end example: A linear classifier in pure PyTorch

3.4.3 What makes the PyTorch approach unique

3.5 Introduction to JAX

3.5.1 First steps with JAX

3.5.2 Tensors in JAX

3.5.3 Random number generation in JAX

3.5.4 An end-to-end example: A linear classifier in pure JAX

3.5.5 What makes the JAX approach unique

3.6 Introduction to Keras

3.6.1 First steps with Keras

3.6.2 Layers: The building blocks of deep learning

3.6.3 From layers to models

3.6.4 The “compile” step: Configuring the learning process

3.6.5 Picking a loss function

3.6.6 Understanding the fit method

3.6.7 Monitoring loss and metrics on validation data

3.6.8 Inference: Using a model after training

4 Classification and regression

4.1 Classifying movie reviews: A binary classification example

4.1.1 The IMDb dataset

4.1.2 Preparing the data

4.1.3 Building your model

4.1.4 Validating your approach

4.1.5 Using a trained model to generate predictions on new data

4.1.6 Further experiments

4.1.7 Wrapping up

4.2 Classifying newswires: A multiclass classification example

4.2.1 The Reuters dataset

4.2.2 Preparing the data

4.2.3 Building your model

4.2.4 Validating your approach

4.2.5 Generating predictions on new data

4.2.6 A different way to handle the labels and the loss

4.2.7 The importance of having sufficiently large intermediate layers

4.2.8 Further experiments

4.2.9 Wrapping up

4.3 Predicting house prices: A regression example

4.3.1 The California Housing Price dataset

4.3.2 Preparing the data

4.3.3 Building your model

4.3.4 Validating your approach using K-fold validation

4.3.5 Generating predictions on new data

4.3.6 Wrapping up

5 Fundamentals of machine learning

5.1 Generalization: The goal of machine learning

5.1.1 Underfitting and overfitting

5.1.2 The nature of generalization in deep learning

5.2 Evaluating machine-learning models

5.2.1 Training, validation, and test sets

5.2.2 Beating a common-sense baseline

5.2.3 Things to keep in mind about model evaluation

5.3 Improving model fit

5.3.1 Tuning key gradient descent parameters

5.3.2 Using better architecture priors

5.3.3 Increasing model capacity

5.4 Improving generalization

5.4.1 Dataset curation

5.4.2 Feature engineering

5.4.3 Using early stopping

5.4.4 Regularizing your model

6 The universal workflow of machine learning

6.1 Defining the task

6.1.1 Framing the problem

6.1.2 Collecting a dataset

6.1.3 Understanding your data

6.1.4 Choosing a measure of success

6.2 Developing a model

6.2.1 Preparing the data

6.2.2 Choosing an evaluation protocol

6.2.3 Beating a baseline

6.2.4 Scaling up: Developing a model that overfits

6.2.5 Regularizing and tuning your model

6.3 Deploying your model

6.3.1 Explaining your work to stakeholders and setting expectations

6.3.2 Shipping an inference model

6.3.3 Monitoring your model in the wild

6.3.4 Maintaining your model

7 A deep dive on Keras

7.1 A spectrum of workflows

7.2 Different ways to build Keras models

7.2.1 The Sequential model

7.2.2 The Functional API

7.2.3 Subclassing the Model class

7.2.4 Mixing and matching different components

7.2.5 Remember: Use the right tool for the job

7.3 Using built-in training and evaluation loops

7.3.1 Writing your own metrics

7.3.2 Using callbacks

7.3.3 Writing your own callbacks

7.3.4 Monitoring and visualization with TensorBoard

7.4 Writing your own training and evaluation loops

7.4.1 Training vs. inference

7.4.2 Writing custom training step functions

7.4.3 Low-level usage of metrics

7.4.4 Using fit() with a custom training loop

7.4.5 Handling metrics in a custom train_step()

8 Image classification

8.1 Introduction to ConvNets

8.1.1 The convolution operation

8.1.2 The max-pooling operation

8.2 Training a ConvNet from scratch on a small dataset

8.2.1 The relevance of deep learning for small-data problems

8.2.2 Downloading the data

8.2.3 Building your model

8.2.4 Data preprocessing

8.2.5 Using data augmentation

8.3 Using a pretrained model

8.3.1 Feature extraction with a pretrained model

8.3.2 Fine-tuning a pretrained model

9 ConvNet architecture patterns

9.1 Modularity, hierarchy, and reuse

9.2 Residual connections

9.3 Batch normalization

9.4 Depthwise separable convolutions

9.5 Putting it together: A mini Xception-like model

9.6 Beyond convolution: Vision Transformers

10 Interpreting what ConvNets learn

10.1 Visualizing intermediate activations

10.2 Visualizing ConvNet filters

10.2.1 Gradient ascent in TensorFlow

10.2.2 Gradient ascent in PyTorch

10.2.3 Gradient ascent in JAX

10.2.4 The filter visualization loop

10.3 Visualizing heatmaps of class activation

10.3.1 Getting the gradient of the top class: TensorFlow version

10.3.2 Getting the gradient of the top class: PyTorch version

10.3.3 Getting the gradient of the top class: JAX version

10.3.4 Displaying the class activation heatmap

10.4 Visualizing the latent space of a ConvNet

11 Image segmentation

11.1 Computer vision tasks

11.1.1 Types of image segmentation

11.2 Training a segmentation model from scratch

11.2.1 Downloading a segmentation dataset

11.2.2 Building and training the segmentation model

11.3 Using a pretrained segmentation model

11.3.1 Downloading the Segment Anything Model

11.3.2 How Segment Anything works

11.3.3 Preparing a test image

11.3.4 Prompting the model with a target point

11.3.5 Prompting the model with a target box

12 Object detection

12.1 Single-stage vs. two-stage object detectors

12.1.1 Two-stage R-CNN detectors

12.1.2 Single-stage detectors

12.2 Training a YOLO model from scratch

12.2.1 Downloading the COCO dataset

12.2.2 Creating a YOLO model

12.2.3 Readying the COCO data for the YOLO model

12.2.4 Training the YOLO model

12.3 Using a pretrained RetinaNet detector

13 Timeseries forecasting

13.1 Different kinds of timeseries tasks

13.2 A temperature forecasting example

13.2.1 Preparing the data

13.2.2 A commonsense, non-machine-learning baseline

13.2.3 Let’s try a basic machine learning model

13.2.4 Let’s try a 1D convolutional model

13.3 Recurrent neural networks

13.3.1 Understanding recurrent neural networks

13.3.2 A recurrent layer in Keras

13.3.3 Getting the most out of recurrent neural networks

13.3.4 Using recurrent dropout to fight overfitting

13.3.5 Stacking recurrent layers

13.3.6 Using bidirectional RNNs

13.4 Going even further

14 Text classification

14.1 A brief history of natural language processing

14.2 Preparing text data

14.2.1 Character and word tokenization

14.2.2 Subword tokenization

14.3 Sets vs. sequences

14.3.1 Loading the IMDb classification dataset

14.4 Set models

14.4.1 Training a bag-of-words model

14.4.2 Training a bigram model

14.5 Sequence models

14.5.1 Training a recurrent model

14.5.2 Understanding word embeddings

14.5.3 Using a word embedding

14.5.4 Pretraining a word embedding

14.5.5 Using the pretrained embedding for classification

15 Language models and the Transformer

15.1 The language model

15.1.1 Training a Shakespeare language model

15.1.2 Generating Shakespeare

15.2 Sequence-to-sequence learning

15.2.1 English-to-Spanish translation

15.2.2 Sequence-to-sequence learning with RNNs

15.3 The Transformer architecture

15.3.1 Dot-product attention

15.3.2 Transformer encoder block

15.3.3 Transformer decoder block

15.3.4 Sequence-to-sequence learning with a Transformer

15.3.5 Embedding positional information

15.4 Classification with a pretrained Transformer

15.4.1 Pretraining a Transformer encoder

15.4.2 Loading a pretrained Transformer

15.4.3 Preprocessing IMDb movie reviews

15.4.4 Fine-tuning a pretrained Transformer

15.5 What makes the Transformer effective?

16 Text generation

16.1 A brief history of sequence generation

16.2 Training a mini-GPT

16.2.1 Building the model

16.2.2 Pretraining the model

16.2.3 Generative decoding

16.2.4 Sampling strategies

16.3 Using a pretrained LLM

16.3.1 Text generation with the Gemma model

16.3.2 Instruction fine-tuning

16.3.3 Low-Rank Adaptation (LoRA)

16.4 Going further with LLMs

16.4.1 Reinforcement Learning with Human Feedback (RLHF)

16.4.2 Multimodal LLMs

16.4.3 Retrieval Augmented Generation (RAG)

16.4.4 “Reasoning” models

16.5 Where are LLMs heading next?

17 Image generation

17.1 Deep learning for image generation

17.1.1 Sampling from latent spaces of images

17.1.2 Variational autoencoders

17.1.3 Implementing a VAE with Keras

17.2 Diffusion models

17.2.1 The Oxford Flowers dataset

17.2.2 A U-Net denoising autoencoder

17.2.3 The concepts of diffusion time and diffusion schedule

17.2.4 The training process

17.2.5 The generation process

17.2.6 Visualizing results with a custom callback

17.2.7 It’s go time!

17.3 Text-to-image models

17.3.1 Exploring the latent space of a text-to-image model

18 Best practices for the real world

18.1 Getting the most out of your models

18.1.1 Hyperparameter optimization

18.1.2 Model ensembling

18.2 Scaling up model training with multiple devices

18.2.1 Multi-GPU training

18.2.2 Distributed training in practice

18.2.3 TPU training

18.3 Speeding up training and inference with lower-precision computation

18.3.1 Understanding floating-point precision

18.3.2 Float16 inference

18.3.3 Mixed-precision training

18.3.4 Using loss scaling with mixed precision

18.3.5 Beyond mixed precision: float8 training

18.3.6 Faster inference with quantization

19 The future of AI

19.1 The limitations of deep learning

19.1.1 Deep learning models struggle to adapt to novelty

19.1.2 Deep learning models are highly sensitive to phrasing and other distractors

19.1.3 Deep learning models struggle to learn generalizable programs

19.1.4 The risk of anthropomorphizing machine-learning models

19.2 Scale isn’t all you need

19.2.1 Automatons vs. intelligent agents

19.2.2 Local generalization vs. extreme generalization

19.2.3 The purpose of intelligence

19.2.4 Climbing the spectrum of generalization

19.3 How to build intelligence

19.3.1 The kaleidoscope hypothesis

19.3.2 The essence of intelligence: Abstraction acquisition and recombination

19.3.3 The importance of setting the right target

19.3.4 A new target: On-the-fly adaptation

19.3.5 ARC Prize

19.3.6 The test-time adaptation era

19.3.7 ARC-AGI 2

19.4 The missing ingredients: Search and symbols

19.4.1 The two poles of abstraction

19.4.2 Cognition as a combination of both kinds of abstraction

19.4.3 Why deep learning isn’t a complete answer to abstraction generation

19.4.4 An alternative approach to AI: Program synthesis

19.4.5 Blending deep learning and program synthesis

19.4.6 Modular component recombination and lifelong learning

19.4.7 The long-term vision

20 Conclusions

20.1 Key concepts in review

20.1.1 Various approaches to artificial intelligence

20.1.2 What makes deep learning special within the field of machine learning

20.1.3 How to think about deep learning

20.1.4 Key enabling technologies

20.1.5 The universal machine learning workflow

20.1.6 Key network architectures

20.2 Limitations of deep learning

20.3 What might lie ahead

20.4 Staying up to date in a fast-moving field

20.4.1 Practice on real-world problems using Kaggle

20.4.2 Read about the latest developments on arXiv

20.4.3 Explore the Keras ecosystem

20.5 Final words

Overview

5 Fundamentals of machine learning

This chapter builds a practical and conceptual foundation for machine learning by centering on the tension between optimization and generalization. It explains how models learn to reduce training loss yet must be evaluated by how well they perform on unseen data, and it frames overfitting as the universal failure mode when optimization goes too far. The material emphasizes reliable evaluation and disciplined experimentation as the backbone of effective model development, setting expectations for how to read learning curves, recognize underfitting versus overfitting, and steer training toward robust, real-world performance.

It then explores why overfitting happens and what enables generalization. Overfitting is amplified by noisy or mislabeled examples, inherently ambiguous targets, and rare features that invite spurious correlations. The chapter introduces the manifold hypothesis: natural data occupies low-dimensional, structured subspaces within high-dimensional input spaces, which lets deep networks generalize primarily by interpolating between training samples along these learned manifolds. Because deep models are smooth, highly expressive function approximators, they can memorize; but with good data and gradual training they also learn manifold structure well enough to interpolate. Consequently, data quality and coverage are paramount—dense, informative sampling of the input space is the most powerful lever for generalization; when data is limited, regularization is used to constrain memorization.

On measurement and practice, the chapter formalizes evaluation protocols: separate training, validation, and test splits; avoid information leaks; and use hold-out, K-fold, or iterated K-fold validation when data is scarce. It recommends establishing common‑sense baselines and guarding against pitfalls such as unrepresentative splits, temporal leakage, and duplicate samples across splits. To improve model fit, start by getting optimization to work (tune learning rate and batch size), select architectures with the right inductive biases for the modality, and increase capacity until overfitting is possible. To improve generalization, prioritize dataset curation and feature engineering, apply early stopping, and regularize via capacity control, weight penalties (L1/L2), and dropout. The overarching message: measure carefully, fit confidently, and regularize deliberately so that interpolation on well-structured data yields reliable performance.

Canonical overfitting behavior

Some pretty weird MNIST training samples

Mislabeled MNIST training samples

Dealing with outliers: robust fit vs. overfitting

Robust fit vs. overfitting giving an ambiguous area of the feature space

Effect of noise channels on validation accuracy

mnist with added noise channels or zeros channels

Different MNIST digits gradually morphing into one another, showing that the space of handwritten digits forms a “manifold”. This image was generated using code from Chapter 17.

Difference between linear interpolation and interpolation on the latent manifold. Every point on the latent manifold of digits is a valid digit, but the average of two digits usually isn’t.

linear interpolation vs manifold interpolation

Uncrumpling a complicated manifold of data

Going from a random model to an overfit model, and achieving a robust fit as an intermediate state

A dense sampling of the input space is necessary in order to learn a model capable of accurate generalization.

Simple hold-out validation split

Three-fold validation

Effect of insufficient model capacity on loss curves

effect of insufficient model capacity on val loss

Validation loss for a model with appropriate capacity

effect of correct model capacity on val loss

Effect of excessive model capacity on validation loss

effect of excessive model capacity on val loss

Feature engineering for reading the time on a clock

Original model vs. smaller model on IMDB review classification

Original model vs. much larger model on IMDB review classification

Effect of L2 weight regularization on validation loss

original model vs l2 regularized model imdb

Dropout applied to an activation matrix at training time, with rescaling happening during training. At test time, the activation matrix is unchanged.

Effect of dropout on validation loss

original model vs dropout regularized model imdb

Chapter summary

The purpose of a machine learning model is to generalize: to perform accurately on never-seen-before inputs. It’s harder than it seems.
A deep neural network achieves generalization by learning a parametric model that can successfully interpolate between training samples – such a model can be said to have learned the latent manifold of the training data. This is why deep learning models can only make sense of inputs that are very close to what they’ve seen during training.
The fundamental problem in machine learning is the tension between optimization and generalization: to attain generalization, you must first achieve a good fit to the training data, but improving your model’s fit to the training data will inevitably start hurting generalization after a while. Every single deep learning best practice deals with managing this tension.
The ability of deep learning models to generalize comes from the fact that they manage to learn to approximate the latent manifold of their data, and can thus make sense of new inputs via interpolation.
It’s essential to be able to accurately evaluate the generalization power of your model while you’re developing it. You have at your disposal an array of evaluation methods, from simple hold-out validation, to K-fold cross-validation and iterated K-fold cross-validation with shuffling. Remember to always keep a completely separate test set for final model evaluation, since information leaks from your validation data to your model may have occurred.
When you start working on a model, your goal is first to achieve a model that has some generalization power and that can overfit. Best practices to do this include tuning your learning rate and batch size, leveraging better architecture priors, increasing model capacity, or simply training longer.
As your model starts overfitting, your goal switches to improving generalization through model regularization. You can reduce your model’s capacity, add dropout or weight regularization, and use early stopping. And naturally, a larger or better dataset is always the number one way to help a model generalize.

[1] Mark Twain even called it “the most delicious fruit known to men”.

FAQ

What is the tension between optimization and generalization in machine learning?

Optimization is the process of minimizing training loss by fitting the model to the training data; generalization is how well the trained model performs on new, unseen data. If you optimize “too well,” the model starts to memorize specifics of the training set (overfitting), causing validation/test performance to degrade. The goal is good generalization, but you can only directly control optimization, so you must manage training to avoid overfitting.

How can I tell whether my model is underfitting or overfitting?

Early in training, both training and validation losses improve together; the model is underfit and still learning useful patterns. After a point, validation loss plateaus and then rises while training loss keeps dropping—this is overfitting. If you can’t get validation metrics to budge or you can’t make the model overfit at all, you’re likely underfitting due to insufficient capacity or poor priors.

What data issues commonly cause overfitting?

Noisy or mislabeled samples, inherently ambiguous regions of the feature space, and rare features that induce spurious correlations all encourage memorization. For example, adding random noise features can reduce validation accuracy via spurious patterns the model learns. Mitigations include better data curation, label proofreading, and feature selection to remove uninformative inputs.

What is the manifold hypothesis and why does it matter for deep learning?

The manifold hypothesis states that natural data (images, speech, text, etc.) lies on low-dimensional, structured subspaces within the high-dimensional input space. Deep networks are smooth, differentiable mappings that can approximate these latent manifolds during training. This structure enables models to generalize by interpolating between nearby training points on the manifold.

How does interpolation explain generalization, and what are its limits?

Within a learned approximation of the data manifold, the model can infer new cases by relating them to nearby training samples—interpolating along the manifold (not simple linear averaging in pixel/input space). This yields local generalization. However, it doesn’t handle extreme novelty; humans rely on additional mechanisms (reasoning, abstraction, priors) beyond interpolation.

Why is training data quality and coverage paramount?

Deep learning is essentially curve fitting; it works best when the training set densely covers the input manifold, especially near decision boundaries. More and better data simplifies the manifold the model must learn and improves generalization. When more data isn’t available, you must constrain the model (regularization) or improve the information content of inputs (feature engineering).

How should I split my data, and why isn’t a train/test split enough?

Use three sets: train, validation, and test. You tune hyperparameters using validation performance, which leaks information into the model; the test set must remain untouched for a final, unbiased estimate. Beware pitfalls: ensure splits are representative, avoid temporal leakage in time series, and eliminate duplicates between splits.

Which evaluation protocols help when data is limited?

- Simple hold-out validation: fast but can be high-variance with small datasets. - K-fold cross-validation: average performance across K folds for a more reliable estimate. - Iterated K-fold with shuffling: repeat K-fold multiple times (more precise but computationally expensive). Even with these, keep a distinct test set for the final check.

Training is stuck or unstable—what should I tune first?

Start with gradient descent settings: learning rate and batch size. Too-high learning rates overshoot and stall; too-low rates make progress look flat. Larger batches reduce gradient noise. If issues persist, review optimizer choice and initialization, and ensure the architecture matches the data modality (appropriate priors).

What are the most effective ways to improve generalization and reduce overfitting?

- Get more or better-curated data; perform feature selection/engineering. - Use early stopping to capture the best validation epoch. - Reduce capacity (fewer/lower-width layers) to limit memorization. - Add weight regularization (L1/L2) for smaller models. - Add dropout, especially for large models. Choose architectures with priors that match your data (e.g., convnets for images, sequence models for text/time series).

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$63.99 $47.99

you save $16.00 (25%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$63.99 $47.99

you save $16.00 (25%)

eBook

pdf, ePub, online

$63.99 $47.99

you save $16.00 (25%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more