table of content

Part 1. Introduction to generative AI

1 What is generative AI and why PyTorch?

1.1 Introducing generative AI and PyTorch

1.1.1 What is generative AI?

1.1.2 The Python programming language

1.1.3 Using PyTorch as our AI framework

1.2 GANs

1.2.1 A high-level overview of GANs

1.2.2 An illustrating example: Generating anime faces

1.2.3 Why should you care about GANs?

1.3 Transformers

1.3.1 The attention mechanism

1.3.2 The Transformer architecture

1.3.3 Multimodal Transformers and pretrained LLMs

1.4 Why build generative models from scratch?

Summary

2 Deep learning with PyTorch

2.1 Data types in PyTorch

2.1.1 Creating PyTorch tensors

2.1.2 Index and slice PyTorch tensors

2.1.3 PyTorch tensor shapes

2.1.4 Mathematical operations on PyTorch tensors

2.2 An end-to-end deep learning project with PyTorch

2.2.1 Deep learning in PyTorch: A high-level overview

2.2.2 Preprocessing data

2.3 Binary classification

2.3.1 Creating batches

2.3.2 Building and training a binary classification model

2.3.3 Testing the binary classification model

2.4 Multicategory classification

2.4.1 Validation set and early stopping

2.4.2 Building and training a multicategory classification model

Summary

3 Generative adversarial networks: Shape and number generation

3.1 Steps involved in training GANs

3.2 Preparing training data

3.2.1 A training dataset that forms an exponential growth curve

3.2.2 Preparing the training dataset

3.3 Creating GANs

3.3.1 The discriminator network

3.3.2 The generator network

3.3.3 Loss functions, optimizers, and early stopping

3.4 Training and using GANs for shape generation

3.4.1 The training of GANs

3.4.2 Saving and using the trained generator

3.5 Generating numbers with patterns

3.5.1 What are one-hot variables?

3.5.2 GANs to generate numbers with patterns

3.5.3 Training the GANs to generate numbers with patterns

3.5.4 Saving and using the trained model

Summary

Part 2. Image generation

4 Image generation with generative adversarial networks

4.1 GANs to generate grayscale images of clothing items

4.1.1 Training samples and the discriminator

4.1.2 A generator to create grayscale images

4.1.3 Training GANs to generate images of clothing items

4.2 Convolutional layers

4.2.1 How do convolutional operations work?

4.2.2 How do stride and padding affect convolutional operations?

4.3 Transposed convolution and batch normalization

4.3.1 How do transposed convolutional layers work?

4.3.2 Batch normalization

4.4 Color images of anime faces

4.4.1 Downloading anime faces

4.4.2 Channels-first color images in PyTorch

4.5 Deep convolutional GAN

4.5.1 Building a DCGAN

4.5.2 Training and using DCGAN

Summary

5 Selecting characteristics in generated images

5.1 The eyeglasses dataset

5.1.1 Downloading the eyeglasses dataset

5.1.2 Visualizing images in the eyeglasses dataset

5.2 cGAN and Wasserstein distance

5.2.1 WGAN with gradient penalty

5.2.2 cGANs

5.3 Create a cGAN

5.3.1 A critic in cGAN

5.3.2 A generator in cGAN

5.3.3 Weight initialization and the gradient penalty function

5.4 Training the cGAN

5.4.1 Adding labels to inputs

5.4.2 Training the cGAN

5.5 Selecting characteristics in generated images

5.5.1 Selecting images with or without eyeglasses

5.5.2 Vector arithmetic in latent space

5.5.3 Selecting two characteristics simultaneously

Summary

6 CycleGAN: Converting blond hair to black hair

6.1 CycleGAN and cycle consistency loss

6.1.1 What is CycleGAN?

6.1.2 Cycle consistency loss

6.2 The celebrity faces dataset

6.2.1 Downloading the celebrity faces dataset

6.2.2 Process the black and blond hair image data

6.3 Building a CycleGAN model

6.3.1 Creating two discriminators

6.3.2 Creating two generators

6.4 Using CycleGAN to translate between black and blond hair

6.4.1 Training a CycleGAN to translate between black and blond hair

6.4.2 Round-trip conversions of black hair images and blond hair images

Summary

7 Image generation with variational autoencoders

7.1 An overview of AEs

7.1.1 What is an AE?

7.1.2 Steps in building and training an AE

7.2 Building and training an AE to generate digits

7.2.1 Gathering handwritten digits

7.2.2 Building and training an AE

7.2.3 Saving and using the trained AE

7.3 What are VAEs?

7.3.1 Differences between AEs and VAEs

7.3.2 The blueprint to train a VAE to generate human face images

7.4 A VAE to generate human face images

7.4.1 Building a VAE

7.4.2 Training the VAE

7.4.3 Generating images with the trained VAE

7.4.4 Encoding arithmetic with the trained VAE

Summary

Part 3. Natural language processing and Transformers

8 Text generation with recurrent neural networks

8.1 Introduction to RNNs

8.1.1 Challenges in generating text

8.1.2 How do RNNs work?

8.1.3 Steps in training a LSTM model

8.2 Fundamentals of NLP

8.2.1 Different tokenization methods

8.2.2 Word embedding

8.3 Preparing data to train the LSTM model

8.3.1 Downloading and cleaning up the text

8.3.2 Creating batches of training data

8.4 Building and training the LSTM model

8.4.1 Building an LSTM model

8.4.2 Training the LSTM model

8.5 Generating text with the trained LSTM model

8.5.1 Generating text by predicting the next token

8.5.2 Temperature and top-K sampling in text generation

Summary

9 A line-by-line implementation of attention and Transformer

9.1 Introduction to attention and Transformer

9.1.1 The attention mechanism

9.1.2 The Transformer architecture

9.1.3 Different types of Transformers

9.2 Building an encoder

9.2.1 The attention mechanism

9.2.2 Creating an encoder

9.3 Building an encoder-decoder Transformer

9.3.1 Creating a decoder layer

9.3.2 Creating an encoder-decoder Transformer

9.4 Putting all the pieces together

9.4.1 Defining a generator

9.4.2 Creating a model to translate between two languages

Summary

10 Training a Transformer to translate English to French

10.1 Subword tokenization

10.1.1 Tokenizing English and French phrases

10.1.2 Sequence padding and batch creation

10.2 Word embedding and positional encoding

10.2.1 Word embedding

10.2.2 Positional encoding

10.3 Training the Transformer for English-to-French translation

10.3.1 Loss function and the optimizer

10.3.2 The training loop

10.4 Translating English to French with the trained model

Summary

11 Building a generative pretrained Transformer from scratch

11.1 GPT-2 architecture and causal self-attention

11.1.1 The architecture of GPT-2

11.1.2 Word embedding and positional encoding in GPT-2

11.1.3 Causal self-attention in GPT-2

11.2 Building GPT-2XL from scratch

11.2.1 BPE tokenization

11.2.2 The Gaussian error linear unit activation function

11.2.3 Causal self-attention

11.2.4 Constructing the GPT-2XL model

11.3 Loading up pretrained weights and generating text

11.3.1 Loading up pretrained parameters in GPT-2XL

11.3.2 Defining a generate() function to produce text

11.3.3 Text generation with GPT-2XL

Summary

12 Training a Transformer to generate text

12.1 Building and training a GPT from scratch

12.1.1 The architecture of a GPT to generate text

12.1.2 The training process of the GPT model to generate text

12.2 Tokenizing text of Hemingway novels

12.2.1 Tokenizing the text

12.2.2 Creating batches for training

12.3 Building a GPT to generate text

12.3.1 Model hyperparameters

12.3.2 Modeling the causal self-attention mechanism

12.3.3 Building the GPT model

12.4 Training the GPT model to generate text

12.4.1 Training the GPT model

12.4.2 A function to generate text

12.4.3 Text generation with different versions of the trained model

Summary

Part 4. Applications and new developments

13 Music generation with MuseGAN

13.1 Digital music representation

13.1.1 Musical notes, octave, and pitch

13.1.2 An introduction to multitrack music

13.1.3 Digitally represent music: Piano rolls

13.2 A blueprint for music generation

13.2.1 Constructing music with chords, style, melody, and groove

13.2.2 A blueprint to train a MuseGAN

13.3 Preparing the training data for MuseGAN

13.3.1 Downloading the training data

13.3.2 Converting multidimensional objects to music pieces

13.4 Building a MuseGAN

13.4.1 A critic in MuseGAN

13.4.2 A generator in MuseGAN

13.4.3 Optimizers and the loss function

13.5 Training the MuseGAN to generate music

13.5.1 Training the MuseGAN

13.5.2 Generating music with the trained MuseGAN

Summary

14 Building and training a music Transformer

14.1 Introduction to the music Transformer

14.1.1 Performance-based music representation

14.1.2 The music Transformer architecture

14.1.3 Training the music Transformer

14.2 Tokenizing music pieces

14.2.1 Downloading training data

14.2.2 Tokenizing MIDI files

14.2.3 Preparing the training data

14.3 Building a GPT to generate music

14.3.1 Hyperparameters in the music Transformer

14.3.2 Building a music Transformer

14.4 Training and using the music Transformer

14.4.1 Training the music Transformer

14.4.2 Music generation with the trained Transformer

Summary

15 Diffusion models and text-to-image Transformers

15.1 Introduction to denoising diffusion models

15.1.1 The forward diffusion process

15.1.2 Using the U-Net model to denoise images

15.1.3 A blueprint to train the denoising U-Net model

15.2 Preparing the training data

15.2.1 Flower images as the training data

15.2.2 Visualizing the forward diffusion process

15.3 Building a denoising U-Net model

15.3.1 The attention mechanism in the denoising U-Net model

15.3.2 The denoising U-Net model

15.4 Training and using the denoising U-Net model

15.4.1 Training the denoising U-Net model

15.4.2 Using the trained model to generate flower images

15.5 Text-to-image Transformers

15.5.1 CLIP: A multimodal Transformer

15.5.2 Text-to-image generation with DALL-E 2

Summary

16 Pretrained large language models and the LangChain library

16.1 Content generation with the OpenAI API

16.1.1 Text generation tasks with OpenAI API

16.1.2 Code generation with OpenAI API

16.1.3 Image generation with OpenAI DALL-E 2

16.1.4 Speech generation with OpenAI API

16.2 Introduction to LangChain

16.2.1 The need for the LangChain library

16.2.2 Using the OpenAI API in LangChain

16.2.3 Zero-shot, one-shot, and few-shot prompting

16.3 A zero-shot know-it-all agent in LangChain

16.3.1 Applying for a Wolfram Alpha API Key

16.3.2 Creating an agent in LangChain

16.3.3 Adding tools by using OpenAI GPTs

16.3.4 Adding tools to generate code and images

16.4 Limitations and ethical concerns of LLMs

16.4.1 Limitations of LLMs

16.4.2 Ethical concerns for LLMs

Summary

Appendixes

Appendix A: Installing Python, Jupyter Notebook, and PyTorch

A.1 Installing Python and setting up a virtual environment

A.1.1 Installing Anaconda

A.1.2 Setting up a Python virtual environment

A.1.3 Installing Jupyter Notebook

A.2 Installing PyTorch

A.2.1 Installing PyTorch without CUDA

A.2.2 Installing PyTorch with CUDA

Appendix B: Minimally qualified readers and deep learning basics

B.1 Deep learning and deep neural networks

B.1.1 Anatomy of a neural network

B.1.2 Different types of layers in neural networks

B.1.3 Activation Functions

B.2 Training a deep neural network

B.2.1 The training process

B.2.2 Loss functions

B.2.3 Optimizers

Overview

1 What Is Generative AI and Why PyTorch?

Generative AI has rapidly moved from research to mainstream since late 2022, reshaping workflows and creative processes across industries. This chapter sets the stage by clarifying what generative AI is, how it differs from non-generative (discriminative) systems, and why its ability to synthesize new text, images, audio, and more is so disruptive. It frames the core questions behind the technology’s mechanism and impact, then positions Python and PyTorch as the practical foundation for learning and experimentation throughout the book.

The chapter introduces the main families of models you will build: Generative Adversarial Networks (GANs) and Transformers, along with variational autoencoders and diffusion models. GANs pair a generator and a discriminator in a competitive loop to learn data distributions and produce convincing new samples, enabling tasks from image synthesis to style and attribute translation. Transformers, powered by the self-attention mechanism, handle sequences efficiently, capture long-range dependencies, and scale via parallel training—properties that underpin large language models and modern multimodal systems. Diffusion models and text-to-image pipelines illustrate how iterative refinement and conditioning unlock high-quality visual generation.

To make these ideas concrete, the book adopts a build-from-scratch approach using Python and PyTorch. PyTorch’s dynamic computation graph, clear syntax, GPU acceleration, and rich ecosystem make it well suited to rapid prototyping, transfer learning, and integration with familiar scientific libraries. By implementing models end to end, readers develop intuition for architectures, learn to control outputs (for example, selecting attributes through latent variables), and gain the skills to adapt or fine-tune pre-trained models for downstream tasks. This deeper understanding not only improves practical effectiveness but also equips readers to evaluate the capabilities and risks of generative AI with greater rigor and responsibility.

A comparison of generative models versus discriminative models. A discriminative model (top half of the figure) takes data as inputs and produces probabilities of different labels, which we denote by Prob(dog) and Prob(cat). In contrast, a generative model (bottom half) acquires an in-depth understanding of the defining characteristics of these images to synthesize new images representing dogs and cats.

Generative Adversarial Networks (GANs) architecture and its components. GANs employ a dual-network architecture comprising a generative model (left) tasked with capturing the underlying data distribution and a discriminative model (center) that serves to estimate the likelihood that a given sample originates from the authentic training dataset (considered as "real") rather than being a product of the generative model (considered as "fake").

Examples from the Anime faces training dataset.

Generated Anime face images by the trained generator in DCGAN.

Changing hair color with CycleGAN. If we feed images with blond hair (first row) to a trained CycleGAN model, the model converts blond hair to black hair in these images (second row). The same trained model can also convert black hair (third row) to blond hair (bottom row).

The Transformer architecture. The encoder in the Transformer (left side of the diagram) learns the meaning of the input sequence (e.g., the English phrase “How are you?”) and converts it into an abstract representation that captures its meaning before passing it to the decoder (right side of the diagram). The decoder constructs the output (e.g., the French translation of the English phrase) by predicting one word at a time, based on previous words in the sequence and the abstract representation from the encoder.

The diffusion model adds more and more noise to the images and learns to reconstruct them. The left column contains four original flower images. As we move to the right, some noise is added to the images in each step, until at the right column, the four images are completely noisy images. We then use these images to train a diffusion-based model to progressively remove noise to generate new data samples.

Image generated by DALL-E 2 with text prompt “an astronaut in a space suit riding a unicorn”.

Summary

Generative AI is a type of technology with the capacity to produce diverse forms of new content, including texts, images, code, music, audio, and video.
Discriminative models specialize in assigning labels while generative models generate new instances of data.
PyTorch, with its dynamic computational graphs and the ability for GPU training, is well suited for deep learning and generative modeling.
GANs are a type of generative modeling method consisting of two neural networks: a generator and a discriminator. The goal of the generator is to create realistic data samples to maximize the chance that the discriminator thinks they are real. The goal of the discriminator is to correctly identify fake samples from real ones.
Transformers are deep neural networks that use the attention mechanism to identify long-term dependencies among elements in a sequence. The original Transformer has an encoder and a decoder. When it’s used for English-to-French translation, for example, the encoder converts the English sentence into an abstract representation before passing it to the decoder. The decoder generates the French translation one word at a time, based on the encoder’s output and the previously generated words.

FAQ

What is generative AI, and how does it differ from discriminative AI?

Generative AI learns data distributions to create new content (text, images, audio, etc.). Discriminative AI focuses on labeling or classifying existing data. Statistically, discriminative models estimate P(Y|X), while generative models aim to learn the joint distribution P(X, Y) (or P(X)) and sample new X from it.

Why does this book use Python and PyTorch for generative AI?

Python offers readable syntax, cross-platform support, and a vast ecosystem. PyTorch complements it with a flexible, Pythonic API, dynamic computational graphs that simplify experimentation and debugging, strong GPU acceleration, and an active community—ideal for fast-moving generative AI work.

What is a dynamic computational graph in PyTorch, and why does it matter?

A dynamic computational graph is built and modified on the fly as your code runs. This makes model architectures easier to vary, experiments faster to iterate, and bugs simpler to diagnose—key advantages when building and training custom generative models.

How do Generative Adversarial Networks (GANs) work at a high level?

GANs pit two networks against each other: a generator that synthesizes data and a discriminator that distinguishes real from fake. Through iterative training in this zero-sum game, the generator learns to produce outputs that the discriminator cannot reliably tell apart from real samples.

What role does the latent vector Z play in GANs?

The latent vector Z is the generator’s input “task description.” Sampling different Z values produces diverse outputs and lets you explore or control characteristics in generated content. Later extensions (e.g., conditional GANs) further steer specific attributes.

What practical applications do GANs have beyond image synthesis?

GANs support image-to-image translation (e.g., changing hair color with CycleGAN), data augmentation, style transfer, and even music generation. They can also reduce production costs by generating realistic previews (such as customized product images) before physical manufacturing.

Why did Transformers overtake RNNs/LSTMs for sequence tasks?

Transformers use self-attention to capture long-range dependencies and process tokens in parallel, enabling much faster training on large datasets. This scalability and context modeling outperform sequential RNN processing for many tasks.

What are the main Transformer variants and their typical use cases?

- Encoder-only (e.g., BERT): understanding tasks such as classification and named entity recognition. - Decoder-only (e.g., GPT-2/ChatGPT): text generation and language modeling. - Encoder-decoder: sequence-to-sequence and multimodal tasks like translation, speech recognition, or text-to-image.

How do diffusion models relate to text-to-image systems like DALL·E?

Diffusion models learn to remove noise step by step to generate high-quality images. Text-to-image systems condition this process on prompts and often pair Transformer components with diffusion-style iterative refinement to align outputs with the textual description.

Why build generative models from scratch instead of only using pre-trained ones?

Implementing models yourself deepens understanding, improves troubleshooting, and enables precise control (e.g., attribute steering in GANs). It also equips you to adapt or fine-tune pre-trained LLMs for downstream tasks and to evaluate benefits and risks of AI more responsibly.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $17.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $17.49

eBook

pdf, ePub, online

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $17.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more