Image Generation Models

GANs, diffusion models, and transformers

Vladimir Bok

MEAP began September 2024
Last updated July 2025
This book is in development

ISBN 9781633437449
350 pages (estimated)

Included with a Manning Online subscription

printed in black & white

available in Russian, Simplified Chinese

catalog / Data Science / Deep Learning / Generative AI

resources: Book forum

table of content

Part 1: Foundations of Generative AI in Computer Vision

1 Generative AI in Computer Vision

1.1 Introduction: From Chisels to Pixels

1.2 AI, Generative AI, and Computer Vision

1.2.1 Artificial Intelligence (AI)

1.2.2 Computer Vision

1.2.3 Generative AI

1.2.4 The Intersection: Generative AI in Computer Vision

1.3 Practical Applications of Generative AI in Computer Vision

1.3.1 Digital Face Re-Aging in Video Production

1.3.2 Simulation Environments for Self-Driving Cars

1.3.3 Data Augmentation for Medical Imaging

1.4 The Evolution of Generative AI for Image Synthesis

1.4.1 Early Foundations (1950s-1980s)

1.4.2 Emergence of Neural Networks (1980s-2000s)

1.4.3 Breakthroughs in Deep Learning (2000s-2010s)

1.4.4 The GAN Revolution (2014-2020)

1.4.5 New Horizons (2020-Present)

1.5 Taxonomy for Generative AI in Computer Vision

1.5.1 Level of Control

1.6 Model Architecture

1.6.1 Autoencoders

1.6.2 Adversarial Networks

1.6.3 Diffusion Models

1.6.4 Transformers

1.6.5 Choosing the Right Architecture

1.7 Conclusion

1.8 Summary

2 Variational Autoencoders (VAEs)

2.1 Introduction to Autoencoders

2.1.1 The Latent Space

2.1.2 Applications of Autoencoders

2.1.3 Autoencoder Architecture and Training

2.2 Implementing an Autoencoder in PyTorch

2.2.1 Step 1: Import Necessary Libraries

2.2.2 Step 2: Prepare the Dataset

2.2.3 Step 3: Implement the Autoencoder Model

2.2.4 Step 4: Define the Loss Function and Optimizer

2.2.5 Step 5: Train the Autoencoder

2.2.6 Step 6: Model Evaluation

2.2.7 Conclusion and Next Steps

2.3 Variational Autoencoders (VAEs)

2.3.1 VAE Model Architecture: Concepts and Components

2.3.2 The Loss Function and VAE Training

2.4 Implementing VAE in PyTorch

2.4.1 Step 1: Import Necessary Libraries

2.4.2 Step 2: Prepare the Dataset

2.4.3 Step 3: Implement the Encoder

2.4.4 Step 4: Implement the Decoder

2.4.5 Step 5: Implement the VAE Model

2.4.6 Step 6: Define the VAE Loss Function

2.4.7 Step 7: Initialize the Model and Optimizer

2.4.8 Step 8: Train the VAE

2.4.9 Step 9: Model Evaluation

2.4.10 Latent Space Interpolation with VAE

2.5 β-VAE and Latent Space Disentanglement

2.5.1 Applications and Implications of β-VAEs

2.6 Conclusion

2.7 Summary

3 Generative Adversarial Networks (GANs)

3.1 Introduction to GANs

3.1.1 Adversarial Training

3.2 Core Concepts and Theoretical Foundations of GANs

3.2.1 Inspiration from Game Theory

3.2.2 GAN Value Function

3.2.3 Minimax Game: How the Value Function Works in GAN Training

3.2.4 Visualizing GAN Model Architecture

3.2.5 Reaching Equilibrium in GAN Training

3.3 Training Challenges in GANs

3.3.1 Mode Collapse

3.3.2 Vanishing Gradient

3.3.3 Non-Convergence

3.4 Wasserstein GANs: A Solution to Training Challenges

3.4.1 The Intuition Behind WGANs

3.4.2 Understanding Earth Mover’s (Wasserstein) Distance in WGANs

3.4.3 Replacing the Discriminator with a Critic

3.4.4 The WGAN Value Function

3.4.5 Visualizing WGAN Model Architecture

3.5 Implementing a WGAN in PyTorch

3.5.1 Adopting Best Practices from DCGANs

3.5.2 Step 1: Import Necessary Libraries

3.5.3 Step 2: Enable GPU Training

3.5.4 Step 3: Load the Dataset

3.5.5 Step 4: Implement the Generator

3.5.6 Step 5: Implement the Critic

3.5.7 Step 6: Initialize the Models and Optimizers

3.5.8 Step 7: Train the Model

3.5.9 Step 8: Evaluate the Model

3.5.10 Exploring the WGAN Latent Space

3.6 Conclusion

3.7 Summary

Part 2: Advanced Generative Models and Techniques

4 Diffusion Models: Forward Diffusion

4.1 Introduction to Diffusion Models

4.1.1 Forward and Reverse Diffusion Phases

4.2 Foundations and Core Concepts

4.2.1 Forward Diffusion

4.2.2 Reverse Diffusion

4.3 Images, Probability Distributions, and the Visual World

4.3.1 The Infinite Possibilities of Pixel Arrangements

4.3.2 Diffusion Models and the Quest for Coherence

4.4 Forward Diffusion In-Depth

4.4.1 Reducing Data Distribution Complexity by Adding Noise

4.5 Mathematical Foundations of the Forward Diffusion Process

4.5.1 Forward Diffusion Process

4.5.2 Closed-Form Formula for Skipping Steps in Forward Diffusion

4.6 Visualizing Forward Diffusion in 1D

4.6.1 Probability Density of an Example 1D Dataset

4.6.2 Applying Forward Diffusion to the 1D Distribution

4.7 Conclusion

4.8 Summary

5 Diffusion Models: Reverse Diffusion

5.1 Understanding Reverse Diffusion

5.2 The Mathematics of Reverse Diffusion

5.2.1 Forward Diffusion: A Brief Recap

5.2.2 The Reverse Diffusion Equation

5.2.3 Reverse Diffusion: Training vs. Inference

5.3 U-Net Architecture for Denoising

5.3.1 U-Net: Structure and Function

5.3.2 Comparing U-Net to Autoencoders

5.3.3 Adapting U-Net for Diffusion Models

5.4 Step-by-Step Implementation of a Denoising Diffusion Probabilistic Model (DDPM)

5.4.1 Step 1: Import Necessary Libraries

5.4.2 Step 2: Enable GPU Training

5.4.3 Step 3: Prepare the Dataset

5.4.4 Step 4: Implement U-Net Model for Denoising

5.4.5 Step 5: Implement DDPM

5.4.6 Step 6: Train the Model

5.4.7 Step 7: Model Evaluation

5.5 Comparing DDPMs with VAEs and GANs

5.5.1 Generation Process and Computational Requirements

5.5.2 Sample Quality and Diversity

5.5.3 Training Stability and Ease of Use

5.5.4 Latent Space and Interpolation

5.5.5 Theoretical Grounding and Flexibility

5.5.6 Comparison Summary

5.6 Conclusion

5.7 Summary

6 Evaluation and Metrics for Generative Models

6.1 Introduction to Model Evaluation in Generative AI

6.1.1 Importance of Evaluation in Generative Models

6.1.2 Challenges In Evaluating Generative Models For Image Synthesis

6.1.3 Overview Of Evaluation Approaches

6.2 Qualitative Evaluation Methods

6.2.1 Visual Inspection

6.2.2 User Studies

6.2.3 Summary of Qualitative Evaluation Methods

6.3 Quantitative Evaluation Metrics

6.3.1 Inception Score (IS)

6.3.2 Fréchet Inception Distance (FID)

6.3.3 Kernel Inception Distance (KID)

6.3.4 Precision and Recall for Distributions

6.3.5 Summary of Quantitative Evaluation Metrics

6.4 Model-Specific Evaluation Techniques

6.4.1 VAE-specific Metrics

6.4.2 GAN-specific Metrics

6.4.3 Diffusion Model-specific Metrics

6.5 Task-Specific Evaluation Metrics

6.5.1 Case Study 1: Super-Resolution for Remote Sensing Imagery

6.5.2 Case Study 2: Medical Image Synthesis for Data Augmentation

6.6 Challenges and Limitations in Evaluation

6.6.1 Bias in Evaluation Metrics

6.6.2 Computational Complexity and Scalability

6.6.3 Lack of Ground Truth in Generative Tasks

6.6.4 Domain Specificity and Generalization

6.7 Conclusion

6.8 Summary

7 Conditional Image Generation

7.1 Why Conditional Generation?

7.1.1 Limitations of Unconditional Models

7.1.2 Applications Where Conditioning is Essential

7.2 Types of Conditioning Information

7.2.1 Conditioning on Class Labels

7.2.2 Conditioning on Images

7.2.3 Conditioning on Text or Other Modalities

7.3 Conditional Variational Autoencoders (cVAEs)

7.3.1 Quick Recap of Variational Autoencoders

7.3.2 Incorporating Conditioning into VAEs

7.4 Implementation of cVAE for Class-Conditioned Image Generation

7.4.1 Step 1: Import Necessary Libraries

7.4.2 Step 2: Prepare the Dataset

7.4.3 Step 3: Define the cVAE Encoder and Decoder

7.4.4 Step 4: Implement cVAE functions

7.4.5 Step 5: Train the cVAE

7.4.6 Step 6: Evaluate the cVAE

7.5 Evaluation and Metrics for Conditional Image Generation

7.5.1 Conditioning Accuracy Metrics

7.5.2 Human Evaluation

7.6 Conclusion

7.7 Summary

8 Hybrid Architectures and Latent Diffusion Models

8.1 Understanding the Path to Hybrid Models

8.1.1 The Generative AI Trilemma

8.1.2 A Comparative Look at Generative Model Architectures

8.2 The Rise of Hybrid Models: Latent Diffusion as a Case Study

8.2.1 Latent Diffusion: Diffusion in the Latent Space

8.2.2 The Best of Both Worlds: Benefits of Latent Diffusion

8.2.3 The Drawbacks of Latent Diffusion

8.2.4 Balancing Trade-Offs in Modern Generative AI

8.3 Implementing a Latent Diffusion Model

8.3.1 Step 1: Importing Necessary Libraries

8.3.2 Step 2: Setting the Device

8.3.3 Step 3: Building the VAE Architecture – Encoder and Decoder Networks

8.3.4 Step 4: Assembling the Complete VAE

8.3.5 Step 5: VAE Training Setup

8.3.6 Step 6: VAE Training Pipeline

8.3.7 Step 7: Building the Diffusion Components: U-Net Architecture with Residual Blocks

8.3.8 Step 8: Implementing the Diffusion Process: The DDPM Class

8.3.9 Step 9: Tying it All Together – The Complete Latent Diffusion Model

8.3.10 Step 10: Training the Latent Diffusion Model

8.3.11 Step 11: Putting It All Together – Training Pipeline Initialization

8.3.12 Step 12: Generating Samples

8.3.13 Model Output Analysis

8.4 Conclusion

8.5 Summary

Part 3: Multimodal Generative AI and Real-World Impact

9 Bridging Language and Vision with Transformers

9.1 Introduction to Multimodal Modeling and Transformers

9.1.1 Understanding Transformers

9.1.2 The Transformer Architecture: An Overview

9.2 Inside the Transformer: Key Components and Mechanisms

9.2.1 Word Embeddings

9.2.2 Positional Encodings

9.2.3 Understanding Attention Mechanisms

9.2.4 Types of Attention Mechanisms

9.2.5 Multi-Head Attention: Multiple Perspectives on the Same Text

9.3 The Complete Transformer Architecture: Putting It All Together

9.3.1 Input Processing

9.3.2 The Encoder Stack

9.3.3 The Decoder Stack

9.3.4 Final Output Layer

9.3.5 Auto-Regressive Generation

9.3.6 Evolution of Transformer Models

9.4 From NLP to Vision: The Vision Transformer (ViT)

9.4.1 Key Differences from Traditional Transformers

9.4.2 The ViT Model Architecture

9.4.3 Comparing ViTs and CNNs

9.5 CLIP: Bridging Vision and Language

9.5.1 Connecting Words and Images

9.5.2 The CLIP Approach: Learning from Internet-Scale Data

9.5.3 CLIP Architecture

9.5.4 CLIP Training Process

9.5.5 The Power of CLIP’s Representations

9.5.6 From CLIP to Text-to-Image Generation

9.6 Conclusion

9.7 Summary

10 Text-to-Image Generation with Stable Diffusion

11 Video Generation

12 Real-World Applications and Case Studies

Overview

5 Diffusion Models: Reverse Diffusion

Reverse Diffusion is the generative core of diffusion models: starting from pure noise, the model iteratively removes noise step by step to reveal coherent images that reflect the training data distribution. The chapter contrasts this with Forward Diffusion, explains the intuition (progressively uncovering structure), and details how the model learns to predict the noise present at each timestep so it can be subtracted reliably. A carefully chosen noise schedule guides the process, and a small stochastic term is retained during sampling to promote diversity and avoid mode collapse. Training teaches the network to estimate the added noise at random timesteps using a simple mean squared error objective; inference chains these predictions from high to low timesteps to synthesize images.

The chapter centers noise prediction on a U-Net backbone whose encoder-decoder design with skip connections captures both global context and fine detail—crucial for accurate denoising. Time step conditioning is introduced via sinusoidal embeddings injected into the network so the model adapts its strategy across noise levels, focusing on broad structure early and fine details later. This conditioning improves accuracy and efficiency, enables flexibility in sampling (such as step skipping), and ties together the denoising and stochastic components governed by the beta/alpha schedules and their cumulative products.

To make the ideas concrete, the chapter implements a DDPM in PyTorch on MNIST: defining the beta schedule and closed-form terms, constructing a time-conditioned U-Net, applying forward and reverse diffusion, training with uniform timestep sampling and MSE on noise, and sampling by iteratively denoising from Gaussian noise, followed by qualitative evaluation (with mentions of common quantitative metrics). It closes by comparing DDPMs with VAEs and GANs: VAEs train stably but can blur; GANs generate sharp outputs quickly but may suffer instability and mode collapse; DDPMs offer stable training and high-quality, diverse samples at the cost of slower iterative generation—an increasingly favorable trade-off that powers modern text-to-image systems.

Reverse Diffusion

U-Net network

MNIST images generated by our DDPM

Summary

Forward Diffusion Process:

Gradually adds noise to data samples over a series of steps
Transforms structured data into unstructured noise
Defined by a carefully chosen noise schedule

Reverse Diffusion Process:

Learns to gradually remove noise from a noisy data
Generates new data by iteratively denoising random noise
Requires a learned model to predict and remove noise at each step

U-Net Architecture:

Crucial for effective noise prediction in DDPMs
Features a symmetric encoder-decoder structure with skip connections
Allows the network to capture both fine-grained details and broader context

Time Step Conditioning:

Enables the model to adapt its denoising behavior based on the current diffusion step
Implemented through sinusoidal position encodings
Crucial for the model’s ability to handle different noise levels

DDPM Training:

Involves forward diffusion to create noisy samples and reverse diffusion to denoise
Uses a simple mean squared error loss between predicted and actual noise
Generally more stable than adversarial training (as in GANs)

DDPM Sampling (Generation):

Starts from pure noise and iteratively applies the reverse diffusion process
More computationally intensive than sampling from VAEs or GANs
Produces high-quality and diverse samples

Comparison with VAEs and GANs:

DDPMs often produce higher quality and more diverse samples than VAEs
DDPMs can match or exceed GAN quality while offering more stable training
DDPMs are slower in generating samples compared to both VAEs and GANs

FAQ

What is Reverse Diffusion, and how does it relate to Forward Diffusion?

Reverse Diffusion inverts Forward Diffusion. While Forward Diffusion adds Gaussian noise to a clean sample x₀ over T steps to produce x_T (pure noise), Reverse Diffusion iteratively removes that noise: starting from x_T and using a learned model to predict and subtract the noise at each timestep to recover x₀. This is how diffusion models generate new images from random noise.

How does the Reverse Diffusion update (Equation 5.2) work?

Equation 5.2 updates x_t to x_t−1 with two parts: - Denoising term: uses the model’s noise prediction ε_θ(x_t, t) and step-dependent scalings (α_t, α̅_t) to subtract the estimated noise proportionally to how much was added in the forward process. - Stochastic term: adds σ_tz (with z ~ N(0, I)) to maintain diversity and correct statistical properties. Together, these refine structure while preserving probabilistic sampling.

Why does the Reverse Diffusion step add noise (the stochastic component)?

The controlled noise σ_tz is crucial to: - Avoid mode collapse by allowing multiple denoising paths. - Improve generalization by preventing overfitting to a single deterministic path. - Increase realism and diversity by preserving the probabilistic nature of the process. σ_t typically decreases as t→0 so late steps are less noisy.

What exactly does the model learn during training, and what loss is used?

During training, the model learns to predict the noise ε added at any timestep t. We: - Sample x₀, pick t ~ Uniform(1, T), generate x_t via the closed-form forward noising equation. - Train the network to predict ε from (x_t, t). - Use mean squared error loss L = E[||ε − ε̂_θ(x_t, t)||²]. Accurate noise prediction enables effective reverse denoising.

How does inference (sampling) generate new images from noise?

Sampling proceeds by: - Initializing x_T ~ N(0, I). - For t = T, T−1, …, 1: predict ε_θ(x_t, t) and apply Equation 5.2 to obtain x_t−1. - Output x₀ as the generated image. This iterative process transforms noise into a coherent sample from the learned data distribution.

Why is a U-Net used for noise prediction in diffusion models?

U-Net is effective because: - Encoder–decoder with skip connections captures multi-scale context (global structure) and preserves fine details. - Skip connections restore spatial precision lost during downsampling. - Its symmetric “U” design naturally supports image-to-image mapping tasks like denoising. These properties make it well-suited to predict ε at varying noise levels.

What is timestep conditioning/embedding, and how is it implemented?

Timestep conditioning informs the network “how noisy” the input is. Implementation: - Encode t into a high-dimensional vector using sinusoidal embeddings. - Inject the embedding into the U-Net (e.g., added as channels or injected at multiple layers). Benefits: - Specializes denoising behavior per noise level. - Improves learning efficiency and quality. - Enables flexibility (e.g., step skipping or adaptive schedules during inference).

How does the noise schedule β_t influence training and sampling?

The schedule β_t (and derived α_t, α̅_t) controls how quickly noise is added in forward diffusion and thus how denoising scales are set in reverse: - Typically increases over time to ensure gradual corruption. - Affects stability, sample quality, and the difficulty of predictions at each step. - Precomputing α products enables closed-form x_t sampling and efficient training.

How do DDPMs compare with VAEs and GANs?

Trade-offs: - Quality/diversity: DDPMs produce sharp, diverse samples; often rival or surpass GANs on benchmarks; VAEs can be blurrier but cover modes well. - Speed: VAEs and GANs sample in a single pass; DDPMs require many iterative steps (slower). - Stability: DDPMs train stably with a simple MSE objective; GANs can be unstable; VAEs are generally stable.

What are key practical tips for implementing a DDPM in PyTorch?

Tips: - Normalize data to [-1, 1] and clamp outputs accordingly. - Sample timesteps t uniformly per batch item during training. - Precompute α, α̅, and related terms for efficiency and numerical stability. - Use GPU if available; keep shapes consistent when broadcasting timestep-dependent scalars. - Inject timestep embeddings throughout the U-Net. - Expect slower sampling; consider fewer steps or acceleration methods if needed.

with subscription

$24.99

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more