Image Generation Models

GANs, diffusion models, and transformers

Vladimir Bok

MEAP began September 2024
Last updated July 2025
This book is in development

ISBN 9781633437449
350 pages (estimated)

Included with a Manning Online subscription

printed in black & white

available in Russian, Simplified Chinese

catalog / Data Science / Deep Learning / Generative AI

resources: Book forum

table of content

Part 1: Foundations of Generative AI in Computer Vision

1 Generative AI in Computer Vision

1.1 Introduction: From Chisels to Pixels

1.2 AI, Generative AI, and Computer Vision

1.2.1 Artificial Intelligence (AI)

1.2.2 Computer Vision

1.2.3 Generative AI

1.2.4 The Intersection: Generative AI in Computer Vision

1.3 Practical Applications of Generative AI in Computer Vision

1.3.1 Digital Face Re-Aging in Video Production

1.3.2 Simulation Environments for Self-Driving Cars

1.3.3 Data Augmentation for Medical Imaging

1.4 The Evolution of Generative AI for Image Synthesis

1.4.1 Early Foundations (1950s-1980s)

1.4.2 Emergence of Neural Networks (1980s-2000s)

1.4.3 Breakthroughs in Deep Learning (2000s-2010s)

1.4.4 The GAN Revolution (2014-2020)

1.4.5 New Horizons (2020-Present)

1.5 Taxonomy for Generative AI in Computer Vision

1.5.1 Level of Control

1.6 Model Architecture

1.6.1 Autoencoders

1.6.2 Adversarial Networks

1.6.3 Diffusion Models

1.6.4 Transformers

1.6.5 Choosing the Right Architecture

1.7 Conclusion

1.8 Summary

2 Variational Autoencoders (VAEs)

2.1 Introduction to Autoencoders

2.1.1 The Latent Space

2.1.2 Applications of Autoencoders

2.1.3 Autoencoder Architecture and Training

2.2 Implementing an Autoencoder in PyTorch

2.2.1 Step 1: Import Necessary Libraries

2.2.2 Step 2: Prepare the Dataset

2.2.3 Step 3: Implement the Autoencoder Model

2.2.4 Step 4: Define the Loss Function and Optimizer

2.2.5 Step 5: Train the Autoencoder

2.2.6 Step 6: Model Evaluation

2.2.7 Conclusion and Next Steps

2.3 Variational Autoencoders (VAEs)

2.3.1 VAE Model Architecture: Concepts and Components

2.3.2 The Loss Function and VAE Training

2.4 Implementing VAE in PyTorch

2.4.1 Step 1: Import Necessary Libraries

2.4.2 Step 2: Prepare the Dataset

2.4.3 Step 3: Implement the Encoder

2.4.4 Step 4: Implement the Decoder

2.4.5 Step 5: Implement the VAE Model

2.4.6 Step 6: Define the VAE Loss Function

2.4.7 Step 7: Initialize the Model and Optimizer

2.4.8 Step 8: Train the VAE

2.4.9 Step 9: Model Evaluation

2.4.10 Latent Space Interpolation with VAE

2.5 β-VAE and Latent Space Disentanglement

2.5.1 Applications and Implications of β-VAEs

2.6 Conclusion

2.7 Summary

3 Generative Adversarial Networks (GANs)

3.1 Introduction to GANs

3.1.1 Adversarial Training

3.2 Core Concepts and Theoretical Foundations of GANs

3.2.1 Inspiration from Game Theory

3.2.2 GAN Value Function

3.2.3 Minimax Game: How the Value Function Works in GAN Training

3.2.4 Visualizing GAN Model Architecture

3.2.5 Reaching Equilibrium in GAN Training

3.3 Training Challenges in GANs

3.3.1 Mode Collapse

3.3.2 Vanishing Gradient

3.3.3 Non-Convergence

3.4 Wasserstein GANs: A Solution to Training Challenges

3.4.1 The Intuition Behind WGANs

3.4.2 Understanding Earth Mover’s (Wasserstein) Distance in WGANs

3.4.3 Replacing the Discriminator with a Critic

3.4.4 The WGAN Value Function

3.4.5 Visualizing WGAN Model Architecture

3.5 Implementing a WGAN in PyTorch

3.5.1 Adopting Best Practices from DCGANs

3.5.2 Step 1: Import Necessary Libraries

3.5.3 Step 2: Enable GPU Training

3.5.4 Step 3: Load the Dataset

3.5.5 Step 4: Implement the Generator

3.5.6 Step 5: Implement the Critic

3.5.7 Step 6: Initialize the Models and Optimizers

3.5.8 Step 7: Train the Model

3.5.9 Step 8: Evaluate the Model

3.5.10 Exploring the WGAN Latent Space

3.6 Conclusion

3.7 Summary

Part 2: Advanced Generative Models and Techniques

4 Diffusion Models: Forward Diffusion

4.1 Introduction to Diffusion Models

4.1.1 Forward and Reverse Diffusion Phases

4.2 Foundations and Core Concepts

4.2.1 Forward Diffusion

4.2.2 Reverse Diffusion

4.3 Images, Probability Distributions, and the Visual World

4.3.1 The Infinite Possibilities of Pixel Arrangements

4.3.2 Diffusion Models and the Quest for Coherence

4.4 Forward Diffusion In-Depth

4.4.1 Reducing Data Distribution Complexity by Adding Noise

4.5 Mathematical Foundations of the Forward Diffusion Process

4.5.1 Forward Diffusion Process

4.5.2 Closed-Form Formula for Skipping Steps in Forward Diffusion

4.6 Visualizing Forward Diffusion in 1D

4.6.1 Probability Density of an Example 1D Dataset

4.6.2 Applying Forward Diffusion to the 1D Distribution

4.7 Conclusion

4.8 Summary

5 Diffusion Models: Reverse Diffusion

5.1 Understanding Reverse Diffusion

5.2 The Mathematics of Reverse Diffusion

5.2.1 Forward Diffusion: A Brief Recap

5.2.2 The Reverse Diffusion Equation

5.2.3 Reverse Diffusion: Training vs. Inference

5.3 U-Net Architecture for Denoising

5.3.1 U-Net: Structure and Function

5.3.2 Comparing U-Net to Autoencoders

5.3.3 Adapting U-Net for Diffusion Models

5.4 Step-by-Step Implementation of a Denoising Diffusion Probabilistic Model (DDPM)

5.4.1 Step 1: Import Necessary Libraries

5.4.2 Step 2: Enable GPU Training

5.4.3 Step 3: Prepare the Dataset

5.4.4 Step 4: Implement U-Net Model for Denoising

5.4.5 Step 5: Implement DDPM

5.4.6 Step 6: Train the Model

5.4.7 Step 7: Model Evaluation

5.5 Comparing DDPMs with VAEs and GANs

5.5.1 Generation Process and Computational Requirements

5.5.2 Sample Quality and Diversity

5.5.3 Training Stability and Ease of Use

5.5.4 Latent Space and Interpolation

5.5.5 Theoretical Grounding and Flexibility

5.5.6 Comparison Summary

5.6 Conclusion

5.7 Summary

6 Evaluation and Metrics for Generative Models

6.1 Introduction to Model Evaluation in Generative AI

6.1.1 Importance of Evaluation in Generative Models

6.1.2 Challenges In Evaluating Generative Models For Image Synthesis

6.1.3 Overview Of Evaluation Approaches

6.2 Qualitative Evaluation Methods

6.2.1 Visual Inspection

6.2.2 User Studies

6.2.3 Summary of Qualitative Evaluation Methods

6.3 Quantitative Evaluation Metrics

6.3.1 Inception Score (IS)

6.3.2 Fréchet Inception Distance (FID)

6.3.3 Kernel Inception Distance (KID)

6.3.4 Precision and Recall for Distributions

6.3.5 Summary of Quantitative Evaluation Metrics

6.4 Model-Specific Evaluation Techniques

6.4.1 VAE-specific Metrics

6.4.2 GAN-specific Metrics

6.4.3 Diffusion Model-specific Metrics

6.5 Task-Specific Evaluation Metrics

6.5.1 Case Study 1: Super-Resolution for Remote Sensing Imagery

6.5.2 Case Study 2: Medical Image Synthesis for Data Augmentation

6.6 Challenges and Limitations in Evaluation

6.6.1 Bias in Evaluation Metrics

6.6.2 Computational Complexity and Scalability

6.6.3 Lack of Ground Truth in Generative Tasks

6.6.4 Domain Specificity and Generalization

6.7 Conclusion

6.8 Summary

7 Conditional Image Generation

7.1 Why Conditional Generation?

7.1.1 Limitations of Unconditional Models

7.1.2 Applications Where Conditioning is Essential

7.2 Types of Conditioning Information

7.2.1 Conditioning on Class Labels

7.2.2 Conditioning on Images

7.2.3 Conditioning on Text or Other Modalities

7.3 Conditional Variational Autoencoders (cVAEs)

7.3.1 Quick Recap of Variational Autoencoders

7.3.2 Incorporating Conditioning into VAEs

7.4 Implementation of cVAE for Class-Conditioned Image Generation

7.4.1 Step 1: Import Necessary Libraries

7.4.2 Step 2: Prepare the Dataset

7.4.3 Step 3: Define the cVAE Encoder and Decoder

7.4.4 Step 4: Implement cVAE functions

7.4.5 Step 5: Train the cVAE

7.4.6 Step 6: Evaluate the cVAE

7.5 Evaluation and Metrics for Conditional Image Generation

7.5.1 Conditioning Accuracy Metrics

7.5.2 Human Evaluation

7.6 Conclusion

7.7 Summary

8 Hybrid Architectures and Latent Diffusion Models

8.1 Understanding the Path to Hybrid Models

8.1.1 The Generative AI Trilemma

8.1.2 A Comparative Look at Generative Model Architectures

8.2 The Rise of Hybrid Models: Latent Diffusion as a Case Study

8.2.1 Latent Diffusion: Diffusion in the Latent Space

8.2.2 The Best of Both Worlds: Benefits of Latent Diffusion

8.2.3 The Drawbacks of Latent Diffusion

8.2.4 Balancing Trade-Offs in Modern Generative AI

8.3 Implementing a Latent Diffusion Model

8.3.1 Step 1: Importing Necessary Libraries

8.3.2 Step 2: Setting the Device

8.3.3 Step 3: Building the VAE Architecture – Encoder and Decoder Networks

8.3.4 Step 4: Assembling the Complete VAE

8.3.5 Step 5: VAE Training Setup

8.3.6 Step 6: VAE Training Pipeline

8.3.7 Step 7: Building the Diffusion Components: U-Net Architecture with Residual Blocks

8.3.8 Step 8: Implementing the Diffusion Process: The DDPM Class

8.3.9 Step 9: Tying it All Together – The Complete Latent Diffusion Model

8.3.10 Step 10: Training the Latent Diffusion Model

8.3.11 Step 11: Putting It All Together – Training Pipeline Initialization

8.3.12 Step 12: Generating Samples

8.3.13 Model Output Analysis

8.4 Conclusion

8.5 Summary

Part 3: Multimodal Generative AI and Real-World Impact

9 Bridging Language and Vision with Transformers

9.1 Introduction to Multimodal Modeling and Transformers

9.1.1 Understanding Transformers

9.1.2 The Transformer Architecture: An Overview

9.2 Inside the Transformer: Key Components and Mechanisms

9.2.1 Word Embeddings

9.2.2 Positional Encodings

9.2.3 Understanding Attention Mechanisms

9.2.4 Types of Attention Mechanisms

9.2.5 Multi-Head Attention: Multiple Perspectives on the Same Text

9.3 The Complete Transformer Architecture: Putting It All Together

9.3.1 Input Processing

9.3.2 The Encoder Stack

9.3.3 The Decoder Stack

9.3.4 Final Output Layer

9.3.5 Auto-Regressive Generation

9.3.6 Evolution of Transformer Models

9.4 From NLP to Vision: The Vision Transformer (ViT)

9.4.1 Key Differences from Traditional Transformers

9.4.2 The ViT Model Architecture

9.4.3 Comparing ViTs and CNNs

9.5 CLIP: Bridging Vision and Language

9.5.1 Connecting Words and Images

9.5.2 The CLIP Approach: Learning from Internet-Scale Data

9.5.3 CLIP Architecture

9.5.4 CLIP Training Process

9.5.5 The Power of CLIP’s Representations

9.5.6 From CLIP to Text-to-Image Generation

9.6 Conclusion

9.7 Summary

10 Text-to-Image Generation with Stable Diffusion

11 Video Generation

12 Real-World Applications and Case Studies

Overview

4 Diffusion Models: Forward Diffusion

Diffusion models are a class of generative models that create images by moving between order and randomness through many small steps. Unlike VAEs and GANs that map directly from latent codes to images, diffusion models rely on a two-phase mechanism: a Forward Diffusion phase that gradually corrupts data with noise, and a Reverse Diffusion phase that learns to remove it. This chapter builds intuition for that process—drawing on ideas like Brownian motion and probabilistic views of images—emphasizing that coherent, high-quality images occupy a tiny manifold within an astronomically large space of possible pixel grids. By understanding how structure is lost in a controlled way, we set up the learning task of reconstructing it.

The Forward Diffusion process incrementally adds Gaussian noise over T steps, transforming an original sample x0 into xT that is nearly isotropic Gaussian noise. It is a Markov process with a predefined variance schedule βt that controls how much noise is added at each step; schedules are typically small at the start and larger later (e.g., linear or cosine), enabling the model to learn fine details at low noise and broader structure at high noise. This intentional corruption simplifies the intractable data distribution into a tractable one while remaining reversible in principle. Mathematically, the process is designed so the distribution approaches Gaussian as t increases, and a closed-form expression allows direct sampling of any intermediate xt from x0 without iterating through all prior steps, improving efficiency. Visualized in 1D, Forward Diffusion smooths complex, multimodal densities into a single-peaked Gaussian, illustrating how structure is progressively washed out.

Practically, Forward Diffusion supplies the training signals for learning the reverse mapping: the model is trained to predict and remove the noise added at each step, enabling it to trace a path from high-entropy noise back to low-entropy, coherent images. This design distills coherence from chaos and underpins the strong sample quality, diversity, and training stability observed with DDPMs. With the foundations, terminology, and motivation established here—why Gaussian noise, how βt shapes learning, and how closed-form skipping aids computation—the stage is set for the next chapter’s detailed treatment of Reverse Diffusion and a hands-on construction of denoising diffusion models from scratch.

[1] Example of Forward and Reverse diffusion.

Illustration of the Forward Diffusion process.

Illustration of the noise addition step in a single step t in during the Forward Diffusion phase, transforming image xt into xt+1 by adding the noise εt.

Illustration of noise removal via Reverse Diffusion. Given a noisy image xt, the Reverse Diffusion model learns to predict εt-1. By subtracting this noise, the model aims to recover the previous image state, xt-1.

Conceptual visualization of the universe of all possible images.

Conceptual illustration of Diffusion in the pixel space. Diffusion Models learn the complex distribution of real images by first adding noise to these images and then learning to reverse this process. Through training, they effectively map the journey from disorder (random noise) to order (structured, recognizable imagery).

Using Diffusion to synthesize new images.

Isotropic Gaussian distribution over 2-dimensions. Characteristic of an Isotropic Gaussian, the distribution exhibits uniform variance in every direction, manifesting in a perfectly symmetrical shape around its mean. This uniformity implies that the data points spread out from the center with equal probability in all planar directions, exhibiting isotropy.

[2] Probability density function of 1D data. This figure shows a Probability Density Function (PDF) for 1D data, with the x-axis representing the data coordinates and the y-axis indicating probability density. A complex, multimodal (having multiple peaks) distribution exemplifies how certain x-coordinates are more likely to host data points, based on the peak heights. That is, the higher the PDF curve denoted as q(x0), the more likely it is that a randomly sampled datapoint would have an x-coordinate near the peak.

[3] Evolution of 1D data distribution via Forward Diffusion. This figure illustrates the transformation of the input data’s probability density, q(x0), into a Standard Normal Distribution, q(xT) via Forward Diffusion. It illustrates the progressive smoothing of the original data distribution’s characteristics through gradual Gaussian noise addition until they align with those of a Gaussian distribution.

Summary

Diffusion models: A class of generative models that gradually transform simple distributions (e.g., Gaussian noise) into complex data distributions (e.g., images) through a process of iterative denoising, effectively learning to reverse the diffusion process. Diffusion models operate through a two-phase process: Forward Diffusion to introduce noise, and Reverse Diffusion to remove noise and reconstruct images.
Forward Diffusion: The Forward Diffusion process systematically adds noise to the original data over a series of steps, moving it towards a distribution that is easier to model and sample from, typically Gaussian noise. This process is designed to be reversible, setting the stage for the Reverse Diffusion process where the model learns to denoise or reconstruct the original data from noise.
Gaussian noise: Gaussian noise, or normally distributed noise, is the type of noise typically added during the Forward Diffusion process, characterized by its bell-shaped probability distribution. Gaussian noise is favored because of its well-understood properties and mathematical tractability, making it ideal for the controlled degradation of data.
Variance schedule: The variance schedule is a predefined sequence of values that dictate the amount of noise added at each step of the Forward Diffusion process. It ensures that the noise addition is both gradual and controlled, preventing the data from becoming too corrupted too quickly. This plays a key role in the reversibility of the diffusion process, allowing the model to accurately learn how to reverse the noise addition during the Reverse Diffusion phase.

FAQ

What is the Forward Diffusion process in diffusion models?

The Forward Diffusion process gradually adds noise to an original image x0 over T timesteps, producing increasingly noisy versions xt until the data becomes indistinguishable from random (Gaussian) noise at xT. This controlled corruption defines a path the model later learns to reverse.

Why do we add noise during Forward Diffusion?

Adding noise reduces the complexity of the original data distribution, transforming it into a simple, tractable Gaussian distribution. This simplification makes sampling easier and sets up a reversible process the model can learn to invert during generation.

How do the Forward and Reverse Diffusion phases differ?

- Forward Diffusion: incrementally corrupts data by adding noise, moving from structure to noise (x0 → xT).
- Reverse Diffusion: learns to remove noise step by step, moving from noise to structure (xT → x0) to reconstruct or synthesize coherent images.

What does it mean that the Forward Diffusion process is Markovian?

It means each noisy state xt depends only on the immediate previous state xt−1 (and the current noise), not on earlier steps. This property simplifies analysis and is key to designing a reversible denoising process.

Why is Gaussian (especially isotropic Gaussian) noise chosen?

Gaussian noise is used because of its symmetry, mathematical tractability, and the Central Limit Theorem. An isotropic Gaussian has equal variance in all directions, yielding a simple, well-understood target distribution for the end of the forward process.

How is noise added at each timestep?

At each step t, a small, scheduled amount of Gaussian noise ε is added to the previous sample xt−1. The addition is scaled by a variance schedule βt so that corruption is gradual—small at first and larger later—preventing the data from becoming noise too quickly.

What is the variance schedule βt and why does it matter?

βt controls how much noise is added at each step. It typically starts near zero and increases over time. Its shape significantly affects training dynamics and sample quality. Common choices:
- Linear schedule: variance increases at a constant rate.
- Cosine schedule: variance increases nonlinearly, with smoother changes at the beginning and end.

Can we jump directly to any timestep t without simulating all previous steps?

Yes. A closed-form expression allows computing xt directly from x0 by combining a scaled version of x0 with appropriately scaled Gaussian noise. This “skip” greatly improves efficiency for training and sampling at arbitrary noise levels.

What does “transition to isotropic Gaussian noise” mean?

After many steps (t = T), the corrupted data distribution closely matches an isotropic Gaussian—uniform variance in all directions, no recognizable features from the original image—serving as the simple endpoint of the forward process.

What intuition does the 1D visualization provide?

It shows a complex, multimodal distribution being progressively smoothed by added Gaussian noise until it becomes a unimodal Gaussian. This illustrates how Forward Diffusion simplifies structure and highlights what the Reverse Diffusion must learn to recover.

with subscription

$24.99

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more