Image Generation Models

GANs, diffusion models, and transformers

Vladimir Bok

MEAP began September 2024
Last updated July 2025
This book is in development

ISBN 9781633437449
350 pages (estimated)

Included with a Manning Online subscription

printed in black & white

available in Russian, Simplified Chinese

catalog / Data Science / Deep Learning / Generative AI

resources: Book forum

table of content

Part 1: Foundations of Generative AI in Computer Vision

1 Generative AI in Computer Vision

1.1 Introduction: From Chisels to Pixels

1.2 AI, Generative AI, and Computer Vision

1.2.1 Artificial Intelligence (AI)

1.2.2 Computer Vision

1.2.3 Generative AI

1.2.4 The Intersection: Generative AI in Computer Vision

1.3 Practical Applications of Generative AI in Computer Vision

1.3.1 Digital Face Re-Aging in Video Production

1.3.2 Simulation Environments for Self-Driving Cars

1.3.3 Data Augmentation for Medical Imaging

1.4 The Evolution of Generative AI for Image Synthesis

1.4.1 Early Foundations (1950s-1980s)

1.4.2 Emergence of Neural Networks (1980s-2000s)

1.4.3 Breakthroughs in Deep Learning (2000s-2010s)

1.4.4 The GAN Revolution (2014-2020)

1.4.5 New Horizons (2020-Present)

1.5 Taxonomy for Generative AI in Computer Vision

1.5.1 Level of Control

1.6 Model Architecture

1.6.1 Autoencoders

1.6.2 Adversarial Networks

1.6.3 Diffusion Models

1.6.4 Transformers

1.6.5 Choosing the Right Architecture

1.7 Conclusion

1.8 Summary

2 Variational Autoencoders (VAEs)

2.1 Introduction to Autoencoders

2.1.1 The Latent Space

2.1.2 Applications of Autoencoders

2.1.3 Autoencoder Architecture and Training

2.2 Implementing an Autoencoder in PyTorch

2.2.1 Step 1: Import Necessary Libraries

2.2.2 Step 2: Prepare the Dataset

2.2.3 Step 3: Implement the Autoencoder Model

2.2.4 Step 4: Define the Loss Function and Optimizer

2.2.5 Step 5: Train the Autoencoder

2.2.6 Step 6: Model Evaluation

2.2.7 Conclusion and Next Steps

2.3 Variational Autoencoders (VAEs)

2.3.1 VAE Model Architecture: Concepts and Components

2.3.2 The Loss Function and VAE Training

2.4 Implementing VAE in PyTorch

2.4.1 Step 1: Import Necessary Libraries

2.4.2 Step 2: Prepare the Dataset

2.4.3 Step 3: Implement the Encoder

2.4.4 Step 4: Implement the Decoder

2.4.5 Step 5: Implement the VAE Model

2.4.6 Step 6: Define the VAE Loss Function

2.4.7 Step 7: Initialize the Model and Optimizer

2.4.8 Step 8: Train the VAE

2.4.9 Step 9: Model Evaluation

2.4.10 Latent Space Interpolation with VAE

2.5 β-VAE and Latent Space Disentanglement

2.5.1 Applications and Implications of β-VAEs

2.6 Conclusion

2.7 Summary

3 Generative Adversarial Networks (GANs)

3.1 Introduction to GANs

3.1.1 Adversarial Training

3.2 Core Concepts and Theoretical Foundations of GANs

3.2.1 Inspiration from Game Theory

3.2.2 GAN Value Function

3.2.3 Minimax Game: How the Value Function Works in GAN Training

3.2.4 Visualizing GAN Model Architecture

3.2.5 Reaching Equilibrium in GAN Training

3.3 Training Challenges in GANs

3.3.1 Mode Collapse

3.3.2 Vanishing Gradient

3.3.3 Non-Convergence

3.4 Wasserstein GANs: A Solution to Training Challenges

3.4.1 The Intuition Behind WGANs

3.4.2 Understanding Earth Mover’s (Wasserstein) Distance in WGANs

3.4.3 Replacing the Discriminator with a Critic

3.4.4 The WGAN Value Function

3.4.5 Visualizing WGAN Model Architecture

3.5 Implementing a WGAN in PyTorch

3.5.1 Adopting Best Practices from DCGANs

3.5.2 Step 1: Import Necessary Libraries

3.5.3 Step 2: Enable GPU Training

3.5.4 Step 3: Load the Dataset

3.5.5 Step 4: Implement the Generator

3.5.6 Step 5: Implement the Critic

3.5.7 Step 6: Initialize the Models and Optimizers

3.5.8 Step 7: Train the Model

3.5.9 Step 8: Evaluate the Model

3.5.10 Exploring the WGAN Latent Space

3.6 Conclusion

3.7 Summary

Part 2: Advanced Generative Models and Techniques

4 Diffusion Models: Forward Diffusion

4.1 Introduction to Diffusion Models

4.1.1 Forward and Reverse Diffusion Phases

4.2 Foundations and Core Concepts

4.2.1 Forward Diffusion

4.2.2 Reverse Diffusion

4.3 Images, Probability Distributions, and the Visual World

4.3.1 The Infinite Possibilities of Pixel Arrangements

4.3.2 Diffusion Models and the Quest for Coherence

4.4 Forward Diffusion In-Depth

4.4.1 Reducing Data Distribution Complexity by Adding Noise

4.5 Mathematical Foundations of the Forward Diffusion Process

4.5.1 Forward Diffusion Process

4.5.2 Closed-Form Formula for Skipping Steps in Forward Diffusion

4.6 Visualizing Forward Diffusion in 1D

4.6.1 Probability Density of an Example 1D Dataset

4.6.2 Applying Forward Diffusion to the 1D Distribution

4.7 Conclusion

4.8 Summary

5 Diffusion Models: Reverse Diffusion

5.1 Understanding Reverse Diffusion

5.2 The Mathematics of Reverse Diffusion

5.2.1 Forward Diffusion: A Brief Recap

5.2.2 The Reverse Diffusion Equation

5.2.3 Reverse Diffusion: Training vs. Inference

5.3 U-Net Architecture for Denoising

5.3.1 U-Net: Structure and Function

5.3.2 Comparing U-Net to Autoencoders

5.3.3 Adapting U-Net for Diffusion Models

5.4 Step-by-Step Implementation of a Denoising Diffusion Probabilistic Model (DDPM)

5.4.1 Step 1: Import Necessary Libraries

5.4.2 Step 2: Enable GPU Training

5.4.3 Step 3: Prepare the Dataset

5.4.4 Step 4: Implement U-Net Model for Denoising

5.4.5 Step 5: Implement DDPM

5.4.6 Step 6: Train the Model

5.4.7 Step 7: Model Evaluation

5.5 Comparing DDPMs with VAEs and GANs

5.5.1 Generation Process and Computational Requirements

5.5.2 Sample Quality and Diversity

5.5.3 Training Stability and Ease of Use

5.5.4 Latent Space and Interpolation

5.5.5 Theoretical Grounding and Flexibility

5.5.6 Comparison Summary

5.6 Conclusion

5.7 Summary

6 Evaluation and Metrics for Generative Models

6.1 Introduction to Model Evaluation in Generative AI

6.1.1 Importance of Evaluation in Generative Models

6.1.2 Challenges In Evaluating Generative Models For Image Synthesis

6.1.3 Overview Of Evaluation Approaches

6.2 Qualitative Evaluation Methods

6.2.1 Visual Inspection

6.2.2 User Studies

6.2.3 Summary of Qualitative Evaluation Methods

6.3 Quantitative Evaluation Metrics

6.3.1 Inception Score (IS)

6.3.2 Fréchet Inception Distance (FID)

6.3.3 Kernel Inception Distance (KID)

6.3.4 Precision and Recall for Distributions

6.3.5 Summary of Quantitative Evaluation Metrics

6.4 Model-Specific Evaluation Techniques

6.4.1 VAE-specific Metrics

6.4.2 GAN-specific Metrics

6.4.3 Diffusion Model-specific Metrics

6.5 Task-Specific Evaluation Metrics

6.5.1 Case Study 1: Super-Resolution for Remote Sensing Imagery

6.5.2 Case Study 2: Medical Image Synthesis for Data Augmentation

6.6 Challenges and Limitations in Evaluation

6.6.1 Bias in Evaluation Metrics

6.6.2 Computational Complexity and Scalability

6.6.3 Lack of Ground Truth in Generative Tasks

6.6.4 Domain Specificity and Generalization

6.7 Conclusion

6.8 Summary

7 Conditional Image Generation

7.1 Why Conditional Generation?

7.1.1 Limitations of Unconditional Models

7.1.2 Applications Where Conditioning is Essential

7.2 Types of Conditioning Information

7.2.1 Conditioning on Class Labels

7.2.2 Conditioning on Images

7.2.3 Conditioning on Text or Other Modalities

7.3 Conditional Variational Autoencoders (cVAEs)

7.3.1 Quick Recap of Variational Autoencoders

7.3.2 Incorporating Conditioning into VAEs

7.4 Implementation of cVAE for Class-Conditioned Image Generation

7.4.1 Step 1: Import Necessary Libraries

7.4.2 Step 2: Prepare the Dataset

7.4.3 Step 3: Define the cVAE Encoder and Decoder

7.4.4 Step 4: Implement cVAE functions

7.4.5 Step 5: Train the cVAE

7.4.6 Step 6: Evaluate the cVAE

7.5 Evaluation and Metrics for Conditional Image Generation

7.5.1 Conditioning Accuracy Metrics

7.5.2 Human Evaluation

7.6 Conclusion

7.7 Summary

8 Hybrid Architectures and Latent Diffusion Models

8.1 Understanding the Path to Hybrid Models

8.1.1 The Generative AI Trilemma

8.1.2 A Comparative Look at Generative Model Architectures

8.2 The Rise of Hybrid Models: Latent Diffusion as a Case Study

8.2.1 Latent Diffusion: Diffusion in the Latent Space

8.2.2 The Best of Both Worlds: Benefits of Latent Diffusion

8.2.3 The Drawbacks of Latent Diffusion

8.2.4 Balancing Trade-Offs in Modern Generative AI

8.3 Implementing a Latent Diffusion Model

8.3.1 Step 1: Importing Necessary Libraries

8.3.2 Step 2: Setting the Device

8.3.3 Step 3: Building the VAE Architecture – Encoder and Decoder Networks

8.3.4 Step 4: Assembling the Complete VAE

8.3.5 Step 5: VAE Training Setup

8.3.6 Step 6: VAE Training Pipeline

8.3.7 Step 7: Building the Diffusion Components: U-Net Architecture with Residual Blocks

8.3.8 Step 8: Implementing the Diffusion Process: The DDPM Class

8.3.9 Step 9: Tying it All Together – The Complete Latent Diffusion Model

8.3.10 Step 10: Training the Latent Diffusion Model

8.3.11 Step 11: Putting It All Together – Training Pipeline Initialization

8.3.12 Step 12: Generating Samples

8.3.13 Model Output Analysis

8.4 Conclusion

8.5 Summary

Part 3: Multimodal Generative AI and Real-World Impact

9 Bridging Language and Vision with Transformers

9.1 Introduction to Multimodal Modeling and Transformers

9.1.1 Understanding Transformers

9.1.2 The Transformer Architecture: An Overview

9.2 Inside the Transformer: Key Components and Mechanisms

9.2.1 Word Embeddings

9.2.2 Positional Encodings

9.2.3 Understanding Attention Mechanisms

9.2.4 Types of Attention Mechanisms

9.2.5 Multi-Head Attention: Multiple Perspectives on the Same Text

9.3 The Complete Transformer Architecture: Putting It All Together

9.3.1 Input Processing

9.3.2 The Encoder Stack

9.3.3 The Decoder Stack

9.3.4 Final Output Layer

9.3.5 Auto-Regressive Generation

9.3.6 Evolution of Transformer Models

9.4 From NLP to Vision: The Vision Transformer (ViT)

9.4.1 Key Differences from Traditional Transformers

9.4.2 The ViT Model Architecture

9.4.3 Comparing ViTs and CNNs

9.5 CLIP: Bridging Vision and Language

9.5.1 Connecting Words and Images

9.5.2 The CLIP Approach: Learning from Internet-Scale Data

9.5.3 CLIP Architecture

9.5.4 CLIP Training Process

9.5.5 The Power of CLIP’s Representations

9.5.6 From CLIP to Text-to-Image Generation

9.6 Conclusion

9.7 Summary

10 Text-to-Image Generation with Stable Diffusion

11 Video Generation

12 Real-World Applications and Case Studies

Overview

2 Variational Autoencoders (VAEs)

This chapter introduces autoencoders as self-supervised neural networks that learn compact, meaningful representations by reconstructing their inputs. It explains the encoder–decoder architecture, the notion of a latent space, and why compressing through a bottleneck yields salient features useful for tasks such as dimensionality reduction, denoising, anomaly detection, and feature learning. While traditional autoencoders excel at reconstruction, the chapter motivates the need for truly generative models that can sample new data, setting the stage for Variational Autoencoders (VAEs).

VAEs extend autoencoders by making the latent space probabilistic: the encoder predicts parameters of a distribution (typically mean and variance of a Gaussian) for each input, and the decoder reconstructs data from latent samples. Training balances two objectives—a reconstruction loss for fidelity and a regularization term that aligns latent distributions with a simple prior—enabled by the reparameterization trick so gradients can flow through sampling. The chapter walks through a practical PyTorch implementation on MNIST, demonstrates evaluation via reconstructions and random sampling, and shows how latent space interpolation produces smooth, semantically coherent transitions that reflect a continuous, well-structured representation.

Building on this foundation, the chapter presents β-VAE, which introduces a hyperparameter to weight the regularization term and promote disentangled latent factors. Increasing β typically trades some reconstruction accuracy for more interpretable, factorized representations where individual latent dimensions control distinct attributes, enabling controlled generation and analysis. The discussion underscores the broader implications: from reliable compression and generation to human-interpretable representations, charting a progression from basic autoencoders to VAEs and β-VAEs as powerful tools for image generation and representation learning.

Autoencoder model architecture

Comparison of original MNIST digits (top row) with their autoencoder reconstructions (bottom row)

Normal (Gaussian) distribution. This figure illustrates a bell-shaped curve representing a Normal distribution denoted by N(μ, σ): μ represents the mean or average of the distribution. σ represents the standard deviation, a measure of how spread out the values are around the mean in the distribution.

VAE model architecture

Randomly generated MNIST digits created by VAE

Latent space interpolation involves selecting two distinct points in the latent space, which represent different latent encodings, and creating a series of intermediate points between them. By feeding interpolated latent vectors into the VAE’s decoder, we can reconstruct the data corresponding to each point. This allows us to observe how the data transitions as we move from one latent representation to another.

[5] Disentangled latent variables in β-VAEs

Summary

Autoencoders: Neural network architectures used to learn efficient codings of the input data; autoencoders have an encoder-decoder structure where the encoder compresses the input and the decoder reconstructs it, aiming to match the original.
Latent space: The hidden, compact representation of data, where autoencoders compress the input data. It represents the essential features learned from the data, which are crucial for the reconstruction or generation of new data instances.
Variational autoencoders (VAEs): An advanced type of autoencoder that learns the distribution of the data in the latent space. Unlike traditional autoencoders, VAEs are generative models that can generate new instances that resemble the input data by sampling from the learned distribution in the latent space.
Reparameterization trick: The reparametrization trick is a method used in VAEs to enable backpropagation through random processes by decoupling the sampling operation from the model’s parameters.
Beta-VAEs (β-VAEs): An extension of the standard VAE, introducing a hyperparameter β (beta) to control the trade-off between accurate reconstruction and adherence to the latent space’s probabilistic distribution, often leading to improved disentanglement of features in the latent space.

FAQ

What problem do Variational Autoencoders (VAEs) solve that traditional autoencoders do not?

Traditional autoencoders learn a deterministic mapping to a single latent vector and focus on reconstruction. VAEs learn a probabilistic latent space by outputting a distribution (typically Gaussian) per input and sampling from it. This enables them to generate new, diverse data by sampling z from the prior (usually N(0,1)) and decoding it, making VAEs true generative models.

How does the VAE encoder differ from a standard autoencoder encoder?

Instead of producing a single latent vector, the VAE encoder outputs the parameters of a distribution over the latent variables, typically the mean (mu) and log-variance (logvar) of a Gaussian. This models q(z|x), allowing sampling of many plausible latent codes for a given input.

What does the VAE decoder model?

The decoder models the conditional probability p(x|z). Given a latent sample z drawn from the encoder’s distribution (or from the prior), it generates a reconstruction x’ that should resemble data drawn from the training distribution.

What is the reparameterization trick and why is it necessary?

Backpropagation cannot pass through a raw sampling operation. The reparameterization trick makes sampling differentiable by expressing z as z = mu + sigma * epsilon, where epsilon is sampled from a standard normal distribution. This isolates randomness from the learnable parameters, enabling gradient-based training.

What are the components of the VAE loss function?

The total loss combines: (1) Reconstruction loss (e.g., BCE for binary-like data or MSE for continuous data) to ensure output resembles input, and (2) a regularization term (KL divergence) that encourages the learned latent distribution q(z|x) to be close to a simple prior, typically N(0,1). Total Loss = Reconstruction Loss + KL Divergence.

Why is a standard normal prior (N(0,1)) used for the latent space?

It regularizes the latent space, promoting structure, continuity, and completeness (nearby z decode to similar outputs). It also provides mathematical convenience: the KL divergence between Gaussians has a closed-form, simplifying training and improving stability.

When should I use Binary Cross-Entropy (BCE) versus Mean Squared Error (MSE) for reconstruction?

Use BCE when outputs are in [0,1] and can be treated as Bernoulli-like (e.g., normalized MNIST with a final sigmoid). Use MSE for continuous-valued targets (e.g., natural images modeled as real intensities). Match the loss to your data’s distributional assumptions.

What is latent space interpolation and what does it show?

Pick two latent vectors, linearly interpolate between them, and decode the intermediate points. The resulting sequence should show smooth transitions in the generated outputs, demonstrating that the latent space is continuous and well-structured.

What is β-VAE and how does the β parameter affect results?

β-VAE scales the KL term in the loss: Loss = Reconstruction + β × KL. β = 1 recovers a standard VAE. β > 1 emphasizes disentanglement and latent regularization (often more interpretable factors) at the cost of reconstruction fidelity. β < 1 favors sharper reconstructions but can reduce disentanglement.

How do you evaluate a trained VAE?

Assess (1) reconstruction quality by comparing inputs to reconstructions, and (2) generative quality by sampling z from N(0,1) and decoding to produce new images. Good VAEs reconstruct well and generate diverse, plausible samples that reflect the training distribution.

with subscription

$24.99

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more