Image Generation Models

GANs, diffusion models, and transformers

Vladimir Bok

MEAP began September 2024
Last updated July 2025
This book is in development

ISBN 9781633437449
350 pages (estimated)

Included with a Manning Online subscription

printed in black & white

available in Russian, Simplified Chinese

catalog / Data Science / Deep Learning / Generative AI

resources: Book forum

table of content

Part 1: Foundations of Generative AI in Computer Vision

1 Generative AI in Computer Vision

1.1 Introduction: From Chisels to Pixels

1.2 AI, Generative AI, and Computer Vision

1.2.1 Artificial Intelligence (AI)

1.2.2 Computer Vision

1.2.3 Generative AI

1.2.4 The Intersection: Generative AI in Computer Vision

1.3 Practical Applications of Generative AI in Computer Vision

1.3.1 Digital Face Re-Aging in Video Production

1.3.2 Simulation Environments for Self-Driving Cars

1.3.3 Data Augmentation for Medical Imaging

1.4 The Evolution of Generative AI for Image Synthesis

1.4.1 Early Foundations (1950s-1980s)

1.4.2 Emergence of Neural Networks (1980s-2000s)

1.4.3 Breakthroughs in Deep Learning (2000s-2010s)

1.4.4 The GAN Revolution (2014-2020)

1.4.5 New Horizons (2020-Present)

1.5 Taxonomy for Generative AI in Computer Vision

1.5.1 Level of Control

1.6 Model Architecture

1.6.1 Autoencoders

1.6.2 Adversarial Networks

1.6.3 Diffusion Models

1.6.4 Transformers

1.6.5 Choosing the Right Architecture

1.7 Conclusion

1.8 Summary

2 Variational Autoencoders (VAEs)

2.1 Introduction to Autoencoders

2.1.1 The Latent Space

2.1.2 Applications of Autoencoders

2.1.3 Autoencoder Architecture and Training

2.2 Implementing an Autoencoder in PyTorch

2.2.1 Step 1: Import Necessary Libraries

2.2.2 Step 2: Prepare the Dataset

2.2.3 Step 3: Implement the Autoencoder Model

2.2.4 Step 4: Define the Loss Function and Optimizer

2.2.5 Step 5: Train the Autoencoder

2.2.6 Step 6: Model Evaluation

2.2.7 Conclusion and Next Steps

2.3 Variational Autoencoders (VAEs)

2.3.1 VAE Model Architecture: Concepts and Components

2.3.2 The Loss Function and VAE Training

2.4 Implementing VAE in PyTorch

2.4.1 Step 1: Import Necessary Libraries

2.4.2 Step 2: Prepare the Dataset

2.4.3 Step 3: Implement the Encoder

2.4.4 Step 4: Implement the Decoder

2.4.5 Step 5: Implement the VAE Model

2.4.6 Step 6: Define the VAE Loss Function

2.4.7 Step 7: Initialize the Model and Optimizer

2.4.8 Step 8: Train the VAE

2.4.9 Step 9: Model Evaluation

2.4.10 Latent Space Interpolation with VAE

2.5 β-VAE and Latent Space Disentanglement

2.5.1 Applications and Implications of β-VAEs

2.6 Conclusion

2.7 Summary

3 Generative Adversarial Networks (GANs)

3.1 Introduction to GANs

3.1.1 Adversarial Training

3.2 Core Concepts and Theoretical Foundations of GANs

3.2.1 Inspiration from Game Theory

3.2.2 GAN Value Function

3.2.3 Minimax Game: How the Value Function Works in GAN Training

3.2.4 Visualizing GAN Model Architecture

3.2.5 Reaching Equilibrium in GAN Training

3.3 Training Challenges in GANs

3.3.1 Mode Collapse

3.3.2 Vanishing Gradient

3.3.3 Non-Convergence

3.4 Wasserstein GANs: A Solution to Training Challenges

3.4.1 The Intuition Behind WGANs

3.4.2 Understanding Earth Mover’s (Wasserstein) Distance in WGANs

3.4.3 Replacing the Discriminator with a Critic

3.4.4 The WGAN Value Function

3.4.5 Visualizing WGAN Model Architecture

3.5 Implementing a WGAN in PyTorch

3.5.1 Adopting Best Practices from DCGANs

3.5.2 Step 1: Import Necessary Libraries

3.5.3 Step 2: Enable GPU Training

3.5.4 Step 3: Load the Dataset

3.5.5 Step 4: Implement the Generator

3.5.6 Step 5: Implement the Critic

3.5.7 Step 6: Initialize the Models and Optimizers

3.5.8 Step 7: Train the Model

3.5.9 Step 8: Evaluate the Model

3.5.10 Exploring the WGAN Latent Space

3.6 Conclusion

3.7 Summary

Part 2: Advanced Generative Models and Techniques

4 Diffusion Models: Forward Diffusion

4.1 Introduction to Diffusion Models

4.1.1 Forward and Reverse Diffusion Phases

4.2 Foundations and Core Concepts

4.2.1 Forward Diffusion

4.2.2 Reverse Diffusion

4.3 Images, Probability Distributions, and the Visual World

4.3.1 The Infinite Possibilities of Pixel Arrangements

4.3.2 Diffusion Models and the Quest for Coherence

4.4 Forward Diffusion In-Depth

4.4.1 Reducing Data Distribution Complexity by Adding Noise

4.5 Mathematical Foundations of the Forward Diffusion Process

4.5.1 Forward Diffusion Process

4.5.2 Closed-Form Formula for Skipping Steps in Forward Diffusion

4.6 Visualizing Forward Diffusion in 1D

4.6.1 Probability Density of an Example 1D Dataset

4.6.2 Applying Forward Diffusion to the 1D Distribution

4.7 Conclusion

4.8 Summary

5 Diffusion Models: Reverse Diffusion

5.1 Understanding Reverse Diffusion

5.2 The Mathematics of Reverse Diffusion

5.2.1 Forward Diffusion: A Brief Recap

5.2.2 The Reverse Diffusion Equation

5.2.3 Reverse Diffusion: Training vs. Inference

5.3 U-Net Architecture for Denoising

5.3.1 U-Net: Structure and Function

5.3.2 Comparing U-Net to Autoencoders

5.3.3 Adapting U-Net for Diffusion Models

5.4 Step-by-Step Implementation of a Denoising Diffusion Probabilistic Model (DDPM)

5.4.1 Step 1: Import Necessary Libraries

5.4.2 Step 2: Enable GPU Training

5.4.3 Step 3: Prepare the Dataset

5.4.4 Step 4: Implement U-Net Model for Denoising

5.4.5 Step 5: Implement DDPM

5.4.6 Step 6: Train the Model

5.4.7 Step 7: Model Evaluation

5.5 Comparing DDPMs with VAEs and GANs

5.5.1 Generation Process and Computational Requirements

5.5.2 Sample Quality and Diversity

5.5.3 Training Stability and Ease of Use

5.5.4 Latent Space and Interpolation

5.5.5 Theoretical Grounding and Flexibility

5.5.6 Comparison Summary

5.6 Conclusion

5.7 Summary

6 Evaluation and Metrics for Generative Models

6.1 Introduction to Model Evaluation in Generative AI

6.1.1 Importance of Evaluation in Generative Models

6.1.2 Challenges In Evaluating Generative Models For Image Synthesis

6.1.3 Overview Of Evaluation Approaches

6.2 Qualitative Evaluation Methods

6.2.1 Visual Inspection

6.2.2 User Studies

6.2.3 Summary of Qualitative Evaluation Methods

6.3 Quantitative Evaluation Metrics

6.3.1 Inception Score (IS)

6.3.2 Fréchet Inception Distance (FID)

6.3.3 Kernel Inception Distance (KID)

6.3.4 Precision and Recall for Distributions

6.3.5 Summary of Quantitative Evaluation Metrics

6.4 Model-Specific Evaluation Techniques

6.4.1 VAE-specific Metrics

6.4.2 GAN-specific Metrics

6.4.3 Diffusion Model-specific Metrics

6.5 Task-Specific Evaluation Metrics

6.5.1 Case Study 1: Super-Resolution for Remote Sensing Imagery

6.5.2 Case Study 2: Medical Image Synthesis for Data Augmentation

6.6 Challenges and Limitations in Evaluation

6.6.1 Bias in Evaluation Metrics

6.6.2 Computational Complexity and Scalability

6.6.3 Lack of Ground Truth in Generative Tasks

6.6.4 Domain Specificity and Generalization

6.7 Conclusion

6.8 Summary

7 Conditional Image Generation

7.1 Why Conditional Generation?

7.1.1 Limitations of Unconditional Models

7.1.2 Applications Where Conditioning is Essential

7.2 Types of Conditioning Information

7.2.1 Conditioning on Class Labels

7.2.2 Conditioning on Images

7.2.3 Conditioning on Text or Other Modalities

7.3 Conditional Variational Autoencoders (cVAEs)

7.3.1 Quick Recap of Variational Autoencoders

7.3.2 Incorporating Conditioning into VAEs

7.4 Implementation of cVAE for Class-Conditioned Image Generation

7.4.1 Step 1: Import Necessary Libraries

7.4.2 Step 2: Prepare the Dataset

7.4.3 Step 3: Define the cVAE Encoder and Decoder

7.4.4 Step 4: Implement cVAE functions

7.4.5 Step 5: Train the cVAE

7.4.6 Step 6: Evaluate the cVAE

7.5 Evaluation and Metrics for Conditional Image Generation

7.5.1 Conditioning Accuracy Metrics

7.5.2 Human Evaluation

7.6 Conclusion

7.7 Summary

8 Hybrid Architectures and Latent Diffusion Models

8.1 Understanding the Path to Hybrid Models

8.1.1 The Generative AI Trilemma

8.1.2 A Comparative Look at Generative Model Architectures

8.2 The Rise of Hybrid Models: Latent Diffusion as a Case Study

8.2.1 Latent Diffusion: Diffusion in the Latent Space

8.2.2 The Best of Both Worlds: Benefits of Latent Diffusion

8.2.3 The Drawbacks of Latent Diffusion

8.2.4 Balancing Trade-Offs in Modern Generative AI

8.3 Implementing a Latent Diffusion Model

8.3.1 Step 1: Importing Necessary Libraries

8.3.2 Step 2: Setting the Device

8.3.3 Step 3: Building the VAE Architecture – Encoder and Decoder Networks

8.3.4 Step 4: Assembling the Complete VAE

8.3.5 Step 5: VAE Training Setup

8.3.6 Step 6: VAE Training Pipeline

8.3.7 Step 7: Building the Diffusion Components: U-Net Architecture with Residual Blocks

8.3.8 Step 8: Implementing the Diffusion Process: The DDPM Class

8.3.9 Step 9: Tying it All Together – The Complete Latent Diffusion Model

8.3.10 Step 10: Training the Latent Diffusion Model

8.3.11 Step 11: Putting It All Together – Training Pipeline Initialization

8.3.12 Step 12: Generating Samples

8.3.13 Model Output Analysis

8.4 Conclusion

8.5 Summary

Part 3: Multimodal Generative AI and Real-World Impact

9 Bridging Language and Vision with Transformers

9.1 Introduction to Multimodal Modeling and Transformers

9.1.1 Understanding Transformers

9.1.2 The Transformer Architecture: An Overview

9.2 Inside the Transformer: Key Components and Mechanisms

9.2.1 Word Embeddings

9.2.2 Positional Encodings

9.2.3 Understanding Attention Mechanisms

9.2.4 Types of Attention Mechanisms

9.2.5 Multi-Head Attention: Multiple Perspectives on the Same Text

9.3 The Complete Transformer Architecture: Putting It All Together

9.3.1 Input Processing

9.3.2 The Encoder Stack

9.3.3 The Decoder Stack

9.3.4 Final Output Layer

9.3.5 Auto-Regressive Generation

9.3.6 Evolution of Transformer Models

9.4 From NLP to Vision: The Vision Transformer (ViT)

9.4.1 Key Differences from Traditional Transformers

9.4.2 The ViT Model Architecture

9.4.3 Comparing ViTs and CNNs

9.5 CLIP: Bridging Vision and Language

9.5.1 Connecting Words and Images

9.5.2 The CLIP Approach: Learning from Internet-Scale Data

9.5.3 CLIP Architecture

9.5.4 CLIP Training Process

9.5.5 The Power of CLIP’s Representations

9.5.6 From CLIP to Text-to-Image Generation

9.6 Conclusion

9.7 Summary

10 Text-to-Image Generation with Stable Diffusion

11 Video Generation

12 Real-World Applications and Case Studies

Overview

6 Evaluation and Metrics for Generative Models

This chapter surveys how to evaluate image generative models, arguing for a multifaceted strategy that blends qualitative judgment with quantitative, reproducible metrics. It motivates evaluation as essential for comparing models, steering research, choosing architectures for applications, and revealing biases or failure modes. Because generative tasks prize both realism and diversity—and often lack a single ground truth—the chapter emphasizes the inherent subjectivity of perception, the many trade-offs between fidelity and coverage, and the need to align measurements with human expectations and practical goals.

The text contrasts qualitative methods—visual inspection and user studies (including pairwise comparisons and mean-opinion scoring)—with widely used quantitative metrics. It explains Inception Score for quality and diversity, Fréchet Inception Distance and Kernel Inception Distance for distribution matching in feature space, and precision/recall for distributions to separately diagnose fidelity versus coverage. It then introduces model-specific probes: for VAEs, latent traversals and disentanglement measures like the Mutual Information Gap; for GANs, tests for mode collapse such as the Birthday Paradox heuristic; and for diffusion models, analyses of intermediate timesteps to understand progressive refinement. The chapter recommends combining complementary metrics to obtain a more reliable picture.

Recognizing that utility depends on context, the chapter highlights task-specific evaluations: in remote sensing super-resolution, downstream land-cover classification and object detection; in medical imaging synthesis, performance gains in diagnostic models and expert clinical review. It also details persistent challenges—bias from reliance on pre-trained feature extractors, computational cost and scalability, absence of definitive ground truth, domain mismatch, and gaps between benchmark scores and real-world robustness. The overarching guidance is to pair multiple quantitative metrics with rigorous human-centered and domain-aware assessments, and to advance toward evaluation frameworks that better reflect human perception, fairness, safety, and application-specific success criteria.

Classifier prediction for a single generated image (low entropy)

Average classifier prediction across all images (high entropy)

ImageNet is a large-scale dataset of over 14 million labeled images spanning more than 20,000 categories, widely used for training and benchmarking computer vision models, with many applications focusing on its 1,000-class subset.

Conceptual 2D illustration of real and generated image feature spaces

Latent traversal illustrated using a Beta-VAE model trained on face images[13]

Example of super-resolution applied to a satellite image[16]

Summary

Evaluation of generative models in computer vision requires a multi-faceted approach, combining qualitative methods (such as visual inspection and user studies) with quantitative metrics (like FID, IS, and KID) to assess both image quality and diversity.
Model-specific evaluation techniques are crucial for addressing the unique characteristics of different generative architectures. For instance, VAEs benefit from latent space analysis, GANs from mode collapse detection, and Diffusion Models from assessing the quality of the learned noise prediction.
Task-specific metrics are essential for evaluating generative models in real-world applications. Case studies in medical imaging and urban planning demonstrate how tailored metrics can provide more relevant assessments of model performance in specific domains.
The choice of evaluation metrics can significantly influence research directions and model development. It is crucial to understand the strengths and limitations of each metric and select appropriate combinations for comprehensive evaluation.
Challenges in evaluation include biases in pre-trained networks used for feature extraction, computational complexity of metrics for large-scale evaluations, and the fundamental difficulty of defining “correctness” in generative tasks.
As generative models advance, there is a growing need for evaluation techniques that can assess ethical implications, including potential biases and the risk of generating misleading content.

FAQ

Why is evaluation critical for generative image models?

Evaluation enables objective comparison across models and architectures, guides research and optimization, supports model selection for specific applications, and helps detect biases and failure modes. As models are deployed in domains like healthcare and urban planning, reliable evaluation is essential for safety, fairness, and trust.

What are the main challenges in evaluating generative models?

- No single “ground truth” output for many generative tasks
- Human perception is subjective and hard to encode in formulas
- Bias from using pre-trained feature extractors (e.g., ImageNet Inception-v3)
- Computational cost and scalability for large, high-res datasets
- Domain specificity: metrics can misalign with clinical or creative goals

How do qualitative methods differ and when should I use them?

- Visual inspection: quick, intuitive, good for spotting artifacts and coherence issues; but subjective and not scalable.
- User studies: structured human evaluation (e.g., pairwise “visual Turing tests” or Mean Opinion Score); provide statistically analyzable feedback but require careful design, time, and cost. Use user studies when human perception or domain expertise (e.g., radiologists) is critical.

What is the Inception Score (IS) and what are its limitations?

IS uses a pre-trained classifier (typically Inception-v3) to reward images that produce confident predictions (quality) and a diverse class distribution across samples (diversity). It averages the KL divergence between p(y|x) and p(y), then exponentiates. Limitations: it does not compare to real data, is sensitive to the choice/biases of the classifier, and may be inappropriate outside ImageNet-like domains.

What is Fréchet Inception Distance (FID) and why is it popular?

FID compares feature statistics (mean and covariance) of real vs generated images using an Inception-v3 feature space, modeling both as multivariate Gaussians and computing their Fréchet distance. Lower is better. It correlates well with human judgments and is sensitive to mode collapse. Limitations: Gaussian assumption, reliance on Inception features (domain bias), and need for large sample sizes.

How does Kernel Inception Distance (KID) differ from FID?

KID measures the squared Maximum Mean Discrepancy (MMD) between real and generated feature embeddings using a polynomial kernel on Inception features. It provides an unbiased estimate and is more reliable with smaller sample sizes. Trade-offs: choice of kernel affects results and it can be computationally intensive.

What do precision and recall for distributions tell us?

- Precision: fidelity—fraction of generated samples that are close to real samples in feature space (how realistic they are).
- Recall: diversity—fraction of the real data manifold covered by generated samples (how much variety is captured).
They reveal failure modes (e.g., high precision/low recall suggests mode collapse; high recall/low precision suggests diverse but low-quality outputs).

Are there model-specific evaluation techniques for VAEs, GANs, and Diffusion Models?

- VAEs: latent traversals to inspect interpretability and smoothness; disentanglement metrics like Mutual Information Gap (MIG).
- GANs: Birthday Paradox Test to estimate diversity and detect mode collapse via duplicate/near-duplicate rates.
- Diffusion Models: analyze intermediate denoising steps qualitatively and quantitatively to understand progressive refinement.

How should I evaluate models for task-specific applications?

Use domain- and task-aligned metrics in addition to generic ones. Examples:
- Remote sensing super-resolution: beyond PSNR/SSIM, evaluate downstream land-cover classification or object detection performance.
- Medical image synthesis: measure impact on diagnostic accuracy when augmenting training data and obtain expert (e.g., radiologist) assessments of anatomical plausibility and clinical utility.

What are best practices and common pitfalls when using these metrics?

- Combine qualitative and quantitative methods; don’t rely on a single score.
- Report both fidelity and diversity (e.g., FID/KID with precision–recall).
- Use domain-appropriate feature extractors when possible to reduce bias.
- Ensure adequate sample size; document seeds, protocols, and confidence intervals for reproducibility.
- Test robustness (distribution shifts, typical corruption) and monitor computational costs.
- Be wary of optimizing to a metric at the expense of real-world utility or fairness.

with subscription

$24.99

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more