Image Generation Models you own this product

GANs, diffusion models, and transformers

Vladimir Bok

MEAP began September 2024
Last updated July 2025
Publication in January 2026 (estimated)

ISBN 9781633437449
350 pages (estimated)

Included with a Manning Online subscription

printed in black & white

available in Russian, Simplified Chinese

catalog / Data Science / Deep Learning / Generative AI

table of content

Part 1: Foundations of Generative AI in Computer Vision

1 Generative AI in Computer Vision

1.1 Introduction: From Chisels to Pixels

1.2 AI, Generative AI, and Computer Vision

1.2.1 Artificial Intelligence (AI)

1.2.2 Computer Vision

1.2.3 Generative AI

1.2.4 The Intersection: Generative AI in Computer Vision

1.3 Practical Applications of Generative AI in Computer Vision

1.3.1 Digital Face Re-Aging in Video Production

1.3.2 Simulation Environments for Self-Driving Cars

1.3.3 Data Augmentation for Medical Imaging

1.4 The Evolution of Generative AI for Image Synthesis

1.4.1 Early Foundations (1950s-1980s)

1.4.2 Emergence of Neural Networks (1980s-2000s)

1.4.3 Breakthroughs in Deep Learning (2000s-2010s)

1.4.4 The GAN Revolution (2014-2020)

1.4.5 New Horizons (2020-Present)

1.5 Taxonomy for Generative AI in Computer Vision

1.5.1 Level of Control

1.6 Model Architecture

1.6.1 Autoencoders

1.6.2 Adversarial Networks

1.6.3 Diffusion Models

1.6.4 Transformers

1.6.5 Choosing the Right Architecture

1.7 Conclusion

1.8 Summary

2 Variational Autoencoders (VAEs)

2.1 Introduction to Autoencoders

2.1.1 The Latent Space

2.1.2 Applications of Autoencoders

2.1.3 Autoencoder Architecture and Training

2.2 Implementing an Autoencoder in PyTorch

2.2.1 Step 1: Import Necessary Libraries

2.2.2 Step 2: Prepare the Dataset

2.2.3 Step 3: Implement the Autoencoder Model

2.2.4 Step 4: Define the Loss Function and Optimizer

2.2.5 Step 5: Train the Autoencoder

2.2.6 Step 6: Model Evaluation

2.2.7 Conclusion and Next Steps

2.3 Variational Autoencoders (VAEs)

2.3.1 VAE Model Architecture: Concepts and Components

2.3.2 The Loss Function and VAE Training

2.4 Implementing VAE in PyTorch

2.4.1 Step 1: Import Necessary Libraries

2.4.2 Step 2: Prepare the Dataset

2.4.3 Step 3: Implement the Encoder

2.4.4 Step 4: Implement the Decoder

2.4.5 Step 5: Implement the VAE Model

2.4.6 Step 6: Define the VAE Loss Function

2.4.7 Step 7: Initialize the Model and Optimizer

2.4.8 Step 8: Train the VAE

2.4.9 Step 9: Model Evaluation

2.4.10 Latent Space Interpolation with VAE

2.5 β-VAE and Latent Space Disentanglement

2.5.1 Applications and Implications of β-VAEs

2.6 Conclusion

2.7 Summary

3 Generative Adversarial Networks (GANs)

3.1 Introduction to GANs

3.1.1 Adversarial Training

3.2 Core Concepts and Theoretical Foundations of GANs

3.2.1 Inspiration from Game Theory

3.2.2 GAN Value Function

3.2.3 Minimax Game: How the Value Function Works in GAN Training

3.2.4 Visualizing GAN Model Architecture

3.2.5 Reaching Equilibrium in GAN Training

3.3 Training Challenges in GANs

3.3.1 Mode Collapse

3.3.2 Vanishing Gradient

3.3.3 Non-Convergence

3.4 Wasserstein GANs: A Solution to Training Challenges

3.4.1 The Intuition Behind WGANs

3.4.2 Understanding Earth Mover’s (Wasserstein) Distance in WGANs

3.4.3 Replacing the Discriminator with a Critic

3.4.4 The WGAN Value Function

3.4.5 Visualizing WGAN Model Architecture

3.5 Implementing a WGAN in PyTorch

3.5.1 Adopting Best Practices from DCGANs

3.5.2 Step 1: Import Necessary Libraries

3.5.3 Step 2: Enable GPU Training

3.5.4 Step 3: Load the Dataset

3.5.5 Step 4: Implement the Generator

3.5.6 Step 5: Implement the Critic

3.5.7 Step 6: Initialize the Models and Optimizers

3.5.8 Step 7: Train the Model

3.5.9 Step 8: Evaluate the Model

3.5.10 Exploring the WGAN Latent Space

3.6 Conclusion

3.7 Summary

Part 2: Advanced Generative Models and Techniques

4 Diffusion Models: Forward Diffusion

4.1 Introduction to Diffusion Models

4.1.1 Forward and Reverse Diffusion Phases

4.2 Foundations and Core Concepts

4.2.1 Forward Diffusion

4.2.2 Reverse Diffusion

4.3 Images, Probability Distributions, and the Visual World

4.3.1 The Infinite Possibilities of Pixel Arrangements

4.3.2 Diffusion Models and the Quest for Coherence

4.4 Forward Diffusion In-Depth

4.4.1 Reducing Data Distribution Complexity by Adding Noise

4.5 Mathematical Foundations of the Forward Diffusion Process

4.5.1 Forward Diffusion Process

4.5.2 Closed-Form Formula for Skipping Steps in Forward Diffusion

4.6 Visualizing Forward Diffusion in 1D

4.6.1 Probability Density of an Example 1D Dataset

4.6.2 Applying Forward Diffusion to the 1D Distribution

4.7 Conclusion

4.8 Summary

5 Diffusion Models: Reverse Diffusion

5.1 Understanding Reverse Diffusion

5.2 The Mathematics of Reverse Diffusion

5.2.1 Forward Diffusion: A Brief Recap

5.2.2 The Reverse Diffusion Equation

5.2.3 Reverse Diffusion: Training vs. Inference

5.3 U-Net Architecture for Denoising

5.3.1 U-Net: Structure and Function

5.3.2 Comparing U-Net to Autoencoders

5.3.3 Adapting U-Net for Diffusion Models

5.4 Step-by-Step Implementation of a Denoising Diffusion Probabilistic Model (DDPM)

5.4.1 Step 1: Import Necessary Libraries

5.4.2 Step 2: Enable GPU Training

5.4.3 Step 3: Prepare the Dataset

5.4.4 Step 4: Implement U-Net Model for Denoising

5.4.5 Step 5: Implement DDPM

5.4.6 Step 6: Train the Model

5.4.7 Step 7: Model Evaluation

5.5 Comparing DDPMs with VAEs and GANs

5.5.1 Generation Process and Computational Requirements

5.5.2 Sample Quality and Diversity

5.5.3 Training Stability and Ease of Use

5.5.4 Latent Space and Interpolation

5.5.5 Theoretical Grounding and Flexibility

5.5.6 Comparison Summary

5.6 Conclusion

5.7 Summary

6 Evaluation and Metrics for Generative Models

6.1 Introduction to Model Evaluation in Generative AI

6.1.1 Importance of Evaluation in Generative Models

6.1.2 Challenges In Evaluating Generative Models For Image Synthesis

6.1.3 Overview Of Evaluation Approaches

6.2 Qualitative Evaluation Methods

6.2.1 Visual Inspection

6.2.2 User Studies

6.2.3 Summary of Qualitative Evaluation Methods

6.3 Quantitative Evaluation Metrics

6.3.1 Inception Score (IS)

6.3.2 Fréchet Inception Distance (FID)

6.3.3 Kernel Inception Distance (KID)

6.3.4 Precision and Recall for Distributions

6.3.5 Summary of Quantitative Evaluation Metrics

6.4 Model-Specific Evaluation Techniques

6.4.1 VAE-specific Metrics

6.4.2 GAN-specific Metrics

6.4.3 Diffusion Model-specific Metrics

6.5 Task-Specific Evaluation Metrics

6.5.1 Case Study 1: Super-Resolution for Remote Sensing Imagery

6.5.2 Case Study 2: Medical Image Synthesis for Data Augmentation

6.6 Challenges and Limitations in Evaluation

6.6.1 Bias in Evaluation Metrics

6.6.2 Computational Complexity and Scalability

6.6.3 Lack of Ground Truth in Generative Tasks

6.6.4 Domain Specificity and Generalization

6.7 Conclusion

6.8 Summary

7 Conditional Image Generation

7.1 Why Conditional Generation?

7.1.1 Limitations of Unconditional Models

7.1.2 Applications Where Conditioning is Essential

7.2 Types of Conditioning Information

7.2.1 Conditioning on Class Labels

7.2.2 Conditioning on Images

7.2.3 Conditioning on Text or Other Modalities

7.3 Conditional Variational Autoencoders (cVAEs)

7.3.1 Quick Recap of Variational Autoencoders

7.3.2 Incorporating Conditioning into VAEs

7.4 Implementation of cVAE for Class-Conditioned Image Generation

7.4.1 Step 1: Import Necessary Libraries

7.4.2 Step 2: Prepare the Dataset

7.4.3 Step 3: Define the cVAE Encoder and Decoder

7.4.4 Step 4: Implement cVAE functions

7.4.5 Step 5: Train the cVAE

7.4.6 Step 6: Evaluate the cVAE

7.5 Evaluation and Metrics for Conditional Image Generation

7.5.1 Conditioning Accuracy Metrics

7.5.2 Human Evaluation

7.6 Conclusion

7.7 Summary

8 Hybrid Architectures and Latent Diffusion Models

8.1 Understanding the Path to Hybrid Models

8.1.1 The Generative AI Trilemma

8.1.2 A Comparative Look at Generative Model Architectures

8.2 The Rise of Hybrid Models: Latent Diffusion as a Case Study

8.2.1 Latent Diffusion: Diffusion in the Latent Space

8.2.2 The Best of Both Worlds: Benefits of Latent Diffusion

8.2.3 The Drawbacks of Latent Diffusion

8.2.4 Balancing Trade-Offs in Modern Generative AI

8.3 Implementing a Latent Diffusion Model

8.3.1 Step 1: Importing Necessary Libraries

8.3.2 Step 2: Setting the Device

8.3.3 Step 3: Building the VAE Architecture – Encoder and Decoder Networks

8.3.4 Step 4: Assembling the Complete VAE

8.3.5 Step 5: VAE Training Setup

8.3.6 Step 6: VAE Training Pipeline

8.3.7 Step 7: Building the Diffusion Components: U-Net Architecture with Residual Blocks

8.3.8 Step 8: Implementing the Diffusion Process: The DDPM Class

8.3.9 Step 9: Tying it All Together – The Complete Latent Diffusion Model

8.3.10 Step 10: Training the Latent Diffusion Model

8.3.11 Step 11: Putting It All Together – Training Pipeline Initialization

8.3.12 Step 12: Generating Samples

8.3.13 Model Output Analysis

8.4 Conclusion

8.5 Summary

Part 3: Multimodal Generative AI and Real-World Impact

9 Bridging Language and Vision with Transformers

9.1 Introduction to Multimodal Modeling and Transformers

9.1.1 Understanding Transformers

9.1.2 The Transformer Architecture: An Overview

9.2 Inside the Transformer: Key Components and Mechanisms

9.2.1 Word Embeddings

9.2.2 Positional Encodings

9.2.3 Understanding Attention Mechanisms

9.2.4 Types of Attention Mechanisms

9.2.5 Multi-Head Attention: Multiple Perspectives on the Same Text

9.3 The Complete Transformer Architecture: Putting It All Together

9.3.1 Input Processing

9.3.2 The Encoder Stack

9.3.3 The Decoder Stack

9.3.4 Final Output Layer

9.3.5 Auto-Regressive Generation

9.3.6 Evolution of Transformer Models

9.4 From NLP to Vision: The Vision Transformer (ViT)

9.4.1 Key Differences from Traditional Transformers

9.4.2 The ViT Model Architecture

9.4.3 Comparing ViTs and CNNs

9.5 CLIP: Bridging Vision and Language

9.5.1 Connecting Words and Images

9.5.2 The CLIP Approach: Learning from Internet-Scale Data

9.5.3 CLIP Architecture

9.5.4 CLIP Training Process

9.5.5 The Power of CLIP’s Representations

9.5.6 From CLIP to Text-to-Image Generation

9.6 Conclusion

9.7 Summary

10 Text-to-Image Generation with Stable Diffusion

11 Video Generation

12 Real-World Applications and Case Studies

Overview

1 Generative AI in Computer Vision

This chapter introduces Generative AI in the context of Computer Vision and situates it within the broader field of Artificial Intelligence. It clarifies how generative models learn data distributions to synthesize new visual content, in contrast to discriminative models that classify or detect. At the intersection of AI, Computer Vision, and Generative AI, systems now interpret, manipulate, and create images with increasing control—from random generation to conditional and text-guided synthesis—powering capabilities such as image-to-image translation, super-resolution, style transfer, and text-to-image generation. The chapter sets a roadmap for the field, highlighting rapid progress, cross-industry impact, and the shifting boundaries of creativity and authorship.

Concrete applications demonstrate this impact at scale. In film and television, digital face re-aging has moved from labor-intensive visual effects to generative pipelines that deliver temporal consistency, identity preservation, fine-grained artistic control, and major efficiency gains. For autonomous vehicles, photorealistic, diverse simulation environments enable safe, repeatable training and validation across rare and hazardous scenarios, accelerating development. In healthcare, synthetic medical imagery augments scarce datasets, preserves privacy, balances classes, and can ease annotation burdens—improving robustness and performance of diagnostic models.

Historically, the field evolved from early computer graphics experiments and foundational neural network research to deep learning breakthroughs, including VAEs and the GAN family (e.g., DCGAN, StyleGAN, BigGAN), followed by the rise of diffusion models and Transformer-based approaches that connect vision and language (e.g., ViT and CLIP) and enable powerful text-to-image systems and latent diffusion. The chapter proposes a taxonomy spanning levels of control (random, conditional, text-driven) and core architectures (autoencoders, adversarial networks, diffusion models, transformers), outlining their respective strengths, trade-offs, and selection criteria. It concludes with a forward-looking view that pairs technical momentum with ethical responsibility around copyright, misuse, and the role of human creators.

A Venn diagram illustrating the relationship between Artificial Intelligence, Computer Vision, and Generative AI

The rapid improvement in synthetic image quality illustrated using AI-generated human faces. In less than 5 years, Generative AI progressed from blurry, low-resolution images to photorealistic, high-resolution faces.

Illustration of random image generation process

Illustration of conditional image generation process

Illustration of text-to-image generation process

High-level autoencoder model architecture

High-level Generative Adversarial Network (GAN) architecture

High-level architecture of a diffusion model

High-lecure

Summary

Artificial Intelligence (AI): Systems designed to perform tasks requiring human-like cognition.
Computer Vision: A subset of AI focused on enabling machines to interpret and understand visual data.
Generative AI: AI models that create new content based on learned patterns. Its intersection with Computer Vision enables the creation and manipulation of visual content.
Model Architectures

Autoencoders: Learn compressed data representations through encoding and decoding processes.
GANs (Generative Adversarial Networks): Use a two-network structure (generator and discriminator) in a competitive process to create realistic images.
Diffusion Models: Transform random noise into coherent images through iterative noise addition and removal.
Transformers: Use self-attention mechanisms to efficiently capture global dependencies in data.
Vision Transformers: Apply transformer principles to image patches for sophisticated visual processing.
CLIP (Contrastive Language–Image Pretraining): Combines text and image understanding for text-aligned image creation.

FAQ

What is Generative AI in Computer Vision?

It is the intersection of AI, Computer Vision, and generative modeling focused on creating or manipulating visual content. It uses AI techniques on visual data to synthesize new, realistic images and videos, not just interpret them.

How do generative models differ from discriminative models?

Discriminative models learn to distinguish or classify inputs, while generative models learn the data distribution and can sample from it to create new, plausible content (images, text, audio, video, 3D).

What practical applications are highlighted in this chapter?

Core applications include image-to-image translation, super-resolution, style transfer, and text-to-image generation. Real-world use spans entertainment, autonomous driving, healthcare, research, and creative industries.

How does digital face re-aging work, and what is FRAN?

Face re-aging uses generative models to modify age-related features while preserving identity. Disney’s Face Re-Aging Network (FRAN) provides temporal consistency across frames, identity preservation, fine-grained artistic control, and major efficiency gains versus manual VFX.

How does Generative AI support autonomous driving development?

Photorealistic simulated worlds let developers test rare or hazardous scenarios safely and at scale. NVIDIA Drive Sim uses generative techniques to produce diverse environments, weather, assets, and edge cases for training and validation.

How is Generative AI used for medical imaging data augmentation?

It generates synthetic scans to address data scarcity, preserve privacy, balance class distributions, and reduce annotation effort. Studies show synthetic data can improve model performance, e.g., in tumor segmentation from MRI.

How has generative image synthesis evolved?

It progressed from early algorithmic art and fractals, through neural network foundations (CNNs, backprop), to deep learning breakthroughs (VAEs, GANs). From 2014 onward, GANs rapidly improved realism; since 2020, diffusion models and transformer-based, text-to-image systems have set new quality and control standards.

What are the levels of control in image generation?

Three stages: random generation (unconditional sampling from noise), conditional generation (guided by labels, attributes, or images), and text-to-image generation (guided by natural language via text encoders).

What are the main model architectures and their trade-offs?

Autoencoders/VAEs: stable and good for latent representations and interpolation, but can be less sharp. GANs: produce sharp, photorealistic images but can be unstable and mode-collapse-prone. Diffusion models: high quality and stable, with iterative inference cost. Transformers (e.g., ViT, CLIP): capture long-range dependencies and enable strong text-image alignment, but are compute-intensive.

What ethical considerations does the chapter raise?

Responsible use is vital. Key issues include copyright and data use, risks of misinformation and deepfakes, and impacts on artists and creative professions. The chapter stresses thoughtful governance and ethical deployment.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $28.79

you save $19.20 (40%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $28.79

you save $19.20 (40%)

eBook

pdf, ePub, online

$47.99 $28.79

you save $19.20 (40%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more