table of content

Part 1 Understanding attention and transformers

1 A tale of two models: Transformers and diffusions

1.1 What is a text-to-image generation model?

1.1.1 Unimodal vs. multimodal models

1.1.2 Practical use cases of text-to-image models

1.2 Transformer-based text-to-image generation

1.2.1 Converting an image into a sequence of integers and then back

1.2.2 Training and using a transformer-based text-to-image model

1.3 Text-to-image generation with diffusion models

1.3.1 Forward and reverse diffusions

1.3.2 Latent diffusion models and Stable Diffusion

1.4 How to build text-to-image models from scratch

1.5 Challenges for text-to-image generation models

1.5.1 Are generative AI models stealing from artists?

1.5.2 The geometric inconsistency problem

1.6 Social, environmental, and ethical concerns

2 Build a transformer

2.1 An overview of attention and transformers

2.1.1 How the attention mechanism works

2.1.2 How to create a transformer

2.2 Word embedding and positional encoding

2.2.1 Word tokenization with the Spacy library

2.2.2 A sequence padding function

2.2.3 Input embedding from word embedding and positional encoding

2.3 Creating an encoder–decoder transformer

2.3.1 Coding the attention mechanism

2.3.2 Defining the Transformer() class

2.3.3 Creating a language translator

2.4 Training and using the German-to-English translator

2.4.1 Training the encoder–decoder transformer

2.4.2 Translating German to English with the trained model

3 Classify images with a vision transformer

3.1 The blueprint to train a ViT

3.1.1 Converting images to sequences

3.1.2 Training a ViT for classification

3.2 The CIFAR-10 dataset

3.2.1 Downloading and visualizing CIFAR-10 images

3.2.2 Preparing datasets for training and testing

3.3 Building a ViT from scratch

3.3.1 Dividing images into patches

3.3.2 Modeling the positions of different patches in an image

3.3.3 Using the multi-head self-attention mechanism

3.3.4 Building an encoder-only transformer

3.3.5 Using the ViT to create a classifier

3.4 Training and using the ViT to classify images

3.4.1 Choosing the optimizer and the loss function

3.4.2 Training the ViT for image classification

3.4.3 Classifying images using the trained ViT

4 Add captions to images

4.1 Training and using a transformer to add captions

4.1.1 Preparing data and the causal attention mask

4.1.2 Creating and training a transformer

4.2 Preparing the training dataset

4.2.1 Downloading and visualizing Flickr 8k images

4.2.2 Building a vocabulary of tokens

4.2.3 Preparing the training dataset

4.3 Creating a multimodal transformer to add captions

4.3.1 Defining a ViT as the image encoder

4.3.2 Creating the decoder to generate text

4.4 Training and using the image-to-text transformer

4.4.1 Training the encoder–decoder transformer

4.4.2 Adding captions to images with the trained model

Part 2 Introduction to diffusion models

5 Generate images with diffusion models

5.1 The forward diffusion process

5.1.1 How diffusion models work

5.1.2 Visualizing the forward diffusion process

5.1.3 Different diffusion schedules

5.2 The reverse diffusion process

5.3 A blueprint to train the U-Net model

5.3.1 Steps in training a denoising U-Net model

5.3.2 Preprocessing the training data

5.4 Training and using the diffusion model

5.4.1 The Denoising Diffusion Probabilistic Model noise scheduler

5.4.2 Inference using the U-Net denoising model

5.4.3 Training and using the denoising U-Net model

6 Control what images to generate in diffusion models

6.1 Classifier-free guidance in diffusion models

6.1.1 An overview of classifier-free guidance

6.1.2 A blueprint to implement CFG

6.2 Different components of a denoising U-Net model

6.2.1 Time step embedding and label embedding

6.2.2 The U-Net denoising model architecture

6.2.3 Down blocks and up blocks in the U-Net

6.3 Building and training the denoising U-Net model

6.3.1 Building the denoising U-Net

6.3.2 The Denoising Diffusion Probabilistic Model

6.3.3 Training the diffusion model

6.4 Generating images with the trained diffusion model

6.4.1 Visualizing generated images

6.4.2 How the guidance parameter affects generated images

7 Generate high-resolution images with diffusion models

7.1 Attention in U-Net, DDIM, and image interpolation

7.1.1 Incorporating the attention mechanism in the U-Net model

7.1.2 Denoising Diffusion Implicit Models

7.1.3 Image interpolation in diffusion models

7.2 High-resolution flower images as training data

7.2.1 Visualizing images in the training dataset

7.2.2 Applying forward diffusion on flower images

7.3 Building and training a U-Net for high-resolution images

7.3.1 Building the denoising U-Net model

7.3.2 Training the denoising U-Net model

7.4 Image generation and interpolation

7.4.1 Using the trained denoising U-Net to generate images

7.4.2 Transition from one image to another

Part 3 Text-to-image generation with diffusion models

8 CLIP: A model to measure the similarity between image and text

8.1 The CLIP model

8.1.1 How the CLIP model works

8.1.2 Selecting an image from Flickr 8k based on a text description

8.2 Preparing the training dataset

8.2.1 Image-caption pairs in Flickr 8k

8.2.2 The DistilBERT tokenizer

8.2.3 Preprocess captions and images for training

8.3 Creating a CLIP model

8.3.1 Creating a text encoder

8.3.2 Creating an image encoder

8.3.3 Building a CLIP model

8.4 Training and using the CLIP model

8.4.1 Training the CLIP model

8.4.2 Using the trained CLIP model to select images

8.4.3 Using the OpenAI pretrained CLIP model to select images

9 Text-to-image generation with latent diffusion

9.1 What is a latent diffusion model?

9.1.1 How variational autoencoders work

9.1.2 Combining a latent diffusion model with a variational autoencoder

9.2 Compressing and reconstructing images with VAEs

9.2.1 Downloading the pretrained VAE

9.2.2 Encoding and decoding images with the pretrained VAE

9.3 Text-to-image generation with latent diffusion

9.3.1 Guidance by the CLIP model

9.3.2 Diffusion in the latent space

9.3.3 Converting latent images to high-resolution ones

9.4 Modifying existing images with text prompts

10 A deep dive into Stable Diffusion

10.1 Generating images with Stable Diffusion

10.2 The Stable Diffusion architecture

10.2.1 Generating images from text with Stable Diffusion

10.2.2 Text embedding interpolation

10.3 Creating text embeddings

10.4 Image generation in the latent space

10.5 Converting latent images to high-resolution ones

Part 4 Text-to-image generation with transformers

11 VQGAN: Convert images into sequences of integers

11.1 Converting images into sequences of integers and back

11.2 Variational autoencoders

11.2.1 What is an autoencoder?

11.2.2 The need for VAEs and their training methodology

11.3 Vector quantized variational autoencoders

11.3.1 The need for VQ-VAEs

11.3.2 The VQ-VAE model architecture and training process

11.4 Vector quantized generative adversarial networks

11.4.1 Generative adversarial networks

11.4.2 VQGAN: A GAN with a VQ-VAE generator

11.5 A pretrained VQGAN model

11.5.1 Reconstructing images with the pretrained VQGAN

11.5.2 Converting images into sequences of integers

12 A minimal implementation of DALL-E

12.1 How min-DALL-E works

12.1.1 Training min-DALL-E

12.1.2 From prompt to pixels: Image generation at inference time

12.2 Tokenizing and encoding the text prompt

12.2.1 Tokenizing the text prompt

12.2.2 Encoding the text prompt

12.3 Iterative prediction of image tokens

12.3.1 Loading the pretrained BART decoder

12.3.2 Predicting image tokens using the BART decoder

12.4 Converting image tokens to high-resolution images

12.4.1 Loading the pretrained VQGAN detokenizer

12.4.2 Visualizing the intermediate and final high-resolution outputs

Part 5 New developments and challenges

13 New developments and challenges in text-to-image generation

13.1 State-of-the-art text-to-image generators

13.1.1 DALL-E series

13.1.2 Google’s Imagen

13.1.3 Latent diffusion models: Stable Diffusion and Midjourney

13.2 Challenges and concerns

13.3 A blueprint to fine-tune ResNet50

13.3.1 The history and architecture of ResNet50

13.3.2 A plan to fine-tune ResNet50 for classification

13.3.3 Using ResNet50 to classify images

13.4 Fine-tuning ResNet50 to detect fake images

13.4.1 Downloading and preprocessing real and fake face images

13.4.2 Fine-tuning ResNet50

13.4.3 Detecting deepfakes using the fine-tuned ResNet50

Appendix

Installing PyTorch and enabling GPU training locally and in Colab

A.1 Installing Python and setting up a virtual environment

A.1.1 Installing Anaconda

A.1.2 Setting up a Python virtual environment

A.1.3 Installing Jupyter Notebook

A.2 Installing PyTorch

A.3 Using Google Colab for GPU training and inference

references

Overview

1 A tale of two models: transformers and diffusions

Generative AI models learn from large datasets to create new content, with text-to-image systems standing out for translating natural language into vivid, high-fidelity visuals. These models are multimodal: they take text as input and produce images, bridging language and vision in ways that power applications across design, marketing, education, healthcare, and entertainment. By contrasting unimodal and multimodal approaches, the chapter shows how advances in natural language processing and computer vision converge to make prompt-driven image creation practical and impactful in real-world workflows.

The chapter centers on two complementary families of techniques. Transformer-based generators treat images like sequences: a VQGAN encoder converts images into discrete codebook tokens, a language model (such as BART) turns text into tokens, and training aligns the two so that, at inference, text tokens can be decoded by VQGAN into images—an idea popularized by DALL·E-style systems. Diffusion models take the opposite route: they learn to reverse a noise-adding process, progressively denoising random noise into an image conditioned on a prompt. To make this efficient, latent diffusion performs denoising in a compact latent space and then upscales via a VAE decoder, often guided by a CLIP model to better match text semantics; Stable Diffusion is a leading open implementation of this pipeline.

Beyond mechanics, the chapter maps a hands-on path to building these systems from scratch—covering transformers and vision transformers, basic and guided diffusion, CLIP for text–image alignment, latent diffusion, and VQGAN—so readers can implement min-DALL·E-like and Stable Diffusion–style generators. It also underscores current limitations and risks: models can misinterpret negations (“pink elephant”–type failures), raise concerns about originality and copyright, and struggle with geometric consistency due to limited 3D understanding. Social and environmental considerations include high compute and energy costs, potential misuse (e.g., deepfakes and misinformation), and representational biases, motivating ongoing work on efficiency, safeguards, and fairness.

Comparison of unimodal and multimodal models. Unimodal models handle only one type of data as both the input and output. For example, GPT-3 is a unimodal model since it processes text as input and generates text as output. On the other hand, multimodal models operate with more than one type of data. A prominent example is text-to-image generation models (say, Stable Diffusion), where the input is text and an image (when editing existing images), and the output is an image.

Examples of generating captions for images. Above each image, we first display the original caption from the training dataset, created by humans. We then feed the image into a trained image-to-text model to generate a caption, which is displayed above the image as well. While the generated captions differ from those created by humans, they accurately describe what’s going on in these images.

How the min-DALL-E model generates an image based on the prompt "panda with top hat reading a book." The model divides an image into 256 patches, organized in a 16x16 grid. When generating an image based on a text prompt, the model first predicts the top left patch. In the next iteration, the model predicts the patch next to it, based on the first patch and the prompt. The process is repeated until we have all the needed patches in the image. In this figure, the top left subplot shows the output when 32 image patches are generated. The second subplot in the top row shows the output when 64 patches are generated. The rest of the images show the outputs when 96, 128, ..., and 256 patches are generated.

A diagram of VQGAN. The encoder in VQGAN compresses an image into a lower-dimensional latent space. The latent vector for each image is divided into different patches. The continuous latent vector for each patch is then compared to the discrete vectors in the codebook. The quantized latent vector uses discrete vectors in the codebook to approximate the continuous latent vector for each image patch. The quantized latent vectors are then passed through the decoder in VQGAN to reconstruct the image.

The left side of this figure depicts how a transformer-based text-to-image model is trained, while the right side illustrates the process of generating an image from a text prompt using the trained model. To train the model, images are encoded into image tokens using a VQGAN encoder. Corresponding captions are processed through a BART encoder and then a BART decoder to generate text tokens. The objective is to train the BART decoder to predict text tokens that match the image tokens produced by the VQGAN encoder. That is, the text tokens are used as the predicted image tokens. To generate an image using the trained model, the text prompt is fed into the BART encoder and then the BART decoder to produce the predicted text tokens, which are passed through the VQGAN decoder to generate the final output, as shown at the top center of the figure.

A diagram of the forward diffusion process. We start with a clean image from the training set, 𝑥0, and add noise 𝜖0 to it to form a noisy image 𝑥1, which is a weighted sum of 𝑥0 and 𝜖0. We repeat this process for 1000 time steps until the image 𝑥1000 becomes random noise.

How a trained latent diffusion model (LDM) generates an image based on a text prompt. The text prompt ("a banana riding a motorcycle, wearing sunglasses and a straw hat," for example, as shown at the top left corner) is first encoded into a text embedding. To generate an image in the lower-dimensional space (latent space), we start with an image of pure noise (bottom left). We use the trained U-Net to iteratively denoise the image, conditional on the text embedding so the generated image matches the text embedding, with the guidance of a trained contrast language-image pre-training (CLIP) model. The generated latent image (bottom right) is presented to a trained VAE to convert it to a high-resolution image, which is the final output (top right).

Intermediate decoded outputs from a trained latent diffusion model at time steps 800, 600, ..., 200, and 0. The text prompt is " a banana riding a motorcycle, wearing sunglasses and a straw hat."

The StableDiffusionPipeline package in the diffusers library allows you to use Stable Diffusion as an off-the-shelf tool to generate great images in just a few lines of code. This figure shows the generated image based on the prompt "an astronaut in a spacesuit riding a unicorn."

The eight steps to building a text-to-image generator from scratch. Steps 1-4 establish the foundation: you’ll learn to build a transformer, understand how a vision transformer (ViT) processes images, implement a basic diffusion model, and use classifier-free guidance to control image generation. In steps 5-8, you’ll train your own CLIP model to connect images and text, create a diffusion-based generator (such as Stable Diffusion), master the VQGAN architecture for discrete image encoding, and build a transformer-based generator inspired by DALL-E. Each step brings you closer to understanding and creating a text-to-image generator.

An image generated by ChatGPT 4o using the prompt "draw me an image of a man without a beard."

Summary

Text-to-image models are a type of multimodal generative model designed to transform a text description into a corresponding image.
Unimodal models operate within a single type of data modality, such as text-only or image-only models. In contrast, multimodal models connect different data modalities, enabling interactions across text, images, audio, and video.
Transformer-based text-to-image generation models treat images as sequences by dividing them into patches, each patch acting as an element in the sequence. Image generation is then a sequence prediction problem, where the model predicts patches from top-left to bottom-right based on a text prompt.
In diffusion-based text-to-image generation models, we start with an image of pure noise. The model iteratively denoises it based on the text prompt, reducing noise with each step until a clear image matching the prompt is produced.
Instead of conducting forward and reverse diffusion processes on high-resolution images, latent diffusion models (LDMs) conduct diffusion processes in a lower-dimensional latent space, making the process faster and more efficient. After training, a variational autoencoder (VAE) converts the low-resolution latent space images into high-resolution final outputs.
Despite significant advancements, text-to-image generative models face challenges like the Pink Elephant problem, copyright disputes, geometry inconsistencies, and social, ethical, and environmental concerns.

FAQ

What are text-to-image generation models, and why are they considered multimodal?

Text-to-image models are generative AI systems that take text as input and produce images as output. They are multimodal because they connect two different data types—language (text) and vision (images)—unlike unimodal models that operate within a single modality (for example, text-in/text-out).

How do transformer-based text-to-image models generate images from text?

They reframe image generation as a sequence prediction task:

Images are encoded into discrete “image tokens” using a model like VQGAN, which turns image patches into indices from a learned codebook.
Text is turned into “text tokens” using a language transformer (e.g., BART).
Training aligns the BART decoder’s output with the VQGAN image-token sequence (minimizing cross-entropy), so the model learns to predict image tokens from text.
At inference, the prompt → BART produces tokens → VQGAN decoder reconstructs the image from those tokens.

What are diffusion models, and how do forward and reverse diffusion work?

Diffusion models learn to denoise. In forward diffusion, small amounts of Gaussian noise are added to a clean image over many steps until it becomes pure noise. A denoising network (often a U-Net) is trained to reverse this process: starting from noise, it removes noise step-by-step to produce a coherent image. Conditioning on text steers the denoising toward images that match the prompt.

What is a latent diffusion model (LDM), and why is it more efficient?

LDMs run diffusion in a low-dimensional latent space instead of pixel space:

A VAE encoder compresses images into a compact latent representation (e.g., 4×64×64 instead of 3×512×512), cutting computation dramatically.
The U-Net denoises in this latent space, often guided by a text embedding and a CLIP-based similarity signal.
A VAE decoder then upsamples the denoised latent to a high-resolution image.

What is Stable Diffusion, and how can I use it quickly?

Stable Diffusion is a popular open-source LDM trained on large-scale image–text pairs. It incorporates training optimizations and broad dataset coverage. You can generate images with just a few lines using the Hugging Face diffusers StableDiffusionPipeline, or explore its internals to customize guidance, sampling steps, and prompts.

What are practical applications of text-to-image models?

Creative content: art, illustrations, marketing visuals, rapid prototyping.
Product and game design: concept art, characters, environments, fashion sketches.
Education and communication: visualizing historical, scientific, or medical concepts.
ML workflows: data augmentation via synthetic images.
Related skills you’ll build: image captioning, text–image similarity (CLIP), and image selection/retrieval by prompt.

What is the “pink elephant” problem in text-to-image generation?

It’s a failure with negation (e.g., “a man without a beard” producing a bearded man). Causes include limited handling of negative constraints in training data/objectives, prompt parsing issues, and bias toward frequent co-occurrences. Emerging techniques in prompt conditioning and model training aim to reduce this.

Why do models struggle with geometric consistency?

They’re trained on diverse 2D images and typically don’t learn explicit 3D structure, depth, lighting, or physics. Positional encodings help with spatial order but don’t enforce strict geometric rules. This can yield inconsistent object parts, perspective errors, or impossible configurations. Stronger 3D priors, multi-view data, or physics-informed constraints can help.

Do text-to-image models “steal” from artists or create new works?

Two viewpoints:

Critique: Models may reproduce elements of copyrighted works in training sets (especially overrepresented images), blurring the line between inspiration and copying.
Defense: Models learn statistical patterns rather than memorizing; diffusion from noise and recombination of concepts produce novel outputs influenced by, but not duplicative of, training data.

The legal and ethical landscape is still evolving.

What are key social, environmental, and ethical concerns?

Energy and compute: Training/inference can be resource-intensive; techniques like pruning, quantization, and transfer learning help but don’t eliminate the footprint.
Misuse: Risk of deepfakes and misinformation calls for guardrails, moderation, and policy collaboration.
Bias: Models may amplify stereotypes; auditing, dataset curation, and careful mitigation are needed—without overcorrection that distorts facts.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $12.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $12.49

eBook

pdf, ePub, online

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $12.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more