table of content

Part 1 Understanding attention and transformers

1 A tale of two models: Transformers and diffusions

1.1 What is a text-to-image generation model?

1.1.1 Unimodal vs. multimodal models

1.1.2 Practical use cases of text-to-image models

1.2 Transformer-based text-to-image generation

1.2.1 Converting an image into a sequence of integers and then back

1.2.2 Training and using a transformer-based text-to-image model

1.3 Text-to-image generation with diffusion models

1.3.1 Forward and reverse diffusions

1.3.2 Latent diffusion models and Stable Diffusion

1.4 How to build text-to-image models from scratch

1.5 Challenges for text-to-image generation models

1.5.1 Are generative AI models stealing from artists?

1.5.2 The geometric inconsistency problem

1.6 Social, environmental, and ethical concerns

2 Build a transformer

2.1 An overview of attention and transformers

2.1.1 How the attention mechanism works

2.1.2 How to create a transformer

2.2 Word embedding and positional encoding

2.2.1 Word tokenization with the Spacy library

2.2.2 A sequence padding function

2.2.3 Input embedding from word embedding and positional encoding

2.3 Creating an encoder–decoder transformer

2.3.1 Coding the attention mechanism

2.3.2 Defining the Transformer() class

2.3.3 Creating a language translator

2.4 Training and using the German-to-English translator

2.4.1 Training the encoder–decoder transformer

2.4.2 Translating German to English with the trained model

3 Classify images with a vision transformer

3.1 The blueprint to train a ViT

3.1.1 Converting images to sequences

3.1.2 Training a ViT for classification

3.2 The CIFAR-10 dataset

3.2.1 Downloading and visualizing CIFAR-10 images

3.2.2 Preparing datasets for training and testing

3.3 Building a ViT from scratch

3.3.1 Dividing images into patches

3.3.2 Modeling the positions of different patches in an image

3.3.3 Using the multi-head self-attention mechanism

3.3.4 Building an encoder-only transformer

3.3.5 Using the ViT to create a classifier

3.4 Training and using the ViT to classify images

3.4.1 Choosing the optimizer and the loss function

3.4.2 Training the ViT for image classification

3.4.3 Classifying images using the trained ViT

4 Add captions to images

4.1 Training and using a transformer to add captions

4.1.1 Preparing data and the causal attention mask

4.1.2 Creating and training a transformer

4.2 Preparing the training dataset

4.2.1 Downloading and visualizing Flickr 8k images

4.2.2 Building a vocabulary of tokens

4.2.3 Preparing the training dataset

4.3 Creating a multimodal transformer to add captions

4.3.1 Defining a ViT as the image encoder

4.3.2 Creating the decoder to generate text

4.4 Training and using the image-to-text transformer

4.4.1 Training the encoder–decoder transformer

4.4.2 Adding captions to images with the trained model

Part 2 Introduction to diffusion models

5 Generate images with diffusion models

5.1 The forward diffusion process

5.1.1 How diffusion models work

5.1.2 Visualizing the forward diffusion process

5.1.3 Different diffusion schedules

5.2 The reverse diffusion process

5.3 A blueprint to train the U-Net model

5.3.1 Steps in training a denoising U-Net model

5.3.2 Preprocessing the training data

5.4 Training and using the diffusion model

5.4.1 The Denoising Diffusion Probabilistic Model noise scheduler

5.4.2 Inference using the U-Net denoising model

5.4.3 Training and using the denoising U-Net model

6 Control what images to generate in diffusion models

6.1 Classifier-free guidance in diffusion models

6.1.1 An overview of classifier-free guidance

6.1.2 A blueprint to implement CFG

6.2 Different components of a denoising U-Net model

6.2.1 Time step embedding and label embedding

6.2.2 The U-Net denoising model architecture

6.2.3 Down blocks and up blocks in the U-Net

6.3 Building and training the denoising U-Net model

6.3.1 Building the denoising U-Net

6.3.2 The Denoising Diffusion Probabilistic Model

6.3.3 Training the diffusion model

6.4 Generating images with the trained diffusion model

6.4.1 Visualizing generated images

6.4.2 How the guidance parameter affects generated images

7 Generate high-resolution images with diffusion models

7.1 Attention in U-Net, DDIM, and image interpolation

7.1.1 Incorporating the attention mechanism in the U-Net model

7.1.2 Denoising Diffusion Implicit Models

7.1.3 Image interpolation in diffusion models

7.2 High-resolution flower images as training data

7.2.1 Visualizing images in the training dataset

7.2.2 Applying forward diffusion on flower images

7.3 Building and training a U-Net for high-resolution images

7.3.1 Building the denoising U-Net model

7.3.2 Training the denoising U-Net model

7.4 Image generation and interpolation

7.4.1 Using the trained denoising U-Net to generate images

7.4.2 Transition from one image to another

Part 3 Text-to-image generation with diffusion models

8 CLIP: A model to measure the similarity between image and text

8.1 The CLIP model

8.1.1 How the CLIP model works

8.1.2 Selecting an image from Flickr 8k based on a text description

8.2 Preparing the training dataset

8.2.1 Image-caption pairs in Flickr 8k

8.2.2 The DistilBERT tokenizer

8.2.3 Preprocess captions and images for training

8.3 Creating a CLIP model

8.3.1 Creating a text encoder

8.3.2 Creating an image encoder

8.3.3 Building a CLIP model

8.4 Training and using the CLIP model

8.4.1 Training the CLIP model

8.4.2 Using the trained CLIP model to select images

8.4.3 Using the OpenAI pretrained CLIP model to select images

9 Text-to-image generation with latent diffusion

9.1 What is a latent diffusion model?

9.1.1 How variational autoencoders work

9.1.2 Combining a latent diffusion model with a variational autoencoder

9.2 Compressing and reconstructing images with VAEs

9.2.1 Downloading the pretrained VAE

9.2.2 Encoding and decoding images with the pretrained VAE

9.3 Text-to-image generation with latent diffusion

9.3.1 Guidance by the CLIP model

9.3.2 Diffusion in the latent space

9.3.3 Converting latent images to high-resolution ones

9.4 Modifying existing images with text prompts

10 A deep dive into Stable Diffusion

10.1 Generating images with Stable Diffusion

10.2 The Stable Diffusion architecture

10.2.1 Generating images from text with Stable Diffusion

10.2.2 Text embedding interpolation

10.3 Creating text embeddings

10.4 Image generation in the latent space

10.5 Converting latent images to high-resolution ones

Part 4 Text-to-image generation with transformers

11 VQGAN: Convert images into sequences of integers

11.1 Converting images into sequences of integers and back

11.2 Variational autoencoders

11.2.1 What is an autoencoder?

11.2.2 The need for VAEs and their training methodology

11.3 Vector quantized variational autoencoders

11.3.1 The need for VQ-VAEs

11.3.2 The VQ-VAE model architecture and training process

11.4 Vector quantized generative adversarial networks

11.4.1 Generative adversarial networks

11.4.2 VQGAN: A GAN with a VQ-VAE generator

11.5 A pretrained VQGAN model

11.5.1 Reconstructing images with the pretrained VQGAN

11.5.2 Converting images into sequences of integers

12 A minimal implementation of DALL-E

12.1 How min-DALL-E works

12.1.1 Training min-DALL-E

12.1.2 From prompt to pixels: Image generation at inference time

12.2 Tokenizing and encoding the text prompt

12.2.1 Tokenizing the text prompt

12.2.2 Encoding the text prompt

12.3 Iterative prediction of image tokens

12.3.1 Loading the pretrained BART decoder

12.3.2 Predicting image tokens using the BART decoder

12.4 Converting image tokens to high-resolution images

12.4.1 Loading the pretrained VQGAN detokenizer

12.4.2 Visualizing the intermediate and final high-resolution outputs

Part 5 New developments and challenges

13 New developments and challenges in text-to-image generation

13.1 State-of-the-art text-to-image generators

13.1.1 DALL-E series

13.1.2 Google’s Imagen

13.1.3 Latent diffusion models: Stable Diffusion and Midjourney

13.2 Challenges and concerns

13.3 A blueprint to fine-tune ResNet50

13.3.1 The history and architecture of ResNet50

13.3.2 A plan to fine-tune ResNet50 for classification

13.3.3 Using ResNet50 to classify images

13.4 Fine-tuning ResNet50 to detect fake images

13.4.1 Downloading and preprocessing real and fake face images

13.4.2 Fine-tuning ResNet50

13.4.3 Detecting deepfakes using the fine-tuned ResNet50

Appendix

Installing PyTorch and enabling GPU training locally and in Colab

A.1 Installing Python and setting up a virtual environment

A.1.1 Installing Anaconda

A.1.2 Setting up a Python virtual environment

A.1.3 Installing Jupyter Notebook

A.2 Installing PyTorch

A.3 Using Google Colab for GPU training and inference

references

Overview

8 CLIP: a model to measure the similarity between image and text

Modern text-to-image systems rely on a bridge between language and vision that can score how well an image matches a description. This chapter introduces CLIP, a multimodal model that aligns captions and images in a shared latent space so their similarity can be measured directly. While CLIP does not generate images, it is foundational for the overall pipeline: first as an evaluator that retrieves or ranks images by a prompt, and later as a guide that helps generative models produce outputs that reflect the intent of the text.

Concretely, the chapter builds a CLIP variant using two encoders—a DistilBERT text encoder and a ResNet50 image encoder—followed by projection heads that map both modalities into 256-dimensional embeddings. Trained with contrastive learning on Flickr 8k image–caption pairs, the objective pulls together matching pairs and pushes apart mismatches. Key implementation details include tokenization with attention masks, standardized image preprocessing, freezing pretrained backbones to keep training lightweight, temperature-scaled similarity matrices, soft targets derived from intra-modal similarities, and a symmetric cross-entropy loss computed in both text-to-image and image-to-text directions.

Once trained, the model supports text-to-image selection: embed a prompt, compare it to cached image embeddings with cosine or dot-product similarity, and return the top matches. The chapter demonstrates this retrieval on Flickr 8k and then repeats the workflow with a pretrained OpenAI CLIP model (e.g., ViT-B/32) to show the gains from large-scale pretraining. It concludes by previewing how the same similarity signal is later used to condition and guide diffusion models, integrating CLIP’s alignment capability into end-to-end text-to-image generation.

Eight steps for building a text-to-image generator from scratch. In this chapter, you’ll focus on step 5: enabling the model to understand images in the context of natural language. By mastering this step, you’ll equip your models with the ability to align and compare images and texts, a capability that is foundational for all subsequent text-to-image generation methods.

How the CLIP model is trained. We collect a large-scale dataset of text-image pairs as the training dataset. The text encoder in the CLIP model compresses each description into a text embedding, and the image encoder in the CLIP model converts the corresponding image into an image embedding of the same dimension (say, both embeddings are 256-value vectors). During training, a batch of N text–image pairs are transformed into N text embeddings and N image embeddings. Using a contrastive learning approach, the CLIP model maximizes the similarity between embeddings of matching pairs (the diagonal values in the figure) while minimizing the similarity between nonmatching pairs (the off-diagonal values).

How to select an image from the Flickr 8k dataset using the trained CLIP model based on a text prompt. First, the text encoder in the CLIP model converts the prompt (e.g., “A dog plays on the beach” as shown at the top left) into a text embedding. Next, the image encoder processes every image in the dataset to generate N image embeddings. The similarity between the text embedding and each image embedding is then computed. Finally, the images are sorted by their similarity scores, and the one with the highest score is chosen as the match.

Ten image-caption pairs from the Flickr 8k dataset. We select ten images and place the shortest caption on top of each image.

Selecting an image from the Flickr 8k dataset using the trained CLIP model. The prompt used to select the image is “students having a class in the classroom.” The original caption of the image is “A woman helps boys on a computer.”

Five matched images based on the prompt "people eating at the restaurant" using the pretrained OpenAI CLIP model.

Summary

The contrastive language-image pretraining (CLIP) model is designed to understand and interpret images in the context of natural language. The model consists of a text encoder and an image encoder. The text encoder compresses the text description into a text embedding. The image encoder converts the corresponding image into an image embedding of the same dimension. During training, a batch of N text-image pairs are converted to N text embeddings and N image embeddings. CLIP uses a contrastive learning approach to maximize the similarity between paired embeddings while minimizing the similarity between embeddings from non-matching text-image pairs.
When creating a CLIP model from scratch, we can use the pretrained DistilBERT model to encode image captions. DistileBERT adds CLS and SEP tokens to the beginning and end of each caption. The text embedding of a caption is created by extracting the embedding from the CLS token at the very last layer of the DistileBERT model. The CLS token is trained to capture the overall meaning of the entire caption. It acts as a learned summary, aggregating contextual information from all other tokens through self-attention.
When training the CLIP model, we compute two cross-entropy losses: one where the correct image should stand out among all images for a given text (text-to-image matching), and another where the correct text should be distinguished from all texts for a given image (image-to-text matching). Averaging these two losses creates a symmetric objective that ensures both the text encoder and image encoder are equally optimized to produce aligned, comparable representations.
A trained CLIP model can measure the cosine similarity between a text description and any given image. This has many practical applications in computer vision tasks such as image generation and image classification.
One application of the CLIP model is image selection. We can use the trained CLIP model to select an image from a large dataset (say, Flickr 8k) that best matches a text description.

FAQ

What is CLIP and why is it important for text-to-image systems?

CLIP (Contrastive Language-Image Pretraining) aligns images and text in a shared latent space so their similarity can be measured directly. In this chapter it is used to: 1) learn embeddings that bring matching image–caption pairs close and push mismatches apart; 2) retrieve the best-matching image for a prompt; and 3) later guide generative models so outputs reflect the prompt.

How does CLIP learn to align images and text?

It uses two encoders (text and image) to produce embeddings of the same dimension, then applies a contrastive objective over batches. The model maximizes similarity for matching pairs (diagonal of the similarity matrix) and minimizes it for non-matching pairs (off-diagonal). Temperature scaling controls sharpness, and losses are computed symmetrically (text-to-image and image-to-text).

Which encoders are used in this chapter and why are they frozen?

Text encoder: DistilBERT. Image encoder: ResNet50. Both are loaded with pretrained weights and frozen to reduce compute and training time while leveraging strong prior knowledge. This leaves only the small projection heads trainable (~0.85M parameters), enabling practical training (about half an hour on a GPU) on Flickr 8k.

Why are projection heads needed and why use 256-dimensional embeddings?

DistilBERT outputs 768-d vectors and ResNet50 outputs 2048-d vectors. Projection heads map these to a shared 256-d space so text and image embeddings are comparable. The heads use a linear layer, GELU, residual connection, dropout, and LayerNorm, helping stabilize training and learn a common semantic space.

Why use the CLS token from the last layer as the text embedding?

The CLS token acts as a learned summary of the entire caption via self-attention. Taking it from the last layer yields a compact, fixed-length embedding that captures the most contextualized features, making it well-suited for alignment with image embeddings.

How are captions tokenized and batched for the text encoder?

Captions are tokenized with DistilBERT’s tokenizer: [CLS] is added at the start and [SEP] at the end. Sequences in a batch are padded to the same length and an attention_mask marks real tokens (1) vs padding (0), ensuring the model ignores pad positions.

What loss is used and what is the purpose of “soft” targets?

The main logits are text_embeddings @ image_embeddings.T (with temperature). Besides the standard contrastive targets, the implementation computes image-to-image and text-to-text similarity matrices to create softened targets via softmax. This acknowledges semantic similarities within a batch and can improve robustness. Losses are computed in both directions and averaged.

How do I retrieve the best-matching image for a text prompt?

- Precompute and cache image embeddings for the dataset using the trained image encoder + projection head. - Encode the prompt to a text embedding with the text encoder + projection head. - Compute similarity between the text embedding and all cached image embeddings (e.g., cosine similarity). - Sort by similarity and return the top-k images.

Why use cosine similarity for matching embeddings?

Cosine similarity measures the angle between vectors, focusing on orientation, not magnitude. This is desirable for semantic embeddings. In practice, dot product on normalized (or normalized-like via LayerNorm) embeddings approximates cosine; many implementations use either approach with temperature scaling.

How does OpenAI’s pretrained CLIP differ and how do I use it here?

OpenAI’s CLIP (e.g., ViT-B/32) is trained on massive image–text data, offering stronger generalization. To use it: load the model and its preprocess, encode all dataset images once, tokenize the prompt to get a text embedding, compute similarities to the image embeddings, and return the top-k matches. It typically yields better results than a small, custom-trained model.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $12.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $12.49

eBook

pdf, ePub, online

$47.99 $23.99

you save $24.00 (50%)

include audio $24.99 $12.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more