Overview

8 CLIP: a model to measure the similarity between image and text

Modern text-to-image systems rely on a bridge between language and vision that can score how well an image matches a description. This chapter introduces CLIP, a multimodal model that aligns captions and images in a shared latent space so their similarity can be measured directly. While CLIP does not generate images, it is foundational for the overall pipeline: first as an evaluator that retrieves or ranks images by a prompt, and later as a guide that helps generative models produce outputs that reflect the intent of the text.

Concretely, the chapter builds a CLIP variant using two encoders—a DistilBERT text encoder and a ResNet50 image encoder—followed by projection heads that map both modalities into 256-dimensional embeddings. Trained with contrastive learning on Flickr 8k image–caption pairs, the objective pulls together matching pairs and pushes apart mismatches. Key implementation details include tokenization with attention masks, standardized image preprocessing, freezing pretrained backbones to keep training lightweight, temperature-scaled similarity matrices, soft targets derived from intra-modal similarities, and a symmetric cross-entropy loss computed in both text-to-image and image-to-text directions.

Once trained, the model supports text-to-image selection: embed a prompt, compare it to cached image embeddings with cosine or dot-product similarity, and return the top matches. The chapter demonstrates this retrieval on Flickr 8k and then repeats the workflow with a pretrained OpenAI CLIP model (e.g., ViT-B/32) to show the gains from large-scale pretraining. It concludes by previewing how the same similarity signal is later used to condition and guide diffusion models, integrating CLIP’s alignment capability into end-to-end text-to-image generation.

Eight steps for building a text-to-image generator from scratch. In this chapter, you’ll focus on step 5: enabling the model to understand images in the context of natural language. By mastering this step, you’ll equip your models with the ability to align and compare images and texts, a capability that is foundational for all subsequent text-to-image generation methods.
How the CLIP model is trained. We collect a large-scale dataset of text-image pairs as the training dataset. The text encoder in the CLIP model compresses each description into a text embedding, and the image encoder in the CLIP model converts the corresponding image into an image embedding of the same dimension (say, both embeddings are 256-value vectors). During training, a batch of N text–image pairs are transformed into N text embeddings and N image embeddings. Using a contrastive learning approach, the CLIP model maximizes the similarity between embeddings of matching pairs (the diagonal values in the figure) while minimizing the similarity between nonmatching pairs (the off-diagonal values).
How to select an image from the Flickr 8k dataset using the trained CLIP model based on a text prompt. First, the text encoder in the CLIP model converts the prompt (e.g., “A dog plays on the beach” as shown at the top left) into a text embedding. Next, the image encoder processes every image in the dataset to generate N image embeddings. The similarity between the text embedding and each image embedding is then computed. Finally, the images are sorted by their similarity scores, and the one with the highest score is chosen as the match.
Ten image-caption pairs from the Flickr 8k dataset. We select ten images and place the shortest caption on top of each image.
Selecting an image from the Flickr 8k dataset using the trained CLIP model. The prompt used to select the image is “students having a class in the classroom.” The original caption of the image is “A woman helps boys on a computer.”
Five matched images based on the prompt "people eating at the restaurant" using the pretrained OpenAI CLIP model.

Summary

  • The contrastive language-image pretraining (CLIP) model is designed to understand and interpret images in the context of natural language. The model consists of a text encoder and an image encoder. The text encoder compresses the text description into a text embedding. The image encoder converts the corresponding image into an image embedding of the same dimension. During training, a batch of N text-image pairs are converted to N text embeddings and N image embeddings. CLIP uses a contrastive learning approach to maximize the similarity between paired embeddings while minimizing the similarity between embeddings from non-matching text-image pairs.
  • When creating a CLIP model from scratch, we can use the pretrained DistilBERT model to encode image captions. DistileBERT adds CLS and SEP tokens to the beginning and end of each caption. The text embedding of a caption is created by extracting the embedding from the CLS token at the very last layer of the DistileBERT model. The CLS token is trained to capture the overall meaning of the entire caption. It acts as a learned summary, aggregating contextual information from all other tokens through self-attention.
  • When training the CLIP model, we compute two cross-entropy losses: one where the correct image should stand out among all images for a given text (text-to-image matching), and another where the correct text should be distinguished from all texts for a given image (image-to-text matching). Averaging these two losses creates a symmetric objective that ensures both the text encoder and image encoder are equally optimized to produce aligned, comparable representations.
  • A trained CLIP model can measure the cosine similarity between a text description and any given image. This has many practical applications in computer vision tasks such as image generation and image classification.
  • One application of the CLIP model is image selection. We can use the trained CLIP model to select an image from a large dataset (say, Flickr 8k) that best matches a text description.

FAQ

What is CLIP and why is it important for text-to-image systems?CLIP (Contrastive Language-Image Pretraining) aligns images and text in a shared latent space so their similarity can be measured directly. In this chapter it is used to: 1) learn embeddings that bring matching image–caption pairs close and push mismatches apart; 2) retrieve the best-matching image for a prompt; and 3) later guide generative models so outputs reflect the prompt.
How does CLIP learn to align images and text?It uses two encoders (text and image) to produce embeddings of the same dimension, then applies a contrastive objective over batches. The model maximizes similarity for matching pairs (diagonal of the similarity matrix) and minimizes it for non-matching pairs (off-diagonal). Temperature scaling controls sharpness, and losses are computed symmetrically (text-to-image and image-to-text).
Which encoders are used in this chapter and why are they frozen?Text encoder: DistilBERT. Image encoder: ResNet50. Both are loaded with pretrained weights and frozen to reduce compute and training time while leveraging strong prior knowledge. This leaves only the small projection heads trainable (~0.85M parameters), enabling practical training (about half an hour on a GPU) on Flickr 8k.
Why are projection heads needed and why use 256-dimensional embeddings?DistilBERT outputs 768-d vectors and ResNet50 outputs 2048-d vectors. Projection heads map these to a shared 256-d space so text and image embeddings are comparable. The heads use a linear layer, GELU, residual connection, dropout, and LayerNorm, helping stabilize training and learn a common semantic space.
Why use the CLS token from the last layer as the text embedding?The CLS token acts as a learned summary of the entire caption via self-attention. Taking it from the last layer yields a compact, fixed-length embedding that captures the most contextualized features, making it well-suited for alignment with image embeddings.
How are captions tokenized and batched for the text encoder?Captions are tokenized with DistilBERT’s tokenizer: [CLS] is added at the start and [SEP] at the end. Sequences in a batch are padded to the same length and an attention_mask marks real tokens (1) vs padding (0), ensuring the model ignores pad positions.
What loss is used and what is the purpose of “soft” targets?The main logits are text_embeddings @ image_embeddings.T (with temperature). Besides the standard contrastive targets, the implementation computes image-to-image and text-to-text similarity matrices to create softened targets via softmax. This acknowledges semantic similarities within a batch and can improve robustness. Losses are computed in both directions and averaged.
How do I retrieve the best-matching image for a text prompt?- Precompute and cache image embeddings for the dataset using the trained image encoder + projection head. - Encode the prompt to a text embedding with the text encoder + projection head. - Compute similarity between the text embedding and all cached image embeddings (e.g., cosine similarity). - Sort by similarity and return the top-k images.
Why use cosine similarity for matching embeddings?Cosine similarity measures the angle between vectors, focusing on orientation, not magnitude. This is desirable for semantic embeddings. In practice, dot product on normalized (or normalized-like via LayerNorm) embeddings approximates cosine; many implementations use either approach with temperature scaling.
How does OpenAI’s pretrained CLIP differ and how do I use it here?OpenAI’s CLIP (e.g., ViT-B/32) is trained on massive image–text data, offering stronger generalization. To use it: load the model and its preprocess, encode all dataset images once, tokenize the prompt to get a text embedding, compute similarities to the image embeddings, and return the top-k matches. It typically yields better results than a small, custom-trained model.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Build a Text-to-Image Generator (from Scratch) ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Build a Text-to-Image Generator (from Scratch) ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Build a Text-to-Image Generator (from Scratch) ebook for free