Overview

2 Variational Autoencoders (VAEs)

This chapter introduces autoencoders as self-supervised neural networks that learn compact, meaningful representations by reconstructing their inputs. It explains the encoder–decoder architecture, the notion of a latent space, and why compressing through a bottleneck yields salient features useful for tasks such as dimensionality reduction, denoising, anomaly detection, and feature learning. While traditional autoencoders excel at reconstruction, the chapter motivates the need for truly generative models that can sample new data, setting the stage for Variational Autoencoders (VAEs).

VAEs extend autoencoders by making the latent space probabilistic: the encoder predicts parameters of a distribution (typically mean and variance of a Gaussian) for each input, and the decoder reconstructs data from latent samples. Training balances two objectives—a reconstruction loss for fidelity and a regularization term that aligns latent distributions with a simple prior—enabled by the reparameterization trick so gradients can flow through sampling. The chapter walks through a practical PyTorch implementation on MNIST, demonstrates evaluation via reconstructions and random sampling, and shows how latent space interpolation produces smooth, semantically coherent transitions that reflect a continuous, well-structured representation.

Building on this foundation, the chapter presents β-VAE, which introduces a hyperparameter to weight the regularization term and promote disentangled latent factors. Increasing β typically trades some reconstruction accuracy for more interpretable, factorized representations where individual latent dimensions control distinct attributes, enabling controlled generation and analysis. The discussion underscores the broader implications: from reliable compression and generation to human-interpretable representations, charting a progression from basic autoencoders to VAEs and β-VAEs as powerful tools for image generation and representation learning.

Autoencoder model architecture
Comparison of original MNIST digits (top row) with their autoencoder reconstructions (bottom row)
Normal (Gaussian) distribution. This figure illustrates a bell-shaped curve representing a Normal distribution denoted by N(μ, σ): μ represents the mean or average of the distribution. σ represents the standard deviation, a measure of how spread out the values are around the mean in the distribution.
VAE model architecture
Randomly generated MNIST digits created by VAE
Latent space interpolation involves selecting two distinct points in the latent space, which represent different latent encodings, and creating a series of intermediate points between them. By feeding interpolated latent vectors into the VAE’s decoder, we can reconstruct the data corresponding to each point. This allows us to observe how the data transitions as we move from one latent representation to another.
[5] Disentangled latent variables in β-VAEs

Summary

  • Autoencoders: Neural network architectures used to learn efficient codings of the input data; autoencoders have an encoder-decoder structure where the encoder compresses the input and the decoder reconstructs it, aiming to match the original.
  • Latent space: The hidden, compact representation of data, where autoencoders compress the input data. It represents the essential features learned from the data, which are crucial for the reconstruction or generation of new data instances.
  • Variational autoencoders (VAEs): An advanced type of autoencoder that learns the distribution of the data in the latent space. Unlike traditional autoencoders, VAEs are generative models that can generate new instances that resemble the input data by sampling from the learned distribution in the latent space.
  • Reparameterization trick: The reparametrization trick is a method used in VAEs to enable backpropagation through random processes by decoupling the sampling operation from the model’s parameters.
  • Beta-VAEs (β-VAEs): An extension of the standard VAE, introducing a hyperparameter β (beta) to control the trade-off between accurate reconstruction and adherence to the latent space’s probabilistic distribution, often leading to improved disentanglement of features in the latent space.

FAQ

What problem do Variational Autoencoders (VAEs) solve that traditional autoencoders do not?Traditional autoencoders learn a deterministic mapping to a single latent vector and focus on reconstruction. VAEs learn a probabilistic latent space by outputting a distribution (typically Gaussian) per input and sampling from it. This enables them to generate new, diverse data by sampling z from the prior (usually N(0,1)) and decoding it, making VAEs true generative models.
How does the VAE encoder differ from a standard autoencoder encoder?Instead of producing a single latent vector, the VAE encoder outputs the parameters of a distribution over the latent variables, typically the mean (mu) and log-variance (logvar) of a Gaussian. This models q(z|x), allowing sampling of many plausible latent codes for a given input.
What does the VAE decoder model?The decoder models the conditional probability p(x|z). Given a latent sample z drawn from the encoder’s distribution (or from the prior), it generates a reconstruction x’ that should resemble data drawn from the training distribution.
What is the reparameterization trick and why is it necessary?Backpropagation cannot pass through a raw sampling operation. The reparameterization trick makes sampling differentiable by expressing z as z = mu + sigma * epsilon, where epsilon is sampled from a standard normal distribution. This isolates randomness from the learnable parameters, enabling gradient-based training.
What are the components of the VAE loss function?The total loss combines: (1) Reconstruction loss (e.g., BCE for binary-like data or MSE for continuous data) to ensure output resembles input, and (2) a regularization term (KL divergence) that encourages the learned latent distribution q(z|x) to be close to a simple prior, typically N(0,1). Total Loss = Reconstruction Loss + KL Divergence.
Why is a standard normal prior (N(0,1)) used for the latent space?It regularizes the latent space, promoting structure, continuity, and completeness (nearby z decode to similar outputs). It also provides mathematical convenience: the KL divergence between Gaussians has a closed-form, simplifying training and improving stability.
When should I use Binary Cross-Entropy (BCE) versus Mean Squared Error (MSE) for reconstruction?Use BCE when outputs are in [0,1] and can be treated as Bernoulli-like (e.g., normalized MNIST with a final sigmoid). Use MSE for continuous-valued targets (e.g., natural images modeled as real intensities). Match the loss to your data’s distributional assumptions.
What is latent space interpolation and what does it show?Pick two latent vectors, linearly interpolate between them, and decode the intermediate points. The resulting sequence should show smooth transitions in the generated outputs, demonstrating that the latent space is continuous and well-structured.
What is β-VAE and how does the β parameter affect results?β-VAE scales the KL term in the loss: Loss = Reconstruction + β × KL. β = 1 recovers a standard VAE. β > 1 emphasizes disentanglement and latent regularization (often more interpretable factors) at the cost of reconstruction fidelity. β < 1 favors sharper reconstructions but can reduce disentanglement.
How do you evaluate a trained VAE?Assess (1) reconstruction quality by comparing inputs to reconstructions, and (2) generative quality by sampling z from N(0,1) and decoding to produce new images. Good VAEs reconstruct well and generate diverse, plausible samples that reflect the training distribution.