This project implements a Variational Autoencoder (VAE) from scratch for educational purposes and experimentation. The focus is on understanding the fundamentals of VAEs, optimizing the training process, and exploring architectural design choices.
- Custom-built VAE architecture
- Image reconstruction and generation capabilities
- Experiment tracking with Weights & Biases (wandb)
- Trained on 128×128 pixel images
- Approximately 12 million parameters
- Dataset of ~15,000 images
The Evidence Lower Bound (ELBO) for a Variational Autoencoder is derived as follows:
Given a data point
Since this integral is intractable, we approximate using a variational distribution
This inequality gives us the Evidence Lower Bound (ELBO):
-
Reconstruction Term:
$\mathbb{E}_{q(z|x)} [\log p(x|z)]$ represents the expected log-likelihood of the data given the latent variable, encouraging accurate reconstruction. -
Regularization Term:
$-\text{KL}(q(z|x) \parallel p(z))$ is the Kullback–Leibler divergence that regularizes the latent space by minimizing the difference between the approximate posterior$q(z|x)$ and the prior$p(z)$ .
The VAE consists of two main components:
- Encoder: Transforms input data into a probabilistic latent representation
- Decoder: Reconstructs the original data from samples in the latent space
- Latent Space: Gaussian distribution with learned mean and variance parameters
The training process minimizes the Evidence Lower Bound (ELBO) loss:
- LPIPS loss: Used for perceptual similarity, capturing human-perceived differences more effectively
- L1 loss: Implemented for reconstruction as it provides stronger signal response compared to MSE loss
- Beta factor: Set to 0.02 to scale KL divergence, ensuring balanced regularization without excessive latent compression
Below are sample results showing original images alongside their reconstructions: