What Is a Generative Model?
You have a pile of images — handwritten digits, faces, product photos. You want a model that can generate new images that look like they came from the same distribution. More precisely, you want to learn p(x), the probability distribution over images.
A natural assumption: each observed image x was "caused" by some hidden (latent) variables z — abstract features like stroke angle, lighting direction, head tilt. The generative story is: sample z from a simple prior p(z), then generate x from p(x|z).
So the marginal likelihood is p(x) = ∫ p(x|z) · p(z) dz. To train the model we'd maximize log p(x) over our data. But this integral is intractable — we'd have to sum over every possible configuration of z.
Three Intractable Things
We're stuck on three fronts at once. The marginal likelihood p(x), the true posterior p(z|x), and the integral ∫ p(x|z)p(z) dz are all intractable. They form a circular dependency — you'd need one to compute the others.
The prior p(z) is a simple Gaussian. The true posterior p(z|x) is complex and multimodal.
The VAE's key insight: Instead of computing the true posterior, we'll learn an approximation q(z|x) — a neural network (the encoder) that outputs a simple Gaussian for each input x. Then we derive a loss we can compute.
Architecture
The encoder outputs parameters (μ, σ) of a Gaussian q(z|x). We sample z = μ + σ · ε where ε ~ N(0,I) — the reparameterization trick — making sampling differentiable. The decoder reconstructs x̂ from z.
Changing μ and σ transforms ε ~ N(0,1) into z ~ N(μ,σ²). Gradients flow through μ and σ, not through the sampling.
Deriving the Evidence Lower Bound
Now the main event. We want to maximize log p(x) but can't. Here's how we derive something we can optimize. Tap each step.
= log ∫ [p(x,z)/q(z|x)] · q(z|x) dz
≥ 𝔼q[log p(x,z) − log q(z|x)]
With Gaussian assumptions, the KL has a beautiful closed form — no sampling needed.
The KL Term Visualized
The KL divergence measures how different q(z|x) = N(μ, σ²) is from the prior p(z) = N(0,1). Drag μ and σ to see KL change in real time.
When μ=0 and σ=1, the encoder is the prior — KL=0. But then z carries no information about x! Training finds the sweet spot.
Two Forces in Tension
The ELBO is a tug-of-war between reconstruction (be accurate) and regularization (be simple). In β-VAE, we scale the KL term by β.
What the ELBO Shapes
The KL term ensures the latent space is smooth and continuous. The reconstruction term ensures each region maps to meaningful outputs.
Colored clusters = class encodings. Dashed circle = prior N(0,I). Higher β compresses clusters toward center.
Loss Curves Over Time
Reconstruction loss drops quickly. KL initially stays near zero (posterior collapse), then rises as the encoder starts encoding information.
Why Not Something Else?
Estimated with 1 Monte Carlo sample. No integrals.
Reparameterization trick lets gradients flow through sampling.
A rigorous lower bound with a known gap (KL to true posterior).
Two terms: reconstruction fidelity vs. latent regularity.
The ELBO comes from variational inference, predating deep learning by decades. Kingma & Welling (2013) made it practical with neural networks and the reparameterization trick.
The Core Idea
We can't optimize log p(x) directly because the integral over z is intractable. The ELBO provides a tractable lower bound: a reconstruction term plus a KL regularizer. The gap between the ELBO and log p(x) is the KL between our approximate and true posteriors — shrinking this gap is what training is all about.