VAEs & The ELBO — From First Principles

01 — The Big Picture

What Is a Generative Model?

You have a pile of images — handwritten digits, faces, product photos. You want a model that can generate new images that look like they came from the same distribution. More precisely, you want to learn p(x), the probability distribution over images.

A natural assumption: each observed image x was "caused" by some hidden (latent) variables z — abstract features like stroke angle, lighting direction, head tilt. The generative story is: sample z from a simple prior p(z), then generate x from p(x|z).

Interactive — The Generative Story

Latent dim 2

Samples 80

So the marginal likelihood is p(x) = ∫ p(x|z) · p(z) dz. To train the model we'd maximize log p(x) over our data. But this integral is intractable — we'd have to sum over every possible configuration of z.

02 — The Problem

Three Intractable Things

We're stuck on three fronts at once. The marginal likelihood p(x), the true posterior p(z|x), and the integral ∫ p(x|z)p(z) dz are all intractable. They form a circular dependency — you'd need one to compute the others.

Interactive — Prior vs Posterior

The prior p(z) is a simple Gaussian. The true posterior p(z|x) is complex and multimodal.

Posterior modes 2

Posterior spread 30

The VAE's key insight: Instead of computing the true posterior, we'll learn an approximation q(z|x) — a neural network (the encoder) that outputs a simple Gaussian for each input x. Then we derive a loss we can compute.

03 — The VAE

Architecture

VAE Pipeline

xinput

→

Encq_φ(z|x)

→

μ,σreparam

→

Decp_θ(x|z)

→

x̂output

The encoder outputs parameters (μ, σ) of a Gaussian q(z|x). We sample z = μ + σ · ε where ε ~ N(0,I) — the reparameterization trick — making sampling differentiable. The decoder reconstructs x̂ from z.

Interactive — Reparameterization Trick

Changing μ and σ transforms ε ~ N(0,1) into z ~ N(μ,σ²). Gradients flow through μ and σ, not through the sampling.

μ (mean) 0.8

σ (std) 2.0

04 — The ELBO

Deriving the Evidence Lower Bound

Now the main event. We want to maximize log p(x) but can't. Here's how we derive something we can optimize. Tap each step.

Step-by-Step Derivation

Start with what we want

log p(x)

Our goal: maximize the log marginal likelihood of observed data. But we can't compute it because p(x) = ∫ p(x|z)p(z) dz involves an intractable integral.

Introduce approximate posterior

log p(x) = log ∫ p(x,z) dz
= log ∫ [p(x,z)/q(z|x)] · q(z|x) dz

We multiply and divide by q(z|x) — our encoder network. Mathematically a no-op, but it rewrites the integral as an expectation under q.

Apply Jensen's inequality

log 𝔼_q[p(x,z)/q(z|x)]
≥ 𝔼_q[log p(x,z) − log q(z|x)]

Since log is concave, Jensen's gives us log 𝔼[·] ≥ 𝔼[log(·)]. The gap is exactly KL(q(z|x) ‖ p(z|x)) ≥ 0.

Decompose the bound

= 𝔼_q[log p(x|z)] − KL(q(z|x) ‖ p(z))

Using p(x,z) = p(x|z)·p(z) and rearranging: a reconstruction term and a KL regularizer. This is the ELBO.

The full picture

log p(x) = ELBO + KL(q‖p_true)

The log-evidence equals the ELBO plus the KL between approximate and true posteriors. Since KL ≥ 0, the ELBO is always a lower bound. Maximizing it simultaneously improves reconstruction AND pushes q closer to the true posterior.

VAE loss ℒ = −ELBO = ‖x − x̂‖² + ½Σ(−1 − log σ² + μ² + σ²)

With Gaussian assumptions, the KL has a beautiful closed form — no sampling needed.

05 — Encoder Meets Prior

The KL Term Visualized

The KL divergence measures how different q(z|x) = N(μ, σ²) is from the prior p(z) = N(0,1). Drag μ and σ to see KL change in real time.

Interactive — KL(q ‖ p)

KL = 0.00

μ_q 1.0

σ_q 1.5

When μ=0 and σ=1, the encoder is the prior — KL=0. But then z carries no information about x! Training finds the sweet spot.

06 — The Tradeoff

Two Forces in Tension

The ELBO is a tug-of-war between reconstruction (be accurate) and regularization (be simple). In β-VAE, we scale the KL term by β.

Interactive — ELBO Decomposition

Recon

−32

ELBO

−42

Gap

β weight 1.0

Capacity 50%

07 — Latent Space

What the ELBO Shapes

The KL term ensures the latent space is smooth and continuous. The reconstruction term ensures each region maps to meaningful outputs.

Interactive — 2D Latent Space

Colored clusters = class encodings. Dashed circle = prior N(0,I). Higher β compresses clusters toward center.

β (KL weight) 1.0

Classes 6

08 — Training

Loss Curves Over Time

Reconstruction loss drops quickly. KL initially stays near zero (posterior collapse), then rises as the encoder starts encoding information.

Interactive — Training Dynamics

Learning rate 0.50

β weight 1.0

09 — Why the ELBO

Why Not Something Else?

Property

Why it matters

Tractable

Estimated with 1 Monte Carlo sample. No integrals.

Differentiable

Reparameterization trick lets gradients flow through sampling.

Principled

A rigorous lower bound with a known gap (KL to true posterior).

Interpretable

Two terms: reconstruction fidelity vs. latent regularity.

The ELBO comes from variational inference, predating deep learning by decades. Kingma & Welling (2013) made it practical with neural networks and the reparameterization trick.

Summary

The Core Idea

We can't optimize log p(x) directly because the integral over z is intractable. The ELBO provides a tractable lower bound: a reconstruction term plus a KL regularizer. The gap between the ELBO and log p(x) is the KL between our approximate and true posteriors — shrinking this gap is what training is all about.

Variational Autoencoders & the ELBO