How entropy becomes negative Q-value, explained from first principles
Soft Actor-Critic (SAC) is a state-of-the-art reinforcement learning algorithm that bridges two worlds: stochastic policy optimization (like PPO) and off-policy Q-learning (like DDPG/TD3). The key insight that makes SAC special is entropy regularization.
Instead of just maximizing reward, SAC maximizes reward + entropy. This means the agent is rewarded for being uncertain, for keeping its options open, for exploring. The entropy term literally adds to the reward signal.
SAC simultaneously learns three things:
Before we can understand how entropy becomes part of the Q-value, we need to understand what entropy actually is and why we care about it.
Entropy measures how "random" or "uncertain" a probability distribution is. Think of it as measuring surprise: if you're pretty sure what's going to happen, there's no surprise (low entropy). If anything could happen, there's maximum surprise (high entropy).
In RL, we care about the entropy of our policy π(·|s) — the distribution over actions given a state. For a Gaussian policy (which SAC uses), the entropy has a nice closed form:
By adding entropy to the reward, we're telling the agent: "don't just find a good action, find all the good actions." This prevents premature convergence to suboptimal deterministic policies and encourages robust exploration.
Here's where it gets interesting. In standard RL, we maximize expected return. In entropy-regularized RL, we maximize expected return plus entropy bonuses:
The value function now includes entropy bonuses from every timestep:
The Q-function includes entropy bonuses from every timestep except the first. This is the crucial design choice:
Q(s, a) measures the value of taking specific action a in state s. Since we've already committed to action a, there's no uncertainty about that first action — hence no entropy bonus for it. The entropy bonuses only apply to future action choices.
This is where the magic happens. The relationship between V and Q reveals why entropy shows up as a negative log probability in the Q-target:
This last equation is the key insight. It says: the value of a state equals the expected [Q-value minus α times log-probability of the action].
Now let's derive the Bellman equation for the soft Q-function and see exactly how entropy transforms into -log π(a|s).
The entropy H(π) has become -α log π(a'|s') inside the Q-target. This is the fundamental transformation: maximizing entropy = subtracting log-probability from Q-values.
Since we're taking an expectation over s' (from replay buffer) and a' (from current policy), we can approximate with samples:
SAC uses two Q-networks (the "clipped double-Q" trick from TD3) to reduce overestimation bias. Both are trained to minimize MSBE (mean squared Bellman error).
| Aspect | TD3 | SAC |
|---|---|---|
| Policy type | Deterministic μ(s) | Stochastic π(a|s) |
| Next action in target | From target policy μ'(s') + noise | From current policy π(·|s') |
| Entropy term | None (uses explicit noise) | -α log π(a'|s') |
| Double-Q | Yes (min of two) | Yes (min of two) |
| Target networks | Yes (Polyak averaged) | Yes (Polyak averaged) |
The policy should maximize the soft value function Vπ(s) — that is, expected Q-value minus entropy penalty:
To backpropagate through the sampling process, SAC uses the reparameterization trick. Instead of sampling a ~ π(·|s), we sample noise ξ and compute action deterministically:
By writing a = f(s, θ, ξ) where ξ is independent noise, we can backpropagate gradients through the sampling: ∂a/∂θ is well-defined, even though a is stochastic. This is the same trick used in VAEs.
The tanh squashing changes the distribution, so we need to account for the Jacobian when computing log π(a|s):
Following Spinning Up's conventions, here's every symbol used in SAC:
Initialize:
Policy parameters θ
Q-function parameters φ₁, φ₂
Target parameters φ_targ,1 ← φ₁, φ_targ,2 ← φ₂
Empty replay buffer 𝒟
Loop:
// Collect experience
Observe s, sample a ~ π_θ(·|s)
Execute a, observe s', r, d
Store (s, a, r, s', d) in 𝒟
// Update (when it's time)
Sample batch B from 𝒟
// Compute targets:
ã' ~ π_θ(·|s')
y = r + γ(1-d)(min_j Q_φtarg,j(s', ã') - α log π_θ(ã'|s'))
// Update Q-functions:
∇_φᵢ (1/|B|) Σ (Q_φᵢ(s,a) - y)² for i=1,2
// Update policy:
∇_θ (1/|B|) Σ (min_j Q_φⱼ(s, ã_θ(s)) - α log π_θ(ã_θ(s)|s))
where ã_θ(s) = tanh(μ_θ(s) + σ_θ(s) ⊙ ξ), ξ ~ 𝒩(0,I)
// Update target networks:
φ_targ,i ← ρ φ_targ,i + (1-ρ) φᵢ for i=1,2
Every time you see -α log π(a|s) in SAC, remember: this IS the entropy regularization. The expectation of -log π IS the entropy H(π). By subtracting log-probabilities from Q-values, we're making "random" actions appear more valuable, encouraging exploration.