Soft Actor-Critic

How entropy becomes negative Q-value, explained from first principles

Based on OpenAI Spinning Up Documentation

01The Big Picture

Soft Actor-Critic (SAC) is a state-of-the-art reinforcement learning algorithm that bridges two worlds: stochastic policy optimization (like PPO) and off-policy Q-learning (like DDPG/TD3). The key insight that makes SAC special is entropy regularization.

The Core Idea

Instead of just maximizing reward, SAC maximizes reward + entropy. This means the agent is rewarded for being uncertain, for keeping its options open, for exploring. The entropy term literally adds to the reward signal.

What SAC Learns

SAC simultaneously learns three things:

1.
A stochastic policy πθ
Given a state s, outputs a probability distribution over actions (not just a single action)
2.
Two Q-functions Qφ₁, Qφ₂
Two independent estimates of the action-value function (the "clipped double-Q trick" from TD3)
3.
Temperature parameter α
Controls the exploration-exploitation trade-off (can be fixed or learned)
State s
Policy πθ(·|s)
Sample action a
Qφ(s, a)
Update

02Understanding Entropy

Before we can understand how entropy becomes part of the Q-value, we need to understand what entropy actually is and why we care about it.

What is Entropy?

Entropy measures how "random" or "uncertain" a probability distribution is. Think of it as measuring surprise: if you're pretty sure what's going to happen, there's no surprise (low entropy). If anything could happen, there's maximum surprise (high entropy).

// For a random variable x with distribution P: H(P) = 𝔼x ~ P[ -log P(x) ] // Expanding the expectation: H(P) = -x P(x) log P(x) // discrete H(P) = -P(x) log P(x) dx // continuous
Interactive: Entropy of a Two-Outcome Distribution
Probability of outcome A 0.50

Entropy of a Policy

In RL, we care about the entropy of our policy π(·|s) — the distribution over actions given a state. For a Gaussian policy (which SAC uses), the entropy has a nice closed form:

// For a Gaussian with standard deviation σ: H(π(·|s)) = ½ log(2πeσ²) // Key insight: entropy increases with σ // Wider distribution → more uncertainty → higher entropy

Why Maximize Entropy?

By adding entropy to the reward, we're telling the agent: "don't just find a good action, find all the good actions." This prevents premature convergence to suboptimal deterministic policies and encourages robust exploration.

03Entropy-Regularized Value Functions

Here's where it gets interesting. In standard RL, we maximize expected return. In entropy-regularized RL, we maximize expected return plus entropy bonuses:

// Standard RL objective: π* = arg maxπ 𝔼τ ~ π[ ∑t=0 γt R(st, at, st+1) ] // SAC's entropy-regularized objective: π* = arg maxπ 𝔼τ ~ π[ ∑t=0 γt ( R(st, at, st+1) + α H(π(·|st)) ) ] // α > 0 is the temperature: higher α → more exploration

The Modified Value Function Vπ

The value function now includes entropy bonuses from every timestep:

Vπ(s) = 𝔼τ ~ π[ ∑t=0 γt ( R(st, at, st+1) + α H(π(·|st)) ) | s0 = s ]

The Modified Q-Function Qπ

The Q-function includes entropy bonuses from every timestep except the first. This is the crucial design choice:

Qπ(s, a) = 𝔼τ ~ π[ ∑t=0 γt R(st, at, st+1) + αt=1 γt H(π(·|st)) | s0 = s, a0 = a ] // Notice: entropy sum starts at t=1, not t=0! // Q(s,a) assumes action a is already taken (no entropy for that choice)

Why Exclude the First Timestep?

Q(s, a) measures the value of taking specific action a in state s. Since we've already committed to action a, there's no uncertainty about that first action — hence no entropy bonus for it. The entropy bonuses only apply to future action choices.

The Connection Between V and Q

This is where the magic happens. The relationship between V and Q reveals why entropy shows up as a negative log probability in the Q-target:

// V is the expected Q plus the entropy of the current policy: Vπ(s) = 𝔼a ~ π[ Qπ(s, a) ] + α H(π(·|s)) // Expand the entropy using its definition: Vπ(s) = 𝔼a ~ π[ Qπ(s, a) ] + α 𝔼a ~ π[ -log π(a|s) ] // Combine the expectations (both over a ~ π): Vπ(s) = 𝔼a ~ π[ Qπ(s, a) - α log π(a|s) ]

This last equation is the key insight. It says: the value of a state equals the expected [Q-value minus α times log-probability of the action].

04The Central Trick: Entropy → Negative Log π

Now let's derive the Bellman equation for the soft Q-function and see exactly how entropy transforms into -log π(a|s).

Step 1: Start with the Soft Bellman Equation

// Q(s,a) = immediate reward + discounted value of next state Qπ(s, a) = 𝔼s' ~ P, a' ~ π[ R(s, a, s') + γ ( Qπ(s', a') + α H(π(·|s')) ) ] // The entropy H(π(·|s')) appears because V includes it

Step 2: Expand the Entropy

// Replace H with its definition: H(P) = 𝔼[-log P] H(π(·|s')) = 𝔼a' ~ π[ -log π(a'|s') ] // Substitute back into Bellman equation: Qπ(s, a) = 𝔼s' ~ P, a' ~ π[ R(s, a, s') + γ ( Qπ(s', a') + α · (-log π(a'|s')) ) ] // Simplify: Qπ(s, a) = 𝔼s' ~ P, a' ~ π[ R(s, a, s') + γ ( Qπ(s', a') - α log π(a'|s') ) ]

🎯 There It Is!

The entropy H(π) has become -α log π(a'|s') inside the Q-target. This is the fundamental transformation: maximizing entropy = subtracting log-probability from Q-values.

Step 3: Sample Approximation

Since we're taking an expectation over s' (from replay buffer) and a' (from current policy), we can approximate with samples:

// Approximate the expectation with a single sample: Qπ(s, a) ≈ r + γ ( Qπ(s', ã') - α log π(ã'|s') ) where: • r, s' come from replay buffer • ã' ~ π(·|s') // fresh sample from current policy // The tilde ã' emphasizes: sample fresh, don't use stored action!
Visualizing the Soft Q-Target
Temperature α 0.20
Policy std dev σ 0.30

05Learning the Q-Functions

SAC uses two Q-networks (the "clipped double-Q" trick from TD3) to reduce overestimation bias. Both are trained to minimize MSBE (mean squared Bellman error).

The Loss Function

// Loss for each Q-function (i = 1, 2): L(φi, 𝒟) = 𝔼(s,a,r,s',d) ~ 𝒟[ ( Qφi(s, a) - y(r, s', d) )² ] // where 𝒟 is the replay buffer

The Target (This is Where It All Comes Together)

// The complete SAC target: y(r, s', d) = r + γ (1 - d) · (minj=1,2 Qφtarg,j(s', ã') - α log πθ(ã'|s')) where ã' ~ πθ(·|s') // Breaking it down: // r → immediate reward from buffer // γ → discount factor // (1 - d) → zero out if terminal state // min Q → clipped double-Q (conservative estimate) // -α log π → THE ENTROPY TERM (encourages exploration)

Comparing with TD3

Aspect TD3 SAC
Policy type Deterministic μ(s) Stochastic π(a|s)
Next action in target From target policy μ'(s') + noise From current policy π(·|s')
Entropy term None (uses explicit noise) -α log π(a'|s')
Double-Q Yes (min of two) Yes (min of two)
Target networks Yes (Polyak averaged) Yes (Polyak averaged)

06Learning the Policy

The policy should maximize the soft value function Vπ(s) — that is, expected Q-value minus entropy penalty:

// We want to maximize: Vπ(s) = 𝔼a ~ π[ Qπ(s, a) - α log π(a|s) ] // Policy gradient (maximize w.r.t. θ): maxθ 𝔼s ~ 𝒟, ξ ~ 𝒩[ minj=1,2 Qφj(s, ãθ(s, ξ)) - α log πθθ(s, ξ)|s) ]

The Reparameterization Trick

To backpropagate through the sampling process, SAC uses the reparameterization trick. Instead of sampling a ~ π(·|s), we sample noise ξ and compute action deterministically:

// SAC uses a "squashed Gaussian" policy: ãθ(s, ξ) = tanh( μθ(s) + σθ(s) ξ ) where: • μθ(s) → neural network outputs mean • σθ(s) → neural network outputs std dev • ξ ~ 𝒩(0, I) → standard Gaussian noise • ⊙ → element-wise multiplication • tanh → squashes to [-1, 1] // Why tanh? Bounds actions to valid range!

The Reparameterization Insight

By writing a = f(s, θ, ξ) where ξ is independent noise, we can backpropagate gradients through the sampling: ∂a/∂θ is well-defined, even though a is stochastic. This is the same trick used in VAEs.

Computing Log Probability with Tanh

The tanh squashing changes the distribution, so we need to account for the Jacobian when computing log π(a|s):

// Let u = μ + σξ (pre-tanh), a = tanh(u) log π(a|s) = log πgauss(u|s) -i log(1 - ai²) // The second term is the log-Jacobian of tanh // This correction is crucial for correct entropy computation!
Squashed Gaussian Distribution
Mean μ 0.00
Std dev σ 0.50

07Complete Notation Reference

Following Spinning Up's conventions, here's every symbol used in SAC:

s, s' Current state, next state a, a', ã' Action, next action from buffer, fresh sample from policy r Reward (from replay buffer) d Done flag (1 if terminal, 0 otherwise) γ Discount factor ∈ (0, 1) πθ Stochastic policy with parameters θ Qφ₁, Qφ₂ Two Q-function approximators Qφtarg Target Q-networks (Polyak averaged) α Temperature/entropy coefficient H(P) Entropy of distribution P 𝒟 Replay buffer ρ Polyak averaging coefficient (typically 0.995) μθ(s) Mean of Gaussian policy σθ(s) Standard deviation of Gaussian policy ξ Noise sample from 𝒩(0, I)

The Complete Algorithm

Initialize: Policy parameters θ Q-function parameters φ₁, φ₂ Target parameters φ_targ,1 ← φ₁, φ_targ,2 ← φ₂ Empty replay buffer 𝒟 Loop: // Collect experience Observe s, sample a ~ π_θ(·|s) Execute a, observe s', r, d Store (s, a, r, s', d) in 𝒟 // Update (when it's time) Sample batch B from 𝒟 // Compute targets: ã' ~ π_θ(·|s') y = r + γ(1-d)(min_j Q_φtarg,j(s', ã') - α log π_θ(ã'|s')) // Update Q-functions: ∇_φᵢ (1/|B|) Σ (Q_φᵢ(s,a) - y)² for i=1,2 // Update policy: ∇_θ (1/|B|) Σ (min_j Q_φⱼ(s, ã_θ(s)) - α log π_θ(ã_θ(s)|s)) where ã_θ(s) = tanh(μ_θ(s) + σ_θ(s) ⊙ ξ), ξ ~ 𝒩(0,I) // Update target networks: φ_targ,i ← ρ φ_targ,i + (1-ρ) φᵢ for i=1,2

The Punchline

Every time you see -α log π(a|s) in SAC, remember: this IS the entropy regularization. The expectation of -log π IS the entropy H(π). By subtracting log-probabilities from Q-values, we're making "random" actions appear more valuable, encouraging exploration.