Reinforcement Learning Study Quiz

✅ Complete Answer Key

Q-Learning Answers

Q1: Bellman Optimality Equation

Q*(s, a) = E[ r + γ · max_{a'} Q*(s', a') ]

Terms:
• Q*(s, a): Optimal action-value
• r: Immediate reward
• γ: Discount factor (0 to 1)
• max_{a'} Q*(s', a'): Best possible value from the next state

Q2: TD Error

δ = r + γ · max_{a'} Q(s', a') - Q(s, a)

It's called an "error" because it measures the difference between what we predicted (Q(s,a)) and a better estimate. If our Q-function were perfect, δ would be zero.

Q3: Epsilon-Greedy

Epsilon-greedy: With probability ε, choose random action (explore). With probability 1-ε, choose best action (exploit).

Why decay ε: Early: need exploration. Later: exploit what we've learned.

ε = 0: Pure exploitation - can get stuck
ε = 1: Pure random - won't converge

Q4: Replay Buffer

Solves two problems:
1. Correlation: Consecutive experiences are correlated. Random sampling breaks correlations.
2. Sample efficiency: Reuse experiences multiple times instead of once.

Q5: Target Network

A frozen copy of Q-network, updated slowly.

Why needed: Using Q for both prediction AND target creates "moving target" problem. Target network provides stable targets.

Q6: Off-Policy

Off-policy = learn about optimal policy while following different behavior policy.

Advantage: Can reuse ANY past experience. Very sample-efficient.

The "max" makes it off-policy - we consider the best action, not the one we took.

Q7: Q-Learning Update

Q(s, a) ← Q(s, a) + α · [r + γ·max_{a'} Q(s', a') - Q(s, a)]

α controls how much we adjust toward the new estimate. Small α = stability but slower learning.

Q8: Maximization Bias

Problem: Same network selects AND evaluates best action. Noisy overestimates get selected, causing systematic bias.

Double DQN: One network SELECTS, different network EVALUATES:

target = r + γ · Q_target(s', argmax_{a'} Q(s', a'))

Policy Gradient Answers

Q9: Policy Gradient vs Q-Learning

Policy gradient: Directly optimizes policy π(a|s).
Q-learning: Learns value function Q(s,a), derives policy from it.

Policy gradient handles continuous actions naturally.

Q10: Policy Gradient Theorem

∇_θ J(θ) = E_τ [ Σ_t ∇_θ log π_θ(a_t|s_t) · G_t ]

∇_θ log π(a|s) points toward making action a MORE likely.

Positive G_t → increase probability. Negative G_t → decrease probability.

Q11: Log-Derivative Trick

Problem: Can't move gradient inside expectation over θ-dependent distribution.

Solution: ∇_θ p_θ(x) = p_θ(x) · ∇_θ log p_θ(x)

Rewrites to: E[f(x)·∇_θ log p_θ(x)]

Now we can sample from current π_θ and compute the gradient!

Q12: Return-to-Go

G_t = r_t + γr_{t+1} + γ²r_{t+2} + ...

Why: Action at time t only affects rewards from t onward. Total return would credit action for past rewards (noise).

Q13: Baseline

Baseline b(s): Subtracted from return to reduce variance.

Why zero expected gradient: E[∇_θ log π(a|s) · b(s)] = b(s) · ∇_θ 1 = 0

Reduces variance by centering reward signal.

Q14: Positive/Negative Reward

+100: Gradient increases π(a|s) - do this more.
-100: Gradient decreases π(a|s) - do this less.

Policy gradient: "do more of what worked, less of what didn't."

Q15: Why On-Policy

Gradient requires E_{τ ~ π_θ}[...] - samples from CURRENT policy.

Old trajectories from π_old have wrong probability weights.

Every update invalidates old data - sample inefficient.

Q16: High Variance

Why: G_t depends on ALL future randomness (compounds).

Two techniques:
1. Baselines: Subtract V(s) to center signal
2. Actor-Critic: Replace G_t with TD estimates

Actor-Critic Answers

Q17: Two Networks

Actor: Policy π_θ(a|s) - learns WHICH actions to take.
Critic: Value V_φ(s) or Q_φ(s,a) - learns HOW GOOD states/actions are.

Q18: Advantage Function

A(s, a) = Q(s, a) - V(s)

Positive A: Action is BETTER than average - do more.
Negative A: Action is WORSE than average - do less.

Advantage = how much better/worse than typical.

Q19: TD Error as Advantage

Q(s,a) = E[r + γV(s')]

So: A = Q - V = E[r + γV(s')] - V(s) ≈ r + γV(s') - V(s) = δ

TD error is a sampled advantage estimate (biased but low variance).

Q20: GAE Formula

A_t^GAE = Σ_{l=0}^{∞} (γλ)^l · δ_{t+l}

λ controls bias-variance:
• λ=0: Just TD error (high bias, low variance)
• λ=1: Monte Carlo (low bias, high variance)

GAE = exponentially-weighted average of n-step advantages!

Q21: Bias-Variance Tradeoff

TD(0): Low variance (one step), high bias (trusts V)
Monte Carlo: High variance (many steps), low bias (actual returns)

Usually want something in between (GAE λ≈0.95).

Q22: Critic Loss

L_critic = E[ (V_θ(s) - G_t)² ] or E[ (V_θ(s) - (r + γV(s')))² ]

MSE between predicted value and target (Monte Carlo or TD).

Q23: GAE λ Extremes

λ=0: A_t = δ_t (TD error only). HIGH BIAS, LOW VARIANCE.
λ=1: A_t = G_t - V(s_t) (Monte Carlo). LOW BIAS, HIGH VARIANCE.

λ interpolates between TD (λ=0) and Monte Carlo (λ=1).

Q24: Lower Variance than REINFORCE

REINFORCE: Full Monte Carlo returns (high variance).
Actor-Critic: Critic provides learned baseline + shorter horizon estimates.

Trade some bias for much lower variance - usually worth it.

PPO Answers

Q25: Probability Ratio

r(θ) = π_θ(a|s) / π_{θ_old}(a|s)

• r=1: Same probability
• r>1: New policy makes action MORE likely
• r<1: New policy makes action LESS likely

Q26: Clipped Objective

L^CLIP = E[ min( r(θ)·A, clip(r(θ), 1-ε, 1+ε)·A ) ]

Clipping: Prevents policy from changing too much. Creates "trust region" - only trust advantage estimates near old policy.

Q27: Clipping for A>0 vs A<0

A>0 (good action): Want to increase π. But capped at (1+ε)·A. Can't increase beyond clip.

A<0 (bad action): Want to decrease π. min() picks less negative term. Can't decrease beyond clip.

Clipping always prevents changes that make objective artificially good!

Q28: Importance Sampling

E_{x~p}[f(x)] = E_{x~q}[f(x) · p(x)/q(x)]

Why PPO needs it: Data from π_old, but optimizing π_θ. Ratio reweights samples for distribution mismatch.

Enables multiple gradient updates on same batch.

Q29: Trust Region

Trust region: Neighborhood where value estimates are reliable.

Why stay close:
1. Advantage estimates from old policy become wrong
2. Large changes can be catastrophic
3. Importance weights become high-variance

Q30: Specific Examples

r=2, A>0: min(2A, 1.2A) = 1.2A. Capped - gradient is zero.

r=0.5, A<0: min(0.5A, 0.8A). Since A<0, 0.5A more negative. Uses 0.8A. Gradient is zero.

Q31: Data Reuse

How: Importance sampling corrects for old data. Clipping prevents drift.

Limit: After many epochs, π_θ diverges from π_old. Clipping triggers constantly. Typically K=3-10 epochs.

SAC Answers

Q32: Entropy

H(π) = -E[ log π(a|s) ]

Entropy: Measures randomness/spread of policy.

Higher entropy → exploration: Maintains diversity, prevents premature convergence.

Q33: Maximum Entropy Objective

J(π) = E[ Σ_t r_t + α · H(π(·|s_t)) ]

Difference: Standard RL only maximizes return. SAC ALSO maximizes entropy.

"Among similar-reward policies, prefer the most random one."

Q34: Soft Bellman

Q(s,a) = r + γ · E[ Q(s',a') - α·log π(a'|s') ]

Extra term: -α·log π (entropy bonus). High-entropy futures valued more highly.

Q35: Reparameterization Trick

Problem: Can't do calculus on dice rolls (sampling).

Solution: a = μ_θ(s) + σ_θ(s) × ε, where ε ~ N(0,1)

1. Roll dice FIRST (sample ε)
2. Now a is a FORMULA (μ + σ × fixed number)
3. Formulas are differentiable!

Moved randomness into ε (constant w.r.t. θ). Action is now differentiable!

Q36: Twin Q-Networks

Uses min(Q_1, Q_2) for targets.

Problem solved: Q-function overestimation. Policy exploits errors.

Why min: Pessimistic estimate counteracts overestimation.

Q37: Auto Temperature Tuning

Target entropy: Typically -dim(action_space).

Why tune α: Fixed α works poorly. Optimal depends on task.

Entropy below target → increase α (more exploration)
Entropy above target → decrease α (more reward focus)

Q38: Sample Efficiency

Off-policy: Reuse ANY past experience from replay buffer.

vs PPO: PPO must throw away data after few epochs. SAC can train on same data indefinitely.

For expensive interactions (robotics), SAC's efficiency is huge.

Q39: Direct Q Optimization

PPO: Log-derivative trick (score function). High variance.

SAC: Reparameterization makes a differentiable. Gradient flows: Q → a → θ

SAC uses pathwise derivatives (low variance) vs score function (high variance).

Gradient Flow Answers

Q40: Why Not Backprop Through G_t

G_t comes from ENVIRONMENT (not differentiable). Can't compute ∂reward/∂θ.

Even with differentiable sim: high variance, exploding gradients, intractable.

G_t tells us HOW MUCH to change, not WHICH DIRECTION.

Q41: Why Detach Advantage

A involves critic V. Without detaching:

Problem: Actor gradients flow into critic. Actor tries to change V to make A larger, not learn better policy.

Both networks become corrupted.

Actor treats A as "given information."

Q42: SAC Gradient Path

Loss = -Q(s,a) + α·log π
↓
a = μ_θ(s) + σ_θ(s) × ε
↓
∂Loss/∂θ = ∂Loss/∂a × ∂a/∂θ

Why possible: Reparameterization makes a differentiable. Can chain: Q → a → θ.

SAC asks critic "what's ∂Q/∂a?" and uses it directly.

Q43: Target Network Soft Updates

θ_target ← τ·θ + (1-τ)·θ_target, τ ≈ 0.005

Why: Stable targets for Q-learning. Changing every step → moving target.

Soft update provides smoother training than hard copies.

Q44: Variance Comparison

REINFORCE: Highest. G_t = all future randomness + score function.

PPO: Medium. Learned baseline, GAE. Still score function.

SAC: Lowest. Pathwise derivatives through Q. No score function.

Pathwise (SAC) < score+baseline (PPO) < vanilla score (REINFORCE)

Q45: Bootstrapping

Bootstrapping: Using own estimates as targets (e.g., r + γV(s')).

Who bootstraps: Q-learning, Actor-Critic, PPO (TD), SAC. NOT REINFORCE.

Tradeoff: Introduces bias (estimates wrong) but reduces variance. Usually worth it.

Big Picture Answers

Q46: On vs Off-Policy Table

	On-Policy	Off-Policy
Examples	REINFORCE, PPO, A2C	DQN, SAC, DDPG, TD3
Sample eff.	Lower (data discarded)	Higher (replay buffer)
Replay?	No / limited	Yes (essential)
Stability	More stable	Can be less stable

Q47: Algorithm Evolution

REINFORCE: First policy gradient. Problem: HIGH VARIANCE.

Actor-Critic: Solved VARIANCE with critic's value estimates.

PPO: Solved STABILITY & DATA REUSE with clipping, enabling multiple epochs.

Each step trades off sample efficiency, stability, complexity.

Q48: PPO vs SAC

Choose PPO: Fast/cheap simulation, need stability, discrete actions, easy parallelization, simpler tuning.

Choose SAC: Real-world robotics (expensive samples), continuous actions, need max sample efficiency, want built-in exploration.

Q49: Continuous Action Challenges

1. Can't enumerate max: SAC learns policy that maximizes Q via reparameterization.

2. Exploration harder: Random continuous actions usually terrible. SAC uses entropy maximization for principled exploration.

Q50: Real Robot vs Simulator

Real robot: Choose SAC. Max sample efficiency. Every interaction costly.

Fast simulator: Choose PPO. Parallelize across many instances. Stability matters more than efficiency.

Optimal algorithm depends on cost of environment interaction!