🎯 Reinforcement Learning Study Quiz

50 Questions • From Q-Learning to SAC • Pen & Paper Ready

1. Q-Learning 8 questions
1
Easy
Write the Bellman optimality equation for Q*(s, a). What does each term represent?
2
Easy
What is the TD (Temporal Difference) error in Q-learning? Write the formula and explain why it's called an "error."
δ = r + γ · max_a' Q(s', a') - Q(s, a)
3
Medium
What is epsilon-greedy exploration? Why do we typically decay ε over time? What happens if ε = 0 vs ε = 1?
4
Medium
Why does Q-learning need a replay buffer? What two problems does it solve?
5
Medium
What is a target network in DQN? Why can't we just use Q(s, a) for both the prediction and the target?
6
Hard
Q-learning is "off-policy." What does this mean, and why is it an advantage for sample efficiency?
7
Medium
Write the Q-learning update rule. What is the role of the learning rate α?
Q(s, a) ← Q(s, a) + α · [_______________]
8
Hard
Why does Q-learning suffer from "maximization bias"? How does Double DQN address this?
2. Policy Gradient 8 questions
9
Easy
In policy gradient methods, what do we optimize directly? How is this different from Q-learning?
10
Medium
Write the policy gradient theorem. What does ∇_θ log π(a|s) represent intuitively?
∇_θ J(θ) = E_τ [ Σ_t ∇_θ log π_θ(a_t|s_t) · G_t ]
11
Hard
Explain the "log-derivative trick." Why do we need it, and how does it let us compute gradients through sampling?
∇_θ E[f(x)] = E[ f(x) · ∇_θ log p_θ(x) ]
12
Easy
What is G_t (return-to-go)? Write the formula. Why do we use this instead of total episode return?
13
Medium
What is a baseline in REINFORCE? Why does subtracting a baseline reduce variance without changing the expected gradient?
∇_θ J(θ) = E[ ∇_θ log π(a|s) · (G_t - b(s)) ]
14
Medium
If an action leads to reward +100, what does the policy gradient do to π(a|s)? What if reward is -100?
15
Hard
REINFORCE is "on-policy." Why can't we reuse old trajectories collected from a previous policy?
16
Medium
Why is vanilla policy gradient known for high variance? Name two techniques to reduce it.
3. Actor-Critic 8 questions
17
Easy
What are the two networks in Actor-Critic? What does each one learn?
18
Easy
Write the formula for the advantage function A(s, a). What does a positive vs. negative advantage mean?
A(s, a) = Q(s, a) - V(s)
19
Medium
We often estimate advantage using TD error: A(s,a) ≈ δ = r + γV(s') - V(s). Why is this valid? (Hint: What is Q(s,a) in terms of r, γ, and V?)
20
Hard
Write the GAE (Generalized Advantage Estimation) formula. What does λ control?
A_t^GAE = Σ_{l=0}^{∞} (γλ)^l · δ_{t+l}
21
Medium
What is the bias-variance tradeoff in advantage estimation? Compare using TD(0) error vs. Monte Carlo returns.
22
Medium
What loss function does the Critic use? Write it out.
L_critic = E[ (V_θ(s) - _________ )² ]
23
Hard
In GAE, what happens when λ = 0? When λ = 1? Which has more bias? More variance?
24
Medium
Why is Actor-Critic generally lower variance than REINFORCE? What's the key difference?
4. PPO (Proximal Policy Optimization) 7 questions
25
Easy
What is the probability ratio r(θ) in PPO? Write the formula.
r(θ) = π_θ(a|s) / π_θ_old(a|s)
26
Medium
Write the PPO clipped objective. What does clipping achieve?
L^CLIP = E[ min( r(θ)·A, clip(r(θ), 1-ε, 1+ε)·A ) ]
27
Hard
Explain why clipping works differently for A > 0 vs A < 0. Draw or describe what happens in each case.
28
Medium
What is importance sampling? Why does PPO need it to reuse data from the old policy?
E_{x~p}[f(x)] = E_{x~q}[f(x) · p(x)/q(x)]
29
Medium
What is a "trust region" conceptually? Why do we want to stay close to the old policy?
30
Hard
If r(θ) = 2 and A > 0, what does PPO do? What if r(θ) = 0.5 and A < 0?
31
Medium
PPO is "on-policy" but reuses data for multiple epochs. How is this possible? What limits how many epochs we can do?
5. SAC (Soft Actor-Critic) 8 questions
32
Easy
What is entropy H(π) of a policy? Why is higher entropy generally better for exploration?
H(π) = -E[ log π(a|s) ]
33
Medium
Write SAC's maximum entropy objective. How is it different from standard RL?
J(π) = E[ Σ_t r_t + α · H(π(·|s_t)) ]
34
Medium
Write the soft Bellman equation. What's the extra term compared to standard Bellman?
Q(s,a) = r + γ · E[ Q(s',a') - α·log π(a'|s') ]
35
Hard
THE REPARAMETERIZATION TRICK (in simple terms):

Problem: We want to take gradients to improve our policy, but actions come from sampling (like rolling dice). You can't do calculus on dice rolls!

Explain the solution:
Instead of: a = sample from π(a|s)
Use: a = μ_θ(s) + σ_θ(s) × ε, where ε ~ N(0, 1)
36
Medium
Why does SAC use twin Q-networks (two critics)? What problem does this solve?
37
Hard
What is automatic temperature tuning in SAC? What is the target entropy, and why do we tune α?
38
Medium
SAC is off-policy. What does this mean for sample efficiency compared to PPO?
39
Hard
Why can SAC directly optimize Q-values through the policy (unlike PPO which needs the log-derivative trick)?
6. Gradient Flow 6 questions
40
Medium
In REINFORCE, gradients flow through log π(a|s). The return G_t is treated as a constant. Why must we NOT backprop through G_t?
41
Hard
In Actor-Critic (and PPO), why do we "detach" the advantage A when computing the actor loss? What would go wrong if we didn't?
42
Hard
In SAC, gradients flow from Q(s, a) through the action a back to the policy. Draw/describe this gradient path. Why is this possible?
43
Medium
Why do we use a target network that updates slowly (soft update)? What's the formula?
θ_target ← τ·θ + (1-τ)·θ_target, where τ << 1
44
Hard
Compare gradient flow in these three algorithms:
• REINFORCE: policy ← log π × G
• PPO: policy ← log π × A (detached)
• SAC: policy ← Q(s, a_sampled)
Which has lowest variance? Why?
45
Medium
What is "bootstrapping" in RL? Which algorithms bootstrap, and what's the tradeoff?
7. Big Picture 5 questions
46
Medium
Fill in this table comparing on-policy vs off-policy:
On-PolicyOff-Policy
Example algorithms
Sample efficiency
Replay buffer?
Stability
47
Medium
Trace the evolution: REINFORCE → Actor-Critic → PPO. What problem did each solve?
48
Medium
When would you choose PPO over SAC? When would you choose SAC over PPO?
49
Hard
Why is continuous action space harder than discrete? Name two challenges and how SAC addresses them.
50
Hard
You're training a robot arm with expensive real-world samples. Which algorithm family would you prefer and why? What if you had a fast simulator instead?
✅ Complete Answer Key

Q-Learning Answers

Q1: Bellman Optimality Equation
Q*(s, a) = E[ r + γ · max_{a'} Q*(s', a') ]
Terms:
• Q*(s, a): Optimal action-value
• r: Immediate reward
• γ: Discount factor (0 to 1)
• max_{a'} Q*(s', a'): Best possible value from the next state
Q2: TD Error
δ = r + γ · max_{a'} Q(s', a') - Q(s, a)
It's called an "error" because it measures the difference between what we predicted (Q(s,a)) and a better estimate. If our Q-function were perfect, δ would be zero.
Q3: Epsilon-Greedy
Epsilon-greedy: With probability ε, choose random action (explore). With probability 1-ε, choose best action (exploit).

Why decay ε: Early: need exploration. Later: exploit what we've learned.

ε = 0: Pure exploitation - can get stuck
ε = 1: Pure random - won't converge
Q4: Replay Buffer
Solves two problems:
1. Correlation: Consecutive experiences are correlated. Random sampling breaks correlations.
2. Sample efficiency: Reuse experiences multiple times instead of once.
Q5: Target Network
A frozen copy of Q-network, updated slowly.

Why needed: Using Q for both prediction AND target creates "moving target" problem. Target network provides stable targets.
Q6: Off-Policy
Off-policy = learn about optimal policy while following different behavior policy.

Advantage: Can reuse ANY past experience. Very sample-efficient.
The "max" makes it off-policy - we consider the best action, not the one we took.
Q7: Q-Learning Update
Q(s, a) ← Q(s, a) + α · [r + γ·max_{a'} Q(s', a') - Q(s, a)]
α controls how much we adjust toward the new estimate. Small α = stability but slower learning.
Q8: Maximization Bias
Problem: Same network selects AND evaluates best action. Noisy overestimates get selected, causing systematic bias.

Double DQN: One network SELECTS, different network EVALUATES:
target = r + γ · Q_target(s', argmax_{a'} Q(s', a'))

Policy Gradient Answers

Q9: Policy Gradient vs Q-Learning
Policy gradient: Directly optimizes policy π(a|s).
Q-learning: Learns value function Q(s,a), derives policy from it.

Policy gradient handles continuous actions naturally.
Q10: Policy Gradient Theorem
∇_θ J(θ) = E_τ [ Σ_t ∇_θ log π_θ(a_t|s_t) · G_t ]
∇_θ log π(a|s) points toward making action a MORE likely.

Positive G_t → increase probability. Negative G_t → decrease probability.
Q11: Log-Derivative Trick
Problem: Can't move gradient inside expectation over θ-dependent distribution.

Solution: ∇_θ p_θ(x) = p_θ(x) · ∇_θ log p_θ(x)

Rewrites to: E[f(x)·∇_θ log p_θ(x)]
Now we can sample from current π_θ and compute the gradient!
Q12: Return-to-Go
G_t = r_t + γr_{t+1} + γ²r_{t+2} + ...
Why: Action at time t only affects rewards from t onward. Total return would credit action for past rewards (noise).
Q13: Baseline
Baseline b(s): Subtracted from return to reduce variance.

Why zero expected gradient: E[∇_θ log π(a|s) · b(s)] = b(s) · ∇_θ 1 = 0

Reduces variance by centering reward signal.
Q14: Positive/Negative Reward
+100: Gradient increases π(a|s) - do this more.
-100: Gradient decreases π(a|s) - do this less.
Policy gradient: "do more of what worked, less of what didn't."
Q15: Why On-Policy
Gradient requires E_{τ ~ π_θ}[...] - samples from CURRENT policy.

Old trajectories from π_old have wrong probability weights.
Every update invalidates old data - sample inefficient.
Q16: High Variance
Why: G_t depends on ALL future randomness (compounds).

Two techniques:
1. Baselines: Subtract V(s) to center signal
2. Actor-Critic: Replace G_t with TD estimates

Actor-Critic Answers

Q17: Two Networks
Actor: Policy π_θ(a|s) - learns WHICH actions to take.
Critic: Value V_φ(s) or Q_φ(s,a) - learns HOW GOOD states/actions are.
Q18: Advantage Function
A(s, a) = Q(s, a) - V(s)
Positive A: Action is BETTER than average - do more.
Negative A: Action is WORSE than average - do less.
Advantage = how much better/worse than typical.
Q19: TD Error as Advantage
Q(s,a) = E[r + γV(s')]

So: A = Q - V = E[r + γV(s')] - V(s) ≈ r + γV(s') - V(s) = δ

TD error is a sampled advantage estimate (biased but low variance).
Q20: GAE Formula
A_t^GAE = Σ_{l=0}^{∞} (γλ)^l · δ_{t+l}
λ controls bias-variance:
• λ=0: Just TD error (high bias, low variance)
• λ=1: Monte Carlo (low bias, high variance)
GAE = exponentially-weighted average of n-step advantages!
Q21: Bias-Variance Tradeoff
TD(0): Low variance (one step), high bias (trusts V)
Monte Carlo: High variance (many steps), low bias (actual returns)

Usually want something in between (GAE λ≈0.95).
Q22: Critic Loss
L_critic = E[ (V_θ(s) - G_t)² ] or E[ (V_θ(s) - (r + γV(s')))² ]
MSE between predicted value and target (Monte Carlo or TD).
Q23: GAE λ Extremes
λ=0: A_t = δ_t (TD error only). HIGH BIAS, LOW VARIANCE.
λ=1: A_t = G_t - V(s_t) (Monte Carlo). LOW BIAS, HIGH VARIANCE.
λ interpolates between TD (λ=0) and Monte Carlo (λ=1).
Q24: Lower Variance than REINFORCE
REINFORCE: Full Monte Carlo returns (high variance).
Actor-Critic: Critic provides learned baseline + shorter horizon estimates.
Trade some bias for much lower variance - usually worth it.

PPO Answers

Q25: Probability Ratio
r(θ) = π_θ(a|s) / π_{θ_old}(a|s)
• r=1: Same probability
• r>1: New policy makes action MORE likely
• r<1: New policy makes action LESS likely
Q26: Clipped Objective
L^CLIP = E[ min( r(θ)·A, clip(r(θ), 1-ε, 1+ε)·A ) ]
Clipping: Prevents policy from changing too much. Creates "trust region" - only trust advantage estimates near old policy.
Q27: Clipping for A>0 vs A<0
A>0 (good action): Want to increase π. But capped at (1+ε)·A. Can't increase beyond clip.

A<0 (bad action): Want to decrease π. min() picks less negative term. Can't decrease beyond clip.
Clipping always prevents changes that make objective artificially good!
Q28: Importance Sampling
E_{x~p}[f(x)] = E_{x~q}[f(x) · p(x)/q(x)]
Why PPO needs it: Data from π_old, but optimizing π_θ. Ratio reweights samples for distribution mismatch.
Enables multiple gradient updates on same batch.
Q29: Trust Region
Trust region: Neighborhood where value estimates are reliable.

Why stay close:
1. Advantage estimates from old policy become wrong
2. Large changes can be catastrophic
3. Importance weights become high-variance
Q30: Specific Examples
r=2, A>0: min(2A, 1.2A) = 1.2A. Capped - gradient is zero.

r=0.5, A<0: min(0.5A, 0.8A). Since A<0, 0.5A more negative. Uses 0.8A. Gradient is zero.
Q31: Data Reuse
How: Importance sampling corrects for old data. Clipping prevents drift.

Limit: After many epochs, π_θ diverges from π_old. Clipping triggers constantly. Typically K=3-10 epochs.

SAC Answers

Q32: Entropy
H(π) = -E[ log π(a|s) ]
Entropy: Measures randomness/spread of policy.

Higher entropy → exploration: Maintains diversity, prevents premature convergence.
Q33: Maximum Entropy Objective
J(π) = E[ Σ_t r_t + α · H(π(·|s_t)) ]
Difference: Standard RL only maximizes return. SAC ALSO maximizes entropy.
"Among similar-reward policies, prefer the most random one."
Q34: Soft Bellman
Q(s,a) = r + γ · E[ Q(s',a') - α·log π(a'|s') ]
Extra term: -α·log π (entropy bonus). High-entropy futures valued more highly.
Q35: Reparameterization Trick
Problem: Can't do calculus on dice rolls (sampling).

Solution: a = μ_θ(s) + σ_θ(s) × ε, where ε ~ N(0,1)

1. Roll dice FIRST (sample ε)
2. Now a is a FORMULA (μ + σ × fixed number)
3. Formulas are differentiable!
Moved randomness into ε (constant w.r.t. θ). Action is now differentiable!
Q36: Twin Q-Networks
Uses min(Q_1, Q_2) for targets.

Problem solved: Q-function overestimation. Policy exploits errors.

Why min: Pessimistic estimate counteracts overestimation.
Q37: Auto Temperature Tuning
Target entropy: Typically -dim(action_space).

Why tune α: Fixed α works poorly. Optimal depends on task.

Entropy below target → increase α (more exploration)
Entropy above target → decrease α (more reward focus)
Q38: Sample Efficiency
Off-policy: Reuse ANY past experience from replay buffer.

vs PPO: PPO must throw away data after few epochs. SAC can train on same data indefinitely.
For expensive interactions (robotics), SAC's efficiency is huge.
Q39: Direct Q Optimization
PPO: Log-derivative trick (score function). High variance.

SAC: Reparameterization makes a differentiable. Gradient flows: Q → a → θ
SAC uses pathwise derivatives (low variance) vs score function (high variance).

Gradient Flow Answers

Q40: Why Not Backprop Through G_t
G_t comes from ENVIRONMENT (not differentiable). Can't compute ∂reward/∂θ.

Even with differentiable sim: high variance, exploding gradients, intractable.
G_t tells us HOW MUCH to change, not WHICH DIRECTION.
Q41: Why Detach Advantage
A involves critic V. Without detaching:

Problem: Actor gradients flow into critic. Actor tries to change V to make A larger, not learn better policy.

Both networks become corrupted.
Actor treats A as "given information."
Q42: SAC Gradient Path
Loss = -Q(s,a) + α·log π

a = μ_θ(s) + σ_θ(s) × ε

∂Loss/∂θ = ∂Loss/∂a × ∂a/∂θ

Why possible: Reparameterization makes a differentiable. Can chain: Q → a → θ.
SAC asks critic "what's ∂Q/∂a?" and uses it directly.
Q43: Target Network Soft Updates
θ_target ← τ·θ + (1-τ)·θ_target, τ ≈ 0.005
Why: Stable targets for Q-learning. Changing every step → moving target.

Soft update provides smoother training than hard copies.
Q44: Variance Comparison
REINFORCE: Highest. G_t = all future randomness + score function.

PPO: Medium. Learned baseline, GAE. Still score function.

SAC: Lowest. Pathwise derivatives through Q. No score function.
Pathwise (SAC) < score+baseline (PPO) < vanilla score (REINFORCE)
Q45: Bootstrapping
Bootstrapping: Using own estimates as targets (e.g., r + γV(s')).

Who bootstraps: Q-learning, Actor-Critic, PPO (TD), SAC. NOT REINFORCE.

Tradeoff: Introduces bias (estimates wrong) but reduces variance. Usually worth it.

Big Picture Answers

Q46: On vs Off-Policy Table
On-PolicyOff-Policy
ExamplesREINFORCE, PPO, A2CDQN, SAC, DDPG, TD3
Sample eff.Lower (data discarded)Higher (replay buffer)
Replay?No / limitedYes (essential)
StabilityMore stableCan be less stable
Q47: Algorithm Evolution
REINFORCE: First policy gradient. Problem: HIGH VARIANCE.

Actor-Critic: Solved VARIANCE with critic's value estimates.

PPO: Solved STABILITY & DATA REUSE with clipping, enabling multiple epochs.
Each step trades off sample efficiency, stability, complexity.
Q48: PPO vs SAC
Choose PPO: Fast/cheap simulation, need stability, discrete actions, easy parallelization, simpler tuning.

Choose SAC: Real-world robotics (expensive samples), continuous actions, need max sample efficiency, want built-in exploration.
Q49: Continuous Action Challenges
1. Can't enumerate max: SAC learns policy that maximizes Q via reparameterization.

2. Exploration harder: Random continuous actions usually terrible. SAC uses entropy maximization for principled exploration.
Q50: Real Robot vs Simulator
Real robot: Choose SAC. Max sample efficiency. Every interaction costly.

Fast simulator: Choose PPO. Parallelize across many instances. Stability matters more than efficiency.
Optimal algorithm depends on cost of environment interaction!