Explain why clipping works differently for A > 0 vs A < 0. Draw or describe what happens in each case.
28
Medium
What is importance sampling? Why does PPO need it to reuse data from the old policy?
E_{x~p}[f(x)] = E_{x~q}[f(x) · p(x)/q(x)]
29
Medium
What is a "trust region" conceptually? Why do we want to stay close to the old policy?
30
Hard
If r(θ) = 2 and A > 0, what does PPO do? What if r(θ) = 0.5 and A < 0?
31
Medium
PPO is "on-policy" but reuses data for multiple epochs. How is this possible? What limits how many epochs we can do?
5. SAC (Soft Actor-Critic) 8 questions
32
Easy
What is entropy H(π) of a policy? Why is higher entropy generally better for exploration?
H(π) = -E[ log π(a|s) ]
33
Medium
Write SAC's maximum entropy objective. How is it different from standard RL?
J(π) = E[ Σ_t r_t + α · H(π(·|s_t)) ]
34
Medium
Write the soft Bellman equation. What's the extra term compared to standard Bellman?
Q(s,a) = r + γ · E[ Q(s',a') - α·log π(a'|s') ]
35
Hard
THE REPARAMETERIZATION TRICK (in simple terms):
Problem: We want to take gradients to improve our policy, but actions come from sampling (like rolling dice). You can't do calculus on dice rolls!
Explain the solution:
Instead of: a = sample from π(a|s) Use: a = μ_θ(s) + σ_θ(s) × ε, where ε ~ N(0, 1)
36
Medium
Why does SAC use twin Q-networks (two critics)? What problem does this solve?
37
Hard
What is automatic temperature tuning in SAC? What is the target entropy, and why do we tune α?
38
Medium
SAC is off-policy. What does this mean for sample efficiency compared to PPO?
39
Hard
Why can SAC directly optimize Q-values through the policy (unlike PPO which needs the log-derivative trick)?
6. Gradient Flow 6 questions
40
Medium
In REINFORCE, gradients flow through log π(a|s). The return G_t is treated as a constant. Why must we NOT backprop through G_t?
41
Hard
In Actor-Critic (and PPO), why do we "detach" the advantage A when computing the actor loss? What would go wrong if we didn't?
42
Hard
In SAC, gradients flow from Q(s, a) through the action a back to the policy. Draw/describe this gradient path. Why is this possible?
43
Medium
Why do we use a target network that updates slowly (soft update)? What's the formula?
θ_target ← τ·θ + (1-τ)·θ_target, where τ << 1
44
Hard
Compare gradient flow in these three algorithms: • REINFORCE: policy ← log π × G • PPO: policy ← log π × A (detached) • SAC: policy ← Q(s, a_sampled) Which has lowest variance? Why?
45
Medium
What is "bootstrapping" in RL? Which algorithms bootstrap, and what's the tradeoff?
7. Big Picture 5 questions
46
Medium
Fill in this table comparing on-policy vs off-policy:
On-Policy
Off-Policy
Example algorithms
Sample efficiency
Replay buffer?
Stability
47
Medium
Trace the evolution: REINFORCE → Actor-Critic → PPO. What problem did each solve?
48
Medium
When would you choose PPO over SAC? When would you choose SAC over PPO?
49
Hard
Why is continuous action space harder than discrete? Name two challenges and how SAC addresses them.
50
Hard
You're training a robot arm with expensive real-world samples. Which algorithm family would you prefer and why? What if you had a fast simulator instead?
✅ Complete Answer Key
Q-Learning Answers
Q1: Bellman Optimality Equation
Q*(s, a) = E[ r + γ · max_{a'} Q*(s', a') ]
Terms: • Q*(s, a): Optimal action-value • r: Immediate reward • γ: Discount factor (0 to 1) • max_{a'} Q*(s', a'): Best possible value from the next state
Q2: TD Error
δ = r + γ · max_{a'} Q(s', a') - Q(s, a)
It's called an "error" because it measures the difference between what we predicted (Q(s,a)) and a better estimate. If our Q-function were perfect, δ would be zero.
Q3: Epsilon-Greedy
Epsilon-greedy: With probability ε, choose random action (explore). With probability 1-ε, choose best action (exploit).
Why decay ε: Early: need exploration. Later: exploit what we've learned.
ε = 0: Pure exploitation - can get stuck ε = 1: Pure random - won't converge
Q4: Replay Buffer
Solves two problems: 1. Correlation: Consecutive experiences are correlated. Random sampling breaks correlations. 2. Sample efficiency: Reuse experiences multiple times instead of once.
Q5: Target Network
A frozen copy of Q-network, updated slowly.
Why needed: Using Q for both prediction AND target creates "moving target" problem. Target network provides stable targets.
Q6: Off-Policy
Off-policy = learn about optimal policy while following different behavior policy.
Advantage: Can reuse ANY past experience. Very sample-efficient.
The "max" makes it off-policy - we consider the best action, not the one we took.
Q7: Q-Learning Update
Q(s, a) ← Q(s, a) + α · [r + γ·max_{a'} Q(s', a') - Q(s, a)]
α controls how much we adjust toward the new estimate. Small α = stability but slower learning.
Q8: Maximization Bias
Problem: Same network selects AND evaluates best action. Noisy overestimates get selected, causing systematic bias.
Double DQN: One network SELECTS, different network EVALUATES:
target = r + γ · Q_target(s', argmax_{a'} Q(s', a'))
Policy Gradient Answers
Q9: Policy Gradient vs Q-Learning
Policy gradient: Directly optimizes policy π(a|s). Q-learning: Learns value function Q(s,a), derives policy from it.