You've trained two robot manipulation policies. You run each one 50 times. Policy A succeeds 25 times (50%). Policy B succeeds 29 times (58%).
Is Policy B actually better?
Your gut says "probably, it scored 4 more." But here's the uncomfortable truth: with only 50 trials, random variation alone could easily produce a 4-success gap between two identical policies. You might be looking at noise, not signal.
Robot policy evaluation is uniquely painful:
We need statistical tools to separate signal from noise. This is the problem that STEP (Sequential Testing for Efficient Policy comparison) solves.
The simplest rigorous approach: Barnard's exact test. You decide in advance how many rollouts to collect, run them all, then test once.
The setup is a 2×2 contingency table:
| Success | Failure | |
|---|---|---|
| Policy A | 25 | 25 |
| Policy B | 29 | 21 |
The null hypothesis Hโ: Policy B is not better than Policy A (their true success rates satisfy pโ โค pโ). The alternative Hโ: Policy B is genuinely better (pโ > pโ).
Barnard's test computes the p-value: the probability of seeing a gap at least this large if the null hypothesis were true. If p < 0.05, we reject Hโ with 95% confidence.
With our example (25 vs 29 out of 50), Barnard's test fails to conclude B is better. The signal-to-noise ratio is too low. You'd need a larger gap or more trials.
Imagine this tempting workflow: run 20 trials each, test. Not significant? Run 10 more, test again. Still nothing? Another 10. Keep going until you get p < 0.05 or give up.
This is p-hacking, and it completely breaks your statistical guarantees. The more times you peek at your data and test, the higher the chance of a false positive โ even when the two policies are identical.
The paradox of batch testing: The number of rollouts you need depends on the true performance gap between policies. A 30-percentage-point gap needs ~20 trials. A 5-point gap might need 500+. But you don't know the gap before running the experiment โ that's what you're trying to measure!
So you're stuck: commit to a number upfront (possibly way too few or wastefully many), or peek and break your statistics.
The chart above shows what happens with a naive "sequential Barnard" approach โ just running Barnard's test after each new trial pair. The false positive rate (supposed to be ≤ 5%) can climb to 15โ20% or higher. STEP maintains it at exactly the promised level.
STEP flips the script. Instead of committing to a fixed number of trials, after each paired trial you make one of three decisions:
The key: these decision boundaries are precomputed offline via optimization, guaranteeing that the overall false positive rate stays below your chosen threshold (e.g., 5%) regardless of when you stop.
The state at step n is simple: x_n = (Sโ, Sโ, n) where Sโ and Sโ are the cumulative success counts for each policy. After each trial pair, both Sโ and Sโ either increment (success) or stay (failure), and n increments by 1.
At each point on this discrete lattice, STEP has precomputed whether to continue, reject, or accept. Easy comparisons (big performance gaps) trigger early stopping. Hard comparisons (small gaps) use more of the budget โ but never wastefully.
STEP precomputes its decision rule by solving a sequence of linear programs โ one for each possible trial count n = 1, 2, ..., N_max.
The total false positive budget ฮฑ* (e.g., 0.05) is split across time steps via a risk budget function f(n):
In practice, a uniform budget works well: f(n) = ฮฑ*/N_max for all n. Each time step gets an equal share of the false-positive allowance.
At each step n, STEP solves:
In plain English: maximize the rejection region (maximize statistical power โ the ability to detect a real difference) while ensuring that the probability of a false rejection, under any null hypothesis, stays within the allocated budget.
The matrix P_n encodes the probability of reaching each state under each possible null hypothesis. The constraint ensures that no matter what the true success rates are (as long as pโ โค pโ), the total rejection probability stays below ฮฑ*.
Why this beats other methods: SAVI (Safe Anytime-Valid Inference) is valid for arbitrary N_max but overly conservative in small-sample regimes. Lai's method solves an asymptotic PDE (heat equation) which is ill-posed at small N. STEP solves the finite-sample exact problem, which is what matters when your budget is 50โ100 trials.
Let's see the difference in action. This simulation draws random binary outcomes for two policies with true success rates you specify, then shows how batch testing and sequential testing compare.
Try it: Set both rates to the same value (e.g., 0.50 and 0.50) and watch STEP correctly fail to declare a difference โ or occasionally reach the budget. Then set a big gap (e.g., 0.30 vs 0.80) and watch it stop very early.
The key practical benefit: when Policy A scored 2/9 and Policy B scored 8/9, STEP concluded early, saving 41 ร 2 = 82 individual rollouts. That's hours of human effort saved.
What if you have more than two policies? Say you're comparing 5 checkpoints from different training runs. You'd need pairwise comparisons โ but with a correction.
Bonferroni correction: If you're running k pairwise tests at confidence level 95%, each individual test must use confidence level 1 - 0.05/k to keep the overall false positive rate at 5%. Conservative, but safe.
The results are visualized with two complementary tools:
1. Compact Letter Display (CLD)
Each policy gets one or more letters. The rule: two policies that share NO letter are statistically separated. Policies that share at least one letter are not significantly different.
2. Bayesian Violin Plots
Why not just use confidence intervals? Because overlapping confidence intervals are misleading โ two intervals can overlap substantially while the policies are still statistically separated by direct hypothesis testing. Violin plots show the full posterior distribution of each policy's success rate, giving a much richer picture.
The posterior uses a Beta distribution with uniform prior: Beta(1 + successes, 1 + failures). CLD letters are overlaid directly on the violin bodies.
STEP was validated on real-world robot tasks at TRI and in simulation. Here are the results โ the number of trials each method needed before reaching a conclusion:
| Task | ฮฑ* | N_max | Gap | SAVI | Lai | STEP | Oracle |
|---|---|---|---|---|---|---|---|
| FoldRedTowel | 0.05 | 50 | 36pp | 20 | 17 | 19 | 17 |
| CleanUpSpill | 0.05 | 50 | 52pp | 7 | 8 | 8 | 7 |
| CarrotOnPlate | 0.05 | 100 | ~0pp | All methods: Fail to Decide | |||
| SpoonOnTowel (sim) | 0.01 | 500 | 30pp | 33 | 36 | 36 | 26 |
| EggplantInBasket (sim) | 0.01 | 500 | 16pp | 192 | 125 | 131 | 128 |
| StackCube (sim) | 0.01 | 500 | 3pp | 329 | 417 | 225 | 135 |
Key findings:
Step 1: Install
Step 2: Run the experiment right
Step 3: Don't forget the caveats
References:
pip install sequentialized_barnard_testsLet's make sure the core concepts are solid:
1 - 0.05/6 = 99.17% (or equivalently, ฮฑ = 0.00833 per test). This ensures the overall family-wise error rate stays at 5%.