๐Ÿค–

The Problem

Why comparing robot policies is harder than it looks
statistics robotics

You've trained two robot manipulation policies. You run each one 50 times. Policy A succeeds 25 times (50%). Policy B succeeds 29 times (58%).

Is Policy B actually better?

Your gut says "probably, it scored 4 more." But here's the uncomfortable truth: with only 50 trials, random variation alone could easily produce a 4-success gap between two identical policies. You might be looking at noise, not signal.

The same policy run 50 times can produce wildly different success counts due to stochastic environments

Robot policy evaluation is uniquely painful:

  • Each trial is expensive โ€” requires manual scene setup, reset, and human supervision
  • Environments are stochastic โ€” even deterministic policies produce varying results due to randomized initial conditions and real-world physics
  • Sample sizes are tiny โ€” typically 10โ€“50 rollouts per policy, not thousands
  • Reproducibility matters โ€” would another researcher reach the same conclusion?

We need statistical tools to separate signal from noise. This is the problem that STEP (Sequential Testing for Efficient Policy comparison) solves.

The central question: how unlikely is it that pure randomness produced the apparent win of Policy B over Policy A? This is exactly what hypothesis testing answers.
๐Ÿ“Š

Batch Testing: Barnard's Test

The classical approach โ€” commit upfront, test once
batch statistics testing

The simplest rigorous approach: Barnard's exact test. You decide in advance how many rollouts to collect, run them all, then test once.

The setup is a 2×2 contingency table:

SuccessFailure
Policy A2525
Policy B2921

The null hypothesis Hโ‚€: Policy B is not better than Policy A (their true success rates satisfy pโ‚ โ‰ค pโ‚€). The alternative Hโ‚: Policy B is genuinely better (pโ‚ > pโ‚€).

Barnard's test computes the p-value: the probability of seeing a gap at least this large if the null hypothesis were true. If p < 0.05, we reject Hโ‚€ with 95% confidence.

Barnard's exact test in Python
from scipy.stats import barnard_exact def is_b_better(succ_a, succ_b, n, confidence=0.95): result = barnard_exact( [[succ_a, succ_b], [n - succ_a, n - succ_b]], alternative="less", ) return result.pvalue < 1.0 - confidence # Our running example: is_b_better(25, 29, 50) # โ†’ False! Not significant.
Try it: Batch significance calculator
Rollouts per policy
Policy A successes
Policy B successes

With our example (25 vs 29 out of 50), Barnard's test fails to conclude B is better. The signal-to-noise ratio is too low. You'd need a larger gap or more trials.

What happens if Barnard's test is inconclusive and you decide to run 20 more trials per policy, then test again?
This is p-hacking (data dredging). By testing at multiple stopping points, you inflate the false positive rate beyond the claimed 5%. The batch approach strictly requires that you commit to a sample size upfront and test exactly once. If you peek and keep going, your statistical guarantees are void.
โš ๏ธ

The P-Hacking Trap

Why you can't just 'collect more data until it works'
phacking statistics

Imagine this tempting workflow: run 20 trials each, test. Not significant? Run 10 more, test again. Still nothing? Another 10. Keep going until you get p < 0.05 or give up.

This is p-hacking, and it completely breaks your statistical guarantees. The more times you peek at your data and test, the higher the chance of a false positive โ€” even when the two policies are identical.

False positive rate inflates with repeated testing โ€” the more you peek, the more you fool yourself

The paradox of batch testing: The number of rollouts you need depends on the true performance gap between policies. A 30-percentage-point gap needs ~20 trials. A 5-point gap might need 500+. But you don't know the gap before running the experiment โ€” that's what you're trying to measure!

So you're stuck: commit to a number upfront (possibly way too few or wastefully many), or peek and break your statistics.

This is the fundamental tension that STEP resolves: how do you adaptively decide when you have enough data, without inflating your false positive rate? The answer: sequential testing with rigorous error control built into the stopping rule.

The chart above shows what happens with a naive "sequential Barnard" approach โ€” just running Barnard's test after each new trial pair. The false positive rate (supposed to be ≤ 5%) can climb to 15โ€“20% or higher. STEP maintains it at exactly the promised level.

โšก

STEP: The Core Idea

Sequential Testing for Efficient Policy Comparison
step sequential

STEP flips the script. Instead of committing to a fixed number of trials, after each paired trial you make one of three decisions:

  1. Continue โ€” not enough evidence yet, run another trial pair
  2. Reject Hโ‚€ โ€” sufficient evidence that Policy B is better
  3. Accept Hโ‚€ โ€” sufficient evidence that Policy B is not better (or stop at budget)

The key: these decision boundaries are precomputed offline via optimization, guaranteeing that the overall false positive rate stays below your chosen threshold (e.g., 5%) regardless of when you stop.

STEP decision regions: after each trial, check your state (cumulative successes) against precomputed boundaries

The state at step n is simple: x_n = (Sโ‚€, Sโ‚, n) where Sโ‚€ and Sโ‚ are the cumulative success counts for each policy. After each trial pair, both Sโ‚€ and Sโ‚ either increment (success) or stay (failure), and n increments by 1.

At each point on this discrete lattice, STEP has precomputed whether to continue, reject, or accept. Easy comparisons (big performance gaps) trigger early stopping. Hard comparisons (small gaps) use more of the budget โ€” but never wastefully.

STEP is like having a lookup table: after each trial, check your (Sโ‚€, Sโ‚, n) coordinates. The table tells you whether to stop or continue. The table was computed to be near-optimal: it stops as early as possible while keeping your error guarantees.
๐Ÿ”ง

How STEP Works Inside

Linear programs and risk budgets
step statistics

STEP precomputes its decision rule by solving a sequence of linear programs โ€” one for each possible trial count n = 1, 2, ..., N_max.

The total false positive budget ฮฑ* (e.g., 0.05) is split across time steps via a risk budget function f(n):

In practice, a uniform budget works well: f(n) = ฮฑ*/N_max for all n. Each time step gets an equal share of the false-positive allowance.

At each step n, STEP solves:

In plain English: maximize the rejection region (maximize statistical power โ€” the ability to detect a real difference) while ensuring that the probability of a false rejection, under any null hypothesis, stays within the allocated budget.

The matrix P_n encodes the probability of reaching each state under each possible null hypothesis. The constraint ensures that no matter what the true success rates are (as long as pโ‚ โ‰ค pโ‚€), the total rejection probability stays below ฮฑ*.

STEP precomputation (pseudocode)
# Offline precomputation โ€” done once before any trials for n in range(1, N_max + 1): P_n = propagate(P_{n-1}, decisions_{n-1}) # forward evolution w_n = solve_LP(P_n, risk_budget) # maximize rejection region decisions_n = extract_regions(w_n) # continue / reject / accept # At runtime โ€” dead simple: for each trial pair: update (S_0, S_1, n) if (S_0, S_1, n) in reject_region: STOP โ†’ B is better if (S_0, S_1, n) in accept_region: STOP โ†’ no difference else: CONTINUE

Why this beats other methods: SAVI (Safe Anytime-Valid Inference) is valid for arbitrary N_max but overly conservative in small-sample regimes. Lai's method solves an asymptotic PDE (heat equation) which is ill-posed at small N. STEP solves the finite-sample exact problem, which is what matters when your budget is 50โ€“100 trials.

Why does STEP use a discrete grid of null hypotheses instead of testing against all possible (pโ‚€, pโ‚) pairs?
The null hypothesis Hโ‚€: pโ‚ โ‰ค pโ‚€ is composite โ€” it contains infinitely many (pโ‚€, pโ‚) pairs. To make the LP tractable, STEP discretizes this space into M representative points (e.g., intervals of 0.1). The constraint must hold for all M points, ensuring approximate control over the full continuous null. Finer grids give tighter control but increase computation.
๐ŸŽฒ

Interactive: Sequential vs Batch

Watch STEP stop early on easy comparisons
interactive step sequential

Let's see the difference in action. This simulation draws random binary outcomes for two policies with true success rates you specify, then shows how batch testing and sequential testing compare.

Simulate: Sequential vs Batch policy comparison
Policy A true rate
Policy B true rate
Max trials (N_max)
Cumulative success difference over trials โ€” green zone = reject Hโ‚€ (B is better), red zone = accept Hโ‚€

Try it: Set both rates to the same value (e.g., 0.50 and 0.50) and watch STEP correctly fail to declare a difference โ€” or occasionally reach the budget. Then set a big gap (e.g., 0.30 vs 0.80) and watch it stop very early.

The key practical benefit: when Policy A scored 2/9 and Policy B scored 8/9, STEP concluded early, saving 41 ร— 2 = 82 individual rollouts. That's hours of human effort saved.

๐Ÿ“ˆ

Multi-Policy Comparison

Compact Letter Display and Bayesian violin plots
visualization statistics

What if you have more than two policies? Say you're comparing 5 checkpoints from different training runs. You'd need pairwise comparisons โ€” but with a correction.

Bonferroni correction: If you're running k pairwise tests at confidence level 95%, each individual test must use confidence level 1 - 0.05/k to keep the overall false positive rate at 5%. Conservative, but safe.

The results are visualized with two complementary tools:

1. Compact Letter Display (CLD)

Each policy gets one or more letters. The rule: two policies that share NO letter are statistically separated. Policies that share at least one letter are not significantly different.

Compact Letter Display: policies sharing a letter are not significantly different from each other

2. Bayesian Violin Plots

Why not just use confidence intervals? Because overlapping confidence intervals are misleading โ€” two intervals can overlap substantially while the policies are still statistically separated by direct hypothesis testing. Violin plots show the full posterior distribution of each policy's success rate, giving a much richer picture.

The posterior uses a Beta distribution with uniform prior: Beta(1 + successes, 1 + failures). CLD letters are overlaid directly on the violin bodies.

Bayesian violin plots with CLD labels โ€” the full posterior uncertainty, not just point estimates
Policy X has CLD label "ab" and Policy Y has label "bc". Are they significantly different?
No. They share the letter "b", so they are NOT significantly different from each other. If X had label "a" and Y had label "b" (no shared letters), then they would be separated.
๐Ÿ†

Real Robot Results

STEP on real manipulation tasks
step robotics

STEP was validated on real-world robot tasks at TRI and in simulation. Here are the results โ€” the number of trials each method needed before reaching a conclusion:

Taskฮฑ*N_maxGapSAVILaiSTEPOracle
FoldRedTowel0.055036pp20171917
CleanUpSpill0.055052pp7887
CarrotOnPlate0.05100~0ppAll methods: Fail to Decide
SpoonOnTowel (sim)0.0150030pp33363626
EggplantInBasket (sim)0.0150016pp192125131128
StackCube (sim)0.015003pp329417225135

Key findings:

  • Easy comparisons (large gaps like FoldRedTowel, CleanUpSpill): all methods stop early, similar performance
  • Hard comparisons (tiny gaps like StackCube at 3pp): STEP saves 104 trials vs SAVI, 192 vs Lai
  • Genuinely ambiguous (CarrotOnPlate): all methods correctly return "fail to decide" โ€” the gap is too small for the budget
Multi-task total trials: STEP saves up to 32% of evaluation effort vs state-of-the-art baselines
STEP matters most where it matters most: on the hardest comparisons where every trial costs significant effort. Easy comparisons are easy for everyone.
๐Ÿ“‹

Practical Playbook

How to actually use STEP in your workflow
practice robotics

Step 1: Install

Installation & basic usage
pip install sequentialized_barnard_tests from sequentialized_barnard_tests import Decision, get_mirrored_test, Hypothesis # Create sequential test (precomputes decision rule) test = get_mirrored_test( num_max_rollouts=50, alternative=Hypothesis.P0LessThanP1, alpha=0.05, ) # After collecting trial data: result = test.run_on_sequence(successes_a, successes_b) if result.decision == Decision.AcceptAlternative: print(f"B is better! Stopped at trial {result.info['Time']}") elif result.decision == Decision.AcceptNull: print("No significant difference found")

Step 2: Run the experiment right

  1. Define success criteria โ€” detailed, unambiguous, before running anything
  2. Interleave and randomize โ€” alternate policies in randomized order, blind the evaluator
  3. Same initial conditions โ€” use the same distribution of scene setups for each policy
  4. Run within a continuous session โ€” don't spread across days if conditions might drift
  5. Report everything โ€” trial count, success criteria, initial conditions, stopping reason
Multi-policy comparison with CLD

Step 3: Don't forget the caveats

  • STEP handles binary success/failure only โ€” not continuous rewards
  • Trials must be IID โ€” independently drawn from the same distribution
  • A "fail to decide" does not mean the policies are equal โ€” it means the gap is too small for your trial budget
  • Complement success rates with other metrics: task completion time, partial progress, failure modes

References:

๐Ÿงช

Recap & Self-Test

Lock in your understanding
statistics step

Let's make sure the core concepts are solid:

1. Why is Barnard's test preferred over a chi-squared test for small robot evaluation sample sizes?
Barnard's test is an exact test โ€” it computes exact probabilities without relying on asymptotic (large-sample) approximations. Chi-squared tests assume the sample is large enough for the normal approximation to hold, which fails when n is 10โ€“50. Barnard's test is valid at any sample size.
2. STEP controls "Type-I error." What is Type-I error in this context, and why is it worse than Type-II error?
Type-I error (false positive): concluding Policy B is better when it's actually not. This is dangerous because you'd deploy an inferior policy based on bad evidence. Type-II error (false negative): failing to detect a real improvement. This is less harmful โ€” you just miss an opportunity. STEP guarantees Type-I stays below ฮฑ* while maximizing power (minimizing Type-II).
3. You're comparing 4 policies with STEP at 95% global confidence. How many pairwise tests do you run, and what confidence level does each use?
With 4 policies, there are C(4,2) = 6 pairwise tests. Using Bonferroni correction, each test uses confidence level 1 - 0.05/6 = 99.17% (or equivalently, ฮฑ = 0.00833 per test). This ensures the overall family-wise error rate stays at 5%.
4. Why should trials be interleaved and randomized rather than running all of Policy A first, then all of Policy B?
Running all of one policy first introduces confounding variables: the experimenter might fatigue, the environment might drift (temperature, lighting, battery charge), or the scene setup might subtly change over a long session. Interleaving ensures both policies face the same conditions. Randomization prevents unconscious bias in the ordering. This is the same principle as randomized controlled trials in medicine.
5. STEP's decision rule is "precomputed offline." What does this mean practically, and what parameters affect the precomputation?
Before any trials run, STEP solves a sequence of linear programs to determine, for every possible state (Sโ‚€, Sโ‚, n), whether to continue, reject, or accept. This precomputation depends on: N_max (trial budget), ฮฑ* (error tolerance), M (discretization resolution of the null hypothesis space), and the risk budget function f(n). Once computed, the runtime decision is just a table lookup โ€” trivially fast.