STEP — Sequential Testing for Robot Policy Evaluation

🤖

The Problem

Why comparing robot policies is harder than it looks

statistics robotics

You've trained two robot manipulation policies. You run each one 50 times. Policy A succeeds 25 times (50%). Policy B succeeds 29 times (58%).

Is Policy B actually better?

Your gut says "probably, it scored 4 more." But here's the uncomfortable truth: with only 50 trials, random variation alone could easily produce a 4-success gap between two identical policies. You might be looking at noise, not signal.

The same policy run 50 times can produce wildly different success counts due to stochastic environments

Robot policy evaluation is uniquely painful:

Each trial is expensive — requires manual scene setup, reset, and human supervision
Environments are stochastic — even deterministic policies produce varying results due to randomized initial conditions and real-world physics
Sample sizes are tiny — typically 10–50 rollouts per policy, not thousands
Reproducibility matters — would another researcher reach the same conclusion?

We need statistical tools to separate signal from noise. This is the problem that STEP (Sequential Testing for Efficient Policy comparison) solves.

The central question: how unlikely is it that pure randomness produced the apparent win of Policy B over Policy A? This is exactly what hypothesis testing answers.

📊

Batch Testing: Barnard's Test

The classical approach — commit upfront, test once

batch statistics testing

The simplest rigorous approach: Barnard's exact test. You decide in advance how many rollouts to collect, run them all, then test once.

The setup is a 2×2 contingency table:

	Success	Failure
Policy A	25	25
Policy B	29	21

The null hypothesis H₀: Policy B is not better than Policy A (their true success rates satisfy p₁ ≤ p₀). The alternative H₁: Policy B is genuinely better (p₁ > p₀).

Barnard's test computes the p-value: the probability of seeing a gap at least this large if the null hypothesis were true. If p < 0.05, we reject H₀ with 95% confidence.

Barnard's exact test in Python ▼

from scipy.stats import barnard_exact

def is_b_better(succ_a, succ_b, n, confidence=0.95):
    result = barnard_exact(
        [[succ_a, succ_b],
         [n - succ_a, n - succ_b]],
        alternative="less",
    )
    return result.pvalue < 1.0 - confidence

# Our running example:
is_b_better(25, 29, 50)  # → False! Not significant.

Try it: Batch significance calculator

Rollouts per policy

Policy A successes

Policy B successes

With our example (25 vs 29 out of 50), Barnard's test fails to conclude B is better. The signal-to-noise ratio is too low. You'd need a larger gap or more trials.

What happens if Barnard's test is inconclusive and you decide to run 20 more trials per policy, then test again?

This is p-hacking (data dredging). By testing at multiple stopping points, you inflate the false positive rate beyond the claimed 5%. The batch approach strictly requires that you commit to a sample size upfront and test exactly once. If you peek and keep going, your statistical guarantees are void.

⚠️

The P-Hacking Trap

Why you can't just 'collect more data until it works'

phacking statistics

Imagine this tempting workflow: run 20 trials each, test. Not significant? Run 10 more, test again. Still nothing? Another 10. Keep going until you get p < 0.05 or give up.

This is p-hacking, and it completely breaks your statistical guarantees. The more times you peek at your data and test, the higher the chance of a false positive — even when the two policies are identical.

False positive rate inflates with repeated testing — the more you peek, the more you fool yourself

The paradox of batch testing: The number of rollouts you need depends on the true performance gap between policies. A 30-percentage-point gap needs ~20 trials. A 5-point gap might need 500+. But you don't know the gap before running the experiment — that's what you're trying to measure!

So you're stuck: commit to a number upfront (possibly way too few or wastefully many), or peek and break your statistics.

This is the fundamental tension that STEP resolves: how do you adaptively decide when you have enough data, without inflating your false positive rate? The answer: sequential testing with rigorous error control built into the stopping rule.

The chart above shows what happens with a naive "sequential Barnard" approach — just running Barnard's test after each new trial pair. The false positive rate (supposed to be ≤ 5%) can climb to 15–20% or higher. STEP maintains it at exactly the promised level.

⚡

STEP: The Core Idea

Sequential Testing for Efficient Policy Comparison

step sequential

STEP flips the script. Instead of committing to a fixed number of trials, after each paired trial you make one of three decisions:

Continue — not enough evidence yet, run another trial pair
Reject H₀ — sufficient evidence that Policy B is better
Accept H₀ — sufficient evidence that Policy B is not better (or stop at budget)

The key: these decision boundaries are precomputed offline via optimization, guaranteeing that the overall false positive rate stays below your chosen threshold (e.g., 5%) regardless of when you stop.

STEP decision regions: after each trial, check your state (cumulative successes) against precomputed boundaries

The state at step n is simple: x_n = (S₀, S₁, n) where S₀ and S₁ are the cumulative success counts for each policy. After each trial pair, both S₀ and S₁ either increment (success) or stay (failure), and n increments by 1.

At each point on this discrete lattice, STEP has precomputed whether to continue, reject, or accept. Easy comparisons (big performance gaps) trigger early stopping. Hard comparisons (small gaps) use more of the budget — but never wastefully.

STEP is like having a lookup table: after each trial, check your (S₀, S₁, n) coordinates. The table tells you whether to stop or continue. The table was computed to be near-optimal: it stops as early as possible while keeping your error guarantees.

🔧

How STEP Works Inside

Linear programs and risk budgets

step statistics

STEP precomputes its decision rule by solving a sequence of linear programs — one for each possible trial count n = 1, 2, ..., N_max.

The total false positive budget α* (e.g., 0.05) is split across time steps via a risk budget function f(n):

In practice, a uniform budget works well: f(n) = α*/N_max for all n. Each time step gets an equal share of the false-positive allowance.

At each step n, STEP solves:

In plain English: maximize the rejection region (maximize statistical power — the ability to detect a real difference) while ensuring that the probability of a false rejection, under any null hypothesis, stays within the allocated budget.

The matrix P_n encodes the probability of reaching each state under each possible null hypothesis. The constraint ensures that no matter what the true success rates are (as long as p₁ ≤ p₀), the total rejection probability stays below α*.

STEP precomputation (pseudocode) ▼

# Offline precomputation — done once before any trials
for n in range(1, N_max + 1):
    P_n = propagate(P_{n-1}, decisions_{n-1})  # forward evolution
    w_n = solve_LP(P_n, risk_budget)           # maximize rejection region
    decisions_n = extract_regions(w_n)         # continue / reject / accept

# At runtime — dead simple:
for each trial pair:
    update (S_0, S_1, n)
    if (S_0, S_1, n) in reject_region:  STOP → B is better
    if (S_0, S_1, n) in accept_region:  STOP → no difference
    else: CONTINUE

Why this beats other methods: SAVI (Safe Anytime-Valid Inference) is valid for arbitrary N_max but overly conservative in small-sample regimes. Lai's method solves an asymptotic PDE (heat equation) which is ill-posed at small N. STEP solves the finite-sample exact problem, which is what matters when your budget is 50–100 trials.

Why does STEP use a discrete grid of null hypotheses instead of testing against all possible (p₀, p₁) pairs?

The null hypothesis H₀: p₁ ≤ p₀ is composite — it contains infinitely many (p₀, p₁) pairs. To make the LP tractable, STEP discretizes this space into M representative points (e.g., intervals of 0.1). The constraint must hold for all M points, ensuring approximate control over the full continuous null. Finer grids give tighter control but increase computation.

🎲

Interactive: Sequential vs Batch

Watch STEP stop early on easy comparisons

interactive step sequential

Let's see the difference in action. This simulation draws random binary outcomes for two policies with true success rates you specify, then shows how batch testing and sequential testing compare.

Simulate: Sequential vs Batch policy comparison

Policy A true rate

Policy B true rate

Max trials (N_max)

Cumulative success difference over trials — green zone = reject H₀ (B is better), red zone = accept H₀

Try it: Set both rates to the same value (e.g., 0.50 and 0.50) and watch STEP correctly fail to declare a difference — or occasionally reach the budget. Then set a big gap (e.g., 0.30 vs 0.80) and watch it stop very early.

The key practical benefit: when Policy A scored 2/9 and Policy B scored 8/9, STEP concluded early, saving 41 × 2 = 82 individual rollouts. That's hours of human effort saved.

📈

Multi-Policy Comparison

Compact Letter Display and Bayesian violin plots

visualization statistics

What if you have more than two policies? Say you're comparing 5 checkpoints from different training runs. You'd need pairwise comparisons — but with a correction.

Bonferroni correction: If you're running k pairwise tests at confidence level 95%, each individual test must use confidence level 1 - 0.05/k to keep the overall false positive rate at 5%. Conservative, but safe.

The results are visualized with two complementary tools:

1. Compact Letter Display (CLD)

Each policy gets one or more letters. The rule: two policies that share NO letter are statistically separated. Policies that share at least one letter are not significantly different.

Compact Letter Display: policies sharing a letter are not significantly different from each other

2. Bayesian Violin Plots

Why not just use confidence intervals? Because overlapping confidence intervals are misleading — two intervals can overlap substantially while the policies are still statistically separated by direct hypothesis testing. Violin plots show the full posterior distribution of each policy's success rate, giving a much richer picture.

The posterior uses a Beta distribution with uniform prior: Beta(1 + successes, 1 + failures). CLD letters are overlaid directly on the violin bodies.

Bayesian violin plots with CLD labels — the full posterior uncertainty, not just point estimates

Policy X has CLD label "ab" and Policy Y has label "bc". Are they significantly different?

No. They share the letter "b", so they are NOT significantly different from each other. If X had label "a" and Y had label "b" (no shared letters), then they would be separated.

🏆

Real Robot Results

STEP on real manipulation tasks

step robotics

STEP was validated on real-world robot tasks at TRI and in simulation. Here are the results — the number of trials each method needed before reaching a conclusion:

Task	α*	N_max	Gap	SAVI	Lai	STEP	Oracle
FoldRedTowel	0.05	50	36pp	20	17	19	17
CleanUpSpill	0.05	50	52pp	7	8	8	7
CarrotOnPlate	0.05	100	~0pp	All methods: Fail to Decide
SpoonOnTowel (sim)	0.01	500	30pp	33	36	36	26
EggplantInBasket (sim)	0.01	500	16pp	192	125	131	128
StackCube (sim)	0.01	500	3pp	329	417	225	135

Key findings:

Easy comparisons (large gaps like FoldRedTowel, CleanUpSpill): all methods stop early, similar performance
Hard comparisons (tiny gaps like StackCube at 3pp): STEP saves 104 trials vs SAVI, 192 vs Lai
Genuinely ambiguous (CarrotOnPlate): all methods correctly return "fail to decide" — the gap is too small for the budget

Multi-task total trials: STEP saves up to 32% of evaluation effort vs state-of-the-art baselines

STEP matters most where it matters most: on the hardest comparisons where every trial costs significant effort. Easy comparisons are easy for everyone.

📋

Practical Playbook

How to actually use STEP in your workflow

practice robotics

Step 1: Install

Installation & basic usage ▼

pip install sequentialized_barnard_tests

from sequentialized_barnard_tests import Decision, get_mirrored_test, Hypothesis

# Create sequential test (precomputes decision rule)
test = get_mirrored_test(
    num_max_rollouts=50,
    alternative=Hypothesis.P0LessThanP1,
    alpha=0.05,
)

# After collecting trial data:
result = test.run_on_sequence(successes_a, successes_b)

if result.decision == Decision.AcceptAlternative:
    print(f"B is better! Stopped at trial {result.info['Time']}")
elif result.decision == Decision.AcceptNull:
    print("No significant difference found")

Step 2: Run the experiment right

Define success criteria — detailed, unambiguous, before running anything
Interleave and randomize — alternate policies in randomized order, blind the evaluator
Same initial conditions — use the same distribution of scene setups for each policy
Run within a continuous session — don't spread across days if conditions might drift
Report everything — trial count, success criteria, initial conditions, stopping reason

Multi-policy comparison with CLD ▼

Step 3: Don't forget the caveats

STEP handles binary success/failure only — not continuous rewards
Trials must be IID — independently drawn from the same distribution
A "fail to decide" does not mean the policies are equal — it means the gap is too small for your trial budget
Complement success rates with other metrics: task completion time, partial progress, failure modes

References:

Snyder et al. (2025) — "Is Your Imitation Learning Policy Better than Mine?" arXiv:2503.10966 (RSS 2025)
Project page: tri-ml.github.io/step
Python package: pip install sequentialized_barnard_tests
TRI Blog: Statistical Thinking for Robot Policy Evaluation

🧪

Recap & Self-Test

Lock in your understanding

statistics step

Let's make sure the core concepts are solid:

1. Why is Barnard's test preferred over a chi-squared test for small robot evaluation sample sizes?

Barnard's test is an exact test — it computes exact probabilities without relying on asymptotic (large-sample) approximations. Chi-squared tests assume the sample is large enough for the normal approximation to hold, which fails when n is 10–50. Barnard's test is valid at any sample size.

2. STEP controls "Type-I error." What is Type-I error in this context, and why is it worse than Type-II error?

Type-I error (false positive): concluding Policy B is better when it's actually not. This is dangerous because you'd deploy an inferior policy based on bad evidence. Type-II error (false negative): failing to detect a real improvement. This is less harmful — you just miss an opportunity. STEP guarantees Type-I stays below α* while maximizing power (minimizing Type-II).

3. You're comparing 4 policies with STEP at 95% global confidence. How many pairwise tests do you run, and what confidence level does each use?

With 4 policies, there are C(4,2) = 6 pairwise tests. Using Bonferroni correction, each test uses confidence level 1 - 0.05/6 = 99.17% (or equivalently, α = 0.00833 per test). This ensures the overall family-wise error rate stays at 5%.

4. Why should trials be interleaved and randomized rather than running all of Policy A first, then all of Policy B?

Running all of one policy first introduces confounding variables: the experimenter might fatigue, the environment might drift (temperature, lighting, battery charge), or the scene setup might subtly change over a long session. Interleaving ensures both policies face the same conditions. Randomization prevents unconscious bias in the ordering. This is the same principle as randomized controlled trials in medicine.

5. STEP's decision rule is "precomputed offline." What does this mean practically, and what parameters affect the precomputation?

Before any trials run, STEP solves a sequence of linear programs to determine, for every possible state (S₀, S₁, n), whether to continue, reject, or accept. This precomputation depends on: N_max (trial budget), α* (error tolerance), M (discretization resolution of the null hypothesis space), and the risk budget function f(n). Once computed, the runtime decision is just a table lookup — trivially fast.