Teleoperation latency-jitter report

FR3 (leader) → FR20 (follower) ServoJ teleop loop · Fairino SimMachine pair · measured end-to-end with ROBOT_TRACE=1

TL;DR

1 · How this was measured

The full stack was booted headless (no frontend) against two freshly-recreated Fairino SimMachine containers via scripts/profile_teleop_jitter.py, which sets ROBOT_TRACE=1, enters teleoperation, and drives the FR3 leader through 120 back-to-back circle laps over 60 s. The follower mirrors the leader through the normal ServoJ command path.

Every command carries a RobotState that is stamped with time.perf_counter() at each pipeline hop (robot_types.py:stamp, a no-op unless ROBOT_TRACE=1). Stamps are flushed to traces/robot_trace_*.jsonl and reshaped into per-hop deltas by servo7/utils/trace_stats.py. 3021 teleop commands were captured. To isolate the SDK lock specifically, an extra command_pre_lock stamp was added immediately before with self._sdk_lock: and command_lock_acquired immediately after, so lock-wait is separated from validation and from the RPC itself.

Environment: simulated controllers (CNDE realtime stream at 10 ms, publish loop 50 Hz). Absolute numbers will differ on real arms and on faster/slower hosts, but the relative breakdown and the contention conclusions hold. One 46.8 ms startup outlier (first command) is excluded from percentiles below.

2 · The command pipeline

A single teleop command travels leader→follower through these stamped hops (file:line of each stamp):

leader robot node robot_node.py:268 publish.pre_read └ read CNDE state robot_node.py:269 publish.got_state system-state manager system_state_node.py:40 ssm.deserialized system_state.py:69 sysstate.pre_callback ──┐ orchestrator teleop cb orchestrator_node.py:362 orch.cb_enter ◄─┘ ← scheduling gap └ leader→follower map orchestrator_node.py:368 orch.cb_converted └ publish to follower orchestrator_node.py:374 orch.cb_pre_serialize follower robot node robot_node.py:334 follower.command_parsed robot_node.py:339 follower.command_pre_hardware_write └ safety validate base.py:294 (validate_command) └ pre-lock fairino.py follower.command_pre_lock └ acquire _sdk_lock fairino.py:842 follower.command_lock_acquired └ ServoJ RPC fairino.py:843 follower.command_applied

3 · End-to-end result

2.43 msmedian (p50) latency
11.0 msp99 latency
±2.49 msstd dev (the jitter)

End-to-end leader-read → follower-ServoJ latency, n=3021: mean 3.20, p50 2.43, p90 7.09, p99 10.98, max 46.8 ms. The distribution is heavily right-skewed — most commands are fast (~1–2 ms) but a long tail reaches 7–11 ms. That spread is the jitter the operator feels.

21
1261
509
422
255
299
213
40
0–1
1–2
2–3
3–4
4–5
5–7.5
7.5–10
10–15

end-to-end latency (ms) — command count per bucket

4 · Where the time (and jitter) goes — per hop

Per-hop latency across all 3021 commands. The two highlighted rows are the jitter drivers; the green row is the SDK lock (the hypothesised culprit).

hopp50p95p99maxPDV p99contribution
leader CNDE read0.070.851.9345.3*2.52
state deserialize0.100.381.612.201.37
state-mgr → callback handoff0.020.100.140.200.07
callback scheduling pre_callback→cb_enter1.225.356.459.243.24
leader→follower convert0.061.532.696.942.80
serialize + bus publish0.040.200.240.500.17
safety validate + servo-mode pre_hw→pre_lock0.592.163.628.263.21
_sdk_lock wait pre_lock→lock_acquired0.0010.0010.0040.01~0
~0 µs
ServoJ RPC lock_acquired→applied1.063.524.438.783.82

Apply-path rows (validate / lock / ServoJ) are over the 869 commands that actually issued a ServoJ. *the 45.3 ms CNDE-read max is the single startup outlier. Bars show p99 relative to the largest hop.

5 · Verdict on the lock-contention hypothesis

NOT CONTENDED  The follower's _sdk_lock is taken between the command_pre_lock and command_lock_acquired stamps in p50 0.001 ms / p99 0.004 ms / max 0.01 ms across 869 ServoJ writes. There is effectively no queue.

Why the original intuition was reasonable but wrong: the apply hop (command_pre_hardware_write → command_applied) is visibly variable (p99 ≈ 5 ms), so it looks like the apply method stalls. Splitting it shows the variance lives in (a) safety.validate_command (numpy bound-checks, base.py:294) and (b) the ServoJ XML-RPC round-trip (fairino.py:843) — never in lock acquisition. On the follower, the only other _sdk_lock taker on the hot path would be the 50 Hz annotate() pass, but on a follower that path reads cached/packet data and does not take the lock (_poll_controller_drag early-returns; wrench/guidance read the CNDE packet). So there is simply nothing to contend with.

6 · Root causes, ranked

① Callback scheduling latency (primary) — p99 6.5 ms, PDV max 8.3 ms

The gap between sysstate.pre_callback (system_state.py:69, just before the state-change callbacks fire) and orch.cb_enter (orchestrator_node.py:362, the teleop conversion callback actually starting). The callback is invoked directly (system_state.py:71-73), so this gap is not code work — it is the time the leader's update waits for the Python interpreter to schedule the work, i.e. GIL hand-off / thread contention between the leader publish thread, the follower publish thread, the orchestrator, and the command-apply path all competing in one process. It is both the largest median hop and the largest jitter contributor.

② ServoJ XML-RPC round-trip (secondary) — p99 4.4 ms

The blocking interface.ServoJ(...) call (fairino.py:843) to the controller. Median ~1 ms but tails to ~9 ms; this is genuine I/O variance to the (simulated) controller and is expected to differ on real hardware.

③ Per-command safety validation (secondary) — p99 3.6 ms

safety.validate_command runs on every setpoint inside base.set_state (base.py:294) before the write. Its numpy bound/speed checks add ~0.6 ms typical, ~8 ms worst-case, and run on the same contended thread.

④ Leader→follower conversion spikes — p99 2.7 ms, max 6.9 ms

cb_enter → cb_converted (orchestrator_node.py:362→368): usually ~60 µs, but occasional multi-ms spikes, again consistent with GIL pauses rather than steady cost.

7 · Side finding — 71% of setpoints never reach the arm

Of 3021 teleop commands, only 869 (29%) issued a ServoJ; the remaining 2152 (71%) returned from set_state in ~8 µs without a hardware write. That path is taken when safety.validate_command returns not ok (base.py:294-296) — the setpoint is dropped, no e-stop. Dropping two of every three frames will read as motion stutter on the follower independent of the latency numbers above. The drop reason (speed/limit/workarea clamp, or duplicate frames) was not isolated here and is the most valuable next investigation. (This is measured behaviour, not yet a confirmed defect.)

8 · Consequence — the jitter makes the CI safety gate flaky

The follower's eef_speed / joint_space_speed guards estimate speed as Δposition/Δtime from the incoming state stream. When a state frame is delayed by the scheduling jitter above, the next frame lands a large Δposition over a small Δt, so the estimated speed spikes far above the real commanded speed and trips an e-stop on a single over-threshold sample (1/1). The identical runs on the faster local host completed 120 laps with no trip.

The CI evidence is unambiguous — and proves the trip is jitter, not motion:

commanded velCI resultguard trippedreported speed
70tripjoint_space_speedjoint 1 = 114.9 °/s > 70
40tripeef_speed1.099 m/s > 1.0
40 (retry)tripjoint_space_speedjoint 0 = 81.9 °/s > 70
20tripeef_speed3.049 m/s > 1.0

Halving the velocity from 40 to 20 made the reported speed worse (1.099 → 3.049 m/s) and the guard varies run-to-run — both impossible for a real, velocity-scaled motion. The trip is a delayed-frame artifact. No commanded velocity reliably passes on the stock runner, so a blocking safe-check gate would fail every PR. Until a mitigation lands, the CI step is therefore configured non-blocking (informational); see §9.

9 · Recommendations

10 · Reproduce

# fresh sims, then a 60 s traced teleop run with the per-hop breakdown docker rm -f fairino-fr3 fairino-fr20 uv run scripts/profile_teleop_jitter.py --duration-s 60 # render the standard interactive profiler HTML from the raw trace uv run python tools/profiler.py --last 1

Instrumentation is gated by ROBOT_TRACE=1 (set automatically by the profiler script) and is a no-op in production. The command_pre_lock / command_lock_acquired stamps used for the lock analysis live in fairino.py:_set_state_impl.

Generated for the fairino-sim-ci-and-jitter branch · data: 3021 teleop commands, 60 s, FR3→FR20 SimMachine pair · all latencies in milliseconds.