For UR robotic arm on MIR600 mobile base. Leader → follower over USB / Ethernet / Wi-Fi. Goal: a stack that is reliable, safe, and interpretable, with a clear path from in-process sim to real Wi-Fi.
#22 / #29, the follower is event-driven via bus.subscribe() (callbacks fire synchronously on the publisher's thread), and an EMA smoother + mode blender already sit in front of every command. The "jitter" you see in the wild is almost certainly not coming from the local executor anymore.PUB/SUB over TCP, and Linux TCP_RTO_MIN = 200 ms means a single drop pauses the stream; (b) default DDS / multicast discovery over Wi-Fi (the team has already written servo7/dds/start_discovery_server.sh as a workaround, the smoking gun); (c) with the new inline command path, a slow hardware write blocks the whole pipeline — but the tracer makes that immediately visible.BEST_EFFORT / KEEP_LAST=1 / VOLATILE QoS for joint streams, on a dedicated 5/6 GHz Wi-Fi 6 AP with DSCP-tagged egress. Fallback stay on FastDDS using the existing Discovery Server, but flip the QoS profile first — that alone likely removes 60-80% of jitter.PubSub Protocol in servo7/bus/pubsub.py with publish / get_latest / subscribe. Implementing a ZenohBus or FastDdsBus conforming to it is ~80-100 lines per backend and zero changes to nodes.#22 / #29 / #32 and almost certainly doesn't exercise the real path; (3) add tc netem impairment injection in CI to validate the harness can detect what we claim it detects.Robot interface as R1RosRobot and use whichever bus we pick.
After #22 "moved from polling to event driven", #29 "no more tick loops", and
#32 "Inline command path + LeRobot temporal ensembling + profiler", the teleop pipeline
is fully event-driven. There is no command queue, no follower poll thread, no
orchestrator tick loop in the teleop hot path.
InProcessBus at servo7/bus/in_process_bus.py now exposes publish, get_latest, and subscribe(topic, callback). Subscribers are invoked synchronously, but a snapshot of the subscriber list is taken under the lock so callbacks don't block other publishers and re-entrant publishes don't deadlock. Exceptions are swallowed and logged so one bad subscriber can't take down the bus.
OrchestratorNode._on_leader_state_change publishes directly to action.follower; RobotNode._on_command_received writes to hardware inline. Zero queue round-trips, zero polling. A slow hardware write becomes immediately visible as a long cmd_pre_hw_write → cmd_applied hop in the tracer.
α = 0.1 joints, α = 0.5 gripper) on every command. servo7/control/mode_blender.py ramps over 0.75 s on mode transitions so teleop↔AI handoffs don't snap. EMA is reset on every set_control_mode().
RobotState as it travels through the pipeline; flushes one JSONL record per completed trace on a daemon thread (non-blocking). Activated by ROBOT_TRACE=1 or --profile. Live in-memory deque (100k traces, ~7 min @ 240/s) feeds robot-frontend/src/components/ProfilerPanel.jsx; offline analysis via tools/profiler.py renders a self-contained HTML report.
OrchestratorNode.ai_inference_worker runs on a dedicated sync thread at command_hz (default 30 Hz). No event-loop jitter on AI ticks. Same EMA/blender pipeline as teleop.
bus: PubSub = InProcessBus() and passes it to every node. Swapping to a network bus is a one-line change per script.
zmq.SUB / zmq.PUB over tcp://.
Linux's TCP_RTO_MIN is hard-coded to 200 ms regardless of measured RTT. A single dropped packet
on the wire stalls every subsequent message at the receiver for at least 200 ms; at 100 Hz that is ~20
stale joint frames piled up behind the gap. On Wi-Fi where micro-bursts of loss are routine, this is
the single most likely cause of the jitter you see across machines. This is not solved by the
in-process improvements on main.
239.255.0.1. Most enterprise APs convert
multicast to broadcast at the lowest mandatory rate (1 Mbps on 2.4 GHz), rate-limit it, or drop it
under load. The fact that servo7/dds/start_discovery_server.sh
exists with a hard-coded laptop IP 192.168.1.85:11811 means the team has already
hit this and worked around it once. Discovery Server is the FastDDS-specific fix; Zenoh sidesteps
the problem entirely by not using multicast for discovery.
InProcessBus.publish() fans out synchronously and _on_command_received
calls robot.set_state() inline, a slow USB/CAN/RTDE write on the follower stalls the leader's
state_publisher thread for the duration of the write. This is the right trade-off
(fewer queues, lower latency, deterministic) but it means we need to keep an eye on the
cmd_pre_hw_write → cmd_applied hop in the tracer. Move the slow side off the hot path
only if profiler data shows it.
InProcessBus overwrites the latest message by topic; the ZMQ path embeds no counter; the
trace envelope carries hop timestamps but not a monotonic per-(topic, source) sequence number.
We can't today distinguish "two adjacent samples were dropped" from "two were reordered" from "the
sender hiccupped." This is a 24-byte fix (§4.2).
tests/test_latency.py is stale#22 / #29 / #32: follower.command_listener(),
follower.command_executer(), orchestrator.control_loop(),
orchestrator.publisher_worker(). It almost certainly is not exercising the current pipeline,
which means our latency CI is currently a placebo. Repair before extending.
RobotType covers Piper / SO100 / R1 / R1T / R1LiteSim / Quest / Dummy. UR and MIR are not in the enum or factory. UR20 appears in sim only (servo7/sim/trajectory_demo.py, Drake/MuJoCo). Adding URRobot means extending the base_ros.py pattern and bridging the official ur_robot_driver ROS 2 package (or RTDE directly).PubSub implementation. Only InProcessBus is wired in. RemoteRobotClient/RemoteRobotServer exist as a side-path through the Robot interface, not as a bus.The intuition behind the sinusoid plan is right: drive a known signal through the leader, record what arrives at the follower, and measure phase shift = latency. The plan needs to be hardened so it can also catch the things you actually care about — packet loss, reordering, and tail latency.
| Metric | Definition | What it surfaces |
|---|---|---|
| Latency p50 / p95 / p99 / max | End-to-end one-way delay distribution | Mean is useless for safety. Max and p99 are what trip protective stops. |
| PDV (jitter, RFC 3393) | Variation in one-way delay between packet pairs | True network jitter. Distinguish from inter-arrival variation, which conflates sender + network. |
| Packet loss rate | Drops / sent | For BEST_EFFORT: real loss. For RELIABLE: shows up as latency spikes from retransmit. |
| Reorder rate | Count where seq < max-seen-seq | The thing your sinusoid will silently miss. |
| Age-of-Information (AoI) | now − timestamp_of_last_received_sample | Single freshness number; correlates with closed-loop control quality better than raw latency. |
| Deadline-miss rate | Periods with no fresh command in time | The metric the safety case actually rests on. |
Run all of these on every link. Keep the sinusoid as a sanity check / phase-shift demo, not as the headline test.
(arrival_time − deadline).Run identical workloads at every rung; the diff between rungs tells you where the time goes.
| # | Rung | Adds | Expected p99 | Red flag |
|---|---|---|---|---|
| 1 | In-process (single Python process, current default) | Inline callbacks + serializer + GIL | tens of µs | p99 > 1 ms ⇒ a callback is doing real work; check tracer hop times |
| 2 | Loopback localhost (UDP/DDS via 127.0.0.1) | Kernel UDP + DDS shm/UDP | ~100-300 µs | max > 5 ms or any loss ⇒ CPU contention or DDS QoS misconfig |
| 3 | Direct point-to-point Ethernet between two boxes | NIC + cable + IRQ handling | +100-500 µs over rung 2 | PDV > few hundred µs, or any reordering (a direct cable cannot reorder — host stack issue) |
| 4 | Switched Ethernet | Switch buffering | small bump over rung 3 | PDV grows with cross-traffic, or loss under iperf3 background ⇒ HoL blocking / cheap switch |
| 5 | Wi-Fi (5/6 GHz, dedicated AP) | RF contention, retries, roaming | 5-20× rung 4, bursty | max > 100 ms, AoI spikes correlated with airtime contention. This is where MIR600 lives. |
linuxptp (ptp4l + phc2sys) on both ends. Verify before each run with pmc.cpupower frequency-set -g performance on every machine. Frequency scaling alone moves p99 by milliseconds.isolcpus=, nohz_full=, rcu_nocbs= on the kernel cmdline; pin the teleop process and the NIC IRQ to isolated cores. Validate with cyclictest -p99 -t -m -D 1h; max should be <~50 µs on a tuned PREEMPT_RT system.iw dev wlan0 set power_save off); document channel and RSSI; mask background sync services.tc netem — Linux traffic controltc qdisc add dev eth0 root netem loss 1%, your harness is broken before you start.
Use it to sweep impairment levels and find the controller's failure point.
iperf3 — background loadtsharkperformance_testcyclictest| Protocol | Verdict for hot path | Reason |
|---|---|---|
| TCP | Wrong | Linux TCP_RTO_MIN = 200 ms hard-coded. One drop ⇒ stream stalls 200 ms while every later packet waits for retransmit. At 500 Hz that's ~100 stale joint frames queued. Wi-Fi micro-bursts of loss are routine ⇒ this is the textbook teleop jitter cause. Your current ZMQ remote path is exactly this. |
| UDP | Right pattern | "Publish a fully-self-describing snapshot every tick, never retransmit." A dropped packet is replaced by the next one ~1-10 ms later — below human-perceptible. Handle ordering with a sequence number, drop OOO frames at the receiver. |
| QUIC | Skip for now | Multiplexed streams over UDP fix per-stream HoL across streams, but each stream is still reliable+ordered (= TCP-like). RFC 9221 DATAGRAM frames give you raw datagrams alongside, but Python tooling is immature. Not worth introducing here. |
| RT-Ethernet (EtherCAT, PROFINET, TSN) | N/A here | Sub-ms determinism but requires kernel bypass, dedicated NICs, layer-2 path with no Wi-Fi. It's how the UR controller talks to its joints internally — not your leader↔follower link. |
| Middleware | Role | Notes for our case |
|---|---|---|
| FastDDS / Cyclone DDS Fallback | ROS 2 default; UR driver + MIR600 are DDS-native | Right QoS for 500 Hz joint stream: RELIABILITY=BEST_EFFORT, HISTORY=KEEP_LAST depth 1, DURABILITY=VOLATILE, explicit DEADLINE = control period. RELIABLE + deep history is the classic teleop footgun. Multicast discovery is the Wi-Fi pain point — use the Discovery Server you've already scripted. |
| Zenoh / rmw_zenoh Primary | Eclipse project, pluggable transports | Promoted to Tier 1 in ROS 2 Kilted (May 2025); one env-var swap from FastDDS. Published evidence (arXiv 2309.07496): Cyclone wins on Ethernet, Zenoh wins on Wi-Fi and 4G with smallest trajectory drift. Doesn't depend on UDP multicast for discovery ⇒ skips the AP-broken-multicast disaster entirely. |
| ZeroMQ | Brokerless, simple, great Python | PUB/SUB with CONFLATE=1 mirrors today's "latest-wins" semantics exactly. No QoS framework. If you want to keep the existing ZMQ code path, switch the transport from tcp:// to udp:// + epgm://, or to radio/dish (UDP multicast) — that alone removes the 200 ms RTO problem. Lose graph introspection and free interop with the UR driver topics. |
| iceoryx2 | Rust rewrite, true zero-copy shm | Alpha for ROS 2 Rolling. Same-host only; no cross-host gateway yet. Useful if leader and follower ever run on the same box, irrelevant for the network leg. |
| WebRTC data channels | SCTP-over-DTLS-over-UDP | Each channel can be ordered=false, maxRetransmits=0. Real option for control over arbitrary networks (NAT/STUN/ICE built-in). Same-LAN, the encryption + congestion control are cost without benefit. Reasonable only if the same stack also has to carry video later. |
| gRPC | HTTP/2 over TCP | Same 200 ms RTO problem as raw TCP. Fine for config/RPC ("home robot", "set TCP", "estop"); wrong for joint streams. |
| Claim source | Reality |
|---|---|
| 5G teleop demos (Ericsson, Vodafone, HoloLight) | Marketing. The 40 ms is media-plane RTT in a clean lab and excludes any control/safety stack. Ignore. |
| Foxglove / Lichtblick | Visualization, not transport. Will not improve latency; can mask jitter in plots if you don't pin timestamps. |
| Nimble / Sanctuary / Reflex | Proprietary, hardware-locked. Not buyable as a stack. |
| UR RTDE | Real and useful. 500 Hz on e-Series, TCP on port 30004, dedicated short link ⇒ 200 ms RTO almost never fires. Use RTDE for the UR side; the ROS 2 driver wraps it. |
| URsim + ROS 2 (quiet local stack) | ~5-15 ms leader-input → follower-state. The cross-Wi-Fi hop is what blows the budget. |
| ros2_control + PREEMPT_RT | Real, free, well-understood. Removes scheduler jitter (sub-100 µs vs 1-10 ms on stock kernel). Does nothing for network jitter — layer it under whichever middleware you pick. |
Honest verdict: <40 ms end-to-end is achievable on quiet wired Ethernet and unreliable on shared Wi-Fi, regardless of whose slide deck claims it.
The codebase already has the right seam. Don't refactor; plug into it.
# servo7/bus/pubsub.py — already on main
from typing import Callable, Protocol
class PubSub(Protocol):
def publish(self, topic: str, payload: bytes) -> None: ...
def get_latest(self, topic: str) -> bytes | None: ...
def subscribe(self, topic: str, callback: Callable[[bytes], None]) -> None: ...
Three methods. Every node (RobotNode, OrchestratorNode, SystemStateManager)
takes a bus: PubSub in its constructor and never references a concrete implementation. Each
scripts/main*.py has exactly one construction site:
bus: PubSub = InProcessBus()
zenoh.open(). publish() calls session.put(topic, payload). subscribe() registers a Zenoh subscriber whose callback invokes the user's callback synchronously (matches InProcessBus semantics — the user's callback pays the cost). get_latest() reads from a per-topic latched cache populated by the subscriber. Estimate: ~80 lines.
rclpy with QoS BEST_EFFORT / KEEP_LAST=1 / VOLATILE / DEADLINE using the existing Discovery Server XML. subscribe() attaches a DDS DataReader listener; get_latest() snapshots the last sample. Estimate: ~100 lines.
tc netem on a loopback interface. Used to validate that the harness can detect what we claim it detects.
Today's payload is a JSON dict produced by RobotState.to_dict(). Wrap it in a 24-byte
binary header so the consumer can compute drop / reorder / PDV without parsing JSON:
# servo7/bus/envelope.py (new, ~30 LOC)
import struct, time
HEADER_FMT = "<QdQ" # seq (u64), send_ts (f64 perf_counter), source_id (u64 hashed)
HEADER_LEN = struct.calcsize(HEADER_FMT)
def wrap(seq: int, source_id: int, payload: bytes) -> bytes:
return struct.pack(HEADER_FMT, seq, time.perf_counter(), source_id) + payload
def unwrap(b: bytes) -> tuple[int, float, int, bytes]:
seq, ts, src = struct.unpack(HEADER_FMT, b[:HEADER_LEN])
return seq, ts, src, b[HEADER_LEN:]
Wrap inside RobotNode.state_publisher (per-(topic, source) monotonic seq), unwrap inside
SystemStateManager._on_state_update and feed (seq, send_ts, recv_ts) into the
existing tracer as a structured event alongside the hop list.
The tracer at servo7/utils/trace_logger.py is good enough; add three columns derived from the new envelope:
max_seen_seq per (topic, source); when a new packet arrives with seq > max_seen + 1, emit a "kind": "loss" event with the gap size.seq < max_seen, emit a "kind": "reorder" event with the seq delta.
Surface these in robot-frontend/src/components/ProfilerPanel.jsx's existing
summary block (where p50/p95/p99/max already live) and in the offline HTML produced by
tools/profiler.py. The frontend filters by (origin_stage, terminal_stage)
already; we just add three counters per filter.
tests/test_latency.py
The existing test references methods that no longer exist on main: follower.command_listener(),
follower.command_executer(), orchestrator.control_loop(),
orchestrator.publisher_worker(). Rewrite it to drive the event-driven path directly:
InProcessBus + a leader DummyRobot with random-walk state + a follower DummyRobot.leader_node.state_publisher as a thread.SystemStateManager + OrchestratorNode as on main; subscribe a probe to action.follower.ROBOT_TRACE=1.cmd_pre_hw_write → cmd_applied hop > 5 ms.This is the in-process noise floor (rung 1). Every other rung re-runs this same test with a different bus.
Two new robots, both extending the existing ROS robot pattern (R1RosRobot already does this in
servo7/robot/r1.py):
URRobot(Robot) — wraps the official ur_robot_driver ROS 2 package. get_state() reads /joint_states; set_state() publishes to /scaled_joint_trajectory_controller, or — for true low-latency teleop — uses RTDE directly via ur_rtde Python bindings at 500 Hz. Run URsim in a Docker container for sim-only testing as documented in the UR ROS 2 docs.MIR600Robot(Robot) — wraps the MIR REST API for high-level nav and the ROS 2 bridge for low-level state. The MIR is the carrier; the UR arm sits on top. Both expose state into the same bus.UR and MIR600 to RobotType in servo7/robot/robot_types.py and the RobotFactory in servo7/robot/factory.py.Neither of these touches the bus/network/QoS work — they're orthogonal and can land in any order.
tc netem impairment injection in CI. If injecting 1% loss / 5 ms jitter / 1% reorder doesn't show up in the new metrics, the harness is wrong. Fix it before measuring anything else.InProcessBus + ZMQ remote. Diff the metrics. This alone tells you whether jitter is host-side, switch-side, or Wi-Fi-side — without changing any code beyond instrumentation.BEST_EFFORT / KEEP_LAST=1 / VOLATILE / DEADLINE=2×period. Build a FastDdsBus that uses them. Re-run the ladder. Likely removes the bulk of the jitter without any new dependency. (Decision gate — see below.)ZenohBus. ~80 LOC. Wire via a CLI flag in scripts/main_remote.py. Re-run the ladder. Pick whichever wins on the Wi-Fi rung.linuxptp, performance governor, isolated cores for the teleop process and NIC IRQ. Validate with cyclictest. Re-run the ladder. The numbers from §3 only mean something on a tuned host.URRobot via ur_robot_driver + RTDE, MIR600Robot via REST + ROS 2. Both consume the chosen bus. The teleop work above is what makes this safe.FastDdsBus already meets the budget at every rung, ship. Don't add Zenoh
for its own sake. (We expect this likely doesn't hold on the Wi-Fi rung — but measure first.)
ZenohBus wins on the Wi-Fi rung by a clear margin (the published evidence says it should),
promote it to default. Otherwise stay on tuned FastDDS — fewer moving parts.
perf_counter which is per-host monotonic — fine within a host, useless across hosts without sync.α (or moving to a Kalman / one-euro filter) once the network jitter is quantified.tc-netem(8) man pagecyclictestPubSub Protocol, now with subscribe() (the injection seam)_on_command_received_on_leader_state_change + sync ai_inference_worker + EMA + mode blender