Servo7 · Atlas (main @ 68ae39c) · 2026-04-28

Teleoperation Stack — Plan, Tests, and Network Choice

For UR robotic arm on MIR600 mobile base. Leader → follower over USB / Ethernet / Wi-Fi. Goal: a stack that is reliable, safe, and interpretable, with a clear path from in-process sim to real Wi-Fi.

TL;DR

The in-process pipeline on main is already in good shape: polling tick-loops were removed in #22 / #29, the follower is event-driven via bus.subscribe() (callbacks fire synchronously on the publisher's thread), and an EMA smoother + mode blender already sit in front of every command. The "jitter" you see in the wild is almost certainly not coming from the local executor anymore.
The remaining suspects, in order of likelihood: (a) the remote path in servo7/robot/robot_client.py is ZMQ PUB/SUB over TCP, and Linux TCP_RTO_MIN = 200 ms means a single drop pauses the stream; (b) default DDS / multicast discovery over Wi-Fi (the team has already written servo7/dds/start_discovery_server.sh as a workaround, the smoking gun); (c) with the new inline command path, a slow hardware write blocks the whole pipeline — but the tracer makes that immediately visible.
The sinusoid test idea is half right. Phase shift gives latency, but a smooth sinusoid silently hides reordering and can't stress congestion. Replace it with: a monotonic sequence number in every payload (always), plus alternating workloads of chirp, step, and PRBS. Compute packet-level metrics (loss, reorder, AoI, PDV) and signal-level metrics (cross-correlation latency, coherence, transfer function, RMSE) on each run.
Walk a 5-rung test ladder: in-process → loopback localhost → direct wired → switched Ethernet → Wi-Fi. The diff between rungs is more informative than any absolute number.
Network stack: Primary rmw_zenoh on ROS 2 Kilted with BEST_EFFORT / KEEP_LAST=1 / VOLATILE QoS for joint streams, on a dedicated 5/6 GHz Wi-Fi 6 AP with DSCP-tagged egress. Fallback stay on FastDDS using the existing Discovery Server, but flip the QoS profile first — that alone likely removes 60-80% of jitter.
Architecture: don't rewrite anything. The codebase already has the perfect injection seam — the PubSub Protocol in servo7/bus/pubsub.py with publish / get_latest / subscribe. Implementing a ZenohBus or FastDdsBus conforming to it is ~80-100 lines per backend and zero changes to nodes.
Three concrete first deliverables: (1) add a 24-byte sequence-numbered envelope and surface drop/reorder/PDV in the existing tracer/profiler panel; (2) repair tests/test_latency.py — it currently references methods removed by #22 / #29 / #32 and almost certainly doesn't exercise the real path; (3) add tc netem impairment injection in CI to validate the harness can detect what we claim it detects.
UR + MIR600 are not in the codebase yet. Add them last, after the bus and metrics work — they'll plug into the same Robot interface as R1RosRobot and use whichever bus we pick.

1. Status Quo on main — What the Pipeline Actually Does

After #22 "moved from polling to event driven", #29 "no more tick loops", and #32 "Inline command path + LeRobot temporal ensembling + profiler", the teleop pipeline is fully event-driven. There is no command queue, no follower poll thread, no orchestrator tick loop in the teleop hot path.

┌─────────────┐ robot.get_state() ┌──────────────────────────┐ │ LEADER │ ─────────────────────────│ RobotNode.state_publisher│ │ robot │ publish_rate_hz (30 Hz) │ (single thread, fixed) │ └─────────────┘ └──────────┬───────────────┘ │ stamp publish.{id}.pre_read │ stamp publish.{id}.got_state │ orjson.dumps(state.to_dict()) ▼ bus.publish("state.leader", bytes) │ │ (synchronous fan-out, OUTSIDE lock) ▼ ┌──────────────────────────────────────────┐ │ InProcessBus.publish() invokes every │ │ subscriber callback inline │ └────────┬─────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────┐ │ SystemStateManager._on_state_ │ │ update("leader", payload) │ │ — dedupe by ts, fire callbacks │ └────────┬────────────────────────────┘ │ register_state_change_callback("leader", …) ▼ ┌─────────────────────────────────────┐ │ OrchestratorNode._on_leader_state_ │ │ change() — runs INLINE on the │ │ publisher's thread │ │ │ │ if mode != TELEOPERATION: return │ │ payload = converter.convert(s) │ │ payload = relative_offset(payload)│ │ payload = ema_smoother.apply(p) │ ← EMA smoother │ payload = mode_blender.apply(p) │ ← cross-fade on mode toggle │ bus.publish("action.follower", │ │ dumps(payload.to_dict())) │ └────────┬────────────────────────────┘ │ inline subscribe()-callback ▼ ┌─────────────────────────────────────┐ │ RobotNode._on_command_received(p) │ │ stamp .cmd_parsed │ │ stamp .cmd_pre_hw_write │ │ robot.set_state(payload) │ ← hardware write, │ stamp .cmd_applied │ INLINE on publisher │ trace_logger.log_trace(...) │ thread └─────────────────────────────────────┘

1.1 What's already in place — credit where due

Event-driven bus

InProcessBus at servo7/bus/in_process_bus.py now exposes publish, get_latest, and subscribe(topic, callback). Subscribers are invoked synchronously, but a snapshot of the subscriber list is taken under the lock so callbacks don't block other publishers and re-entrant publishes don't deadlock. Exceptions are swallowed and logged so one bad subscriber can't take down the bus.

Inline command path

OrchestratorNode._on_leader_state_change publishes directly to action.follower; RobotNode._on_command_received writes to hardware inline. Zero queue round-trips, zero polling. A slow hardware write becomes immediately visible as a long cmd_pre_hw_write → cmd_applied hop in the tracer.

EMA + mode blender

servo7/control/ema_filter.py applies a per-joint EMA (α = 0.1 joints, α = 0.5 gripper) on every command. servo7/control/mode_blender.py ramps over 0.75 s on mode transitions so teleop↔AI handoffs don't snap. EMA is reset on every set_control_mode().

Tracer + profiler

servo7/utils/trace_logger.py stamps every hop on the RobotState as it travels through the pipeline; flushes one JSONL record per completed trace on a daemon thread (non-blocking). Activated by ROBOT_TRACE=1 or --profile. Live in-memory deque (100k traces, ~7 min @ 240/s) feeds robot-frontend/src/components/ProfilerPanel.jsx; offline analysis via tools/profiler.py renders a self-contained HTML report.

Sync AI worker (deterministic)

OrchestratorNode.ai_inference_worker runs on a dedicated sync thread at command_hz (default 30 Hz). No event-loop jitter on AI ticks. Same EMA/blender pipeline as teleop.

Single bus construction site

Each scripts/main*.py instantiates exactly one bus: PubSub = InProcessBus() and passes it to every node. Swapping to a network bus is a one-line change per script.

1.2 What's missing — the real jitter sources

Suspect 1 — TCP under the ZMQ remote path

servo7/robot/robot_client.py uses zmq.SUB / zmq.PUB over tcp://. Linux's TCP_RTO_MIN is hard-coded to 200 ms regardless of measured RTT. A single dropped packet on the wire stalls every subsequent message at the receiver for at least 200 ms; at 100 Hz that is ~20 stale joint frames piled up behind the gap. On Wi-Fi where micro-bursts of loss are routine, this is the single most likely cause of the jitter you see across machines. This is not solved by the in-process improvements on main.

Suspect 2 — DDS multicast discovery on Wi-Fi

Default DDS discovery uses UDP multicast on 239.255.0.1. Most enterprise APs convert multicast to broadcast at the lowest mandatory rate (1 Mbps on 2.4 GHz), rate-limit it, or drop it under load. The fact that servo7/dds/start_discovery_server.sh exists with a hard-coded laptop IP 192.168.1.85:11811 means the team has already hit this and worked around it once. Discovery Server is the FastDDS-specific fix; Zenoh sidesteps the problem entirely by not using multicast for discovery.

Suspect 3 — inline pipeline blocks on slow hardware

Because InProcessBus.publish() fans out synchronously and _on_command_received calls robot.set_state() inline, a slow USB/CAN/RTDE write on the follower stalls the leader's state_publisher thread for the duration of the write. This is the right trade-off (fewer queues, lower latency, deterministic) but it means we need to keep an eye on the cmd_pre_hw_write → cmd_applied hop in the tracer. Move the slow side off the hot path only if profiler data shows it.

Suspect 4 — no sequence numbers, no ordering signal

InProcessBus overwrites the latest message by topic; the ZMQ path embeds no counter; the trace envelope carries hop timestamps but not a monotonic per-(topic, source) sequence number. We can't today distinguish "two adjacent samples were dropped" from "two were reordered" from "the sender hiccupped." This is a 24-byte fix (§4.2).

Suspect 5 — tests/test_latency.py is stale

The test references methods removed by #22 / #29 / #32: follower.command_listener(), follower.command_executer(), orchestrator.control_loop(), orchestrator.publisher_worker(). It almost certainly is not exercising the current pipeline, which means our latency CI is currently a placebo. Repair before extending.

1.3 What does not exist yet on main

UR robot driver wrapper. RobotType covers Piper / SO100 / R1 / R1T / R1LiteSim / Quest / Dummy. UR and MIR are not in the enum or factory. UR20 appears in sim only (servo7/sim/trajectory_demo.py, Drake/MuJoCo). Adding URRobot means extending the base_ros.py pattern and bridging the official ur_robot_driver ROS 2 package (or RTDE directly).
MIR600 integration. Not present anywhere. Needs ROS 2 nav-stack glue.
A real network PubSub implementation. Only InProcessBus is wired in. RemoteRobotClient/RemoteRobotServer exist as a side-path through the Robot interface, not as a bus.
Drop/reorder/PDV metrics in the profiler panel. The tracer captures wall-clock-stamped hops; it doesn't track sequence-number-derived packet metrics yet.

2. The Test — Better Than a Sinusoid

The intuition behind the sinusoid plan is right: drive a known signal through the leader, record what arrives at the follower, and measure phase shift = latency. The plan needs to be hardened so it can also catch the things you actually care about — packet loss, reordering, and tail latency.

2.1 Metrics that matter

Metric	Definition	What it surfaces
Latency p50 / p95 / p99 / max	End-to-end one-way delay distribution	Mean is useless for safety. Max and p99 are what trip protective stops.
PDV (jitter, RFC 3393)	Variation in one-way delay between packet pairs	True network jitter. Distinguish from inter-arrival variation, which conflates sender + network.
Packet loss rate	Drops / sent	For BEST_EFFORT: real loss. For RELIABLE: shows up as latency spikes from retransmit.
Reorder rate	Count where seq < max-seen-seq	The thing your sinusoid will silently miss.
Age-of-Information (AoI)	`now − timestamp_of_last_received_sample`	Single freshness number; correlates with closed-loop control quality better than raw latency.
Deadline-miss rate	Periods with no fresh command in time	The metric the safety case actually rests on.

2.2 Why a sinusoid alone is not enough

Smoothness hides reordering

Two adjacent samples on a 1 Hz sine, swapped, reconstruct identically at the millimetre level. The reorder rate from a sinusoid alone reads zero even when the network is misbehaving. Single-frequency excitation also probes only one point on the channel's transfer function; bandwidth-dependent effects (queueing under burst, MTU fragmentation) won't show up. And a 6-DOF sinusoid at 250 Hz is ~12 kbps — the network is idle, so no congestion stress.

2.3 The signal set that does the job

Sequence number — non-negotiable

A monotonic counter inside the payload on every packet. This is how you detect drops and reorders independent of signal shape. Every other test below assumes it. Cost: 8 bytes.

PRBS / m-sequence on one joint

Each sample is uniquely identifiable, so reordering becomes visible at the signal level too. LFSR-generated maximum-length sequences have white-noise-like autocorrelation — ideal for system-identifying the network's transfer function.

Chirp (log frequency sweep, 0.1 Hz → Nyquist/2)

Exposes bandwidth-dependent fidelity. Lets you compute coherence vs frequency and a Bode-plot-style transfer function in a single short run.

Step / square wave

Exposes deadline misses, slew-rate clipping, and any rate-limiter in the controller path. Worst-case transient response — and a useful stress test for the EMA smoother (a step into EMA becomes an exponential, by design).

Run all of these on every link. Keep the sinusoid as a sanity check / phase-shift demo, not as the headline test.

2.4 What to compute from s_L(t) and s_F(t)

Cross-correlation peak → latency estimate (sub-sample accuracy via parabolic fit around the peak).
Coherence γ²_xy(f) → frequency-resolved fidelity ∈ [0,1]. Drops below ~0.9 mark the usable bandwidth of the link.
Transfer function H(f) = S_xy/S_xx → Bode plot. Phase slope ⇒ latency, magnitude roll-off ⇒ effective bandwidth (also the EMA's first-order pole — useful for separating "EMA roll-off" from "network roll-off").
RMSE on time-aligned signals after compensating the cross-correlation latency → a single fidelity number.
Packet-level metrics from the seq number — loss count, reorder count, max gap, AoI trace, PDV histogram. Independent of signal shape and ultimately the load-bearing ones.
Deadline-miss histogram at the controller, binned by (arrival_time − deadline).

2.5 Pragmatic test ladder

Run identical workloads at every rung; the diff between rungs tells you where the time goes.

#	Rung	Adds	Expected p99	Red flag
1	In-process (single Python process, current default)	Inline callbacks + serializer + GIL	tens of µs	p99 > 1 ms ⇒ a callback is doing real work; check tracer hop times
2	Loopback localhost (UDP/DDS via 127.0.0.1)	Kernel UDP + DDS shm/UDP	~100-300 µs	max > 5 ms or any loss ⇒ CPU contention or DDS QoS misconfig
3	Direct point-to-point Ethernet between two boxes	NIC + cable + IRQ handling	+100-500 µs over rung 2	PDV > few hundred µs, or any reordering (a direct cable cannot reorder — host stack issue)
4	Switched Ethernet	Switch buffering	small bump over rung 3	PDV grows with cross-traffic, or loss under iperf3 background ⇒ HoL blocking / cheap switch
5	Wi-Fi (5/6 GHz, dedicated AP)	RF contention, retries, roaming	5-20× rung 4, bursty	max > 100 ms, AoI spikes correlated with airtime contention. This is where MIR600 lives.

2.6 Experiment hygiene (so the numbers mean something)

PTP, not NTP. NTP is millisecond-class; you cannot make sub-ms latency claims with it. Use linuxptp (ptp4l + phc2sys) on both ends. Verify before each run with pmc.
Warm-up. Discard the first 30 s — JIT, page faults, ARP, DDS discovery, Wi-Fi association all distort early samples. The trace deque is 7 minutes deep; one warm-up minute is cheap.
Sample budget. p99 needs ≥10⁵ samples; p99.9 needs ≥10⁶. At 30 Hz teleop publish, that's about an hour for p99.9 — bump publish rate during tests if you want it faster. Bootstrap CIs on the tail percentiles, not just mean ± stdev.
CPU governor: performance. cpupower frequency-set -g performance on every machine. Frequency scaling alone moves p99 by milliseconds.
Core isolation + PREEMPT_RT. isolcpus=, nohz_full=, rcu_nocbs= on the kernel cmdline; pin the teleop process and the NIC IRQ to isolated cores. Validate with cyclictest -p99 -t -m -D 1h; max should be <~50 µs on a tuned PREEMPT_RT system.
Quiet the environment. Disable Wi-Fi power-save (iw dev wlan0 set power_save off); document channel and RSSI; mask background sync services.
Record everything. Kernel, RT patch, DDS vendor + QoS profile, NIC, driver, switch, AP, channel, ambient temp. Otherwise the run is not reproducible.

2.7 Tools that earn their place

`tc netem` — Linux traffic control

The single most useful tool. Inject controlled delay, jitter, loss, dup, reordering on the egress of either host. If your harness can't detect 1% loss injected with tc qdisc add dev eth0 root netem loss 1%, your harness is broken before you start. Use it to sweep impairment levels and find the controller's failure point.

`iperf3` — background load

Run UDP at a controlled bitrate alongside teleop traffic; watch your metrics on the teleop link degrade. Tests whether QoS / DSCP tagging actually works.

Wireshark / `tshark`

Ground truth. Capture on both ends with PTP-synced timestamps; compute one-way delay and PDV directly from the pcaps. Indispensable for diagnosing DDS discovery storms.

Apex.AI `performance_test`

The right tool for isolating the DDS layer from your application. Run with the same QoS profile as your teleop topics on the same hosts before blaming the network.

`cyclictest`

Validates host real-time configuration. If it shows hundreds of µs of host jitter, your network numbers are dominated by host scheduling and you're measuring the wrong thing.

The existing tracer + profiler panel

Already industrial-grade. Extend with three columns derived from the new sequence number: loss count, reorder count, PDV histogram. Don't build a parallel dashboard; fold the metrics into robot-frontend/src/components/ProfilerPanel.jsx and tools/profiler.py.

3. The Network Stack — What Actually Works

3.1 Protocol: TCP vs UDP vs QUIC vs RT-Ethernet

Protocol	Verdict for hot path	Reason
TCP	Wrong	Linux `TCP_RTO_MIN = 200 ms` hard-coded. One drop ⇒ stream stalls 200 ms while every later packet waits for retransmit. At 500 Hz that's ~100 stale joint frames queued. Wi-Fi micro-bursts of loss are routine ⇒ this is the textbook teleop jitter cause. Your current ZMQ remote path is exactly this.
UDP	Right pattern	"Publish a fully-self-describing snapshot every tick, never retransmit." A dropped packet is replaced by the next one ~1-10 ms later — below human-perceptible. Handle ordering with a sequence number, drop OOO frames at the receiver.
QUIC	Skip for now	Multiplexed streams over UDP fix per-stream HoL across streams, but each stream is still reliable+ordered (= TCP-like). RFC 9221 DATAGRAM frames give you raw datagrams alongside, but Python tooling is immature. Not worth introducing here.
RT-Ethernet (EtherCAT, PROFINET, TSN)	N/A here	Sub-ms determinism but requires kernel bypass, dedicated NICs, layer-2 path with no Wi-Fi. It's how the UR controller talks to its joints internally — not your leader↔follower link.

3.2 Middleware

Middleware	Role	Notes for our case
FastDDS / Cyclone DDS Fallback	ROS 2 default; UR driver + MIR600 are DDS-native	Right QoS for 500 Hz joint stream: `RELIABILITY=BEST_EFFORT`, `HISTORY=KEEP_LAST` depth 1, `DURABILITY=VOLATILE`, explicit `DEADLINE` = control period. RELIABLE + deep history is the classic teleop footgun. Multicast discovery is the Wi-Fi pain point — use the Discovery Server you've already scripted.
Zenoh / rmw_zenoh Primary	Eclipse project, pluggable transports	Promoted to Tier 1 in ROS 2 Kilted (May 2025); one env-var swap from FastDDS. Published evidence (arXiv 2309.07496): Cyclone wins on Ethernet, Zenoh wins on Wi-Fi and 4G with smallest trajectory drift. Doesn't depend on UDP multicast for discovery ⇒ skips the AP-broken-multicast disaster entirely.
ZeroMQ	Brokerless, simple, great Python	PUB/SUB with `CONFLATE=1` mirrors today's "latest-wins" semantics exactly. No QoS framework. If you want to keep the existing ZMQ code path, switch the transport from `tcp://` to `udp://` + `epgm://`, or to `radio/dish` (UDP multicast) — that alone removes the 200 ms RTO problem. Lose graph introspection and free interop with the UR driver topics.
iceoryx2	Rust rewrite, true zero-copy shm	Alpha for ROS 2 Rolling. Same-host only; no cross-host gateway yet. Useful if leader and follower ever run on the same box, irrelevant for the network leg.
WebRTC data channels	SCTP-over-DTLS-over-UDP	Each channel can be `ordered=false, maxRetransmits=0`. Real option for control over arbitrary networks (NAT/STUN/ICE built-in). Same-LAN, the encryption + congestion control are cost without benefit. Reasonable only if the same stack also has to carry video later.
gRPC	HTTP/2 over TCP	Same 200 ms RTO problem as raw TCP. Fine for config/RPC ("home robot", "set TCP", "estop"); wrong for joint streams.

3.3 Off-the-shelf "<40 ms" claims — the honest table

Claim source	Reality
5G teleop demos (Ericsson, Vodafone, HoloLight)	Marketing. The 40 ms is media-plane RTT in a clean lab and excludes any control/safety stack. Ignore.
Foxglove / Lichtblick	Visualization, not transport. Will not improve latency; can mask jitter in plots if you don't pin timestamps.
Nimble / Sanctuary / Reflex	Proprietary, hardware-locked. Not buyable as a stack.
UR RTDE	Real and useful. 500 Hz on e-Series, TCP on port 30004, dedicated short link ⇒ 200 ms RTO almost never fires. Use RTDE for the UR side; the ROS 2 driver wraps it.
URsim + ROS 2 (quiet local stack)	~5-15 ms leader-input → follower-state. The cross-Wi-Fi hop is what blows the budget.
ros2_control + PREEMPT_RT	Real, free, well-understood. Removes scheduler jitter (sub-100 µs vs 1-10 ms on stock kernel). Does nothing for network jitter — layer it under whichever middleware you pick.

Honest verdict: <40 ms end-to-end is achievable on quiet wired Ethernet and unreliable on shared Wi-Fi, regardless of whose slide deck claims it.

3.4 The Wi-Fi-specific story

Quiet 5/6 GHz AP, one client: ~1-3 ms one-way. With one streaming client sharing: 20-100+ ms tail spikes.
5/6 GHz, never 2.4 GHz — more channels, less interference, much shorter PHY frames.
Dedicated SSID/AP for the robot, no other clients. Single biggest win.
Wi-Fi 6 (802.11ax) OFDMA + Target Wake Time: real reductions in contention latency. Caveat: NIST measured OFDMA can hurt TCP control streams (more reason to prefer UDP-based middleware).
WMM / DSCP: tag control packets with DSCP 46 (EF) or 34 (AF41); enterprise APs map those to WMM Voice/Video access categories with shorter contention windows. FastDDS, Cyclone DDS, and Zenoh all expose this.
DDS multicast discovery over Wi-Fi: known disaster. Use Discovery Server (FastDDS) or static peers, always — or skip the problem entirely with Zenoh.

4. Architecture — Where to Inject the Network Layer

The codebase already has the right seam. Don't refactor; plug into it.

# servo7/bus/pubsub.py — already on main
from typing import Callable, Protocol

class PubSub(Protocol):
    def publish(self, topic: str, payload: bytes) -> None: ...
    def get_latest(self, topic: str) -> bytes | None: ...
    def subscribe(self, topic: str, callback: Callable[[bytes], None]) -> None: ...

Three methods. Every node (RobotNode, OrchestratorNode, SystemStateManager) takes a bus: PubSub in its constructor and never references a concrete implementation. Each scripts/main*.py has exactly one construction site:

bus: PubSub = InProcessBus()

4.1 Proposed concrete buses

ZenohBus target

Wraps zenoh.open(). publish() calls session.put(topic, payload). subscribe() registers a Zenoh subscriber whose callback invokes the user's callback synchronously (matches InProcessBus semantics — the user's callback pays the cost). get_latest() reads from a per-topic latched cache populated by the subscriber. Estimate: ~80 lines.

FastDdsBus fallback

Wraps rclpy with QoS BEST_EFFORT / KEEP_LAST=1 / VOLATILE / DEADLINE using the existing Discovery Server XML. subscribe() attaches a DDS DataReader listener; get_latest() snapshots the last sample. Estimate: ~100 lines.

UdpBus test rig

Bare UDP with seq-numbered framing. Used as a noise-floor reference and as the netem target during the test ladder. ~60 lines.

NetemBus testing only

Decorator wrapping any other bus, injecting controlled delay/loss/reorder via tc netem on a loopback interface. Used to validate that the harness can detect what we claim it detects.

4.2 Sequence-numbered envelope (do this first, regardless of bus choice)

Today's payload is a JSON dict produced by RobotState.to_dict(). Wrap it in a 24-byte binary header so the consumer can compute drop / reorder / PDV without parsing JSON:

# servo7/bus/envelope.py  (new, ~30 LOC)
import struct, time

HEADER_FMT = "<QdQ"   # seq (u64), send_ts (f64 perf_counter), source_id (u64 hashed)
HEADER_LEN = struct.calcsize(HEADER_FMT)

def wrap(seq: int, source_id: int, payload: bytes) -> bytes:
    return struct.pack(HEADER_FMT, seq, time.perf_counter(), source_id) + payload

def unwrap(b: bytes) -> tuple[int, float, int, bytes]:
    seq, ts, src = struct.unpack(HEADER_FMT, b[:HEADER_LEN])
    return seq, ts, src, b[HEADER_LEN:]

Wrap inside RobotNode.state_publisher (per-(topic, source) monotonic seq), unwrap inside SystemStateManager._on_state_update and feed (seq, send_ts, recv_ts) into the existing tracer as a structured event alongside the hop list.

4.3 Extend the existing profiler — do not build a new one

The tracer at servo7/utils/trace_logger.py is good enough; add three columns derived from the new envelope:

Loss count. Track max_seen_seq per (topic, source); when a new packet arrives with seq > max_seen + 1, emit a "kind": "loss" event with the gap size.
Reorder count. When seq < max_seen, emit a "kind": "reorder" event with the seq delta.
PDV. Difference of consecutive one-way delays; expose as a histogram in the profiler panel next to the existing p50/p95/p99/max latency stats.

Surface these in robot-frontend/src/components/ProfilerPanel.jsx's existing summary block (where p50/p95/p99/max already live) and in the offline HTML produced by tools/profiler.py. The frontend filters by (origin_stage, terminal_stage) already; we just add three counters per filter.

4.4 Repair `tests/test_latency.py`

The existing test references methods that no longer exist on main: follower.command_listener(), follower.command_executer(), orchestrator.control_loop(), orchestrator.publisher_worker(). Rewrite it to drive the event-driven path directly:

Build an InProcessBus + a leader DummyRobot with random-walk state + a follower DummyRobot.
Start leader_node.state_publisher as a thread.
Start the SystemStateManager + OrchestratorNode as on main; subscribe a probe to action.follower.
Drive a chirp / step / PRBS workload for ≥10 s (warm-up included), with ROBOT_TRACE=1.
Read the JSONL trace file and assert: p99 latency < threshold, max loss = 0, max reorder = 0, no cmd_pre_hw_write → cmd_applied hop > 5 ms.

This is the in-process noise floor (rung 1). Every other rung re-runs this same test with a different bus.

4.5 UR + MIR600 integration sketch

Two new robots, both extending the existing ROS robot pattern (R1RosRobot already does this in servo7/robot/r1.py):

URRobot(Robot) — wraps the official ur_robot_driver ROS 2 package. get_state() reads /joint_states; set_state() publishes to /scaled_joint_trajectory_controller, or — for true low-latency teleop — uses RTDE directly via ur_rtde Python bindings at 500 Hz. Run URsim in a Docker container for sim-only testing as documented in the UR ROS 2 docs.
MIR600Robot(Robot) — wraps the MIR REST API for high-level nav and the ROS 2 bridge for low-level state. The MIR is the carrier; the UR arm sits on top. Both expose state into the same bus.
Add UR and MIR600 to RobotType in servo7/robot/robot_types.py and the RobotFactory in servo7/robot/factory.py.

Neither of these touches the bus/network/QoS work — they're orthogonal and can land in any order.

5. The Pragmatic Plan

Week 1 — Repair the harness. Add the 24-byte sequence-numbered envelope (§4.2). Surface drop / reorder / PDV in the existing profiler panel (§4.3). Rewrite tests/test_latency.py against the event-driven path on main (§4.4). This gives us the metric vocabulary to argue about everything else, and removes the placebo CI.
Week 1 — Sanity-check the harness. Add tc netem impairment injection in CI. If injecting 1% loss / 5 ms jitter / 1% reorder doesn't show up in the new metrics, the harness is wrong. Fix it before measuring anything else.
Week 2 — Walk the ladder on the dummy robot. Run sinusoid + chirp + step + PRBS at all five rungs (in-process, loopback, direct wired, switched, Wi-Fi) using the existing InProcessBus + ZMQ remote. Diff the metrics. This alone tells you whether jitter is host-side, switch-side, or Wi-Fi-side — without changing any code beyond instrumentation.
Week 2 — Quick win: tune the existing FastDDS QoS. Move the existing leader/follower DDS XML configs to BEST_EFFORT / KEEP_LAST=1 / VOLATILE / DEADLINE=2×period. Build a FastDdsBus that uses them. Re-run the ladder. Likely removes the bulk of the jitter without any new dependency. (Decision gate — see below.)
Week 3 — Build ZenohBus. ~80 LOC. Wire via a CLI flag in scripts/main_remote.py. Re-run the ladder. Pick whichever wins on the Wi-Fi rung.
Week 3-4 — Host hygiene. PREEMPT_RT kernel on both hosts, PTP via linuxptp, performance governor, isolated cores for the teleop process and NIC IRQ. Validate with cyclictest. Re-run the ladder. The numbers from §3 only mean something on a tuned host.
Week 4 — Wi-Fi hygiene. Dedicated 5/6 GHz Wi-Fi 6 AP, no other clients on the SSID, DSCP-tagged control egress, WMM verified at the AP. Document RSSI and channel. Re-run the Wi-Fi rung.
Week 5+ — UR + MIR600 on top. URRobot via ur_robot_driver + RTDE, MIR600Robot via REST + ROS 2. Both consume the chosen bus. The teleop work above is what makes this safe.

5.1 Decision gates

Gate 1 — after week 2

If a tuned-QoS FastDdsBus already meets the budget at every rung, ship. Don't add Zenoh for its own sake. (We expect this likely doesn't hold on the Wi-Fi rung — but measure first.)

Gate 2 — after week 3

If ZenohBus wins on the Wi-Fi rung by a clear margin (the published evidence says it should), promote it to default. Otherwise stay on tuned FastDDS — fewer moving parts.

Gate 3 — after week 4

If after host + Wi-Fi hygiene the Wi-Fi rung still has p99 > 50 ms, escalate: either accept the budget cannot be met on this Wi-Fi infrastructure (and the system needs a wired teleop link or a buffer/predictor on the follower), or adopt one of the heavier options (WebRTC, dedicated UDP transport with FEC). Do not hand-wave past this gate.

6. Open Questions

What's the actual current p99 / max? The existing tracer captures hop times but the broken latency test isn't asserting on tail percentiles. Until we add p99 / max / PDV / loss / reorder to the assertions, we're flying half-blind.
What's the deadline budget? "Reliable and safe to operate" is not a number. For a UR arm at 500 Hz the deadline is 2 ms; we cannot promise that today. Pick an explicit budget (e.g. p99 < 10 ms, max < 30 ms, loss < 1e-3) and treat it as the acceptance criterion.
How tightly do leader and follower clocks need to agree? Without PTP, our one-way latency numbers are accurate only to NTP error (~1-10 ms). Below that, all latency claims are noise. The tracer uses perf_counter which is per-host monotonic — fine within a host, useless across hosts without sync.
Is the EMA the right place for jitter suppression? EMA on the command path masks observation jitter at the cost of a first-order lag. For a UR teleop loop that's probably fine; for fine manipulation it may not be. Worth re-tuning α (or moving to a Kalman / one-euro filter) once the network jitter is quantified.
Is there a hold-last-sample / predictor on the follower side? A safety net that holds the last commanded joint when no fresh command arrives within deadline drastically reduces the consequences of a Wi-Fi blip. Worth building regardless of which middleware we pick.
Where do safety estops sit? Estop must not depend on the same network as control. Hardware estop on the UR controller is non-negotiable; software estop should travel on a separate reliable channel.

7. References

Test methodology

Network & middleware

In the Atlas codebase (main @ 68ae39c)

servo7/bus/pubsub.py — the PubSub Protocol, now with subscribe() (the injection seam)
servo7/bus/in_process_bus.py — current in-process implementation; synchronous fan-out outside the lock
servo7/bus/command_router.py — topic naming + serialization
servo7/nodes/robot_node.py — leader publish + follower inline _on_command_received
servo7/nodes/orchestrator_node.py — inline _on_leader_state_change + sync ai_inference_worker + EMA + mode blender
servo7/nodes/system_state_node.py — event-driven state aggregation, fires registered callbacks
servo7/control/ema_filter.py — per-joint EMA smoother (already in command path)
servo7/control/mode_blender.py — 0.75 s cross-fade on mode transitions
servo7/utils/trace_logger.py — JSONL tracer + in-memory deque for live profiler
tools/profiler.py — offline self-contained HTML profiler renderer
robot-frontend/src/components/ProfilerPanel.jsx — live profiler frontend
servo7/robot/robot_client.py + robot_server.py — current ZMQ-over-TCP remote path (the suspect)
servo7/dds/start_discovery_server.sh + config/dds/dds_config_*.xml — existing FastDDS scaffolding (unused on main)
tests/test_latency.py — currently broken on main; references methods removed by #22 / #29 / #32. Repair before extending.
scripts/main.py / main_dummy.py / main_remote.py / main_ros.py — single bus construction sites

Compiled 2026-04-28 against main @ 68ae39c, after #22 "moved from polling to event driven", #29 "no more tick loops", #32 "Inline command path + LeRobot temporal ensembling + profiler", and recent profiler / trace refinements. Multi-agent investigation: codebase trace · teleop QoS literature · 2026 ROS 2 / DDS / Zenoh state of the art. All recommendations are testable; the test ladder in §2.5 is the falsification protocol.