Servo7 · Atlas (main @ 68ae39c) · 2026-04-28

Teleoperation Stack — Plan, Tests, and Network Choice

For UR robotic arm on MIR600 mobile base. Leader → follower over USB / Ethernet / Wi-Fi. Goal: a stack that is reliable, safe, and interpretable, with a clear path from in-process sim to real Wi-Fi.

TL;DR

  1. The in-process pipeline on main is already in good shape: polling tick-loops were removed in #22 / #29, the follower is event-driven via bus.subscribe() (callbacks fire synchronously on the publisher's thread), and an EMA smoother + mode blender already sit in front of every command. The "jitter" you see in the wild is almost certainly not coming from the local executor anymore.
  2. The remaining suspects, in order of likelihood: (a) the remote path in servo7/robot/robot_client.py is ZMQ PUB/SUB over TCP, and Linux TCP_RTO_MIN = 200 ms means a single drop pauses the stream; (b) default DDS / multicast discovery over Wi-Fi (the team has already written servo7/dds/start_discovery_server.sh as a workaround, the smoking gun); (c) with the new inline command path, a slow hardware write blocks the whole pipeline — but the tracer makes that immediately visible.
  3. The sinusoid test idea is half right. Phase shift gives latency, but a smooth sinusoid silently hides reordering and can't stress congestion. Replace it with: a monotonic sequence number in every payload (always), plus alternating workloads of chirp, step, and PRBS. Compute packet-level metrics (loss, reorder, AoI, PDV) and signal-level metrics (cross-correlation latency, coherence, transfer function, RMSE) on each run.
  4. Walk a 5-rung test ladder: in-process → loopback localhost → direct wired → switched Ethernet → Wi-Fi. The diff between rungs is more informative than any absolute number.
  5. Network stack: Primary rmw_zenoh on ROS 2 Kilted with BEST_EFFORT / KEEP_LAST=1 / VOLATILE QoS for joint streams, on a dedicated 5/6 GHz Wi-Fi 6 AP with DSCP-tagged egress. Fallback stay on FastDDS using the existing Discovery Server, but flip the QoS profile first — that alone likely removes 60-80% of jitter.
  6. Architecture: don't rewrite anything. The codebase already has the perfect injection seam — the PubSub Protocol in servo7/bus/pubsub.py with publish / get_latest / subscribe. Implementing a ZenohBus or FastDdsBus conforming to it is ~80-100 lines per backend and zero changes to nodes.
  7. Three concrete first deliverables: (1) add a 24-byte sequence-numbered envelope and surface drop/reorder/PDV in the existing tracer/profiler panel; (2) repair tests/test_latency.py — it currently references methods removed by #22 / #29 / #32 and almost certainly doesn't exercise the real path; (3) add tc netem impairment injection in CI to validate the harness can detect what we claim it detects.
  8. UR + MIR600 are not in the codebase yet. Add them last, after the bus and metrics work — they'll plug into the same Robot interface as R1RosRobot and use whichever bus we pick.

1. Status Quo on main — What the Pipeline Actually Does

After #22 "moved from polling to event driven", #29 "no more tick loops", and #32 "Inline command path + LeRobot temporal ensembling + profiler", the teleop pipeline is fully event-driven. There is no command queue, no follower poll thread, no orchestrator tick loop in the teleop hot path.

┌─────────────┐ robot.get_state() ┌──────────────────────────┐ │ LEADER │ ─────────────────────────│ RobotNode.state_publisher│ │ robot │ publish_rate_hz (30 Hz) │ (single thread, fixed) │ └─────────────┘ └──────────┬───────────────┘ │ stamp publish.{id}.pre_read │ stamp publish.{id}.got_state │ orjson.dumps(state.to_dict()) ▼ bus.publish("state.leader", bytes) │ │ (synchronous fan-out, OUTSIDE lock) ▼ ┌──────────────────────────────────────────┐ │ InProcessBus.publish() invokes every │ │ subscriber callback inline │ └────────┬─────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────┐ │ SystemStateManager._on_state_ │ │ update("leader", payload) │ │ — dedupe by ts, fire callbacks │ └────────┬────────────────────────────┘ │ register_state_change_callback("leader", …) ▼ ┌─────────────────────────────────────┐ │ OrchestratorNode._on_leader_state_ │ │ change() — runs INLINE on the │ │ publisher's thread │ │ │ │ if mode != TELEOPERATION: return │ │ payload = converter.convert(s) │ │ payload = relative_offset(payload)│ │ payload = ema_smoother.apply(p) │ ← EMA smoother │ payload = mode_blender.apply(p) │ ← cross-fade on mode toggle │ bus.publish("action.follower", │ │ dumps(payload.to_dict())) │ └────────┬────────────────────────────┘ │ inline subscribe()-callback ▼ ┌─────────────────────────────────────┐ │ RobotNode._on_command_received(p) │ │ stamp .cmd_parsed │ │ stamp .cmd_pre_hw_write │ │ robot.set_state(payload) │ ← hardware write, │ stamp .cmd_applied │ INLINE on publisher │ trace_logger.log_trace(...) │ thread └─────────────────────────────────────┘

1.1 What's already in place — credit where due

Event-driven bus

InProcessBus at servo7/bus/in_process_bus.py now exposes publish, get_latest, and subscribe(topic, callback). Subscribers are invoked synchronously, but a snapshot of the subscriber list is taken under the lock so callbacks don't block other publishers and re-entrant publishes don't deadlock. Exceptions are swallowed and logged so one bad subscriber can't take down the bus.

Inline command path

OrchestratorNode._on_leader_state_change publishes directly to action.follower; RobotNode._on_command_received writes to hardware inline. Zero queue round-trips, zero polling. A slow hardware write becomes immediately visible as a long cmd_pre_hw_write → cmd_applied hop in the tracer.

EMA + mode blender

servo7/control/ema_filter.py applies a per-joint EMA (α = 0.1 joints, α = 0.5 gripper) on every command. servo7/control/mode_blender.py ramps over 0.75 s on mode transitions so teleop↔AI handoffs don't snap. EMA is reset on every set_control_mode().

Tracer + profiler

servo7/utils/trace_logger.py stamps every hop on the RobotState as it travels through the pipeline; flushes one JSONL record per completed trace on a daemon thread (non-blocking). Activated by ROBOT_TRACE=1 or --profile. Live in-memory deque (100k traces, ~7 min @ 240/s) feeds robot-frontend/src/components/ProfilerPanel.jsx; offline analysis via tools/profiler.py renders a self-contained HTML report.

Sync AI worker (deterministic)

OrchestratorNode.ai_inference_worker runs on a dedicated sync thread at command_hz (default 30 Hz). No event-loop jitter on AI ticks. Same EMA/blender pipeline as teleop.

Single bus construction site

Each scripts/main*.py instantiates exactly one bus: PubSub = InProcessBus() and passes it to every node. Swapping to a network bus is a one-line change per script.

1.2 What's missing — the real jitter sources

Suspect 1 — TCP under the ZMQ remote path
servo7/robot/robot_client.py uses zmq.SUB / zmq.PUB over tcp://. Linux's TCP_RTO_MIN is hard-coded to 200 ms regardless of measured RTT. A single dropped packet on the wire stalls every subsequent message at the receiver for at least 200 ms; at 100 Hz that is ~20 stale joint frames piled up behind the gap. On Wi-Fi where micro-bursts of loss are routine, this is the single most likely cause of the jitter you see across machines. This is not solved by the in-process improvements on main.
Suspect 2 — DDS multicast discovery on Wi-Fi
Default DDS discovery uses UDP multicast on 239.255.0.1. Most enterprise APs convert multicast to broadcast at the lowest mandatory rate (1 Mbps on 2.4 GHz), rate-limit it, or drop it under load. The fact that servo7/dds/start_discovery_server.sh exists with a hard-coded laptop IP 192.168.1.85:11811 means the team has already hit this and worked around it once. Discovery Server is the FastDDS-specific fix; Zenoh sidesteps the problem entirely by not using multicast for discovery.
Suspect 3 — inline pipeline blocks on slow hardware
Because InProcessBus.publish() fans out synchronously and _on_command_received calls robot.set_state() inline, a slow USB/CAN/RTDE write on the follower stalls the leader's state_publisher thread for the duration of the write. This is the right trade-off (fewer queues, lower latency, deterministic) but it means we need to keep an eye on the cmd_pre_hw_write → cmd_applied hop in the tracer. Move the slow side off the hot path only if profiler data shows it.
Suspect 4 — no sequence numbers, no ordering signal
InProcessBus overwrites the latest message by topic; the ZMQ path embeds no counter; the trace envelope carries hop timestamps but not a monotonic per-(topic, source) sequence number. We can't today distinguish "two adjacent samples were dropped" from "two were reordered" from "the sender hiccupped." This is a 24-byte fix (§4.2).
Suspect 5 — tests/test_latency.py is stale
The test references methods removed by #22 / #29 / #32: follower.command_listener(), follower.command_executer(), orchestrator.control_loop(), orchestrator.publisher_worker(). It almost certainly is not exercising the current pipeline, which means our latency CI is currently a placebo. Repair before extending.

1.3 What does not exist yet on main

2. The Test — Better Than a Sinusoid

The intuition behind the sinusoid plan is right: drive a known signal through the leader, record what arrives at the follower, and measure phase shift = latency. The plan needs to be hardened so it can also catch the things you actually care about — packet loss, reordering, and tail latency.

2.1 Metrics that matter

MetricDefinitionWhat it surfaces
Latency p50 / p95 / p99 / maxEnd-to-end one-way delay distributionMean is useless for safety. Max and p99 are what trip protective stops.
PDV (jitter, RFC 3393)Variation in one-way delay between packet pairsTrue network jitter. Distinguish from inter-arrival variation, which conflates sender + network.
Packet loss rateDrops / sentFor BEST_EFFORT: real loss. For RELIABLE: shows up as latency spikes from retransmit.
Reorder rateCount where seq < max-seen-seqThe thing your sinusoid will silently miss.
Age-of-Information (AoI)now − timestamp_of_last_received_sampleSingle freshness number; correlates with closed-loop control quality better than raw latency.
Deadline-miss ratePeriods with no fresh command in timeThe metric the safety case actually rests on.

2.2 Why a sinusoid alone is not enough

Smoothness hides reordering
Two adjacent samples on a 1 Hz sine, swapped, reconstruct identically at the millimetre level. The reorder rate from a sinusoid alone reads zero even when the network is misbehaving. Single-frequency excitation also probes only one point on the channel's transfer function; bandwidth-dependent effects (queueing under burst, MTU fragmentation) won't show up. And a 6-DOF sinusoid at 250 Hz is ~12 kbps — the network is idle, so no congestion stress.

2.3 The signal set that does the job

Sequence number — non-negotiable

A monotonic counter inside the payload on every packet. This is how you detect drops and reorders independent of signal shape. Every other test below assumes it. Cost: 8 bytes.

PRBS / m-sequence on one joint

Each sample is uniquely identifiable, so reordering becomes visible at the signal level too. LFSR-generated maximum-length sequences have white-noise-like autocorrelation — ideal for system-identifying the network's transfer function.

Chirp (log frequency sweep, 0.1 Hz → Nyquist/2)

Exposes bandwidth-dependent fidelity. Lets you compute coherence vs frequency and a Bode-plot-style transfer function in a single short run.

Step / square wave

Exposes deadline misses, slew-rate clipping, and any rate-limiter in the controller path. Worst-case transient response — and a useful stress test for the EMA smoother (a step into EMA becomes an exponential, by design).

Run all of these on every link. Keep the sinusoid as a sanity check / phase-shift demo, not as the headline test.

2.4 What to compute from s_L(t) and s_F(t)

2.5 Pragmatic test ladder

Run identical workloads at every rung; the diff between rungs tells you where the time goes.

#RungAddsExpected p99Red flag
1In-process (single Python process, current default)Inline callbacks + serializer + GILtens of µsp99 > 1 ms ⇒ a callback is doing real work; check tracer hop times
2Loopback localhost (UDP/DDS via 127.0.0.1)Kernel UDP + DDS shm/UDP~100-300 µsmax > 5 ms or any loss ⇒ CPU contention or DDS QoS misconfig
3Direct point-to-point Ethernet between two boxesNIC + cable + IRQ handling+100-500 µs over rung 2PDV > few hundred µs, or any reordering (a direct cable cannot reorder — host stack issue)
4Switched EthernetSwitch bufferingsmall bump over rung 3PDV grows with cross-traffic, or loss under iperf3 background ⇒ HoL blocking / cheap switch
5Wi-Fi (5/6 GHz, dedicated AP)RF contention, retries, roaming5-20× rung 4, burstymax > 100 ms, AoI spikes correlated with airtime contention. This is where MIR600 lives.

2.6 Experiment hygiene (so the numbers mean something)

2.7 Tools that earn their place

tc netem — Linux traffic control

The single most useful tool. Inject controlled delay, jitter, loss, dup, reordering on the egress of either host. If your harness can't detect 1% loss injected with tc qdisc add dev eth0 root netem loss 1%, your harness is broken before you start. Use it to sweep impairment levels and find the controller's failure point.

iperf3 — background load

Run UDP at a controlled bitrate alongside teleop traffic; watch your metrics on the teleop link degrade. Tests whether QoS / DSCP tagging actually works.

Wireshark / tshark

Ground truth. Capture on both ends with PTP-synced timestamps; compute one-way delay and PDV directly from the pcaps. Indispensable for diagnosing DDS discovery storms.

Apex.AI performance_test

The right tool for isolating the DDS layer from your application. Run with the same QoS profile as your teleop topics on the same hosts before blaming the network.

cyclictest

Validates host real-time configuration. If it shows hundreds of µs of host jitter, your network numbers are dominated by host scheduling and you're measuring the wrong thing.

The existing tracer + profiler panel

Already industrial-grade. Extend with three columns derived from the new sequence number: loss count, reorder count, PDV histogram. Don't build a parallel dashboard; fold the metrics into robot-frontend/src/components/ProfilerPanel.jsx and tools/profiler.py.

3. The Network Stack — What Actually Works

3.1 Protocol: TCP vs UDP vs QUIC vs RT-Ethernet

ProtocolVerdict for hot pathReason
TCP Wrong Linux TCP_RTO_MIN = 200 ms hard-coded. One drop ⇒ stream stalls 200 ms while every later packet waits for retransmit. At 500 Hz that's ~100 stale joint frames queued. Wi-Fi micro-bursts of loss are routine ⇒ this is the textbook teleop jitter cause. Your current ZMQ remote path is exactly this.
UDP Right pattern "Publish a fully-self-describing snapshot every tick, never retransmit." A dropped packet is replaced by the next one ~1-10 ms later — below human-perceptible. Handle ordering with a sequence number, drop OOO frames at the receiver.
QUIC Skip for now Multiplexed streams over UDP fix per-stream HoL across streams, but each stream is still reliable+ordered (= TCP-like). RFC 9221 DATAGRAM frames give you raw datagrams alongside, but Python tooling is immature. Not worth introducing here.
RT-Ethernet (EtherCAT, PROFINET, TSN) N/A here Sub-ms determinism but requires kernel bypass, dedicated NICs, layer-2 path with no Wi-Fi. It's how the UR controller talks to its joints internally — not your leader↔follower link.

3.2 Middleware

MiddlewareRoleNotes for our case
FastDDS / Cyclone DDS Fallback ROS 2 default; UR driver + MIR600 are DDS-native Right QoS for 500 Hz joint stream: RELIABILITY=BEST_EFFORT, HISTORY=KEEP_LAST depth 1, DURABILITY=VOLATILE, explicit DEADLINE = control period. RELIABLE + deep history is the classic teleop footgun. Multicast discovery is the Wi-Fi pain point — use the Discovery Server you've already scripted.
Zenoh / rmw_zenoh Primary Eclipse project, pluggable transports Promoted to Tier 1 in ROS 2 Kilted (May 2025); one env-var swap from FastDDS. Published evidence (arXiv 2309.07496): Cyclone wins on Ethernet, Zenoh wins on Wi-Fi and 4G with smallest trajectory drift. Doesn't depend on UDP multicast for discovery ⇒ skips the AP-broken-multicast disaster entirely.
ZeroMQ Brokerless, simple, great Python PUB/SUB with CONFLATE=1 mirrors today's "latest-wins" semantics exactly. No QoS framework. If you want to keep the existing ZMQ code path, switch the transport from tcp:// to udp:// + epgm://, or to radio/dish (UDP multicast) — that alone removes the 200 ms RTO problem. Lose graph introspection and free interop with the UR driver topics.
iceoryx2 Rust rewrite, true zero-copy shm Alpha for ROS 2 Rolling. Same-host only; no cross-host gateway yet. Useful if leader and follower ever run on the same box, irrelevant for the network leg.
WebRTC data channels SCTP-over-DTLS-over-UDP Each channel can be ordered=false, maxRetransmits=0. Real option for control over arbitrary networks (NAT/STUN/ICE built-in). Same-LAN, the encryption + congestion control are cost without benefit. Reasonable only if the same stack also has to carry video later.
gRPC HTTP/2 over TCP Same 200 ms RTO problem as raw TCP. Fine for config/RPC ("home robot", "set TCP", "estop"); wrong for joint streams.

3.3 Off-the-shelf "<40 ms" claims — the honest table

Claim sourceReality
5G teleop demos (Ericsson, Vodafone, HoloLight)Marketing. The 40 ms is media-plane RTT in a clean lab and excludes any control/safety stack. Ignore.
Foxglove / LichtblickVisualization, not transport. Will not improve latency; can mask jitter in plots if you don't pin timestamps.
Nimble / Sanctuary / ReflexProprietary, hardware-locked. Not buyable as a stack.
UR RTDEReal and useful. 500 Hz on e-Series, TCP on port 30004, dedicated short link ⇒ 200 ms RTO almost never fires. Use RTDE for the UR side; the ROS 2 driver wraps it.
URsim + ROS 2 (quiet local stack)~5-15 ms leader-input → follower-state. The cross-Wi-Fi hop is what blows the budget.
ros2_control + PREEMPT_RTReal, free, well-understood. Removes scheduler jitter (sub-100 µs vs 1-10 ms on stock kernel). Does nothing for network jitter — layer it under whichever middleware you pick.

Honest verdict: <40 ms end-to-end is achievable on quiet wired Ethernet and unreliable on shared Wi-Fi, regardless of whose slide deck claims it.

3.4 The Wi-Fi-specific story

4. Architecture — Where to Inject the Network Layer

The codebase already has the right seam. Don't refactor; plug into it.

# servo7/bus/pubsub.py — already on main
from typing import Callable, Protocol

class PubSub(Protocol):
    def publish(self, topic: str, payload: bytes) -> None: ...
    def get_latest(self, topic: str) -> bytes | None: ...
    def subscribe(self, topic: str, callback: Callable[[bytes], None]) -> None: ...

Three methods. Every node (RobotNode, OrchestratorNode, SystemStateManager) takes a bus: PubSub in its constructor and never references a concrete implementation. Each scripts/main*.py has exactly one construction site:

bus: PubSub = InProcessBus()

4.1 Proposed concrete buses

ZenohBus target

Wraps zenoh.open(). publish() calls session.put(topic, payload). subscribe() registers a Zenoh subscriber whose callback invokes the user's callback synchronously (matches InProcessBus semantics — the user's callback pays the cost). get_latest() reads from a per-topic latched cache populated by the subscriber. Estimate: ~80 lines.

FastDdsBus fallback

Wraps rclpy with QoS BEST_EFFORT / KEEP_LAST=1 / VOLATILE / DEADLINE using the existing Discovery Server XML. subscribe() attaches a DDS DataReader listener; get_latest() snapshots the last sample. Estimate: ~100 lines.

UdpBus test rig

Bare UDP with seq-numbered framing. Used as a noise-floor reference and as the netem target during the test ladder. ~60 lines.

NetemBus testing only

Decorator wrapping any other bus, injecting controlled delay/loss/reorder via tc netem on a loopback interface. Used to validate that the harness can detect what we claim it detects.

4.2 Sequence-numbered envelope (do this first, regardless of bus choice)

Today's payload is a JSON dict produced by RobotState.to_dict(). Wrap it in a 24-byte binary header so the consumer can compute drop / reorder / PDV without parsing JSON:

# servo7/bus/envelope.py  (new, ~30 LOC)
import struct, time

HEADER_FMT = "<QdQ"   # seq (u64), send_ts (f64 perf_counter), source_id (u64 hashed)
HEADER_LEN = struct.calcsize(HEADER_FMT)

def wrap(seq: int, source_id: int, payload: bytes) -> bytes:
    return struct.pack(HEADER_FMT, seq, time.perf_counter(), source_id) + payload

def unwrap(b: bytes) -> tuple[int, float, int, bytes]:
    seq, ts, src = struct.unpack(HEADER_FMT, b[:HEADER_LEN])
    return seq, ts, src, b[HEADER_LEN:]

Wrap inside RobotNode.state_publisher (per-(topic, source) monotonic seq), unwrap inside SystemStateManager._on_state_update and feed (seq, send_ts, recv_ts) into the existing tracer as a structured event alongside the hop list.

4.3 Extend the existing profiler — do not build a new one

The tracer at servo7/utils/trace_logger.py is good enough; add three columns derived from the new envelope:

Surface these in robot-frontend/src/components/ProfilerPanel.jsx's existing summary block (where p50/p95/p99/max already live) and in the offline HTML produced by tools/profiler.py. The frontend filters by (origin_stage, terminal_stage) already; we just add three counters per filter.

4.4 Repair tests/test_latency.py

The existing test references methods that no longer exist on main: follower.command_listener(), follower.command_executer(), orchestrator.control_loop(), orchestrator.publisher_worker(). Rewrite it to drive the event-driven path directly:

  1. Build an InProcessBus + a leader DummyRobot with random-walk state + a follower DummyRobot.
  2. Start leader_node.state_publisher as a thread.
  3. Start the SystemStateManager + OrchestratorNode as on main; subscribe a probe to action.follower.
  4. Drive a chirp / step / PRBS workload for ≥10 s (warm-up included), with ROBOT_TRACE=1.
  5. Read the JSONL trace file and assert: p99 latency < threshold, max loss = 0, max reorder = 0, no cmd_pre_hw_write → cmd_applied hop > 5 ms.

This is the in-process noise floor (rung 1). Every other rung re-runs this same test with a different bus.

4.5 UR + MIR600 integration sketch

Two new robots, both extending the existing ROS robot pattern (R1RosRobot already does this in servo7/robot/r1.py):

Neither of these touches the bus/network/QoS work — they're orthogonal and can land in any order.

5. The Pragmatic Plan

  1. Week 1 — Repair the harness. Add the 24-byte sequence-numbered envelope (§4.2). Surface drop / reorder / PDV in the existing profiler panel (§4.3). Rewrite tests/test_latency.py against the event-driven path on main (§4.4). This gives us the metric vocabulary to argue about everything else, and removes the placebo CI.
  2. Week 1 — Sanity-check the harness. Add tc netem impairment injection in CI. If injecting 1% loss / 5 ms jitter / 1% reorder doesn't show up in the new metrics, the harness is wrong. Fix it before measuring anything else.
  3. Week 2 — Walk the ladder on the dummy robot. Run sinusoid + chirp + step + PRBS at all five rungs (in-process, loopback, direct wired, switched, Wi-Fi) using the existing InProcessBus + ZMQ remote. Diff the metrics. This alone tells you whether jitter is host-side, switch-side, or Wi-Fi-side — without changing any code beyond instrumentation.
  4. Week 2 — Quick win: tune the existing FastDDS QoS. Move the existing leader/follower DDS XML configs to BEST_EFFORT / KEEP_LAST=1 / VOLATILE / DEADLINE=2×period. Build a FastDdsBus that uses them. Re-run the ladder. Likely removes the bulk of the jitter without any new dependency. (Decision gate — see below.)
  5. Week 3 — Build ZenohBus. ~80 LOC. Wire via a CLI flag in scripts/main_remote.py. Re-run the ladder. Pick whichever wins on the Wi-Fi rung.
  6. Week 3-4 — Host hygiene. PREEMPT_RT kernel on both hosts, PTP via linuxptp, performance governor, isolated cores for the teleop process and NIC IRQ. Validate with cyclictest. Re-run the ladder. The numbers from §3 only mean something on a tuned host.
  7. Week 4 — Wi-Fi hygiene. Dedicated 5/6 GHz Wi-Fi 6 AP, no other clients on the SSID, DSCP-tagged control egress, WMM verified at the AP. Document RSSI and channel. Re-run the Wi-Fi rung.
  8. Week 5+ — UR + MIR600 on top. URRobot via ur_robot_driver + RTDE, MIR600Robot via REST + ROS 2. Both consume the chosen bus. The teleop work above is what makes this safe.

5.1 Decision gates

Gate 1 — after week 2
If a tuned-QoS FastDdsBus already meets the budget at every rung, ship. Don't add Zenoh for its own sake. (We expect this likely doesn't hold on the Wi-Fi rung — but measure first.)
Gate 2 — after week 3
If ZenohBus wins on the Wi-Fi rung by a clear margin (the published evidence says it should), promote it to default. Otherwise stay on tuned FastDDS — fewer moving parts.
Gate 3 — after week 4
If after host + Wi-Fi hygiene the Wi-Fi rung still has p99 > 50 ms, escalate: either accept the budget cannot be met on this Wi-Fi infrastructure (and the system needs a wired teleop link or a buffer/predictor on the follower), or adopt one of the heavier options (WebRTC, dedicated UDP transport with FEC). Do not hand-wave past this gate.

6. Open Questions

7. References

Test methodology

Network & middleware

In the Atlas codebase (main @ 68ae39c)

Compiled 2026-04-28 against main @ 68ae39c, after #22 "moved from polling to event driven", #29 "no more tick loops", #32 "Inline command path + LeRobot temporal ensembling + profiler", and recent profiler / trace refinements. Multi-agent investigation: codebase trace · teleop QoS literature · 2026 ROS 2 / DDS / Zenoh state of the art. All recommendations are testable; the test ladder in §2.5 is the falsification protocol.