Teleop on the wire

Measuring, fixing, and rethinking the latency between an operator and a remote robot.

00Introduction

Teleoperation is where modern robotics starts. Before any policy is trained, before any data set exists, before the full set of AI components that make a robot autonomous (vision, object detection, inverse kinematics, failure detection, and so on) is even built, a human with a controller can drive the robot through useful work. That fact carries more weight than it sounds like it should. It lets the robot do real tasks for real users while the autonomous stack is still being developed. It opens robot operation up to anyone who can hold a controller, not just the engineers who built the robot.

This is why teleoperation will be core to Atlas as a product even once autonomy is on the table. A deployed robot does not need to be fully autonomous to be useful; it needs to be useful, often, and predictable. Teleoperation closes the gap on the days when autonomy is not yet good enough, and the same loop produces the data that makes the autonomy better. The two run together. Ideally the operator can be anywhere, because the value of teleop grows with how far the operator can sit from the robot, and in practice the link between them runs over any kind of network: a cable in a lab, a Wi-Fi access point in a customer's building, a continent-spanning WAN from somebody's home to a robot in another country. Each of those links has different failure modes, which if not handled properly can split someone in half.

This document is the record of one branch of work on that link. The branch rebuilt the way the operator-to-robot loop is measured, ran the same workload across local, wired, and wireless setups, and chased the largest sources of variation in latency until each had been either fixed or eliminated as a suspect.

01Network primer

A more thorough primer lives in the Networking primer tab at the top of this page. The compressed version below is the minimum mental model needed to read the rest of this tab: the layered model the whole stack rests on, and the link layer that sets the latency floor underneath everything else. Anything deeper (encapsulation, TCP versus UDP, head-of-line blocking, QUIC, WebRTC, ROS, ZeroMQ) lives in the primer.

The layered model

A network works because each layer does one job and trusts the layer below to do its job. The four that matter for teleop are summarised here.

LayerJobExamples
LinkMove bits between two devices that share a mediumEthernet, Wi-Fi, 5G
NetworkRoute packets across many linksIPv4, IPv6
TransportDeliver to the right program with chosen reliabilityTCP, UDP, QUIC
ApplicationWhatever you wroteHTTP, the teleop control loop
Operator Robot Application teleop control loop Application teleop control loop teleop messages Transport TCP / UDP Transport TCP / UDP peer protocol Network IP Network IP peer protocol Link Ethernet / Wi-Fi / 5G Link Ethernet / Wi-Fi / 5G Physical medium copper cable, fibre, Wi-Fi radio channel, cellular link, ...
Bits travel down the operator's stack, across the medium once, and up the robot's stack. The dashed horizontal lines are conceptual: each layer behaves as if it were talking to its peer at the same level, but only the link layer ever touches the physical medium.

The link layer sets the floor

If the operator-to-robot path includes a single Wi-Fi hop, then the floor on latency and on variation is set by that Wi-Fi hop, no matter how clever the protocol above it is. Wi-Fi is a take-turns shared medium (half-duplex): every device on the channel, including the neighbours' devices, waits for every other device to finish transmitting, and the wait gets quietly catastrophic when the access point is busy. Ethernet sends and receives simultaneously on dedicated wires (full-duplex), is switched, deterministic, sub-millisecond per hop, and essentially zero loss. Anything done in software is rounding error compared to the choice of medium.

MediumLatencyLossVariation
Ethernet<0.5 ms per hop~0%microseconds
Wi-Fi 6 (5 GHz)1 to 5 ms typical0.1 to 2%tens of milliseconds
4G / 5G / Starlink10 to 60 msvariabletens of milliseconds

DSCP and Wi-Fi priority

Every IP packet has a small priority byte in its header. On Wi-Fi, the access point uses that byte to decide which packets get airtime first. Marking control packets with the highest priority value puts them ahead of background traffic such as a queued web download. This is applied on both ends of the teleop link.

02Measurements and metrics

The goal of the measurement work on this branch is to know, with confidence and in detail, where time is spent between an operator's hand and a robot's actuators. In the Atlas teleop loop, the leader process polls the controller (for example a Quest) at 100 Hz via a get_state call, runs the result through inverse kinematics and filtering, hands the resulting command to the orchestrator, which forwards it through the remote robot client, across the wire to the remote robot server, and finally into a set_state call on the robot driver. The number of interest is the elapsed time from get_state on the operator side to set_state applied on the robot side, broken down per stage so the slow step is identifiable whenever the total is slow.

What actually happens between hand and actuator

Before getting to the metrics themselves it is worth being concrete about the path the number measures. The only pull in the chain sits at the very top: the leader's publish thread calls get_state on the Quest robot adapter, which reaches into a controller-receiver thread for the latest pose. From there onward every stage is push. An in-process pub/sub bus delivers each new state to subscribers by callback, the orchestrator is not a separate process but a callback registered on the leader topic that runs inline on the publish thread and converts to follower joint space, and the remote robot client then serialises the command and hands it to the wire transport. The transport itself is selectable at launch (raw UDP or ZMQ pub/sub), but the stages, the trace stamps, and the threading model are identical either way, which is what keeps the A/B comparisons later in this document trustworthy.

After the robot applies the command the remote robot server sends a small receipt back to the operator carrying the three robot-side timestamps for that trace: kernel-receive, userspace-receive, and applied. Those stamps live in the robot host's perf_counter frame, which is meaningless on the operator host. When the receipt arrives the operator translates each stamp through the running clock-sync offset (the four-stamp exchange described in the next subsection) into its own clock frame and only then closes the trace. That translation is what turns two disjoint per-host timelines into a single elapsed number from get_state on the operator to set_state applied on the robot. Until clock sync has produced an offset, traces involving the wire are dropped rather than reported with a fake projected zero.

OPERATOR HOST ROBOT HOST Meta Quest controller controller pose over the local network 11 floats per arm, tracking flags, grip values Controller receiver (daemon thread) blocking recv loop, parses latest pose held under a lock for the puller above to read no trace stamp yet (no command exists) Leader.get_state() [pull] parallel per-arm IK (Pinocchio, thread pool) filtering, home interpolation, mode shaping stamp: publish.{leader}.got_state PubSub bus [push, in-process] topic "state.leader", subscribers on the same thread callbacks fire inline on the publisher's thread stamp: ssm.leader.deserialized OrchestratorNode callback [push] leader → follower joint convert, EMA, mode blend re-publishes on a follower-command topic stamp: orch.cb_enter stamp: orch.cb_converted stamp: orch.cb_pre_serialize RemoteRobotClient.set_state() serialise command + trace hops to JSON payload hand to configured wire transport (UDP or ZMQ) stamp: follower.{id}.command_wire_sent trace_id + send_ts cached pending the receipt Receipt handler + clock-sync translation matches trace_id against pending send projects robot stamps → operator clock frame writes the closed trace to traces/*.jsonl terminal stamp: follower.{id}.remote_applied Robot kernel + wire arrival command bytes ingested by the network stack kernel receive timestamp captured (UDP path) RemoteRobotServer.listener() thread blocking receive returns payload + kernel stamp converts kernel ns into perf_counter frame stamp: remote.{id}.kernel_received _handle_command() deserialise payload into RobotState replay incoming trace hops onto the record stamp: remote.{id}.receive robot.set_state(command) hardware driver write (Piper, SO100, sim, ...) apply_ns sampled immediately after return stamp: remote.{id}.applied _send_receipt() receipt { trace_id, kernel_recv, receive, apply } sent on the same transport back to operator all three stamps are in the robot's perf_counter Parallel back-channels state publisher: robot state at ~100 Hz sync loop: t0/t1/t2/t3 NTP exchange at 1 Hz camera frames: dedicated channel at ~30 Hz, JPEG each on its own socket / thread to keep heavy workloads off the command path push: in-process callback / pub-sub pull: leader polls latest controller pose command on wire (operator → robot): ~200 B JSON over the configured transport receipt on wire (robot → operator): carries kernel-receive, receive, and apply stamps

The trace logger

A small piece of instrumentation called the trace logger captures this. Every teleop_cmd that flows through the system starts a trace record, and each subsequent stage along its path appends a timestamp to that record. When the command is applied on the robot, the record is closed and written as one line of JSON to traces/robot_trace_*.jsonl. The same file is read by the live profiler panel, the profiler sidecar, the offline analysis tool, and the test harness, so every consumer sees the same source of truth.

A typical record carries stamps for the stages below.

leader.get_stateoperator pose polled from controller
operator.pipeline_doneIK, filtering, command shaping finished
command_wire_sentserialised, handed to the transport socket
remote.kernel_receivedrobot kernel had the packet (UDP only)
remote.follower.receivePython on the robot deserialised it
remote.follower.appliedset_state returned on the hardware driver

One trace record is one full operator-to-actuator path, with the breakdown that answers "which step took the time".

Clock synchronisation

There is a problem hiding in the stage list above. The operator runs on one machine, the robot on another, and each side reads its stamps from a monotonic clock: a clock that only counts forward and is never adjusted by the operating system for time-zone changes, daylight saving, or wall-clock corrections. Monotonic clocks are ideal for measuring how long something takes within a single process, but their zero point is arbitrary per process. The operator's stamps and the robot's stamps live in two different number spaces. You can subtract two stamps on the same machine to get a duration. You cannot subtract a stamp on the operator from one on the robot and get anything meaningful.

Before this branch, the trace worked around the gap by assuming the network was symmetric and dividing one round trip in half to get the one-way delay. That assumption breaks on Wi-Fi. The two directions of the link queue and schedule independently on the access point, and on a busy consumer AP the downlink direction (AP to client) is noticeably slower than the uplink direction (client to AP) because the AP has to serve every connected client. Splitting the round trip in half therefore over-reports one direction and under-reports the other, and any per-stage breakdown derived from that number is misleading exactly when it matters most.

There are a few standard ways to fix this. PTP, the Precision Time Protocol, achieves sub-microsecond accuracy by exchanging hardware-stamped messages and requires NIC support on both ends. NTP-style software exchange, the classic four-stamp protocol from RFC 5905, is accurate to roughly a millisecond on an ordinary link. The NTP-style approach is selected here because the measurement targets sit at milliseconds and tens of milliseconds, so a clock-sync accuracy of about one millisecond is comfortably below the variation being measured. PTP would give more headroom, but its complexity is unjustified at this layer.

The exchange itself is the standard one. The operator records the moment it sends a probe; the robot records the moment it receives the probe and the moment it sends the reply; and the operator records the moment the reply arrives. With those four timestamps, the offset between the two clocks can be computed without assuming anything about path symmetry.

Operator clock Robot clock t0 probe sent network outbound t1 probe received robot processing t2 reply sent network inbound t3 reply received
Four-stamp clock synchronisation. The two parallel lines are the two clocks; the diagonal arrows are network legs; the dashed segment is robot-side processing carved out of the path delay.

From the four stamps the path delay and the clock offset follow directly:

round_trip_delay_ns = (t3 - t0) - (t2 - t1)        # network only
offset_ns           = ((t1 - t0) + (t2 - t3)) // 2 # robot relative to operator
operator_perf_ns    = robot_perf_ns - offset_ns    # project robot time

The operator probes at 1 Hz so the offset keeps tracking drift between the two clocks. Each sample feeds into a sliding window of the most recent sixteen exchanges and the exposed offset is the median of that window, so a single noisy probe (a garbage collection pause, a scheduler hiccup, a one-off retransmit) cannot pull it.

What is reported, and what it means

For each measurement window, p50, p95, p99, and max are reported, never means. Network and scheduling distributions are heavy-tailed, and a mean over a heavy-tailed distribution is hostile: a single outlier moves it, and an "average" gives a falsely reassuring picture of an unstable system. Percentiles tell the truth. The four chosen carry different signals.

StatisticWhat it tells you
p50 (median)The typical experience. Half the ticks are faster, half are slower.
p95Frequent worst case. Roughly one in twenty ticks lands above this.
p99Rare but recurring worst case. Roughly one in a hundred ticks lands above this.
maxThe single worst tick in the measurement window. Useful as a sanity bound, easy to overinterpret in isolation.

Two quantities matter for teleop. The first is one-way latency, which sets feel: how laggy the system seems to the operator. The second is the variation in that latency from one tick to the next, which sets controllability: a controller predictor stays accurate on a steady link and starts to fight a varying one.

QuantityDefinitionScope
One-way latencyElapsed time from the start of a tick on the operator to the moment it is applied on the robot.Full path
PDVPacket delay variation: |delayi − delayi−1| across consecutive packets.Wire only
Latency variationSame definition, applied to end-to-end one-way latency between consecutive ticks.Full path

PDV is the standard term in networking, applied only to packets on the wire. Latency variation is the broader term used in this document, because by the time the packets have been deserialised, applied to a robot driver, and possibly held for a recorder, the variation under measurement includes more than the wire. The math is identical. The tables that follow just call it "Variation" to avoid attaching a wire-only label to a full-stack number.

A worked example, end to end

To make the measurement framework concrete, here is one synthetic teleop command flowing through the system. The numbers are illustrative but proportional to a clean Wi-Fi tick at 100 Hz.

A note on the very large numbers in the code block below: they are perf_counter_ns() readings, which count nanoseconds from an arbitrary per-process zero. The raw values carry no calendar meaning; "twelve trillion" does not mean "twelve trillion nanoseconds since midnight", it just means "twelve trillion nanoseconds since some moment in this process's history". Only differences within a single clock are meaningful, and the whole point of the clock-sync exchange is to make differences across the two clocks meaningful as well.

One stamp deserves a small detour. remote.kernel_received on the robot side does not come from Python at all. It is supplied by the Linux kernel via the SO_TIMESTAMPNS socket option, delivered alongside the packet when the receiver uses recvmsg rather than recvfrom. The stamp records the moment the kernel ingested the packet from the network interface, before any Python code has had a chance to run. This separates wire latency from Python overhead on the robot: any time Python stalls between the packet arriving and the receiver reading it, the gap shows up as its own segment instead of leaking into the wire number and blaming the wrong stage.

trace_id: tc-2026-05-11-08-14-22.123
kind: teleop_cmd

# Stamps as written by the trace logger
# (perf_counter_ns; operator and robot live in different number spaces)

operator clock:
  leader.get_state            =  12,345,600,000,000
  operator.pipeline_done      =  12,345,601,200,000
  command_wire_sent           =  12,345,601,250,000

robot clock:
  remote.kernel_received      =  89,876,543,800,000   # from SO_TIMESTAMPNS
  remote.follower.receive     =  89,876,544,000,000
  remote.follower.applied     =  89,876,544,800,000

# Clock-sync offset (robot − operator), median of recent window
clock_offset_ns               =  77,530,942,550,000

# Robot stamps projected onto operator clock (subtract offset)
projected:
  remote.kernel_received      =  12,345,603,800,000
  remote.follower.receive     =  12,345,604,000,000
  remote.follower.applied     =  12,345,604,800,000

Once everything is in the operator clock, the per-segment durations fall out by subtraction.

SegmentFrom → ToDuration
Operator pipelineget_statepipeline_done1.20 ms
Serialise & queuepipeline_donewire_sent0.05 ms
Wire (operator → robot kernel)wire_sentkernel_received2.55 ms
Python overhead on robotkernel_receivedfollower.receive0.20 ms
Robot applyfollower.receivefollower.applied0.80 ms
Totalget_statefollower.applied4.80 ms
operator pipeline 1.20 ms wire (Wi-Fi) 2.55 ms py robot apply 0.80 ms 0 1 2 3 4 5 ms
One teleop command, decomposed across the stages of the trace. The wire hop dominates this clean tick at about half of the total time. On a bad Wi-Fi window the same orange segment grows to fifty milliseconds and the rest of the bar stays put, which is what the rest of this document is about.

Two things are visible in this single example that were hard to see before. The wire segment is now bounded from above by a kernel timestamp rather than by a Python read, so a slow Python on the robot side does not leak into the wire number. And every duration in the decomposition is in the operator clock, so the segments add up to the end-to-end total without any cross-host arithmetic the reader has to track. The same record format is what the live profiler panel reads, what the offline tool charts, what the test harness asserts against, and what the sidecar broadcasts. One file format, one set of stamps, one consistent number across every consumer.

03Findings

What follows is the chronological path the work actually took: which layer was probed, what each measurement revealed, and what was fixed or ruled out before moving on. Every measurement ran through the same closed-loop test harness, so the numbers across the story are comparable.

The closed-loop test harness

A pytest-based harness automates the workload. It spawns the main teleop process with profiling on, waits for it to come up, switches the orchestrator into teleoperation mode, sleeps for the measurement window (typically 30 to 120 seconds, ten minutes for the full-system test), terminates the process, and writes a per-segment report next to the new trace file. Three flavours of the test cover three configurations: local in-process, wired remote, and wireless remote. The contract is "produce a report"; a human or agent loop reads the numbers and decides what to change next.

The local baseline took work to become a baseline

The first configuration was the noise floor: leader, orchestrator, and follower all in one Python process, no network. The first runs reported end-to-end max latencies above 10 ms with isolated spikes near 100 ms. That is well above what a no-network path should ever show, and the cause turned out to be the measurement tooling itself. The in-process profiler builds a snapshot every couple of seconds on the event-loop thread, holding the GIL for 100 to 200 ms while it walks the trace and serialises. The 100 Hz publish thread shares that interpreter, so its ticks landed late while the snapshot ran, and the trace then blamed the gap on "operator pipeline".

Three fixes flattened the floor. The profiler was moved into a separate sidecar process so its snapshot work runs on its own GIL and cannot block the publish thread. The CPU frequency governor was switched from powersave to performance so a process that bursts at 100 Hz and sleeps is not served at minimum clock between bursts. And Python's stdout and stderr buffering was disabled in subprocesses, which had been making the harness wait pointlessly on a process that was actually ready but whose readiness line was sitting in a 4 KB buffer. After these changes, the local median is under 2 ms and the max is consistently under 5 ms, with no run-to-run jitter in the maximum. Results from agent run.

Wired Ethernet

With the local floor flat, the next configuration was the same workload over Ethernet to a separate machine.

Metricp50p95p99max
One-way (ms)1.4501.5711.6492.399
Variation (ms)0.0420.1630.2400.953

Zero loss across 6,001 packets. The spread between p50 and max is under a millisecond. Ethernet plus the transport library is essentially deterministic at 100 Hz. Anything that shows up beyond this is the cost of leaving the wire.

Wi-Fi

Same code, same cadence, switched to Wi-Fi.

Metricp50p95p99max
One-way (ms)2.7376.96240.28124.1
Variation (ms)0.1986.11410.38120.8

The medians are within 1.3 ms of Ethernet. The tails are fifty times worse. This is the moment the rest of the work became necessary: the median says one thing, the tail says another, and on Wi-Fi the tail is what the operator feels. Localising the 120 ms was the next several hypotheses.

Suspect 1: the protocol

The first hypothesis was TCP. On a lossy link, a single dropped segment forces TCP to wait for retransmission and head-of-line-blocks every later packet behind it. A second transport was added in parallel, a raw UDP path with the same wire format as the ZMQ one (described in §4), and the two were run back to back across several Wi-Fi sessions. They came out indistinguishable to within the noise of the test. Neither was consistently better than the other. Protocol-level behaviour was not the source of the Wi-Fi tail.

Suspect 2: TCP retransmit timeout

Even if UDP and TCP came out the same on average, TCP's 200 ms minimum retransmit timeout could still be hurting the worst ticks. The minimum was reduced to 10 ms per route, and a parallel experiment tightened the kernel's connection-drop timeout. Neither moved the numbers on the hardware available. The connection-drop experiment was reverted; the RTO floor change survives only because it is easy to revert and useful for confirming on future hardware that this is not the issue.

Confirming the wire: the ping baseline

With both protocol hypotheses ruled out, the last way to localise the spikes was to bypass the protocol entirely. ICMP ping at 100 Hz in parallel with a teleop session showed the same spikes at the same moments. ICMP is processed in the kernel on both ends; there is no user-space, no Python interpreter, no JSON parsing, no GIL. Two independent measurements pointing at the same hop established that the optimisation budget belonged on the wire rather than in Python or in TCP. Would have been better to start with this approach.

Tuning the wire: DSCP and Wi-Fi power save

With the medium identified as the bottleneck, the next round of changes targeted how the Wi-Fi link was being used rather than what was sent across it. Wi-Fi power save was silently on on at least two of the test PCs; turning it off was the single largest improvement on the branch, since a network interface in power-save mode holds packets for up to 100 ms while it sleeps the radio between beacon intervals. DSCP marking with the EF codepoint was applied to every outbound packet on both ends of the link, which gets control traffic into the access point's highest-priority airtime queue instead of the default best-effort one. Marking the return leg matters more than it sounds: consumer access points hurt most on the downlink, and the downlink queue is decided by what the robot marks on its reply.

Combined, these changes pressed the worst-case Wi-Fi spikes down from around 120 ms to around 50 ms. That is real progress, a halving of the worst tick. But the bursty shape of the tail did not change. The spikes still came in clusters rather than isolated points, and the inside of each cluster still held 30 to 50 ms ticks back to back. The numbers moved; the experience did not.

Stress-testing with the camera stream

The control loop is only part of what the deployed system does. A robot in production also streams camera frames to the operator so the operator can see what the robot sees, and feeds them into the recorder so each demonstration captures the full observation. Without that channel, the system could not be stress-tested as it will actually run. A camera path was added on its own socket and thread, separate from state and commands, so that a 50 KB JPEG send at 30 Hz cannot block a 200 B state send at 100 Hz. With the separation in place, the streaming-with-recording run produces essentially the same per-segment percentiles as the no-cameras run. Adding the camera stream did not move the latency or the variation, which is exactly the result a separate channel should produce.

The headline test: ten minutes with everything on

The full-system run, with cameras streaming, the recorder writing episodes to disk, all operational fixes applied, and the Wi-Fi link active. The trace covers 62,332 ticks at 100 Hz, broken down by the trace logger into operator pipeline, network, and robot processing segments. By the original system targets, a maximum end-to-end latency below 250 ms and a p99 latency variation below 20 ms, the session is within spec on both: p99 variation 9.84 ms, max latency 230 ms.

Stage Latency (ms) Latency variation (ms)
p50p95p99max p50p95p99max
End to end4.837.9822.42300.483.109.8484.3
Operator pipeline3.924.1628.52300.141.798.4084.3
Network2.174.1820.453.40.142.626.7836.4
Robot processing0.271.321.7135.60.050.071.323.05

By the chosen metrics, this run passes. By the way the operator felt the robot during it, it did not. The conclusion returns to this gap.

Putting it together

LayerMedian one-wayp99 one-wayComment
Local in-process<1 ms~2 msNoise floor, once the profiler stopped polluting it.
Ethernet1.5 ms1.6 msWire plus transport. Deterministic.
Wi-Fi, after tuning2 to 3 ms20 to 50 msTail set by AP queueing, 802.11 retries, scheduling. Down from around 120 ms before tuning; still bursty.

04Improvements & additions

The previous section folded each operational fix (Wi-Fi power save, CPU governor, unbuffered subprocess output, DSCP marking, sidecar profiler, TCP RTO experiment, closed-loop test harness) into the chronological story of what was measured before and after it. What remains here is the supporting infrastructure: things built or added to make the rest of the work possible, but without an obvious before-and-after number of their own.

UDP transport with kernel receive timestamps

A UDP transport was added alongside the existing ZMQ pub/sub transport so the two could be compared head to head on the same Wi-Fi window. They share the same wire shape (a small message-type tag plus UTF-8 JSON) and the same clock-sync window. They are deliberately kept side by side rather than refactored into a common base class: a unified parent would let a bug fixed on one path silently change the behaviour of the other, and the A/B comparison central to the protocol-comparison finding would lose meaning. For a benchmarking-driven branch, independently auditable beats DRY.

The UDP server is single-client by design, which keeps it short. The receive side uses the Linux kernel's SO_TIMESTAMPNS socket option, which delivers a kernel-supplied timestamp alongside each datagram. That stamp is what makes the kernel-to-Python gap visible as its own trace segment, so a robot-side Python stall stops being indistinguishable from wire latency.

Camera frames on a separate channel

Before this branch, the remote follower had no frames path at all. A leader operating against a remote follower had nothing for the recorder to write and nothing for inference to read; the system could not be stress-tested as it would actually run. The frames channel was added to fill that gap, deliberately structured to look as much as possible like the local setup once the wire is crossed.

Local versus remote, side by side

The local setup and the remote setup differ in one structural way: who owns the cameras. With everything in one process on one machine, the recorder reads the frame from the same Python that captured it; there is no serialisation, no wire, no clock skew, no question of stale frames. With the robot on a different machine, the frame has to be encoded on the robot side, sent across the network, decoded on the operator side, time-projected onto the operator's clock, and aligned with the state and command stream so that a recorded row carries the observation that was current at the moment the command was sent. The picture below puts the two side by side.

LOCAL: one process, no wire Camera Driver shared memory Orchestrator Recorder REMOTE: encode, send, decode, cache, then join Robot machine Camera Driver JPEG encode + capture timestamp frames PUB state PUB Operator machine Orchestrator joins state + cmd + frame frame cache (per camera) JPEG decode + project stamp onto operator clock frames SUB state SUB frames: 2 cams × 30 Hz × ~50 KB ≈ 24 Mbit/s state: 100 Hz × ~200 B ≈ 0.16 Mbit/s
In the local case, the recorder reads frames from the same process that captured them. In the remote case, the frame is encoded on the robot, sent over its own socket, decoded and cached on the operator side, then joined by the orchestrator with the current state and command before the recorder writes a step.

How much data the channel actually carries

Each camera produces frames at 30 Hz. A JPEG at quality 75 is around 50 KB. With the two cameras currently in the setup, that is roughly 3 MB/s sent across the link, or about 24 Mbit/s. The state path itself is roughly 200 bytes at 100 Hz, which is 20 KB/s, or 0.16 Mbit/s. Frames dominate the wire in raw byte terms by roughly two orders of magnitude, which is why they cannot share a socket with state: a slow JPEG send on a shared socket would hold the thread long enough to miss the next 10 ms state tick. The wire itself has more than enough capacity for 24 Mbit/s on Wi-Fi; the point of the separation is not bandwidth, it is timing isolation.

Encoding, sending, receiving, decoding

On the robot side a dedicated frames thread runs once per camera tick. It captures a high-resolution timestamp at the moment before reading the camera, calls the driver to get a raw frame, encodes the result as a JPEG, and publishes a multipart message containing the camera name, the capture timestamp, an encoding tag, and the JPEG bytes. The frames socket uses a small send-side buffer: if the wire cannot drain frames as fast as they are produced, the publisher drops the oldest queued frames at the sender rather than queueing them, so the operator never sees a stale backlog after a transient slowdown.

On the operator side a separate receiver thread mirrors the structure. It reads each multipart message, projects the capture timestamp from the robot's clock onto the operator's clock using the clock-sync offset (so the frame's age can be measured against the operator's other stamps), decodes the JPEG, and stores the result keyed by camera name in a cache. The cache always holds the most recent frame for every camera the local config expects to see. If a camera goes silent for longer than a one-second staleness threshold, the cache still returns its last known frame but flags the camera as stale so downstream code can decide what to do.

Joining frames with state and commands at record time

The recorder needs each row to carry three things atomically: the command that was just sent, the follower state at that moment, and an observation that contains a frame for every configured camera. The orchestrator is the single place where that alignment happens. Every time the orchestrator publishes a command, it reads the current state and the current cached observation in the same instant and pins all three onto the recorder's next step. The recorder writes a step only when all three are present. A slightly stale frame, from a brief Wi-Fi dropout, still produces a valid row; a missing frame for any camera causes the row to be skipped entirely rather than writing a row with a gap. Inference uses the same cache and skips a tick when any camera is past the staleness threshold.

The whole pipeline above (encode, send, receive, decode, project, cache, atomic join) collapses to a single shared-memory read in the local case. Both cases use the same recorder code; the difference is everything between the camera and the orchestrator. On a clean Wi-Fi link the difference does not move the latency or the variation, which is what the streaming-with-recording run in the findings section established. On a degraded link, the cache-with-staleness pattern is what keeps a transient dropout from corrupting an episode rather than crashing it.


05Conclusion

The headline finding of this branch is that Wi-Fi is stochastic. Latency is not the killer; variation in latency is. The protocol turned out to matter less than the Wi-Fi setup. The back-to-back protocol comparisons on the same window showed no consistent winner between raw UDP and ZMQ TCP, and a parallel ping baseline showed the same spikes as the teleop traces at the same moments. The Wi-Fi link itself is what dominates the tail. Against that, the operational settings around the link (power save, CPU governor, packet buffering, DSCP marking) each moved the numbers by far more than any choice of protocol did. The largest improvements on this branch were turning off Wi-Fi power save, switching the CPU governor to performance, unbuffering Python's stderr, marking packets with DSCP on both directions, and moving the profiler out of the teleop process so its snapshot work could no longer stall the publish thread.

The target was wrong

The ten-minute session reported in the Findings section passed both of the original system targets cleanly. Its end-to-end p99 latency variation came in at 9.84 ms, well under the 20 ms target. Its maximum end-to-end latency landed at 230 ms, comfortably under the 250 ms ceiling. By those numbers it was a successful run. And during that same session the operator reported clear moments where the robot felt jittery and unstable, lasting one to a few seconds at a time. The numbers said one thing, the operator said another, and the operator was right.

The visual evidence is in the end-to-end latency plot from the profiler. Most of the run is a flat band at a few milliseconds, but the plot also has distinct clusters of spikes that reach 50 ms or so and sit together for one to a few seconds at a stretch before disappearing back to baseline. Each one of those clusters is a window where the operator feels jitter, and there are several of them in the ten minutes. The summary statistics do not see the clusters because they are short relative to the session length. The operator does see them, because for the duration of each cluster the system is unstable in a way that no single tick captures.

A ten-minute p99 cannot see this. About fifty seconds of bad behaviour smeared against five hundred and fifty seconds of clean operation reads as roughly ten milliseconds at the 99th percentile, which sits comfortably under the 20 ms target. The maximum-latency bound is satisfied because the worst spike, while large, was rare. The session passes the targets while feeling broken, because what hurts operationally is not the run-level percentile. It is whether any one-second window contains a sustained run of back-to-back late packets. Three consecutive 40 ms ticks feel worse than thirty 40 ms ticks scattered randomly over ten minutes, and run-level statistics cannot tell those two cases apart.

The replacement target is binary and matches what operators actually feel. Zero one-second windows in which the wire-hop p95 exceeds 10 ms, which is equivalent to no run of more than one consecutive tick above the threshold. On this metric the same ten-minute run scores several failures rather than "fine", and the rest of the work on the branch finally has a number that fails when operators say it does and passes when operators say it does not.

What comes next

The work that follows naturally from this is a control path that pays an explicit latency penalty in exchange for latency variation that approaches zero. The right design takes a running measurement of the network's variation and feeds it back into a small, adjustable buffer on the receive side, so that an arriving command is held for just long enough to absorb the tail of the distribution before being applied. The buffer grows when variation grows and shrinks when variation shrinks. The operator experiences a slightly higher but very stable latency, and feels a smooth, predictable system rather than a fast but jittery one. For a teleoperated robot, the second feel is strictly better than the first, and the prerequisites for building it (a clock-synchronised view of per-tick latency, the ability to attribute time to the right hop, a metric that actually correlates with feel) are exactly the foundations laid in this branch.

06Fix: a jitter buffer on the receive side

The fix that follows from the conclusion is a fixed-delay playout buffer on the robot side. The operator stamps every outgoing command with the moment it left the operator's clock, translated into the robot's clock via the existing clock-sync exchange. The robot reads each command, computes a target apply time of send_robot_ns + buffer, and sleeps until that target before handing the command to the hardware. Packets that arrive early get held; packets that arrive late get applied immediately and counted as over-budget. The trade is one-line: the system pays a flat latency equal to the buffer size, in exchange for a latency variance that collapses to whatever residual the sleep granularity and clock-sync noise leave behind.

The wire change is a single field on the command, send_robot_ns, computed on the operator with clock_sync.to_robot_frame(time.perf_counter_ns()). The robot reads it, gates apply on send_robot_ns + buffer, and falls back to receive_ns + buffer for the first few hundred milliseconds before clock-sync has accumulated its minimum samples. In steady state with a sender at 100 Hz, a 60 ms buffer keeps roughly six packets in flight at any moment; that depth is the slack the system has to spend when the wire misbehaves.

Three scenarios

Three looping visualisations follow. Each runs the same six packets, sent at 100 Hz, through the system; only the buffer policy differs. In every scene the top line is the operator's send moment, the middle dashed band is the wire, and the bottom line is the moment the actuator applied the command. Diagonals show packets in flight on the wire; horizontal segments in the middle band show packets waiting in the buffer for their apply target. The chart on the right of each scene plots the end-to-end latency of each packet after it applied.

A.  No buffer — actuator inherits all wire jitter

Each packet applies the instant it arrives. The latency bars on the right mirror the wire's variance: short when the wire was clean, taller when it spiked. The actuator fires at irregular intervals because the arrivals are irregular.

B.  Buffer at 60 ms — wire jitter absorbed, latency flat

Same packets, same wire variance. Each packet now waits in the buffer (the horizontal segment in the middle band) until send + 60 ms. A packet that hit the wire fast waits longer; one that took longer waits less. Total send→apply is identical for every packet, and the actuator fires on a metronome.

C.  Buffer at 60 ms with a 120 ms spike on packet 3 — stale packet dropped, immediate recovery

Packet 3 takes 120 ms on the wire — twice the budget. Because packets 4 and 5 arrived on time and were already applied at send + 60 ms, packet 3's state is older than what's currently on the robot. The receiver drops it (shown as the red ✕ on the wire, no apply dot, no bar) rather than regressing the actuator. Packets 4–6 are unaffected: they had normal wire times, fit inside the buffer, and apply on time. The cost of one over-budget packet is a single missed update at the actuator's cadence, not a backward jolt. A spike that persisted longer than the buffer would empty the queue and leave a real gap; that's the regime where bigger buffers, sender-side lookahead, or extrapolation start to matter.

What is still missing

The current implementation has the minimum viable shape: a single field on the wire, a gate in the receive loop, and a drop-stale check that throws away any command older than the most recent one applied. It hands the operator constant latency in steady state without ever regressing the actuator on a single-packet spike. Two things from the standard playout-buffer toolkit are still not in place. First, there is no underflow policy: if the queue empties, the actuator holds the last command rather than extrapolating from recent velocity. Second, the queue lives implicitly inside the ZeroMQ socket buffer, so the receive code can't ask "am I N packets behind?" and act on the answer. The cheap next step is to split the listener into a recv thread and a worker thread with a bounded deque between them; that refactor unlocks both policies and turns the buffer from a flat constant into something that can adapt.