## Summary - New `--udp-gso` flag (Linux, requires `--udp-sendmmsg`) collapses same-destination, same-size sendmmsg batches into a single `sendmsg` with a `UDP_SEGMENT` cmsg, so the kernel allocates one super-skb that traverses the network stack once and is segmented at egress instead of running `udp_sendmsg → ip_finish_output → __dev_queue_xmit` per datagram. - Also wraps the relay-side `recvmmsg` callback loop in `udp_sendmmsg_batch_begin/end` so peer→client sends triggered inside a recv batch can also coalesce — without that wrapping the relay path issues one `sendto` per delivered datagram. - Sticky-disable on `EINVAL/ENOPROTOOPT` for older kernels/NICs that lack UDP-GSO; one warning logged, then transparent fallback to the existing `sendmmsg` and `udp_send` paths. ## Why The `--udp-recvmmsg` and `--udp-sendmmsg` follow-ups confirmed (see [docs/PerformanceIterationLog.md](docs/PerformanceIterationLog.md)) that on the relay flood workload the dominant cost is the per-datagram kernel TX path. mmsg-style batching reduces only the syscall entry/exit, not the per-skb stack traversal — UDP-GSO collapses both. ## Result DigitalOcean nyc1 c-4, 30 s alternating A/B, `-Y packet -m 1`, eth1 TX as the authoritative server forwarding metric: | Variant | eth1 RX | eth1 TX | sys CPU | idle CPU | |---|---:|---:|---:|---:| | baseline (no flags) | 322,091 | 127,445 | 22.9 % | 67.5 % | | `--udp-recvmmsg --udp-sendmmsg --udp-gso` | 266,068 | **257,996** | 15.0 % | 78.7 % | | baseline (no flags) | 309,475 | 125,573 | 20.9 % | 70.7 % | | `--udp-recvmmsg --udp-sendmmsg --udp-gso` | 275,992 | **225,366** | 14.9 % | 74.3 % | Mean server forwarding rate: **126.5 k → 241.7 k pps (+91 %, 1.91×)**, mean system CPU **21.9 % → 14.9 %** — about **2.8× CPU efficiency** (TX pps per system-CPU-%). Full perf-children comparison and methodology in the new section of [docs/PerformanceIterationLog.md](docs/PerformanceIterationLog.md). ## Notes for reviewers - `--udp-gso` is opt-in and requires `--udp-sendmmsg` (the help text states the dependency). Without `--udp-sendmmsg` the batch state never accumulates and GSO has nothing to flush. - GSO eligibility resets on every `_begin/_end`. Mixed-destination, mixed-size, or oversize batches transparently fall back through `sendmmsg` / `udp_send`. - Rebased onto current `master`; the recvmmsg dependency is already merged via #1906. ## Test plan - [x] `cmake --build build --target turnserver` (RelWithDebInfo + ASan local builds clean) - [x] `ctest --test-dir build --output-on-failure` — 3/3 unit tests pass - [x] `examples/run_tests.sh` — TCP/TLS/UDP pass; DTLS pre-existing failure on macOS environment, unrelated to this change - [x] DigitalOcean A/B perf validation captured above - [ ] Reviewer to confirm CI green on Linux build/test/CodeQL --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
25 KiB
Performance iteration log
Running notes for the multi-iteration performance work on the UDP relay data path. Pick this up to continue without re-deriving everything.
The harness, baseline command, and droplet topology are documented in CLAUDE.md under "Load Test on DigitalOcean" — this file captures the deltas: what was measured, what landed, what didn't, and where the next round should go.
Cumulative result
Five commits on claude/beautiful-black-c3b741 between 727ec2ab
("loadgen") and 321a2d18:
| # | Commit | Optimization |
|---|---|---|
| 1 | ce7e7e53 |
Hoist turn_server_get_engine() out of per-packet hot path |
| 2 | 8e28491a |
ioa_socket_check_bandwidth early fast-exit; drop dead if (!(s->done || s->fd==-1)) in send_data_from_ioa_socket_nbh |
| 3 | 344360f6 |
Cache get_relay_socket_ss() and ioa_network_buffer_get_size() in write_to_peerchannel, handle_turn_send, read_client_connection |
| 4 | a6f6767f |
Inline get_ioa_addr_len() via ns_turn_ioaddr.h |
| 5 | 321a2d18 |
Inline addr_cpy() via ns_turn_ioaddr.h |
Current relay-recvmmsg follow-up:
| # | Commit | Optimization |
|---|---|---|
| 6 | 54c589d0 / 4b1a8d71 |
Initial Linux recvmmsg batching for UDP listener and connected relay sockets |
| 7 | 8d9a7292 |
Share the existing --udp-recvmmsg flag across listener and relay UDP paths; remove separate relay flag; use the shared ancillary-data parser in dtls_listener |
| 8 | d48686b7 |
Reduce relay per-socket recvmmsg state from 16 x 64 KiB cmsg buffers to TTL/TOS-sized buffers, avoid an extra would-block fallback recvmsg, and clean up all preallocated buffers after partial batches |
| 9 | ad81705e |
Add per-engine recvmmsg occupancy counters and 10 s log summaries (calls, packets, avg_batch, wouldblock, unavailable, no_buffer, batch-size histogram) |
| 10 | 388b15d4 |
Move connected relay UDP recvmmsg scratch from per-socket state to per-engine/per-thread state |
| 11 | 4c4fd67e |
Make the occupancy summaries opt-in behind --udp-recvmmsg-log, so --udp-recvmmsg can ship without periodic stats logs |
Validation after #7-#11:
- Local
cmake -S . -B build -DBUILD_TESTING=ONpassed. - Local
cmake --build build --parallel 8passed. - Local
ctest --test-dir build --output-on-failurepassed 3/3. - Local
build/bin/turnserver --udp-recvmmsg --udp-recvmmsg-log --versionparsed both flags and printed4.11.0. - Linux Docker
turnserverbuild passed after #7, after #8, and after #10.
Shipping cleanup learning: keep the occupancy counters in place because they
are low overhead and useful for DigitalOcean diagnostics, but keep the periodic
summaries off by default. Use --udp-recvmmsg-log only during measured runs
where the log stream is part of the observation.
DigitalOcean check on 2026-05-09:
- Reused the existing
c-4droplets innyc1: turnserver public157.230.3.102, private10.116.0.2; loadgen public167.99.153.216, private10.116.0.3. Droplets were left running between steps. - Built fresh current artifacts from
d48686b7on both droplets under/root/coturn_recvmmsg_current. - Same-binary
--udp-recvmmsgoff/on,-Y packet -m 1 -l 120, 5 alternating 30 s rounds each:- off mean 154,527, median 154,596, stdev 3,467
- on mean 149,994, median 153,011, stdev 7,174
- on was -2.9 % by mean and -1.0 % by median
- Same-binary
--udp-recvmmsgoff/on,-Y packet -m 100 -l 120, 5 alternating rounds each. The client completed before the 30 s timeout and landed in two send-volume buckets, so treat this as a coarse many-connection signal:- off mean 59,432, median 65,071, stdev 7,952
- on mean 59,640, median 65,421, stdev 7,963
- on was +0.3 % by mean and +0.5 % by median
- Follow-up
m=100 -n 1000run, 3 alternating rounds each, derived receive count fromtot_recv_bytes / 120because this log format omitstot_recv_msgs:- off mean 8,540, median 8,990, stdev 1,004
- on mean 8,857, median 8,749, stdev 759
- on was +3.7 % by mean and -2.7 % by median
Learning: the corrected relay recvmmsg implementation is now buildable and
much safer for many connections, but these droplet runs still do not show a
clear throughput win. Keep --udp-recvmmsg opt-in for now. The next useful
step is to instrument actual batch occupancy on connected relay sockets; if
most readiness events return one datagram, recvmmsg will mostly add setup
work without reducing syscalls.
DigitalOcean occupancy check on 2026-05-09:
- Built fresh current artifacts from
388b15d4on both droplets under/root/coturn_recvmmsg_current. - Same-binary
--udp-recvmmsgoff/on,-Y packet -m 1 -l 120, 3 alternating 30 s rounds each:- off mean 153,133, median 153,608, stdev 4,383
- on mean 148,452, median 149,711, stdev 10,833
- on was -3.1 % by mean and -2.5 % by median
m=1occupancy from the on runs: 1,129,427recvmmsgcalls returned 17,660,300 packets, average batch 15.64. Histogram buckets:hist_1=1,353,hist_2=1,496,hist_3_4=3,707,hist_5_8=14,817,hist_9_16=1,108,057; 98.1 % of calls were in the9..16bucket.- Same-binary
--udp-recvmmsgoff/on,-Y packet -m 100 -l 120, 3 alternating runs each:- off mean 55,443, median 50,679, stdev 8,369
- on mean 60,596, median 65,404, stdev 8,383
- on was +9.3 % by mean and +29.1 % by median, but the client again landed in two send-volume buckets, so treat the throughput delta as noisy.
m=100occupancy from the on runs across all relay threads: 1,426,401recvmmsgcalls returned 16,188,946 packets, average batch 11.35. Histogram buckets:hist_1=83,057,hist_2=79,781,hist_3_4=130,066,hist_5_8=188,259,hist_9_16=945,238; 66.3 % of calls were in the9..16bucket.
Learning: receive-side occupancy is high. The earlier hypothesis that
recvmmsg was mostly returning one packet is wrong for this harness. The
remaining bottleneck is after receive: per-packet callbacks, TURN processing,
and especially one sendto per relayed packet. The per-thread scratch change
is still worth keeping for memory/cache behavior with thousands of sockets,
but the next performance lever should be send-side batching or a design that
passes batches deeper instead of immediately decomposing them back into
single-packet callbacks.
Alternating A/B run on the same droplet pair, m=1 packet flood, 30 s per run, with a 4 s warm-up between binary swaps:
- Baseline (clean
masterbinary): mean 146,984 round-trips / 30 s - Cumulative (all 5 iters): mean 155,468 round-trips / 30 s
- +5.8 % throughput
Per-iteration deltas were within run-to-run noise (~5–10 % variance). The cumulative effect is what's visible.
Test setup that was used
Two c-4 Ubuntu 24.04 droplets in nyc1, same VPC default-nyc1.
Current active pair:
coturn-turnserver— public157.230.3.102, private10.116.0.2coturn-loadgen— public167.99.153.216, private10.116.0.3
Older pair used for the iter 5 cumulative run:
coturn-turnserver— public68.183.121.197, private10.116.0.2coturn-loadgen— public68.183.132.220, private10.116.0.3
Created via the DigitalOcean v2 API (doctl is not installed; use
curl + $DIGITALOCEAN_TOKEN from the user's ~/.zshrc). SSH via
~/.ssh/id_rsa (matches DO ssh key id 23704483, fingerprint
37:3a:9b:e3:1e:1a:9b:42:a0:6f:58:f5:5a:3a:6a:2c).
State on the turnserver droplet (kept across iterations):
/root/coturn_clean.tar—git archive HEADof master at start of run. Re-extract this before applying any new patch./root/coturn_baseline/build/bin/turnserver— clean baseline binary, used as the "B" in every A/B round. Don't overwrite./root/coturn/build/bin/turnserver— current iteration binary./root/start_turnserver.sh,/root/baseline_run.sh— helper scripts.
State on the loadgen droplet:
/root/coturn/build/bin/turnutils_uclient,turnutils_peer.turnutils_peerruns as a daemon on10.116.0.3:3480(pidin/root/peer.pid).
A small env file was written to /tmp/coturn_perf_env.sh on the local
machine with the IPs / droplet IDs — recreate it from the current
state of the DO account if it gets lost.
The standard packet-flood command (matches CLAUDE.md baseline, runs without
--udp-recvmmsg; add --udp-recvmmsg to turnserver, not the client, for the
batched listener/relay receive path):
timeout -s INT 30s /root/coturn/build/bin/turnutils_uclient \
-Y packet -m 1 -l 120 \
-e 10.116.0.3 -r 3480 -X -g \
-u user -W secret \
10.116.0.2
Metric: the tot_recv_msgs field on the last start_mclient: log
line. (This is round-trips through the relay over the test window;
send_pps is loadgen-side only and can hit 262 K even when the relay
is dropping most of them, so it's not a useful proxy for relay
throughput.)
Hot-path map at the end of iter 5
perf record -F 99 -g on the turnserver during a 12 s -Y packet -m 1
run, sorted by user-space self-time:
0.80 % send_data_from_ioa_socket_nbh
0.76 % socket_input_worker
0.69 % read_client_connection.isra.0
0.60 % turn_report_session_usage
0.53 % peer_input_handler
0.51 % udp_server_input_handler
0.35 % udp_recvfrom # was 0.76 % at iter 1
0.34 % lm_map_get
0.27 % stun_is_channel_message_str
0.27 % get_relay_socket
0.26 % ioa_socket_check_bandwidth # was 0.33 % at iter 1
0.26 % udp_send # was 0.60 % at iter 1
0.18 % ioa_network_buffer_get_size
Total user-space coturn cycles: ~5–7 % of the relay thread. The relay thread sits at ~100 % CPU pinned to one core; the 4 relay threads aren't parallelised by the m=1 single-flow test (one 5-tuple hashes to one SO_REUSEPORT worker).
Kernel side (children-aggregated) is the real cost:
36 % udp_sendmsg (sendto path)
14 % udp_recvmsg
17 % ip_finish_output / ip_output / __dev_queue_xmit
~23 % syscall enter / exit machinery (sysret, SYSRETQ, SYSCALL_64*)
That ~23 % syscall overhead is the next big lever. Halving it (via batching) is worth ~10 % wall-clock CPU.
What didn't work
Default --udp-recvmmsg=true on Linux (tried in iter 1, kept opt-in)
The flag now covers both the unconnected listener socket in dtls_listener.c and connected plain-UDP relay sockets in ns_ioalib_engine_impl.c. DTLS session sockets remain on the SSL read path and are not batched by the relay socket helper.
Throughput parity or slight negative results were confirmed across multiple
A/B rounds on m=1 and m=100; keep this opt-in until batch occupancy
instrumentation proves that real deployments commonly receive multiple queued
datagrams per connected socket readiness event.
Caching get_relay_socket_ss (iter 3) — no measurable wall-clock win
The function is static inline already and the underlying
get_relay_socket() is a four-line accessor. Caching the result
does save a cross-TU function call per packet (the compiler can't
prove get_relay_socket pure across the
set_df_on_ioa_socket / ioa_network_buffer_* calls in between),
which the perf profile picked up as a small redistribution, but
throughput stayed in the noise band. Kept anyway: the cleanup is
defensible and matches the iter 4/5 inlining direction.
Methodology lessons
- Always alternate A/B per round rather than running 5×B then 5×I. The droplet pair has noticeable environmental drift over a few minutes (other tenants on the hypervisor, NIC ring backpressure, whatever); sequential blocks bias whichever binary ran on the worse half of the run.
- Discard the first run after a turnserver restart. The loadgen's first run after a server restart is consistently 30–80 % slower than steady-state — looks like channel/permission state in the client side warming up, not the server. A 4 s "throwaway" run before the measured 30 s run is enough.
- Run-to-run variance is ~5–10 % even with alternation. Plan on 6–8 rounds (≈ 8 minutes wall-clock) before claiming a sub-10 % win. A single 3-round A/B will lie to you.
- Use the
tot_recv_msgsfield, notsend_pps. Loadgen send rate saturates at ~262 K pps regardless of relay capacity — it's whatever the loadgen kernel will accept into its UDP send buffer. The receive count is what made it round-trip through the relay. - The relay is kernel-bound. User-space coturn is ~5 % of cycles. Halving it gives at most ~2.5 % wall-clock — usually undetectable per-iteration, only visible cumulatively. Don't expect a 10 % jump from a CSE.
- Single-flow tests pin one core. With
SO_REUSEPORTthe kernel hashes 5-tuples to worker sockets; one client → one tuple → one worker thread. The other 3 cores sit idle. To exercise all 4 relay threads you'd need m≥4 with distinct source ports — ours don't spread cleanly because the loadgen reuses ports. - Don't re-extract
/root/coturnbetween iterations if you want to keepgit apply-style patches working. The droplet copy is not a git checkout (it's thegit archivetar). Usepatch -p1. Each iteration uploaded a cumulative diff (current branch vsmaster) and re-extracted from/root/coturn_clean.tarfirst to get a clean apply.
Optimization backlog (bigger fish for next session)
Ordered by expected impact for the m=1 packet-flood metric:
-
Batch the send side (
sendmmsg) or pass receive batches deeper. The occupancy counters show receive batching is already working:m=1averaged 15.6 packets per call andm=100averaged 11.4. The code immediately invokes the existing per-packet callback for each received datagram, and each forwarded packet still pays a separate send syscall. The next measurable lever is to queue per-thread outbound datagrams during a receive batch and flush them withsendmmsg, or introduce a batch-aware callback path for the hot UDP relay case. -
Keep
recvmmsgoccupancy counters available while developing send batching. They are cheap enough for targeted performance builds and make it obvious whether a benchmark is exercising one relay thread or all relay threads. Consider hiding periodic logs behind a verbose/debug option before shipping broadly. -
GSO (
UDP_SEGMENT) on the send path. Linux can take one "large" datagram and segment it in the kernel for back-to-back packets to the same destination. Our channel-data flood IS same-destination. SettingUDP_SEGMENTand submitting a singlesendmsgof N×packet_size cuts skb-alloc /__dev_queue_xmitwork substantially. Needs careful handling for short tails and non-uniform sizes; complementary to (2). -
Inline more cross-TU per-packet accessors. Pattern from iter 4/5 still applies:
addr_eq(called per channel-data packet for permission lookup),ioa_network_buffer_get_size,get_ioa_socket_type/_app_type. Each is small enough; the only reason to be cautious is they're declared inns_turn_ioalib.hwhich is part of the public-ish server library API — moving the body inline doesn't break ABI but does require a recompile of all consumers. Likely <1 % each but cheap to do. -
Re-evaluate
--udp-recvmmsgdefault after instrumentation. The current measurements do not justify default-on. Revisit only if production-like traces show frequent batch sizes above one and no latency/memory downside.
Things investigated and ruled out (don't redo)
set_socket_ttl/set_socket_tosalready short-circuit on no-change vias->current_ttl != ttl/s->current_tos != tos. In a steady-state flood the per-packet call returns immediately withoutsetsockopt. Already optimized.set_df_on_ioa_socketsimilarly guarded (ns_ioalib_engine_impl.c:242).turn_report_session_usageslow path runs once per 4096 packets (see iter 1 commit); the per-call overhead is now ~3 reads + 1 bitmask test + 1 conditional return.MSG_CONFIRMinsendtowould skip ARP refresh, butneigh_resolve_output+neigh_hh_outputshow ~17 % combined in perf only because we're sending that many packets — per-packet it's the normal cached neighbor path, not a refresh.- Increasing
MAX_TRIESfrom 16 to 64 insocket_input_workerdoesn't change syscall count; it only delays returning to libevent. Useless without (1) above.
How to resume
- Verify the droplets are still up (the IPs above). If they were
destroyed, re-create with
c-4/nyc1/default-nyc1VPC and thepavelSSH key (id 23704483). - Re-upload
/tmp/coturn_clean.tarfromgit archive masterand rebuild/root/coturn_baseline/build/bin/turnserverif the baseline binary is gone. The A/B harness depends on having both binaries side-by-side on the turnserver droplet. - Run a 6-round alternating A/B as a sanity check that the current
tip-of-branch still beats
masterby ~5 %. If it doesn't, the environment drifted and the baseline needs re-anchoring. - Pick the next item from the backlog. Item (1) —
recvmmsgintosocket_input_worker— is where the next material gain lives.
2026-05-03 sendmmsg follow-up
A later run on two DigitalOcean CPU-optimized c-4 droplets in sfo3
(10.124.0.2 turnserver, 10.124.0.3 loadgen) tested an experimental
Linux-only --udp-sendmmsg flag with --udp-recvmmsg.
| Run | Code/flags | Generator max pps | Generator avg pps | Server RX avg pps | Server TX avg pps | Server TX peak pps | CPU avg | Perf conclusion |
|---|---|---|---|---|---|---|---|---|
| iter0 | baseline, --udp-recvmmsg |
335,872 | 286,721 | 360,900 | 257,357 | 323,488 | 97.8% | sendto/udp_sendmsg dominates |
| iter1 | --udp-sendmmsg, both directions |
409,600 | 312,662 | 428,184 | 197,300 | 260,453 | 99.8% | sendmmsg path dominates; TX regressed |
| iter2 | sendmmsg only for batches >= 4 |
393,216 | 315,393 | 398,121 | 163,626 | 215,068 | 98.9% | Threshold did not recover TX |
| iter3 | listener-side batching only | 425,984 | 286,038 | 376,444 | 210,050 | 332,417 | 97.4% | Peak ingress/TX improved, average TX still below baseline |
Validation result: sendmmsg() is not a proven general win for this workload.
It can increase generator max pps and peak server TX, but average delivered
server TX stayed below the --udp-recvmmsg baseline. Keep it opt-in until a
follow-up change proves better end-to-end relay throughput.
Perf still points at per-datagram kernel transmit cost:
- baseline:
udp_send -> sendto -> __sys_sendto -> udp_sendmsg -> udp_send_skb -> ip_output - sendmmsg variants:
udp_sendmmsg_flush -> __sendmmsg -> __sys_sendmmsg -> ___sys_sendmsg -> udp_sendmsg -> ip_output
The key observation is that sendmmsg() reduces syscall entry count but still
walks udp_sendmsg and the IP output path once per datagram. On this workload,
the extra mmsghdr copy/looping overhead can offset the syscall savings.
Deferred bigger refactors from this run:
- Per-peer connected UDP relay sockets or a destination cache could reduce address handling and route lookup for repeated peer sends, but it changes relay socket semantics and receive filtering.
- Shard a single hot allocation/flow across multiple relay workers only with a careful design for ordering, session accounting, socket ownership, and lock contention.
- Investigate
io_uringsend batching or kernel-bypass style transmit only as a larger architecture experiment. - Consider a purpose-built benchmark mode that measures delivered relay pps at a controlled input rate. The current saturated packet flood is useful for finding hot functions but can obscure end-to-end delivery changes.
2026-05-09 UDP-GSO send path (--udp-gso)
Realizes the GSO backlog item from the iter-5 backlog above. The recvmmsg /
sendmmsg follow-ups confirmed that on this workload the dominant cost is the
per-datagram kernel TX path (udp_sendmsg → ip_finish_output → __dev_queue_xmit → start_xmit), which mmsg-style batching does not collapse. UDP-GSO (Linux
UDP_SEGMENT cmsg) does collapse it: N same-destination, same-size datagrams
are submitted as one sendmsg carrying an iovec; the kernel allocates one
super-skb that traverses the network stack once and is split at egress (NIC).
Implementation lives in src/apps/relay/ns_ioalib_engine_impl.c
and reuses the existing --udp-sendmmsg batch state. Eligibility (same fd,
same dest, same size, ≤ 1472 B per datagram) is tracked on every
udp_sendmmsg_enqueue; eligible flushes go through udp_gso_attempt_flush
ahead of the sendmmsg loop, with an automatic sticky disable on
EINVAL/ENOPROTOOPT so a kernel/NIC without GSO support gracefully falls back.
The relay-side socket_udp_read_batch_recvmmsg now wraps its callback loop
in udp_sendmmsg_batch_begin/end so peer→client sends triggered inside a
recvmmsg batch can also coalesce — without that wrapping, the relay path
issues one sendto per delivered datagram.
DigitalOcean validation on 2026-05-09 — fresh nyc1 c-4 droplets (turn
10.116.0.4, load 10.116.0.5), all variants built from the same source tree
under /root/coturn/build, -Y packet -m 1 -l 120, monitor window via sar -n DEV for eth1, mpstat, pidstat. The 12 s sweep first established the
ordering, then a 30 s alternating A/B (baseline → gso → baseline → gso)
confirmed the magnitude of the delta:
| Variant | eth1 RX pps | eth1 TX pps | sys CPU | idle CPU |
|---|---|---|---|---|
| baseline_r1 | 322,091 | 127,445 | 22.9% | 67.5% |
--udp-recvmmsg --udp-sendmmsg --udp-gso (gso_r1) |
266,068 | 257,996 | 15.0% | 78.7% |
| baseline_r2 | 309,475 | 125,573 | 20.9% | 70.7% |
| gso_r2 | 275,992 | 225,366 | 14.9% | 74.3% |
Mean server forwarding rate (eth1 TX): baseline 126,509 pps → GSO 241,681 pps, +91 % (1.91×), with mean system CPU dropping from 21.9 % to 14.9 % — about 2.8× CPU efficiency in TX pps per system-CPU-%.
12 s packet sweep, all four variants, mean send_pps reported by uclient (used only for ordering — for absolute throughput trust eth1 TX above):
| Variant | m=1 | m=2 | m=4 | m=8 | m=16 | m=32 |
|---|---|---|---|---|---|---|
| baseline | 230,401 | 150,189 | 187,055 | 174,771 | 160,871 | 167,789 |
--udp-recvmmsg |
255,660 | 148,824 | 174,767 | 142,997 | 150,743 | 144,200 |
--udp-recvmmsg --udp-sendmmsg |
231,766 | 146,776 | 148,826 | 136,542 | 148,955 | 143,575 |
--udp-recvmmsg --udp-sendmmsg --udp-gso |
136,876 | 147,458 | 124,250 | 131,081 | 137,636 | 114,714 |
The uclient generator reports its own send rate, which drops with GSO because
the loadgen droplet's turnutils_peer becomes the new bottleneck — it is
single-threaded and cannot reflect 240 k pps. The 30 s eth1 capture is the
authoritative server-side metric; sweep_m1 is retained only to show that
GSO does not regress in the moderately-loaded m=2..32 range relative to
recvmmsg+sendmmsg.
Perf children share, m=1 12 s perf record on the turnserver process:
| Symbol | baseline | recvmmsg | recvsendmmsg | gso |
|---|---|---|---|---|
__x64_sys_sendto (children) |
43.6 % | 47.6 % | 22.8 % | 0.0 % |
__x64_sys_sendmsg (children) |
— | — | — | 38.1 % |
__x64_sys_sendmmsg (children) |
— | — | 27.0 % | 0.0 % |
udp_sendmsg |
38.8 % | 41.9 % | 20.6 % | 35.9 % |
__dev_queue_xmit |
18.5 % | — | — | 29.3 % |
skb_segment (egress GSO split) |
absent | absent | absent | 2.2 % |
syscall_return_via_sysret (self) |
7.2 % | 4.7 % | 4.4 % | 2.4 % |
entry_SYSCALL_64_after_hwframe (self) |
4.1 % | 3.6 % | 2.6 % | 1.8 % |
In the GSO column the per-packet kernel-stack cost is now amortized across
the segments of a single super-skb. The proportional rise of
__dev_queue_xmit is misleading on its own — it reflects a smaller
denominator (CPU usage dropped) while the per-packet absolute cost dropped.
Operational notes:
- Flag is opt-in.
--udp-gsorequires--udp-sendmmsg; without that flag the batch state never accumulates and GSO has nothing to flush. The--helptext states the dependency. - GSO eligibility resets on every
_begin/_end. Mixed-destination or mixed-size workloads transparently fall back through the existingsendmmsgandudp_sendpaths. - Sticky disable on
EINVAL/ENOPROTOOPTkeeps a process running on an un-virtio host or older kernel from hot-looping in the sticky failure path. A WARNING line is logged once. - Tested on Linux 6.8 + virtio-net (DO
c-4),gso_max_segs=65535. Older hosts (kernel <4.18) lackUDP_SEGMENTentirely; the sticky-disable path covers them.
Suggested next levers if more relay throughput is needed:
- Move loadgen off turnutils_peer. The 240 k → 90 k tot_recv_msgs/30 s
gap at GSO is dominated by single-threaded peer reflection, not the TURN
server. A multi-thread peer or
pktgen-style reflector would let us measure the real ceiling. - Per-peer connected relay sockets. Same-destination is the GSO
eligibility predicate; a connected relay socket would always be
same-dest and would also save
route_lookupper send. MSG_ZEROCOPYon the GSO sendmsg.rep_movs_alternativeis still 3 % self in GSO, and zerocopy avoids the userspace→kernel copy. Probably small for 32-B STUN packets; revisit when payloads are larger.
Artifacts (perf.data, sar/mpstat/pidstat, sweep logs, AB logs) are saved at
perf-results-20260508-213056/ in the worktree.