Files
Pavel Punsky fb94ab117d turnutils_uclient: sender thread pool + UDP-GSO send batching + recv_pps reporting (#1913)
## Summary

Three related changes to `turnutils_uclient` that together unblock the
loadgen from being the bottleneck when benchmarking the relay:

1. **Sender thread pool** (`--sender-threads <N>`, max 4, auto-bumped to
2 at `-m >= 4`). Mirrors the listener pool that landed in #1911. Each
sender thread owns its own libevent base, a session shard (round-robin
assigned at allocation time via `elem->sender_id`), and a 100 µs timer
that runs the burst loop just like the legacy main-thread
`timer_handler` did. Send-side counters (`tot_send_messages`,
`tot_send_bytes`, `tot_send_dropped`, `load_sent_packets`) and the
completion accumulators in `client_timer_handler` (`total_loss` /
`total_latency` / `total_jitter`) are written into per-thread
cache-line-aligned slabs and reduced into the globals after
`pthread_join`. This avoids the cross-core atomic-counter contention
that the listener-pool work already documented.

2. **UDP-GSO send batching** in `send_buffer` for the plain-UDP path.
The sender pool opens a thread-local batch window around its per-tick
iteration; within the window, `send_buffer` copies the payload into a
per-thread slot and appends to a scatter-gather `iov[]`. On flush:
- **If `count > 1` and all segments share the same size** → one
`sendmsg(2)` with a `UDP_SEGMENT` cmsg.
- **If GSO is unavailable** (kernel returns
`EINVAL`/`ENOPROTOOPT`/`EOPNOTSUPP`) → sticky-disable per thread, fall
back to `sendmmsg(2)` over the same iov array.
- **Per-entry `send(2)`** as the final fallback for whatever sendmmsg
refused (EAGAIN tail, etc.).

Auto-flush triggers: different fd (next session in iteration), different
segment size, batch capacity (64), or end of iteration.

3. **`recv_pps` in `print_load_generator_rate`**, alongside the existing
`send_pps`. Once the sender pool + GSO let uclient push >>1 Mpps of UDP,
the meaningful end-to-end metric is the round-trip count, not the
send-side count — the relay/peer pipeline drops 95+% of packets when
uclient outpaces it. The progress line now reads:

send_pps=6012928.00, recv_pps=101486.00, total_sent=112975924,
total_recv=1853369

## Why

Benchmarking `--multiplex-client` / `--multiplex-peer` on a c-4
DigitalOcean droplet, the loadgen's single-threaded `timer_handler`
saturated one CPU around 300 kpps regardless of `-m`. The relay was
never put under real pressure, so the multiplex paths' value couldn't be
measured. With this patch the loadgen can produce >6 Mpps from a single
c-4 droplet, far above the relay's per-thread saturation point, so the
bottleneck moves to the server where it belongs.

## Benchmark — multiplex-client turnserver, c-4 loadgen, m=4, 20 s

| Round | OLD (master) | NEW (this PR) | Lift |
|-------|--------------|---------------|------|
| 1 | 246k send_pps | 7.48M | 30.4× |
| 2 | 459k | 6.06M | 13.2× |
| 3 | 360k | 5.07M | 14.1× |
| **avg** | **355k** | **6.20M** | **17.5×** |

Throughput cap shifts from loadgen to relay. End-to-end recv_pps (which
is now first-class in the progress line) is ~100 kpps in this
configuration — limited by the relay, not uclient.

## Design notes

- **Cache-line alignment** on `uclient_sender` mirrors the
listener-pool's slab pattern. Same false-sharing trap, same fix.
- **Main-thread timer slows to 10 ms** when the sender pool is engaged.
The main timer still fires for lifecycle / `__turn_getMSTime` refresh,
but `timer_handler` early-returns when `num_sender_threads > 0` so we
don't burn a core on no-op 100 µs ticks.
- **Stop ordering**: `stop_sender_threads()` runs before
`stop_listener_threads()` — the senders own session mutation (wmsgnum,
to_send_timems, shutdown), so joining them first prevents a race where a
listener accumulates a stat into a session whose owning sender is still
iterating it.
- **UDP-GSO copy**: the per-slot memcpy is intentional. The caller
(`client_write`) reuses `elem->out_buffer` across burst iterations, so
pointing `iov[i]` at the session buffer would alias all entries to the
most recent payload. A rotating per-session output ring would eliminate
the copy — left out of this PR because the kernel-side savings from
collapsing N sendmsg into one GSO sendmsg dominate the per-packet copy
cost at the rates we measured.
- **Linux-only**: send-side batching machinery is gated by `#if
defined(__linux__)`. Non-Linux builds get no-op
`uclient_send_batch_begin`/`_end` and `uclient_tx_enqueue` returns
false, falling through to the legacy `send(2)` loop.

## Test plan

- [x] macOS local build (Apple Silicon, AppleClang). Sender-pool code
paths compile under both Linux and non-Linux gates.
- [x] `clang-format-15 --dry-run --Werror` clean.
- [x] Linux build on a c-4 Ubuntu 24.04 droplet (`cmake
-DCMAKE_BUILD_TYPE=Release`).
- [x] `--help` includes the new `--sender-threads` option with
valid-range hint; out-of-range values rejected.
- [x] Benchmark on two c-4 droplets in nyc1 against `turnserver
--multiplex-client`: 3 alternating rounds OLD vs NEW, +17.5× average
send-side lift (data table above).
- [x] `print_load_generator_rate` output verified — `send_pps`,
`recv_pps`, `total_sent`, `total_recv` all populated and consistent
across listener slab reductions.

## Limitations

- `--multiplex-peer` is not driven by this PR. uclient's pattern (each
`-m N` opens two internal sessions per client that share the same peer
port) hits the multiplex-peer "one allocation per peer endpoint" rule;
benchmarking that flag at high concurrency requires a separate small
change (per-session secondary peer port) — not in scope here.
- The wider per-round variance under the sender pool (rounds in our
bench ranged 13×–30× lift) is timing/scheduler noise at small per-thread
shards. Smoothens out as `-m` and per-thread session counts grow.
2026-05-11 20:59:12 -07:00
..
2025-05-30 14:13:59 -07:00