mirror of
https://github.com/coturn/coturn.git
synced 2026-05-12 09:40:35 +00:00
fb94ab117d
## Summary Three related changes to `turnutils_uclient` that together unblock the loadgen from being the bottleneck when benchmarking the relay: 1. **Sender thread pool** (`--sender-threads <N>`, max 4, auto-bumped to 2 at `-m >= 4`). Mirrors the listener pool that landed in #1911. Each sender thread owns its own libevent base, a session shard (round-robin assigned at allocation time via `elem->sender_id`), and a 100 µs timer that runs the burst loop just like the legacy main-thread `timer_handler` did. Send-side counters (`tot_send_messages`, `tot_send_bytes`, `tot_send_dropped`, `load_sent_packets`) and the completion accumulators in `client_timer_handler` (`total_loss` / `total_latency` / `total_jitter`) are written into per-thread cache-line-aligned slabs and reduced into the globals after `pthread_join`. This avoids the cross-core atomic-counter contention that the listener-pool work already documented. 2. **UDP-GSO send batching** in `send_buffer` for the plain-UDP path. The sender pool opens a thread-local batch window around its per-tick iteration; within the window, `send_buffer` copies the payload into a per-thread slot and appends to a scatter-gather `iov[]`. On flush: - **If `count > 1` and all segments share the same size** → one `sendmsg(2)` with a `UDP_SEGMENT` cmsg. - **If GSO is unavailable** (kernel returns `EINVAL`/`ENOPROTOOPT`/`EOPNOTSUPP`) → sticky-disable per thread, fall back to `sendmmsg(2)` over the same iov array. - **Per-entry `send(2)`** as the final fallback for whatever sendmmsg refused (EAGAIN tail, etc.). Auto-flush triggers: different fd (next session in iteration), different segment size, batch capacity (64), or end of iteration. 3. **`recv_pps` in `print_load_generator_rate`**, alongside the existing `send_pps`. Once the sender pool + GSO let uclient push >>1 Mpps of UDP, the meaningful end-to-end metric is the round-trip count, not the send-side count — the relay/peer pipeline drops 95+% of packets when uclient outpaces it. The progress line now reads: send_pps=6012928.00, recv_pps=101486.00, total_sent=112975924, total_recv=1853369 ## Why Benchmarking `--multiplex-client` / `--multiplex-peer` on a c-4 DigitalOcean droplet, the loadgen's single-threaded `timer_handler` saturated one CPU around 300 kpps regardless of `-m`. The relay was never put under real pressure, so the multiplex paths' value couldn't be measured. With this patch the loadgen can produce >6 Mpps from a single c-4 droplet, far above the relay's per-thread saturation point, so the bottleneck moves to the server where it belongs. ## Benchmark — multiplex-client turnserver, c-4 loadgen, m=4, 20 s | Round | OLD (master) | NEW (this PR) | Lift | |-------|--------------|---------------|------| | 1 | 246k send_pps | 7.48M | 30.4× | | 2 | 459k | 6.06M | 13.2× | | 3 | 360k | 5.07M | 14.1× | | **avg** | **355k** | **6.20M** | **17.5×** | Throughput cap shifts from loadgen to relay. End-to-end recv_pps (which is now first-class in the progress line) is ~100 kpps in this configuration — limited by the relay, not uclient. ## Design notes - **Cache-line alignment** on `uclient_sender` mirrors the listener-pool's slab pattern. Same false-sharing trap, same fix. - **Main-thread timer slows to 10 ms** when the sender pool is engaged. The main timer still fires for lifecycle / `__turn_getMSTime` refresh, but `timer_handler` early-returns when `num_sender_threads > 0` so we don't burn a core on no-op 100 µs ticks. - **Stop ordering**: `stop_sender_threads()` runs before `stop_listener_threads()` — the senders own session mutation (wmsgnum, to_send_timems, shutdown), so joining them first prevents a race where a listener accumulates a stat into a session whose owning sender is still iterating it. - **UDP-GSO copy**: the per-slot memcpy is intentional. The caller (`client_write`) reuses `elem->out_buffer` across burst iterations, so pointing `iov[i]` at the session buffer would alias all entries to the most recent payload. A rotating per-session output ring would eliminate the copy — left out of this PR because the kernel-side savings from collapsing N sendmsg into one GSO sendmsg dominate the per-packet copy cost at the rates we measured. - **Linux-only**: send-side batching machinery is gated by `#if defined(__linux__)`. Non-Linux builds get no-op `uclient_send_batch_begin`/`_end` and `uclient_tx_enqueue` returns false, falling through to the legacy `send(2)` loop. ## Test plan - [x] macOS local build (Apple Silicon, AppleClang). Sender-pool code paths compile under both Linux and non-Linux gates. - [x] `clang-format-15 --dry-run --Werror` clean. - [x] Linux build on a c-4 Ubuntu 24.04 droplet (`cmake -DCMAKE_BUILD_TYPE=Release`). - [x] `--help` includes the new `--sender-threads` option with valid-range hint; out-of-range values rejected. - [x] Benchmark on two c-4 droplets in nyc1 against `turnserver --multiplex-client`: 3 alternating rounds OLD vs NEW, +17.5× average send-side lift (data table above). - [x] `print_load_generator_rate` output verified — `send_pps`, `recv_pps`, `total_sent`, `total_recv` all populated and consistent across listener slab reductions. ## Limitations - `--multiplex-peer` is not driven by this PR. uclient's pattern (each `-m N` opens two internal sessions per client that share the same peer port) hits the multiplex-peer "one allocation per peer endpoint" rule; benchmarking that flag at high concurrency requires a separate small change (per-session secondary peer port) — not in scope here. - The wider per-round variance under the sender pool (rounds in our bench ranged 13×–30× lift) is timing/scheduler noise at small per-thread shards. Smoothens out as `-m` and per-thread session counts grow.