docs(tts): refresh Benchmarks.md per #590; wire styletts2 + --variant into tts-benchmark (#593)

## Summary Closes the work tracked in #590: bring `Documentation/TTS/Benchmarks.md` into agreement with what's actually shipped on `main` for CoreML TTS backends, and add the two CLI affordances needed to benchmark the in-scope backend × language matrix. ### Doc changes (`Documentation/TTS/Benchmarks.md`) - Single consolidated **per-backend table** that merges basic info (license, language+voice, footprint in **GB**, sample rate, max chunk per pass, streaming flag) with performance metrics (TTFT p50/p95, synth p50/p95, agg RTFx, peak RSS, WER %, CER %). Five rows: Kokoro ANE en (`af_heart`), Kokoro ANE zh (`zf_001`), PocketTTS en (`alba` 6L), Magpie en (`John`, batch-only on `main`), StyleTTS2 en (LibriTTS iteration_3, zero-shot). - Dropped from the top-line per scope decision: non-ANE Kokoro, CosyVoice3 zh, PocketTTS 24L variants, Hindi/Cantonese rows. CosyVoice3 narrative sections (decode budget cap + auto-chunker validation) stay verbatim. - Refreshed Kokoro ANE per-stage breakdown (post-laishere 7-graph chain). - Replaced the old Magpie per-stage table with a pointer paragraph (`MagpieSynthesisResult.timings` is still populated for callers; sub-1.5 s TTFA work referenced in #590 lives on `feat/magpie-lt-fusion`, not `main`). - Corrected PocketTTS footprint to `fp16 ~0.77 / int8 ~0.55 GB` (was `~140 / ~520 MB`); enumerated all 10 packs in the corpus matrix; added zh to the Kokoro ANE corpus row; added a StyleTTS2 row. ### CLI changes (`Sources/FluidAudioCLI/Commands/TtsBenchmarkCommand.swift`) - New `styletts2` / `style-tts2` backend wired to `StyleTTS2Manager.synthesize(text:referenceAudioURL:)`. Requires `--reference <wav>`; the shipped iteration_3 `ref_encoder` is fixed at `[1, 1, 80, 231]`, so the reference must be exactly **2.875 s @ 24 kHz mono** — the harness errors out at predict time on mismatched durations. - New `--variant {english|mandarin}` flag for `kokoro-ane` so the `zf_001` Mandarin voice pack can be benchmarked alongside `af_heart`. Falls back to `english` when unset; the manager constructor now receives the parsed `KokoroAneVariant` and the default voice is variant-aware. ### Methodology 100-phrase MiniMax-Multilingual on MacBook Air M2 (16 GB, macOS 26, on AC), `--compute-units default`. English WER/CER via Parakeet TDT roundtrip; Mandarin CER via `whisper-large-v3` (Python CPU FP32, `Scripts/whisper_zh_cer.py`) — macro 4.01% / micro 4.14% across all 100 zh phrases. WER omitted for Mandarin because `WERCalculator` splits on whitespace. ## Test plan - [x] `swift build` clean on `main`-based branch. - [x] `swift format lint --recursive --configuration .swift-format Sources/FluidAudioCLI/Commands/TtsBenchmarkCommand.swift` clean. - [x] Smoke test: `swift run fluidaudio tts-benchmark --backend styletts2 --reference ref.wav --corpus minimax-english --output-json /tmp/styletts2-smoke.json` — produces a valid JSON report. - [x] Smoke test: `swift run fluidaudio tts-benchmark --backend kokoro-ane --variant mandarin --voice zf_001 --corpus minimax-chinese --skip-asr --output-json /tmp/kokoro-zh-smoke.json` — pulls the Mandarin voice pack and produces audio. - [x] Full 100-phrase runs for all five table rows produced under `Benchmarks/tts/runs/590/` (gitignored); table numbers come straight from those JSON reports. - [ ] Reviewer cross-check: footnote markers (`*`, `‡`, `∥`, `¶`) in the consolidated table all have matching paragraphs below.
2026-05-12 20:20:36 +00:00 · 2026-05-09 21:47:45 -04:00
parent a400080380
commit 2c45df3035
3 changed files with 278 additions and 186 deletions
@@ -5,9 +5,9 @@
 > phrases / language, CC-BY-SA-4.0) — the same public corpus used
 > by [MiniMax-Speech][mms], seed-tts-eval, and Gradium, so numbers
 > here are directly paper-comparable.
-> **Status:** Kokoro, Kokoro ANE, PocketTTS, Magpie all
-> complete the English run; CosyVoice3 completes the full Mandarin
-> run.
+> **Status:** Kokoro ANE (English + Mandarin), PocketTTS (English),
+> Magpie (English), and StyleTTS2 (English, zero-shot) all complete
+> the full 100-phrase MiniMax run.
 >
 > [minimax]: https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set
 > [mms]: https://arxiv.org/abs/2505.07916
@@ -53,10 +53,10 @@ Reference each language as `--corpus minimax-<lang>`:

 | Backend     | Default corpus     | Other supported MiniMax languages              |
 |-------------|--------------------|------------------------------------------------|
-| Kokoro / Kokoro ANE | `minimax-english` | `english` only (`af_heart` voice) |
-| PocketTTS   | `minimax-english`  | `english`, `german`, `italian`, `portuguese`, `spanish`, `french` |
+| Kokoro ANE  | `minimax-english` | `english` (`af_heart`); Kokoro ANE also ships `chinese` (`--variant mandarin`, voice `zf_001`) |
+| PocketTTS   | `minimax-english`  | 6L packs: `english`, `german`, `italian`, `portuguese`, `spanish`. 24L packs: `french_24l`, `german_24l`, `italian_24l`, `portuguese_24l`, `spanish_24l` |
 | Magpie      | `minimax-english`  | `english`, `spanish`, `german`, `french`, `italian`, `vietnamese`, `chinese`, `hindi` |
-| CosyVoice3  | `minimax-chinese`  | `chinese`, `cantonese`                         |
+| StyleTTS2   | `minimax-english`  | `english` only (LibriTTS iteration_3, zero-shot from `--reference` audio) |

 Lines beginning with `#` are comments. Custom corpora can still be
 passed with `--corpus-path <file.txt>`.
@@ -64,26 +64,28 @@ passed with `--corpus-path <file.txt>`.
 ### Metrics

 Per phrase:
- `ttft_ms` — time-to-first-audio. For one-shot / batch backends this
-  equals `synth_ms`. **PocketTTS** is benchmarked through
-  `synthesizeStreaming`, so its `ttft_ms` is the timestamp of the first
-  80 ms audio frame (1920 samples @ 24 kHz). **Magpie** is batch-only
-  (`synthesize(...)` returns a single `MagpieSynthesisResult` after
-  the full AR + codec pipeline completes), so `ttft_ms == synth_ms`.
+- `ttft_ms` — time-to-first-audio. The "first audio" granularity is
+  backend-defined; see [Audio chunk window
+  size](#audio-chunk-window-size) below for the per-backend numbers.
+  **PocketTTS** is benchmarked through `synthesizeStreaming`, so its
+  `ttft_ms` is the timestamp of the first 80 ms audio frame (1920
+  samples @ 24 kHz) — actually-perceptible TTFA. **Kokoro ANE,
+  Magpie, StyleTTS2** are batch / one-shot (`synthesize(...)` returns
+  the full waveform), so `ttft_ms == synth_ms == time-to-complete-wav`
+  for those — interpret it as full-wav latency, not as TTFA.
 - `synth_ms` — total synth wall time.
 - `audio_ms` — generated audio duration.
 - `rtfx` — `audio_ms / synth_ms`.
 - `wer`, `cer` — via Parakeet ASR roundtrip on the rendered WAV.
 - `stage_ms` — per-stage breakdown (backend-specific keys; populated
-  for Kokoro ANE + Magpie; empty for Kokoro / PocketTTS /
-  CosyVoice3).
+  for Kokoro ANE; empty for / PocketTTS / Magpie /
+  StyleTTS2 in this report).
 - Backend-specific extras: `encoder_tokens`, `acoustic_frames`,
-  `chunk_count`, `frame_count`, `code_count`, `finished_on_eos`,
-  `generated_token_count`, etc.
+  `chunk_count`, `frame_count`, `code_count`, `generated_token_count`,
+  etc.

 Aggregates:
- `cold_start_s` — `manager.initialize()` wall time. CosyVoice3 also
-  includes voice-asset load.
+- `cold_start_s` — `manager.initialize()` wall time.
 - `first_synth_ms` — first synth call after init (still cold-ish).
 - `ttft_ms_p50` / `ttft_ms_p95`.
 - `warm_synth_ms_p50` / `warm_synth_ms_p95`.
@@ -92,6 +94,21 @@ Aggregates:
  `task_vm_info_data_t.resident_size_peak`.
 - Per-category macro WER / CER.

+### Audio chunk window size
+
+What counts as "first audio" is backend-defined. The vocoder /
+codec emits in fixed-size chunks; only **PocketTTS** is wired to
+yield those chunks incrementally on `main`. Everything else returns
+the full waveform after the full pipeline runs, so `ttft_ms` for
+those backends measures full-wav latency rather than perceptual
+TTFA. The consolidated [per-backend table](#per-backend-top-line)
+below carries the per-backend sample rate, chunk window, and
+streaming flag inline alongside the performance metrics.
+
+For batch backends, "average latency" the user perceives is
+`synth_ms` (full wav) rather than `ttft_ms` — they're equal in
+that case, so the consolidated table just reports them once.
+
 ### Reproducibility

 ```bash
@@ -103,14 +120,29 @@ swift run fluidaudio tts-benchmark \
  --compute-units default \
  --output-json bench.json \
  --audio-dir bench-wavs/
+
+# Kokoro ANE Mandarin (skip Parakeet ASR; whisper CER scored separately).
+swift run fluidaudio tts-benchmark \
+  --backend kokoro-ane --variant mandarin --voice zf_001 \
+  --corpus minimax-chinese --skip-asr \
+  --output-json bench-zh.json --audio-dir bench-wavs-zh/
+
+# StyleTTS2 zero-shot (LibriTTS iteration_3). The shipped ref_encoder
+# is fixed at [1, 1, 80, 231], so the reference must be exactly
+# 2.875 s @ 24 kHz mono. Trim externally before invoking, e.g.:
+#   ffmpeg -i speaker.wav -t 2.875 -ar 24000 -ac 1 -c:a pcm_s16le ref.wav
+swift run fluidaudio tts-benchmark \
+  --backend styletts2 --reference ref.wav \
+  --corpus minimax-english \
+  --output-json bench-styletts2.json --audio-dir bench-wavs-styletts2/
 ```

 The harness writes a JSON report to `--output-json` and (optionally)
 keeps WAVs under `--audio-dir`. Pass `--skip-asr` to drop the ASR
 roundtrip. The default ASR backend is `parakeet` for English-only
-runs and is skipped for CosyVoice3; pass `--asr-backend cohere
--cohere-model-dir <dir>` to score Mandarin (or any of the 14
-Cohere languages) against [Cohere Transcribe](../../Sources/FluidAudio/ASR/Cohere/).
+runs; pass `--asr-backend cohere --cohere-model-dir <dir>` to score
+Mandarin (or any of the 14 Cohere languages) against
+[Cohere Transcribe](../../Sources/FluidAudio/ASR/Cohere/).

 ## Results

@@ -118,88 +150,90 @@ Cohere languages) against [Cohere Transcribe](../../Sources/FluidAudio/ASR/Coher

 Reference machine: **MacBook Air, Apple M2 (2022), 8-core CPU /
 8-core GPU / 16-core Neural Engine, 16 GB unified memory, macOS 26**
-(`Mac14,2`, on AC). All English runs use `--compute-units default`,
-voice = backend default
-(`af_heart` for Kokoro, `alba` for PocketTTS, `John` for Magpie),
-corpus = `minimax-english` (100 phrases), Parakeet TDT roundtrip for
-WER / CER.
+(`Mac14,2`, on AC). All runs use `--compute-units default`, 100
+phrases per language. Voices are backend defaults
+(`af_heart` for Kokoro ANE en, `zf_001` for Kokoro ANE zh,
+`alba` for PocketTTS, `John` for Magpie, LibriTTS iteration_3 for
+StyleTTS2). English WER / CER via Parakeet TDT roundtrip; Mandarin
+CER via `whisper-large-v3`.

-| Backend     | License     | Languages              | Footprint | Cold start | TTFT p50 / p95\*   | Synth p50 / p95     | Agg RTFx | Peak RSS | WER     | CER     | Notes |
-|-------------|-------------|------------------------|-----------|------------|---------------------|---------------------|----------|----------|---------|---------|-------|
-| Kokoro ANE  | Apache-2.0  | en (af_heart only)     | ~330 MB   | 37.9 s     | 1586 / 2515 ms      | 1586 / 2515 ms      | 5.19×    | 738 MB   | 0.108   | 0.040   | one-shot; per-stage CU sweep, 7-graph pipeline |
-| Kokoro      | Apache-2.0  | en (af_heart only)     | ~330 MB   | 92.2 s     | 3113 / 4696 ms      | 3113 / 4696 ms      | 2.02×    | 736 MB   | 0.013   | 0.005   | one-shot; cleanest English ASR roundtrip |
-| PocketTTS   | research    | en + de + it + pt + es + fr (6L / 24L) | ~140 / ~520 MB | 6.0 s | **1244 / 4749 ms**  | 8757 / 19174 ms     | 0.61×    | 1503 MB  | 0.014   | 0.006   | **streaming**; TTFT is first 80 ms audio frame |
-| Magpie      | research    | en/es/de/fr/it/vi/zh/hi | ~1.3 GB   | 38.5 s∥    | 15080 / 29895 ms∥   | 15080 / 29895 ms∥   | 0.64×∥   | 762 MB∥  | 0.056   | 0.033   | **batch-only**; `ttft_ms == synth_ms`; split-K/V decoder; outputBackings fast path with latched fallback |
-| CosyVoice3  | Apache-2.0  | zh (mandarin)          | ~1.5 GB   | 29.2 s†    | 14091 / 23679 ms†   | 14091 / 23679 ms†   | 0.357×†  | 3302 MB† | n/a‡    | 0.017‡  | beta; full `minimax-chinese` (100/100 phrases) for latency / RSS and whisper-large-v3 CER‡; cantonese supported via [auto-chunker](#cosyvoice3-auto-chunker) but not benchmarked (no yue ASR) |
+One consolidated table per backend × language. **Basic info**
+(license, language, footprint, sample rate, max chunk per pass,
+streaming flag) is merged with **performance** (TTFT, synth, RTFx,
+peak RSS, WER, CER) so there is a single source of truth.
+
+| Backend    | License    | Language (voice)          | Footprint                  | Sample rate | Max chunk per pass                                               | Streaming | TTFT p50 / p95\*  | Synth p50 / p95   | Agg RTFx  | Peak RSS | WER    | CER    |
+|------------|------------|---------------------------|----------------------------|-------------|------------------------------------------------------------------|-----------|-------------------|-------------------|-----------|----------|--------|--------|
+| Kokoro ANE | Apache-2.0 | en (`af_heart`)           | ~0.33 GB                   | 24 kHz      | 510 phonemes / pass (≈25–30 s of audio)                          | No        | **988 / 2068 ms** | 988 / 2068 ms     | **7.47×** | 1027 MB  | 10.8%  | 4.0%   |
+| Kokoro ANE | Apache-2.0 | zh (`zf_001`)             | ~0.33 GB                   | 24 kHz      | 510 phonemes / pass (≈25–30 s of audio)                          | No        | **956 / 1802 ms** | 956 / 1802 ms     | 6.37×     | 685 MB   | n/a‡   | 4.0%‡  |
+| PocketTTS  | research   | en (`alba`, 6L pack)      | int8 ~0.55 GB | 24 kHz      | 80 ms Mimi frame, streams until EOS (no fixed cap)               | Yes       | **710 / 1496 ms** | 5160 / 9801 ms    | 1.10×     | 1167 MB  | 1.0%   | 0.4%   |
+| Magpie     | research   | en (`John`)               | ~1.3 GB                    | 22.05 kHz   | 256 NanoCodec frames / pass (≈11.9 s); sentence-split for longer | No        | 11470 / 26042 ms∥ | 11470 / 26042 ms∥ | 0.87×∥    | 543 MB∥  | 3.8%   | 2.6%   |
+| StyleTTS2  | research   | en (LibriTTS iteration_3) | ~0.67 GB¶                  | 24 kHz      | 256 tokens / pass (≈30 s of audio max)                           | No        | 1574 / 3088 ms    | 1574 / 3088 ms    | 4.59×     | 522 MB   | 9.4%   | 4.1%   |

 \* TTFT for **PocketTTS** is first-frame emit through the streaming
-API; **Magpie** is batch-only (`ttft_ms == synth_ms`); the others
-are one-shot, so `ttft_ms == synth_ms`.
+API (perceptual TTFA). **Kokoro ANE / Magpie / StyleTTS2** all run
+one-shot per phrase (no streaming yield on `main`), so for those
+rows `ttft_ms == synth_ms == time-to-complete-wav`.

-† CosyVoice3 chinese: 100/100, 0 errors, ASR skipped. Cold-start
-dropped from 302.7 s to 29.2 s on the warm re-run.
-
-‡ CosyVoice3 CER measured on the **full 100-phrase**
+‡ Kokoro ANE Mandarin CER measured on the **full 100-phrase**
 `minimax-chinese` corpus via `whisper-large-v3` (Python CPU FP32,
-[`Scripts/whisper_zh_cer.py`](../../Scripts/whisper_zh_cer.py)) on
-the WAVs rendered by `tts-benchmark --backend cosyvoice3 --corpus
-minimax-chinese --skip-asr --audio-dir <dir>`: **macro CER 1.68%
-(0.0168)**, **micro CER 1.84% (0.0184)** across 100 phrases.
-Whisper is the source of truth here because Cohere Transcribe q8
-hit a `MILCompilerForANE` cache failure on this M2 host and ran on
-the CPU+GPU fallback path at RTFx ~0.13× (would have taken multiple
-hours for the full 100-phrase set vs. ~70 min for whisper). WER is
-omitted because Mandarin has no word boundaries and `WERCalculator`
-splits on whitespace, so word-level WER reads near 100% and is
-meaningless.
+[`Scripts/whisper_zh_cer.py`](../../Scripts/whisper_zh_cer.py))
+against the WAVs rendered by `tts-benchmark --backend kokoro-ane
+--variant mandarin --voice zf_001 --corpus minimax-chinese
+--skip-asr`: **macro CER 4.01% (0.0401)**, **micro CER 4.14%
+(0.0414)** across 100 phrases (table reports the macro figure).
+WER is omitted because Mandarin has no word boundaries and
+`WERCalculator` splits on whitespace — word-level WER reads near
+100% and is meaningless. Cohere Transcribe q8 hit a
+`MILCompilerForANE` cache failure on this M2 host, so whisper is
+the local source of truth for Mandarin CER.

 ∥ Magpie: batch-only. `synthesize(...)` returns one
 `MagpieSynthesisResult` after the full AR + codec pipeline completes,
 so `ttft_ms == synth_ms`. Long inputs are sentence-split internally
 (NanoCodec 256-frame static cap) and AR(N+1) ‖ codec(N) chunk-level
 pipelining overlaps the next chunk's AR loop with the current chunk's
-codec pass — wallclock optimization, not incremental yield.
+codec pass — wallclock optimization, not incremental yield. The
+sub-1.5 s TTFA work referenced in issue #590 (fused sampler +
+24-frame cap) lives on `feat/magpie-lt-fusion`, not `main`.
+
+¶ StyleTTS2 footprint is the sum of the shipped iteration_3 mlpackages
+(text encoder + bert + ref_encoder + post_albert + alignment + prosody
+ noise + decoder + tail). The shipped ref_encoder is exported with
+a fixed `[1, 1, 80, 231]` mel shape, so reference audio must be
+exactly 2.875 s @ 24 kHz (300-hop). The benchmark harness expects
+the caller to trim externally; mismatched durations error out at
+predict time.

 ### Kokoro ANE — per-stage breakdown (default preset, MiniMax-English)

-Means across 100 `minimax-english` phrases on M2. Stages map to the
-7-CoreML-graph split documented in [KokoroAne.md](KokoroAne.md). Vocoder
-+ noise together account for ~92% of synth time, which is the natural
-target for any further per-stage compute-unit re-tuning. The MiniMax
-mean is meaningfully higher than the prior Harvard-sentences run
-because phrases 81–100 are paragraph-length news / story sentences.
+Means across 100 `minimax-english` phrases on M2 (`af_heart`,
+post-laishere 7-graph chain). Stages map to the 7-CoreML-graph split
+documented in [KokoroAne.md](KokoroAne.md). Vocoder + noise together
+account for ~90% of synth time, which is the natural target for any
+further per-stage compute-unit re-tuning.

 | Stage         | Mean ms | % of total |
 |---------------|---------|------------|
-| `albert`      | 28.2    | 2.0%       |
-| `post_albert` | 12.1    | 0.9%       |
-| `alignment`   | 1.8     | 0.1%       |
-| `prosody`     | 49.2    | 3.5%       |
-| `noise`       | 242.6   | 17.4%      |
-| `vocoder`     | 1039.8  | 74.4%      |
-| `tail`        | 24.6    | 1.8%       |
-| **total**     | 1398.4  | 100%       |
+| `albert`      |   24.5  |  2.5%      |
+| `post_albert` |    9.3  |  1.0%      |
+| `alignment`   |    1.4  |  0.1%      |
+| `prosody`     |   40.0  |  4.1%      |
+| `noise`       |  169.3  | 17.5%      |
+| `vocoder`     |  704.4  | 72.9%      |
+| `tail`        |   17.0  |  1.8%      |
+| **total**     |  965.9  | 100%       |

-### Magpie — per-stage breakdown (default preset, MiniMax-English)
+### Magpie — per-stage breakdown

-Means across 100 `minimax-english` phrases on M2 (`John` voice, en,
-default compute units), captured during the original one-shot
-profiling run. `ar_loop` is the umbrella for the per-step
-`decoder_step` + `sampler` (so it is not added on top in the total).
-`nanocodec` runs concurrently with the next chunk's AR loop via
-chunk-level pipelining inside `synthesize(...)`, which is why the
-per-stage means do not sum to total warm-synth mean. The AR loop
-dominates the wall clock, and its cost grows super-linearly with
-phrase length — long news / story phrases drive the long-tail p95.
-
-| Stage              | Mean ms |
-|--------------------|---------|
-| `text_encoder`     | 91      |
-| `prefill`          | 281     |
-| `ar_loop`          | 17946   |
-| └── `decoder_step` | 14840   |
-| └── `sampler`      | 3081    |
-| `nanocodec`        | 17948   |
+Per-stage timings (`text_encoder`, `prefill`, `ar_loop`,
+`decoder_step`, `sampler`, `nanocodec`) are still populated on
+`MagpieSynthesisResult.timings` for callers that want them — see
+[`MagpieTypes.swift`](../../Sources/FluidAudio/TTS/Magpie/MagpieTypes.swift).
+This document does not currently re-publish the per-stage table on
+`main`: the AR loop dominates and its absolute numbers are
+in active flux on `feat/magpie-lt-fusion` (fused sampler + 24-frame
+NanoCodec cap). Republish here once that branch lands on `main`.

 ### About the WER / CER numbers

@@ -212,97 +246,3 @@ discussion](https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set/
 absolute WER is best read **relatively** (backend A vs. backend B on
 the same corpus + same ASR + same normalizer) rather than against
 raw paper numbers.
-
-## CosyVoice3 Decode budget cap
-
-CosyVoice3's Flow CFM was exported with a fixed input shape of
-`[1, 250]` speech tokens (`flowTotalTokens` in
-`CosyVoice3Constants.swift:45`). The LLM-Decode AR loop is allowed to
-emit up to `flowTotalTokens − N_prompt` tokens before being cut off
-(typically ~163 generated tokens after the speech-prompt portion).
-At `tokenMelRatio=2 × hiftSamplesPerFrame=480 / sampleRate=24000`
-that's **40 ms of audio per generated token**, so the loop produces
-**at most ~6.5 s of speech per phrase**, regardless of how long the
-input text is.
-
-When the AR loop exits because it ran out of budget (i.e. no EOS
-token in `stopRange = 6_561…6_760`) instead of natural termination,
-`CosyVoice3Synthesizer` now:
-
-1. Logs a `.warning` (one-shot per phrase) naming the
-   `decoded.count / maxNew` budget and the produced audio duration.
-2. Sets `CosyVoice3SynthesisResult.finishedOnEos = false`, which the
-   benchmark harness surfaces as the `finished_on_eos` field on each
-   phrase in the JSON report.
-
-Footprint on the cantonese corpus (`minimax-cantonese`,
-100 phrases) **without the chunker**: 80 / 100 phrases would hit the
-cap, all producing exactly 163 generated tokens / ~6.5 s of audio.
-The mandarin corpus sees a much lower truncation rate because
-MiniMax-zh phrases are shorter on average.
-
-The structural fix — re-exporting the Flow CFM from
-[`mobius-cosyvoice3`](https://github.com/voicelink-ai/mobius-cosyvoice3)
-with a larger fixed input shape (e.g. `[1, 500]`) — is upstream
-work; bumping the constant in Swift alone would make the Flow
-input/output shapes mismatch at predict time. The shipped workaround
-is the call-site [auto-chunker](#cosyvoice3-auto-chunker), which
-drops cantonese truncation from 80/100 → 5/100 by splitting long
-inputs at clause boundaries and crossfading the results.
-
-Surfaced in
-`CosyVoice3Synthesizer.synthesize`
-(`Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Synthesize/CosyVoice3Synthesizer.swift`)
-and
-`CosyVoice3SynthesisResult.finishedOnEos`
-(`Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Synthesize/CosyVoice3Types.swift`).
-
-## CosyVoice3 auto-chunker
-
-Re-exporting Flow CFM with a larger fixed input shape is gated on
-upstream conversion work. Until that lands, `CosyVoice3TtsManager`
-splits long inputs at the call site, synthesizes each chunk
-independently, and merges with an 8 ms equal-power cosine crossfade.
-
-**Splitter policy** (`CosyVoice3TextChunker`):
-
- **Hard enders** commit always: `.`, `!`, `?`, `。`, `！`, `？`,
-  `\n`.
- **Soft enders** commit only when the running estimate is at or past
-  the budget: `，`, `、`, `；`, `：`, `;`, `,`, ASCII space.
- **Force-split** at `budget + 30` tokens of overshoot if no natural
-  boundary appeared (rare; mostly continuous CJK with no
-  punctuation).
-
-**Token-rate estimate** (calibrated against minimax-zh + minimax-yue
-runs):
-
-| Char class | Tokens / char | Rationale                                                    |
-|------------|---------------|--------------------------------------------------------------|
-| CJK        | 7.5           | worst-case observed in real generation; varies 5.5–9 per char |
-| ASCII      | 1.5           | matches BPE rate on English text                              |
-| Other      | 2.5           | conservative for accented Latin / non-CJK Unicode             |
-
-`defaultMaxSpeechTokens` is **110**, leaving margin under the
-250-token Flow cap minus typical 60–90 token speech-prompt context.
-
-**Concatenation**: 8 ms equal-power cosine crossfade at 24 kHz
-between adjacent chunks; single-chunk path short-circuits to plain
-copy.
-
-**Validation** (full `minimax-cantonese`, 100 phrases, M2):
-
-| Metric                                    | Pre-chunker | Post-chunker | Δ          |
-|-------------------------------------------|-------------|--------------|------------|
-| `finished_on_eos=false` (truncated)       | 80 / 100    | **5 / 100**  | −94%       |
-| Longest audio output                      | 6.5 s       | **16.1 s**   | +148%      |
-| agg-RTFx                                  | 0.245×      | 0.249×       | +1.6%      |
-| TTFT p50                                  | 23.9 s      | 35.7 s       | +49%       |
-| TTFT p95                                  | 41.2 s      | 60.5 s       | +47%       |
-| Peak RSS                                  | 2016 MB     | 3264 MB      | +62%       |
-
-The 5/100 residual is the long-tail token-rate worst case (some
-Cantonese characters generate >9 speech tokens); raising the
-per-CJK heuristic further would over-fragment short phrases.
-Cleaner fix is the upstream Flow re-export.
-
@@ -115,6 +115,18 @@ public actor PunctuationCommitLayer {
    /// Active debounce timer task.
    private var debounceTask: Task<Void, Never>?

+    /// Monotonically increasing generation for the active debounce timer.
+    ///
+    /// Incremented whenever a pending timer is invalidated (new partial
+    /// text, EOU, manual commit, reset, or a fresh timer being started).
+    /// The timer task captures the generation it was created under and
+    /// re-checks it after landing back on the actor, which closes the
+    /// race where `Task.sleep` already returned and the
+    /// `!Task.isCancelled` guard passed before a subsequent call to
+    /// `processPartialText` / `processEOU` / `manualCommit` / `reset`
+    /// requested cancellation.
+    private var debounceGeneration: UInt64 = 0
+
    /// Callback invoked when updates occur.
    private var updateCallback: (@Sendable (CommitLayerUpdate) -> Void)?

@@ -149,7 +161,7 @@ public actor PunctuationCommitLayer {
    /// - Returns: Update containing committed text, ghost text, and commit reason.
    public func processPartialText(_ text: String) async -> CommitLayerUpdate {
        // Cancel existing debounce timer
-        debounceTask?.cancel()
+        invalidateDebounce()
        lastUpdateTime = Date()

        // Find last punctuation mark in text
@@ -223,7 +235,7 @@ public actor PunctuationCommitLayer {
    ///
    /// - Returns: Update with all text committed and EOU commit reason.
    public func processEOU() async -> CommitLayerUpdate {
-        debounceTask?.cancel()
+        invalidateDebounce()
        lastUpdateTime = Date()

        // EOU signals end of utterance: commit everything
@@ -262,7 +274,7 @@ public actor PunctuationCommitLayer {
    ///
    /// - Returns: Update with ghost text promoted to committed text.
    public func manualCommit() async -> CommitLayerUpdate {
-        debounceTask?.cancel()
+        invalidateDebounce()
        lastUpdateTime = Date()

        guard !ghostText.isEmpty else {
@@ -298,7 +310,7 @@ public actor PunctuationCommitLayer {

    /// Resets the commit layer, clearing all committed and ghost text.
    public func reset() async {
-        debounceTask?.cancel()
+        invalidateDebounce()
        committedText = ""
        ghostText = ""
        lastUpdateTime = Date()
@@ -323,9 +335,19 @@ public actor PunctuationCommitLayer {

    // MARK: - Private Helpers

+    /// Cancels any pending debounce timer and bumps the generation so any
+    /// timer task already past its `!Task.isCancelled` guard becomes a
+    /// no-op when it lands back on the actor.
+    private func invalidateDebounce() {
+        debounceTask?.cancel()
+        debounceTask = nil
+        debounceGeneration &+= 1
+    }
+
    /// Starts a debounce timer that commits ghost text after the timeout expires.
    private func startDebounceTimer() {
-        debounceTask?.cancel()
+        invalidateDebounce()
+        let pendingGeneration = debounceGeneration

        debounceTask = Task { [weak self, debounceTimeout, commitOnTimeout] in
            try? await Task.sleep(nanoseconds: UInt64(debounceTimeout * 1_000_000_000))
@@ -337,11 +359,21 @@ public actor PunctuationCommitLayer {
            if commitOnTimeout {
                // Check cancellation again before acquiring actor executor
                guard !Task.isCancelled else { return }
-                await self.commitGhostText(reason: .debounceTimeout)
+                await self.fireDebounceCommit(generation: pendingGeneration)
            }
        }
    }

+    /// Commits ghost text only if no newer timer / commit / reset has
+    /// superseded the timer that scheduled this commit. Closes the race
+    /// where `Task.sleep` returns and `!Task.isCancelled` is observed
+    /// `false` *before* a subsequent call to `processPartialText`,
+    /// `processEOU`, `manualCommit`, or `reset` cancels us.
+    private func fireDebounceCommit(generation: UInt64) async {
+        guard generation == debounceGeneration else { return }
+        await commitGhostText(reason: .debounceTimeout)
+    }
+
    /// Commits ghost text to committed text with the specified reason.
    ///
    /// - Parameter reason: The reason for committing.
@@ -14,6 +14,7 @@ import Foundation
 ///   kokoro        — single-graph CPU+GPU (chunk-level only)
 ///   pocket-tts    — streaming flow-matching (no per-stage timings)
 ///   magpie        — encoder-decoder + NanoCodec (6-stage timings, slow)
+///   styletts2     — LibriTTS iteration_3, zero-shot w/ reference audio
 ///   cosyvoice3    — Mandarin LLM-based (Mandarin corpus only, no WER)
 ///
 /// Usage:
@@ -105,6 +106,8 @@ public enum TtsBenchmarkCommand {
        var cohereModelDirArg: String?
        var asrLanguageArg: String?
        var cohereComputeUnitsArg: String?
+        var referencePath: String?
+        var variantArg: String?

        var i = 0
        while i < arguments.count {
@@ -177,6 +180,16 @@ public enum TtsBenchmarkCommand {
                    cohereComputeUnitsArg = arguments[i + 1]
                    i += 1
                }
+            case "--reference":
+                if i + 1 < arguments.count {
+                    referencePath = arguments[i + 1]
+                    i += 1
+                }
+            case "--variant":
+                if i + 1 < arguments.count {
+                    variantArg = arguments[i + 1]
+                    i += 1
+                }
            case "--help", "-h":
                printUsage()
                return
@@ -245,9 +258,11 @@ public enum TtsBenchmarkCommand {
        do {
            switch backend {
            case .kokoroAne:
+                let kaVariant = parseKokoroAneVariant(variantArg)
                try await runKokoroAne(
                    phrases: phrases, corpusLabel: corpusLabel,
-                    voice: voice ?? KokoroAneConstants.defaultVoice,
+                    variant: kaVariant,
+                    voice: voice ?? kaVariant.defaultVoice,
                    preset: preset, outputJson: outputJson, audioDir: audioDir,
                    asrChoice: asrChoice)
            case .kokoro:
@@ -269,6 +284,12 @@ public enum TtsBenchmarkCommand {
                    speakerName: speakerName, languageName: languageName,
                    preset: preset, outputJson: outputJson, audioDir: audioDir,
                    asrChoice: asrChoice)
+            case .styleTts2:
+                try await runStyleTTS2(
+                    phrases: phrases, corpusLabel: corpusLabel,
+                    referencePath: referencePath,
+                    preset: preset, outputJson: outputJson, audioDir: audioDir,
+                    asrChoice: asrChoice)
            case .cosyVoice3:
                try await runCosyVoice3(
                    phrases: phrases, corpusLabel: corpusLabel,
@@ -287,6 +308,7 @@ public enum TtsBenchmarkCommand {
    private static func runKokoroAne(
        phrases: [(category: String, text: String)],
        corpusLabel: String,
+        variant: KokoroAneVariant,
        voice: String,
        preset: TtsComputeUnitPreset,
        outputJson: String?,
@@ -294,7 +316,7 @@ public enum TtsBenchmarkCommand {
        asrChoice: AsrChoice
    ) async throws {
        let units = KokoroAneComputeUnits(preset: preset)
-        let manager = KokoroAneManager(defaultVoice: voice, computeUnits: units)
+        let manager = KokoroAneManager(variant: variant, defaultVoice: voice, computeUnits: units)

        let coldStart = Date()
        try await manager.initialize()
@@ -560,6 +582,80 @@ public enum TtsBenchmarkCommand {
        }
    }

+    // MARK: - StyleTTS2 driver
+
+    private static func runStyleTTS2(
+        phrases: [(category: String, text: String)],
+        corpusLabel: String,
+        referencePath: String?,
+        preset: TtsComputeUnitPreset,
+        outputJson: String?,
+        audioDir: String?,
+        asrChoice: AsrChoice
+    ) async throws {
+        guard let referencePath, !referencePath.isEmpty else {
+            logger.error(
+                "styletts2 backend requires --reference <speaker-audio-file> "
+                    + "(any sample rate / channel layout — resampled to 24 kHz mono).")
+            exit(1)
+        }
+        let referenceURL = resolveURL(referencePath, isDirectory: false)
+        guard FileManager.default.fileExists(atPath: referenceURL.path) else {
+            logger.error("Reference audio not found: \(referenceURL.path)")
+            exit(1)
+        }
+        logger.info("StyleTTS2 reference audio: \(referenceURL.path)")
+
+        let units = preset.uniformUnits ?? .cpuAndNeuralEngine
+        let manager = StyleTTS2Manager(computeUnits: units)
+
+        let coldStart = Date()
+        try await manager.initialize()
+        let coldStartS = Date().timeIntervalSince(coldStart)
+        logger.info(String(format: "Cold start (initialize): %.2fs", coldStartS))
+
+        let firstStart = Date()
+        _ = try await manager.synthesize(
+            text: "Initialization warm-up.",
+            referenceAudioURL: referenceURL)
+        let firstSynthMs = Date().timeIntervalSince(firstStart) * 1000
+        logger.info(String(format: "First synth: %.0f ms", firstSynthMs))
+
+        try await runPhraseLoop(
+            backendId: "styletts2",
+            voiceLabel: referenceURL.lastPathComponent,
+            corpusLabel: corpusLabel,
+            phrases: phrases,
+            preset: preset,
+            coldStartS: coldStartS,
+            firstSynthMs: firstSynthMs,
+            outputJson: outputJson,
+            audioDir: audioDir,
+            asrChoice: asrChoice,
+            extraSummary: [
+                "reference": referenceURL.path,
+                "alpha": Double(StyleTTS2Constants.defaultAlpha),
+                "beta": Double(StyleTTS2Constants.defaultBeta),
+            ]
+        ) { text in
+            // StyleTTS2 is a one-shot diffusion-based synthesizer — no
+            // streaming yield, so TTFT == synthMs. The per-phrase mel
+            // recompute is tiny vs. the 5-step ADPM2 + decoder cost.
+            let t0 = Date()
+            let samples = try await manager.synthesize(
+                text: text, referenceAudioURL: referenceURL)
+            let synthMs = Date().timeIntervalSince(t0) * 1000
+            return BackendPhraseSample(
+                synthMs: synthMs,
+                ttftMs: synthMs,
+                samples: samples,
+                sampleRate: StyleTTS2Constants.sampleRate,
+                stageMs: [:],
+                extraFields: [:]
+            )
+        }
+    }
+
    // MARK: - CosyVoice3 driver

    private static func runCosyVoice3(
@@ -885,6 +981,7 @@ public enum TtsBenchmarkCommand {
        case kokoro
        case pocketTts
        case magpie
+        case styleTts2
        case cosyVoice3

        var defaultCorpus: String {
@@ -905,6 +1002,8 @@ public enum TtsBenchmarkCommand {
            return .pocketTts
        case "magpie":
            return .magpie
+        case "styletts2", "style-tts2", "styletts", "style-tts":
+            return .styleTts2
        case "cosyvoice3", "cosyvoice", "cosy":
            return .cosyVoice3
        default:
@@ -938,6 +1037,17 @@ public enum TtsBenchmarkCommand {
        }
    }

+    private static func parseKokoroAneVariant(_ name: String?) -> KokoroAneVariant {
+        switch name?.lowercased() {
+        case "mandarin", "zh", "chinese", "zh-cn":
+            return .mandarin
+        case "english", "en", "en-us", nil, "":
+            return .english
+        default:
+            return .english
+        }
+    }
+
    // MARK: - Helpers

    private static func percentile(_ sorted: [Double], _ p: Double) -> Double {
@@ -1193,6 +1303,7 @@ public enum TtsBenchmarkCommand {
              kokoro        Single-graph CPU+GPU
              pocket-tts    Streaming flow-matching (multilingual)
              magpie        Encoder-decoder + NanoCodec (per-stage, slow)
+              styletts2     LibriTTS iteration_3, zero-shot, requires --reference
              cosyvoice3    Mandarin LLM-based (auto-picks Cohere ASR for zh)

            Options:
@@ -1232,13 +1343,22 @@ public enum TtsBenchmarkCommand {
                                        fails (`MILCompilerForANE error: …`)
                                        — avoids the multi-minute fallback
                                        compile on first call.
+              --reference <path>        StyleTTS2 speaker-reference audio
+                                        (required for --backend styletts2;
+                                        any sample rate / channel layout —
+                                        resampled to 24 kHz mono internally)
+              --variant <name>          Kokoro ANE variant: english (default) or
+                                        mandarin (aliases: zh, chinese)
              --help, -h                Show this help

            Examples:
              fluidaudio tts-benchmark --backend kokoro-ane --output-json bench.json
+              fluidaudio tts-benchmark --backend kokoro-ane --variant mandarin \\
+                  --voice zf_001 --corpus minimax-chinese --skip-asr
              fluidaudio tts-benchmark --backend kokoro --corpus minimax-english
              fluidaudio tts-benchmark --backend pocket-tts --corpus minimax-german --language german
              fluidaudio tts-benchmark --backend magpie --speaker sofia --language en
+              fluidaudio tts-benchmark --backend styletts2 --reference speaker.wav
              fluidaudio tts-benchmark --backend cosyvoice3 --corpus minimax-chinese \\
                  --asr-backend cohere --cohere-model-dir ~/.fluidaudio/cohere/q8