docs(tts): refresh Benchmarks.md per #590; wire styletts2 + --variant into tts-benchmark (#593)

## Summary

Closes the work tracked in #590: bring `Documentation/TTS/Benchmarks.md`
into agreement with what's actually shipped on `main` for CoreML TTS
backends, and add the two CLI affordances needed to benchmark the
in-scope backend × language matrix.

### Doc changes (`Documentation/TTS/Benchmarks.md`)

- Single consolidated **per-backend table** that merges basic info
(license, language+voice, footprint in **GB**, sample rate, max chunk
per pass, streaming flag) with performance metrics (TTFT p50/p95, synth
p50/p95, agg RTFx, peak RSS, WER %, CER %). Five rows: Kokoro ANE en
(`af_heart`), Kokoro ANE zh (`zf_001`), PocketTTS en (`alba` 6L), Magpie
en (`John`, batch-only on `main`), StyleTTS2 en (LibriTTS iteration_3,
zero-shot).
- Dropped from the top-line per scope decision: non-ANE Kokoro,
CosyVoice3 zh, PocketTTS 24L variants, Hindi/Cantonese rows. CosyVoice3
narrative sections (decode budget cap + auto-chunker validation) stay
verbatim.
- Refreshed Kokoro ANE per-stage breakdown (post-laishere 7-graph
chain).
- Replaced the old Magpie per-stage table with a pointer paragraph
(`MagpieSynthesisResult.timings` is still populated for callers; sub-1.5
s TTFA work referenced in #590 lives on `feat/magpie-lt-fusion`, not
`main`).
- Corrected PocketTTS footprint to `fp16 ~0.77 / int8 ~0.55 GB` (was
`~140 / ~520 MB`); enumerated all 10 packs in the corpus matrix; added
zh to the Kokoro ANE corpus row; added a StyleTTS2 row.

### CLI changes
(`Sources/FluidAudioCLI/Commands/TtsBenchmarkCommand.swift`)

- New `styletts2` / `style-tts2` backend wired to
`StyleTTS2Manager.synthesize(text:referenceAudioURL:)`. Requires
`--reference <wav>`; the shipped iteration_3 `ref_encoder` is fixed at
`[1, 1, 80, 231]`, so the reference must be exactly **2.875 s @ 24 kHz
mono** — the harness errors out at predict time on mismatched durations.
- New `--variant {english|mandarin}` flag for `kokoro-ane` so the
`zf_001` Mandarin voice pack can be benchmarked alongside `af_heart`.
Falls back to `english` when unset; the manager constructor now receives
the parsed `KokoroAneVariant` and the default voice is variant-aware.

### Methodology

100-phrase MiniMax-Multilingual on MacBook Air M2 (16 GB, macOS 26, on
AC), `--compute-units default`. English WER/CER via Parakeet TDT
roundtrip; Mandarin CER via `whisper-large-v3` (Python CPU FP32,
`Scripts/whisper_zh_cer.py`) — macro 4.01% / micro 4.14% across all 100
zh phrases. WER omitted for Mandarin because `WERCalculator` splits on
whitespace.

## Test plan

- [x] `swift build` clean on `main`-based branch.
- [x] `swift format lint --recursive --configuration .swift-format
Sources/FluidAudioCLI/Commands/TtsBenchmarkCommand.swift` clean.
- [x] Smoke test: `swift run fluidaudio tts-benchmark --backend
styletts2 --reference ref.wav --corpus minimax-english --output-json
/tmp/styletts2-smoke.json` — produces a valid JSON report.
- [x] Smoke test: `swift run fluidaudio tts-benchmark --backend
kokoro-ane --variant mandarin --voice zf_001 --corpus minimax-chinese
--skip-asr --output-json /tmp/kokoro-zh-smoke.json` — pulls the Mandarin
voice pack and produces audio.
- [x] Full 100-phrase runs for all five table rows produced under
`Benchmarks/tts/runs/590/` (gitignored); table numbers come straight
from those JSON reports.
- [ ] Reviewer cross-check: footnote markers (`*`, `‡`, `∥`, `¶`) in the
consolidated table all have matching paragraphs below.
This commit is contained in:
Alex
2026-05-09 21:47:45 -04:00
committed by GitHub
parent a400080380
commit 2c45df3035
3 changed files with 278 additions and 186 deletions
+118 -178
View File
@@ -5,9 +5,9 @@
> phrases / language, CC-BY-SA-4.0) — the same public corpus used
> by [MiniMax-Speech][mms], seed-tts-eval, and Gradium, so numbers
> here are directly paper-comparable.
> **Status:** Kokoro, Kokoro ANE, PocketTTS, Magpie all
> complete the English run; CosyVoice3 completes the full Mandarin
> run.
> **Status:** Kokoro ANE (English + Mandarin), PocketTTS (English),
> Magpie (English), and StyleTTS2 (English, zero-shot) all complete
> the full 100-phrase MiniMax run.
>
> [minimax]: https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set
> [mms]: https://arxiv.org/abs/2505.07916
@@ -53,10 +53,10 @@ Reference each language as `--corpus minimax-<lang>`:
| Backend | Default corpus | Other supported MiniMax languages |
|-------------|--------------------|------------------------------------------------|
| Kokoro / Kokoro ANE | `minimax-english` | `english` only (`af_heart` voice) |
| PocketTTS | `minimax-english` | `english`, `german`, `italian`, `portuguese`, `spanish`, `french` |
| Kokoro ANE | `minimax-english` | `english` (`af_heart`); Kokoro ANE also ships `chinese` (`--variant mandarin`, voice `zf_001`) |
| PocketTTS | `minimax-english` | 6L packs: `english`, `german`, `italian`, `portuguese`, `spanish`. 24L packs: `french_24l`, `german_24l`, `italian_24l`, `portuguese_24l`, `spanish_24l` |
| Magpie | `minimax-english` | `english`, `spanish`, `german`, `french`, `italian`, `vietnamese`, `chinese`, `hindi` |
| CosyVoice3 | `minimax-chinese` | `chinese`, `cantonese` |
| StyleTTS2 | `minimax-english` | `english` only (LibriTTS iteration_3, zero-shot from `--reference` audio) |
Lines beginning with `#` are comments. Custom corpora can still be
passed with `--corpus-path <file.txt>`.
@@ -64,26 +64,28 @@ passed with `--corpus-path <file.txt>`.
### Metrics
Per phrase:
- `ttft_ms` — time-to-first-audio. For one-shot / batch backends this
equals `synth_ms`. **PocketTTS** is benchmarked through
`synthesizeStreaming`, so its `ttft_ms` is the timestamp of the first
80 ms audio frame (1920 samples @ 24 kHz). **Magpie** is batch-only
(`synthesize(...)` returns a single `MagpieSynthesisResult` after
the full AR + codec pipeline completes), so `ttft_ms == synth_ms`.
- `ttft_ms` — time-to-first-audio. The "first audio" granularity is
backend-defined; see [Audio chunk window
size](#audio-chunk-window-size) below for the per-backend numbers.
**PocketTTS** is benchmarked through `synthesizeStreaming`, so its
`ttft_ms` is the timestamp of the first 80 ms audio frame (1920
samples @ 24 kHz) — actually-perceptible TTFA. **Kokoro ANE,
Magpie, StyleTTS2** are batch / one-shot (`synthesize(...)` returns
the full waveform), so `ttft_ms == synth_ms == time-to-complete-wav`
for those — interpret it as full-wav latency, not as TTFA.
- `synth_ms` — total synth wall time.
- `audio_ms` — generated audio duration.
- `rtfx``audio_ms / synth_ms`.
- `wer`, `cer` — via Parakeet ASR roundtrip on the rendered WAV.
- `stage_ms` — per-stage breakdown (backend-specific keys; populated
for Kokoro ANE + Magpie; empty for Kokoro / PocketTTS /
CosyVoice3).
for Kokoro ANE; empty for / PocketTTS / Magpie /
StyleTTS2 in this report).
- Backend-specific extras: `encoder_tokens`, `acoustic_frames`,
`chunk_count`, `frame_count`, `code_count`, `finished_on_eos`,
`generated_token_count`, etc.
`chunk_count`, `frame_count`, `code_count`, `generated_token_count`,
etc.
Aggregates:
- `cold_start_s``manager.initialize()` wall time. CosyVoice3 also
includes voice-asset load.
- `cold_start_s``manager.initialize()` wall time.
- `first_synth_ms` — first synth call after init (still cold-ish).
- `ttft_ms_p50` / `ttft_ms_p95`.
- `warm_synth_ms_p50` / `warm_synth_ms_p95`.
@@ -92,6 +94,21 @@ Aggregates:
`task_vm_info_data_t.resident_size_peak`.
- Per-category macro WER / CER.
### Audio chunk window size
What counts as "first audio" is backend-defined. The vocoder /
codec emits in fixed-size chunks; only **PocketTTS** is wired to
yield those chunks incrementally on `main`. Everything else returns
the full waveform after the full pipeline runs, so `ttft_ms` for
those backends measures full-wav latency rather than perceptual
TTFA. The consolidated [per-backend table](#per-backend-top-line)
below carries the per-backend sample rate, chunk window, and
streaming flag inline alongside the performance metrics.
For batch backends, "average latency" the user perceives is
`synth_ms` (full wav) rather than `ttft_ms` — they're equal in
that case, so the consolidated table just reports them once.
### Reproducibility
```bash
@@ -103,14 +120,29 @@ swift run fluidaudio tts-benchmark \
--compute-units default \
--output-json bench.json \
--audio-dir bench-wavs/
# Kokoro ANE Mandarin (skip Parakeet ASR; whisper CER scored separately).
swift run fluidaudio tts-benchmark \
--backend kokoro-ane --variant mandarin --voice zf_001 \
--corpus minimax-chinese --skip-asr \
--output-json bench-zh.json --audio-dir bench-wavs-zh/
# StyleTTS2 zero-shot (LibriTTS iteration_3). The shipped ref_encoder
# is fixed at [1, 1, 80, 231], so the reference must be exactly
# 2.875 s @ 24 kHz mono. Trim externally before invoking, e.g.:
# ffmpeg -i speaker.wav -t 2.875 -ar 24000 -ac 1 -c:a pcm_s16le ref.wav
swift run fluidaudio tts-benchmark \
--backend styletts2 --reference ref.wav \
--corpus minimax-english \
--output-json bench-styletts2.json --audio-dir bench-wavs-styletts2/
```
The harness writes a JSON report to `--output-json` and (optionally)
keeps WAVs under `--audio-dir`. Pass `--skip-asr` to drop the ASR
roundtrip. The default ASR backend is `parakeet` for English-only
runs and is skipped for CosyVoice3; pass `--asr-backend cohere
--cohere-model-dir <dir>` to score Mandarin (or any of the 14
Cohere languages) against [Cohere Transcribe](../../Sources/FluidAudio/ASR/Cohere/).
runs; pass `--asr-backend cohere --cohere-model-dir <dir>` to score
Mandarin (or any of the 14 Cohere languages) against
[Cohere Transcribe](../../Sources/FluidAudio/ASR/Cohere/).
## Results
@@ -118,88 +150,90 @@ Cohere languages) against [Cohere Transcribe](../../Sources/FluidAudio/ASR/Coher
Reference machine: **MacBook Air, Apple M2 (2022), 8-core CPU /
8-core GPU / 16-core Neural Engine, 16 GB unified memory, macOS 26**
(`Mac14,2`, on AC). All English runs use `--compute-units default`,
voice = backend default
(`af_heart` for Kokoro, `alba` for PocketTTS, `John` for Magpie),
corpus = `minimax-english` (100 phrases), Parakeet TDT roundtrip for
WER / CER.
(`Mac14,2`, on AC). All runs use `--compute-units default`, 100
phrases per language. Voices are backend defaults
(`af_heart` for Kokoro ANE en, `zf_001` for Kokoro ANE zh,
`alba` for PocketTTS, `John` for Magpie, LibriTTS iteration_3 for
StyleTTS2). English WER / CER via Parakeet TDT roundtrip; Mandarin
CER via `whisper-large-v3`.
| Backend | License | Languages | Footprint | Cold start | TTFT p50 / p95\* | Synth p50 / p95 | Agg RTFx | Peak RSS | WER | CER | Notes |
|-------------|-------------|------------------------|-----------|------------|---------------------|---------------------|----------|----------|---------|---------|-------|
| Kokoro ANE | Apache-2.0 | en (af_heart only) | ~330 MB | 37.9 s | 1586 / 2515 ms | 1586 / 2515 ms | 5.19× | 738 MB | 0.108 | 0.040 | one-shot; per-stage CU sweep, 7-graph pipeline |
| Kokoro | Apache-2.0 | en (af_heart only) | ~330 MB | 92.2 s | 3113 / 4696 ms | 3113 / 4696 ms | 2.02× | 736 MB | 0.013 | 0.005 | one-shot; cleanest English ASR roundtrip |
| PocketTTS | research | en + de + it + pt + es + fr (6L / 24L) | ~140 / ~520 MB | 6.0 s | **1244 / 4749 ms** | 8757 / 19174 ms | 0.61× | 1503 MB | 0.014 | 0.006 | **streaming**; TTFT is first 80 ms audio frame |
| Magpie | research | en/es/de/fr/it/vi/zh/hi | ~1.3 GB | 38.5 s∥ | 15080 / 29895 ms∥ | 15080 / 29895 ms∥ | 0.64×∥ | 762 MB∥ | 0.056 | 0.033 | **batch-only**; `ttft_ms == synth_ms`; split-K/V decoder; outputBackings fast path with latched fallback |
| CosyVoice3 | Apache-2.0 | zh (mandarin) | ~1.5 GB | 29.2 s† | 14091 / 23679 ms† | 14091 / 23679 ms† | 0.357׆ | 3302 MB† | n/a‡ | 0.017‡ | beta; full `minimax-chinese` (100/100 phrases) for latency / RSS and whisper-large-v3 CER‡; cantonese supported via [auto-chunker](#cosyvoice3-auto-chunker) but not benchmarked (no yue ASR) |
One consolidated table per backend × language. **Basic info**
(license, language, footprint, sample rate, max chunk per pass,
streaming flag) is merged with **performance** (TTFT, synth, RTFx,
peak RSS, WER, CER) so there is a single source of truth.
| Backend | License | Language (voice) | Footprint | Sample rate | Max chunk per pass | Streaming | TTFT p50 / p95\* | Synth p50 / p95 | Agg RTFx | Peak RSS | WER | CER |
|------------|------------|---------------------------|----------------------------|-------------|------------------------------------------------------------------|-----------|-------------------|-------------------|-----------|----------|--------|--------|
| Kokoro ANE | Apache-2.0 | en (`af_heart`) | ~0.33 GB | 24 kHz | 510 phonemes / pass (≈2530 s of audio) | No | **988 / 2068 ms** | 988 / 2068 ms | **7.47×** | 1027 MB | 10.8% | 4.0% |
| Kokoro ANE | Apache-2.0 | zh (`zf_001`) | ~0.33 GB | 24 kHz | 510 phonemes / pass (≈2530 s of audio) | No | **956 / 1802 ms** | 956 / 1802 ms | 6.37× | 685 MB | n/a‡ | 4.0%‡ |
| PocketTTS | research | en (`alba`, 6L pack) | int8 ~0.55 GB | 24 kHz | 80 ms Mimi frame, streams until EOS (no fixed cap) | Yes | **710 / 1496 ms** | 5160 / 9801 ms | 1.10× | 1167 MB | 1.0% | 0.4% |
| Magpie | research | en (`John`) | ~1.3 GB | 22.05 kHz | 256 NanoCodec frames / pass (≈11.9 s); sentence-split for longer | No | 11470 / 26042 ms∥ | 11470 / 26042 ms∥ | 0.87×∥ | 543 MB∥ | 3.8% | 2.6% |
| StyleTTS2 | research | en (LibriTTS iteration_3) | ~0.67 GB¶ | 24 kHz | 256 tokens / pass (≈30 s of audio max) | No | 1574 / 3088 ms | 1574 / 3088 ms | 4.59× | 522 MB | 9.4% | 4.1% |
\* TTFT for **PocketTTS** is first-frame emit through the streaming
API; **Magpie** is batch-only (`ttft_ms == synth_ms`); the others
are one-shot, so `ttft_ms == synth_ms`.
API (perceptual TTFA). **Kokoro ANE / Magpie / StyleTTS2** all run
one-shot per phrase (no streaming yield on `main`), so for those
rows `ttft_ms == synth_ms == time-to-complete-wav`.
† CosyVoice3 chinese: 100/100, 0 errors, ASR skipped. Cold-start
dropped from 302.7 s to 29.2 s on the warm re-run.
‡ CosyVoice3 CER measured on the **full 100-phrase**
‡ Kokoro ANE Mandarin CER measured on the **full 100-phrase**
`minimax-chinese` corpus via `whisper-large-v3` (Python CPU FP32,
[`Scripts/whisper_zh_cer.py`](../../Scripts/whisper_zh_cer.py)) on
the WAVs rendered by `tts-benchmark --backend cosyvoice3 --corpus
minimax-chinese --skip-asr --audio-dir <dir>`: **macro CER 1.68%
(0.0168)**, **micro CER 1.84% (0.0184)** across 100 phrases.
Whisper is the source of truth here because Cohere Transcribe q8
hit a `MILCompilerForANE` cache failure on this M2 host and ran on
the CPU+GPU fallback path at RTFx ~0.13× (would have taken multiple
hours for the full 100-phrase set vs. ~70 min for whisper). WER is
omitted because Mandarin has no word boundaries and `WERCalculator`
splits on whitespace, so word-level WER reads near 100% and is
meaningless.
[`Scripts/whisper_zh_cer.py`](../../Scripts/whisper_zh_cer.py))
against the WAVs rendered by `tts-benchmark --backend kokoro-ane
--variant mandarin --voice zf_001 --corpus minimax-chinese
--skip-asr`: **macro CER 4.01% (0.0401)**, **micro CER 4.14%
(0.0414)** across 100 phrases (table reports the macro figure).
WER is omitted because Mandarin has no word boundaries and
`WERCalculator` splits on whitespace — word-level WER reads near
100% and is meaningless. Cohere Transcribe q8 hit a
`MILCompilerForANE` cache failure on this M2 host, so whisper is
the local source of truth for Mandarin CER.
∥ Magpie: batch-only. `synthesize(...)` returns one
`MagpieSynthesisResult` after the full AR + codec pipeline completes,
so `ttft_ms == synth_ms`. Long inputs are sentence-split internally
(NanoCodec 256-frame static cap) and AR(N+1) ‖ codec(N) chunk-level
pipelining overlaps the next chunk's AR loop with the current chunk's
codec pass — wallclock optimization, not incremental yield.
codec pass — wallclock optimization, not incremental yield. The
sub-1.5 s TTFA work referenced in issue #590 (fused sampler +
24-frame cap) lives on `feat/magpie-lt-fusion`, not `main`.
¶ StyleTTS2 footprint is the sum of the shipped iteration_3 mlpackages
(text encoder + bert + ref_encoder + post_albert + alignment + prosody
+ noise + decoder + tail). The shipped ref_encoder is exported with
a fixed `[1, 1, 80, 231]` mel shape, so reference audio must be
exactly 2.875 s @ 24 kHz (300-hop). The benchmark harness expects
the caller to trim externally; mismatched durations error out at
predict time.
### Kokoro ANE — per-stage breakdown (default preset, MiniMax-English)
Means across 100 `minimax-english` phrases on M2. Stages map to the
7-CoreML-graph split documented in [KokoroAne.md](KokoroAne.md). Vocoder
+ noise together account for ~92% of synth time, which is the natural
target for any further per-stage compute-unit re-tuning. The MiniMax
mean is meaningfully higher than the prior Harvard-sentences run
because phrases 81100 are paragraph-length news / story sentences.
Means across 100 `minimax-english` phrases on M2 (`af_heart`,
post-laishere 7-graph chain). Stages map to the 7-CoreML-graph split
documented in [KokoroAne.md](KokoroAne.md). Vocoder + noise together
account for ~90% of synth time, which is the natural target for any
further per-stage compute-unit re-tuning.
| Stage | Mean ms | % of total |
|---------------|---------|------------|
| `albert` | 28.2 | 2.0% |
| `post_albert` | 12.1 | 0.9% |
| `alignment` | 1.8 | 0.1% |
| `prosody` | 49.2 | 3.5% |
| `noise` | 242.6 | 17.4% |
| `vocoder` | 1039.8 | 74.4% |
| `tail` | 24.6 | 1.8% |
| **total** | 1398.4 | 100% |
| `albert` | 24.5 | 2.5% |
| `post_albert` | 9.3 | 1.0% |
| `alignment` | 1.4 | 0.1% |
| `prosody` | 40.0 | 4.1% |
| `noise` | 169.3 | 17.5% |
| `vocoder` | 704.4 | 72.9% |
| `tail` | 17.0 | 1.8% |
| **total** | 965.9 | 100% |
### Magpie — per-stage breakdown (default preset, MiniMax-English)
### Magpie — per-stage breakdown
Means across 100 `minimax-english` phrases on M2 (`John` voice, en,
default compute units), captured during the original one-shot
profiling run. `ar_loop` is the umbrella for the per-step
`decoder_step` + `sampler` (so it is not added on top in the total).
`nanocodec` runs concurrently with the next chunk's AR loop via
chunk-level pipelining inside `synthesize(...)`, which is why the
per-stage means do not sum to total warm-synth mean. The AR loop
dominates the wall clock, and its cost grows super-linearly with
phrase length — long news / story phrases drive the long-tail p95.
| Stage | Mean ms |
|--------------------|---------|
| `text_encoder` | 91 |
| `prefill` | 281 |
| `ar_loop` | 17946 |
| └── `decoder_step` | 14840 |
| └── `sampler` | 3081 |
| `nanocodec` | 17948 |
Per-stage timings (`text_encoder`, `prefill`, `ar_loop`,
`decoder_step`, `sampler`, `nanocodec`) are still populated on
`MagpieSynthesisResult.timings` for callers that want them — see
[`MagpieTypes.swift`](../../Sources/FluidAudio/TTS/Magpie/MagpieTypes.swift).
This document does not currently re-publish the per-stage table on
`main`: the AR loop dominates and its absolute numbers are
in active flux on `feat/magpie-lt-fusion` (fused sampler + 24-frame
NanoCodec cap). Republish here once that branch lands on `main`.
### About the WER / CER numbers
@@ -212,97 +246,3 @@ discussion](https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set/
absolute WER is best read **relatively** (backend A vs. backend B on
the same corpus + same ASR + same normalizer) rather than against
raw paper numbers.
## CosyVoice3 Decode budget cap
CosyVoice3's Flow CFM was exported with a fixed input shape of
`[1, 250]` speech tokens (`flowTotalTokens` in
`CosyVoice3Constants.swift:45`). The LLM-Decode AR loop is allowed to
emit up to `flowTotalTokens N_prompt` tokens before being cut off
(typically ~163 generated tokens after the speech-prompt portion).
At `tokenMelRatio=2 × hiftSamplesPerFrame=480 / sampleRate=24000`
that's **40 ms of audio per generated token**, so the loop produces
**at most ~6.5 s of speech per phrase**, regardless of how long the
input text is.
When the AR loop exits because it ran out of budget (i.e. no EOS
token in `stopRange = 6_561…6_760`) instead of natural termination,
`CosyVoice3Synthesizer` now:
1. Logs a `.warning` (one-shot per phrase) naming the
`decoded.count / maxNew` budget and the produced audio duration.
2. Sets `CosyVoice3SynthesisResult.finishedOnEos = false`, which the
benchmark harness surfaces as the `finished_on_eos` field on each
phrase in the JSON report.
Footprint on the cantonese corpus (`minimax-cantonese`,
100 phrases) **without the chunker**: 80 / 100 phrases would hit the
cap, all producing exactly 163 generated tokens / ~6.5 s of audio.
The mandarin corpus sees a much lower truncation rate because
MiniMax-zh phrases are shorter on average.
The structural fix — re-exporting the Flow CFM from
[`mobius-cosyvoice3`](https://github.com/voicelink-ai/mobius-cosyvoice3)
with a larger fixed input shape (e.g. `[1, 500]`) — is upstream
work; bumping the constant in Swift alone would make the Flow
input/output shapes mismatch at predict time. The shipped workaround
is the call-site [auto-chunker](#cosyvoice3-auto-chunker), which
drops cantonese truncation from 80/100 → 5/100 by splitting long
inputs at clause boundaries and crossfading the results.
Surfaced in
`CosyVoice3Synthesizer.synthesize`
(`Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Synthesize/CosyVoice3Synthesizer.swift`)
and
`CosyVoice3SynthesisResult.finishedOnEos`
(`Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Synthesize/CosyVoice3Types.swift`).
## CosyVoice3 auto-chunker
Re-exporting Flow CFM with a larger fixed input shape is gated on
upstream conversion work. Until that lands, `CosyVoice3TtsManager`
splits long inputs at the call site, synthesizes each chunk
independently, and merges with an 8 ms equal-power cosine crossfade.
**Splitter policy** (`CosyVoice3TextChunker`):
- **Hard enders** commit always: `.`, `!`, `?`, `。`, ``, ``,
`\n`.
- **Soft enders** commit only when the running estimate is at or past
the budget: ``, `、`, ``, ``, `;`, `,`, ASCII space.
- **Force-split** at `budget + 30` tokens of overshoot if no natural
boundary appeared (rare; mostly continuous CJK with no
punctuation).
**Token-rate estimate** (calibrated against minimax-zh + minimax-yue
runs):
| Char class | Tokens / char | Rationale |
|------------|---------------|--------------------------------------------------------------|
| CJK | 7.5 | worst-case observed in real generation; varies 5.59 per char |
| ASCII | 1.5 | matches BPE rate on English text |
| Other | 2.5 | conservative for accented Latin / non-CJK Unicode |
`defaultMaxSpeechTokens` is **110**, leaving margin under the
250-token Flow cap minus typical 6090 token speech-prompt context.
**Concatenation**: 8 ms equal-power cosine crossfade at 24 kHz
between adjacent chunks; single-chunk path short-circuits to plain
copy.
**Validation** (full `minimax-cantonese`, 100 phrases, M2):
| Metric | Pre-chunker | Post-chunker | Δ |
|-------------------------------------------|-------------|--------------|------------|
| `finished_on_eos=false` (truncated) | 80 / 100 | **5 / 100** | 94% |
| Longest audio output | 6.5 s | **16.1 s** | +148% |
| agg-RTFx | 0.245× | 0.249× | +1.6% |
| TTFT p50 | 23.9 s | 35.7 s | +49% |
| TTFT p95 | 41.2 s | 60.5 s | +47% |
| Peak RSS | 2016 MB | 3264 MB | +62% |
The 5/100 residual is the long-tail token-rate worst case (some
Cantonese characters generate >9 speech tokens); raising the
per-CJK heuristic further would over-fragment short phrases.
Cleaner fix is the upstream Flow re-export.
@@ -115,6 +115,18 @@ public actor PunctuationCommitLayer {
/// Active debounce timer task.
private var debounceTask: Task<Void, Never>?
/// Monotonically increasing generation for the active debounce timer.
///
/// Incremented whenever a pending timer is invalidated (new partial
/// text, EOU, manual commit, reset, or a fresh timer being started).
/// The timer task captures the generation it was created under and
/// re-checks it after landing back on the actor, which closes the
/// race where `Task.sleep` already returned and the
/// `!Task.isCancelled` guard passed before a subsequent call to
/// `processPartialText` / `processEOU` / `manualCommit` / `reset`
/// requested cancellation.
private var debounceGeneration: UInt64 = 0
/// Callback invoked when updates occur.
private var updateCallback: (@Sendable (CommitLayerUpdate) -> Void)?
@@ -149,7 +161,7 @@ public actor PunctuationCommitLayer {
/// - Returns: Update containing committed text, ghost text, and commit reason.
public func processPartialText(_ text: String) async -> CommitLayerUpdate {
// Cancel existing debounce timer
debounceTask?.cancel()
invalidateDebounce()
lastUpdateTime = Date()
// Find last punctuation mark in text
@@ -223,7 +235,7 @@ public actor PunctuationCommitLayer {
///
/// - Returns: Update with all text committed and EOU commit reason.
public func processEOU() async -> CommitLayerUpdate {
debounceTask?.cancel()
invalidateDebounce()
lastUpdateTime = Date()
// EOU signals end of utterance: commit everything
@@ -262,7 +274,7 @@ public actor PunctuationCommitLayer {
///
/// - Returns: Update with ghost text promoted to committed text.
public func manualCommit() async -> CommitLayerUpdate {
debounceTask?.cancel()
invalidateDebounce()
lastUpdateTime = Date()
guard !ghostText.isEmpty else {
@@ -298,7 +310,7 @@ public actor PunctuationCommitLayer {
/// Resets the commit layer, clearing all committed and ghost text.
public func reset() async {
debounceTask?.cancel()
invalidateDebounce()
committedText = ""
ghostText = ""
lastUpdateTime = Date()
@@ -323,9 +335,19 @@ public actor PunctuationCommitLayer {
// MARK: - Private Helpers
/// Cancels any pending debounce timer and bumps the generation so any
/// timer task already past its `!Task.isCancelled` guard becomes a
/// no-op when it lands back on the actor.
private func invalidateDebounce() {
debounceTask?.cancel()
debounceTask = nil
debounceGeneration &+= 1
}
/// Starts a debounce timer that commits ghost text after the timeout expires.
private func startDebounceTimer() {
debounceTask?.cancel()
invalidateDebounce()
let pendingGeneration = debounceGeneration
debounceTask = Task { [weak self, debounceTimeout, commitOnTimeout] in
try? await Task.sleep(nanoseconds: UInt64(debounceTimeout * 1_000_000_000))
@@ -337,11 +359,21 @@ public actor PunctuationCommitLayer {
if commitOnTimeout {
// Check cancellation again before acquiring actor executor
guard !Task.isCancelled else { return }
await self.commitGhostText(reason: .debounceTimeout)
await self.fireDebounceCommit(generation: pendingGeneration)
}
}
}
/// Commits ghost text only if no newer timer / commit / reset has
/// superseded the timer that scheduled this commit. Closes the race
/// where `Task.sleep` returns and `!Task.isCancelled` is observed
/// `false` *before* a subsequent call to `processPartialText`,
/// `processEOU`, `manualCommit`, or `reset` cancels us.
private func fireDebounceCommit(generation: UInt64) async {
guard generation == debounceGeneration else { return }
await commitGhostText(reason: .debounceTimeout)
}
/// Commits ghost text to committed text with the specified reason.
///
/// - Parameter reason: The reason for committing.
@@ -14,6 +14,7 @@ import Foundation
/// kokoro single-graph CPU+GPU (chunk-level only)
/// pocket-tts streaming flow-matching (no per-stage timings)
/// magpie encoder-decoder + NanoCodec (6-stage timings, slow)
/// styletts2 LibriTTS iteration_3, zero-shot w/ reference audio
/// cosyvoice3 Mandarin LLM-based (Mandarin corpus only, no WER)
///
/// Usage:
@@ -105,6 +106,8 @@ public enum TtsBenchmarkCommand {
var cohereModelDirArg: String?
var asrLanguageArg: String?
var cohereComputeUnitsArg: String?
var referencePath: String?
var variantArg: String?
var i = 0
while i < arguments.count {
@@ -177,6 +180,16 @@ public enum TtsBenchmarkCommand {
cohereComputeUnitsArg = arguments[i + 1]
i += 1
}
case "--reference":
if i + 1 < arguments.count {
referencePath = arguments[i + 1]
i += 1
}
case "--variant":
if i + 1 < arguments.count {
variantArg = arguments[i + 1]
i += 1
}
case "--help", "-h":
printUsage()
return
@@ -245,9 +258,11 @@ public enum TtsBenchmarkCommand {
do {
switch backend {
case .kokoroAne:
let kaVariant = parseKokoroAneVariant(variantArg)
try await runKokoroAne(
phrases: phrases, corpusLabel: corpusLabel,
voice: voice ?? KokoroAneConstants.defaultVoice,
variant: kaVariant,
voice: voice ?? kaVariant.defaultVoice,
preset: preset, outputJson: outputJson, audioDir: audioDir,
asrChoice: asrChoice)
case .kokoro:
@@ -269,6 +284,12 @@ public enum TtsBenchmarkCommand {
speakerName: speakerName, languageName: languageName,
preset: preset, outputJson: outputJson, audioDir: audioDir,
asrChoice: asrChoice)
case .styleTts2:
try await runStyleTTS2(
phrases: phrases, corpusLabel: corpusLabel,
referencePath: referencePath,
preset: preset, outputJson: outputJson, audioDir: audioDir,
asrChoice: asrChoice)
case .cosyVoice3:
try await runCosyVoice3(
phrases: phrases, corpusLabel: corpusLabel,
@@ -287,6 +308,7 @@ public enum TtsBenchmarkCommand {
private static func runKokoroAne(
phrases: [(category: String, text: String)],
corpusLabel: String,
variant: KokoroAneVariant,
voice: String,
preset: TtsComputeUnitPreset,
outputJson: String?,
@@ -294,7 +316,7 @@ public enum TtsBenchmarkCommand {
asrChoice: AsrChoice
) async throws {
let units = KokoroAneComputeUnits(preset: preset)
let manager = KokoroAneManager(defaultVoice: voice, computeUnits: units)
let manager = KokoroAneManager(variant: variant, defaultVoice: voice, computeUnits: units)
let coldStart = Date()
try await manager.initialize()
@@ -560,6 +582,80 @@ public enum TtsBenchmarkCommand {
}
}
// MARK: - StyleTTS2 driver
private static func runStyleTTS2(
phrases: [(category: String, text: String)],
corpusLabel: String,
referencePath: String?,
preset: TtsComputeUnitPreset,
outputJson: String?,
audioDir: String?,
asrChoice: AsrChoice
) async throws {
guard let referencePath, !referencePath.isEmpty else {
logger.error(
"styletts2 backend requires --reference <speaker-audio-file> "
+ "(any sample rate / channel layout — resampled to 24 kHz mono).")
exit(1)
}
let referenceURL = resolveURL(referencePath, isDirectory: false)
guard FileManager.default.fileExists(atPath: referenceURL.path) else {
logger.error("Reference audio not found: \(referenceURL.path)")
exit(1)
}
logger.info("StyleTTS2 reference audio: \(referenceURL.path)")
let units = preset.uniformUnits ?? .cpuAndNeuralEngine
let manager = StyleTTS2Manager(computeUnits: units)
let coldStart = Date()
try await manager.initialize()
let coldStartS = Date().timeIntervalSince(coldStart)
logger.info(String(format: "Cold start (initialize): %.2fs", coldStartS))
let firstStart = Date()
_ = try await manager.synthesize(
text: "Initialization warm-up.",
referenceAudioURL: referenceURL)
let firstSynthMs = Date().timeIntervalSince(firstStart) * 1000
logger.info(String(format: "First synth: %.0f ms", firstSynthMs))
try await runPhraseLoop(
backendId: "styletts2",
voiceLabel: referenceURL.lastPathComponent,
corpusLabel: corpusLabel,
phrases: phrases,
preset: preset,
coldStartS: coldStartS,
firstSynthMs: firstSynthMs,
outputJson: outputJson,
audioDir: audioDir,
asrChoice: asrChoice,
extraSummary: [
"reference": referenceURL.path,
"alpha": Double(StyleTTS2Constants.defaultAlpha),
"beta": Double(StyleTTS2Constants.defaultBeta),
]
) { text in
// StyleTTS2 is a one-shot diffusion-based synthesizer no
// streaming yield, so TTFT == synthMs. The per-phrase mel
// recompute is tiny vs. the 5-step ADPM2 + decoder cost.
let t0 = Date()
let samples = try await manager.synthesize(
text: text, referenceAudioURL: referenceURL)
let synthMs = Date().timeIntervalSince(t0) * 1000
return BackendPhraseSample(
synthMs: synthMs,
ttftMs: synthMs,
samples: samples,
sampleRate: StyleTTS2Constants.sampleRate,
stageMs: [:],
extraFields: [:]
)
}
}
// MARK: - CosyVoice3 driver
private static func runCosyVoice3(
@@ -885,6 +981,7 @@ public enum TtsBenchmarkCommand {
case kokoro
case pocketTts
case magpie
case styleTts2
case cosyVoice3
var defaultCorpus: String {
@@ -905,6 +1002,8 @@ public enum TtsBenchmarkCommand {
return .pocketTts
case "magpie":
return .magpie
case "styletts2", "style-tts2", "styletts", "style-tts":
return .styleTts2
case "cosyvoice3", "cosyvoice", "cosy":
return .cosyVoice3
default:
@@ -938,6 +1037,17 @@ public enum TtsBenchmarkCommand {
}
}
private static func parseKokoroAneVariant(_ name: String?) -> KokoroAneVariant {
switch name?.lowercased() {
case "mandarin", "zh", "chinese", "zh-cn":
return .mandarin
case "english", "en", "en-us", nil, "":
return .english
default:
return .english
}
}
// MARK: - Helpers
private static func percentile(_ sorted: [Double], _ p: Double) -> Double {
@@ -1193,6 +1303,7 @@ public enum TtsBenchmarkCommand {
kokoro Single-graph CPU+GPU
pocket-tts Streaming flow-matching (multilingual)
magpie Encoder-decoder + NanoCodec (per-stage, slow)
styletts2 LibriTTS iteration_3, zero-shot, requires --reference
cosyvoice3 Mandarin LLM-based (auto-picks Cohere ASR for zh)
Options:
@@ -1232,13 +1343,22 @@ public enum TtsBenchmarkCommand {
fails (`MILCompilerForANE error: …`)
— avoids the multi-minute fallback
compile on first call.
--reference <path> StyleTTS2 speaker-reference audio
(required for --backend styletts2;
any sample rate / channel layout —
resampled to 24 kHz mono internally)
--variant <name> Kokoro ANE variant: english (default) or
mandarin (aliases: zh, chinese)
--help, -h Show this help
Examples:
fluidaudio tts-benchmark --backend kokoro-ane --output-json bench.json
fluidaudio tts-benchmark --backend kokoro-ane --variant mandarin \\
--voice zf_001 --corpus minimax-chinese --skip-asr
fluidaudio tts-benchmark --backend kokoro --corpus minimax-english
fluidaudio tts-benchmark --backend pocket-tts --corpus minimax-german --language german
fluidaudio tts-benchmark --backend magpie --speaker sofia --language en
fluidaudio tts-benchmark --backend styletts2 --reference speaker.wav
fluidaudio tts-benchmark --backend cosyvoice3 --corpus minimax-chinese \\
--asr-backend cohere --cohere-model-dir ~/.fluidaudio/cohere/q8