mirror of https://github.com/FluidInference/FluidAudio.git synced 2026-05-12 20:20:36 +00:00

Files

T

Alex 7603ac6733 feat(tts/benchmark): tts-benchmark CLI covering all TTS backends (#557 )

## Summary

Adds `fluidaudio tts-benchmark`, a unified harness for measuring
**latency × efficiency × quality** across every shipping TTS backend in
FluidAudio, plus the model + runtime fixes needed to actually clear all
six backends end-to-end on the [MiniMax Multilingual TTS Test
Set](https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set).
Also tags Magpie / StyleTTS2 / CosyVoice3 as **beta** at the API + docs
level so users get a runtime warning on `initialize()` reflecting their
actual perf / quality posture.

### Backends — all green on M2 / macOS 26

| Backend | Corpus | Status | Audio out (min / p50 / max) | RTFx | WER |
Notes |
|---|---|---|---|---|---|---|
| Kokoro ANE | minimax-en (100/100) | ✅ | 3.5 s / 8.0 s / 11.4 s | 5.19×
| 10.8% | one-shot @ 24 kHz, 7-graph pipeline; per-stage CU sweep |
| Kokoro | minimax-en (100/100) | ✅ | 3.5 s / 6.8 s / 9.3 s | 2.02× |
1.3% | one-shot @ 24 kHz; multi-chunk w/ 8 ms crossfade; cleanest
English ASR roundtrip |
| PocketTTS | minimax-en (100/100) | ✅ | 2.8 s / 6.3 s / 9.4 s | 0.61× |
1.4% | **streaming** @ 24 kHz, 80 ms frames; TTFT 1244 ms — RTFx looks
slow but is honest per-frame cost (see "RTFx caveat" below) |
| Magpie | minimax-en (100/100) | ⚠️ **BETA** | 4.7 s / 10.0 s / 20.6 s
| 0.64× | 5.6% | **streaming TTFT** @ 22.05 kHz: first chunk at **9.6 s
p50** vs full synth 15.1 s; split-K/V decoder + `outputBackings` fast
path; below real-time, runtime warning on init |
| StyleTTS2 | minimax-en (100/100) | ⚠️ **BETA** | 9.6 s / 22.6 s / 32.6
s | 2.72× | 44.0% | one-shot @ 24 kHz; flex-shape fix + misaki→espeak
post-pass remap (WER 58.1% → 44.0%); WER ~30× Kokoro's, runtime warning
on init |
| CosyVoice3 | minimax-zh (100/100) | ⚠️ **BETA** | 2.2 s / 6.5 s /
**16.0 s** | 0.357×† | n/a‡ | post auto-chunker @ 24 kHz; long phrases
now split + crossfaded (8 ms cosine) — longest output 16.0 s (was capped
at ~6.5 s); HiFT `.cpuAndGPU` + LLM-Decode `outputBackings` (+33% RTFx);
**whisper-large-v3 CER 1.68% (macro) / 1.84% (micro)** across 100/100
phrases‡; RTFx < 1, runtime warning on init |
| CosyVoice3 | minimax-yue (100/100) | ⚠️ **BETA** | 3.3 s / 8.0 s /
**16.1 s** | 0.249× | n/a | post auto-chunker; **truncation 80/100 →
5/100 phrases** (`finished_on_eos=false` field), longest output 6.5 s →
16.1 s. TTFT-p50 climbs (24 s → 36 s) as the cost of multi-chunk synth |

⚠️ **BETA** = `${Backend}TtsManager.initialize()` emits a
`logger.warning` flagging the perf / quality posture; safe to ship in
non-latency-sensitive paths but read the per-backend doc first.

‡ CosyVoice3 zh WER stays `n/a` because `WERCalculator`
whitespace-tokenizes and Mandarin has no word boundaries (word-level WER
reads ~100% and is meaningless). CER is `whisper-large-v3` against the
rendered WAVs from the full 100-phrase `minimax-chinese` run via
`Scripts/whisper_zh_cer.py`. Cohere Transcribe q8 is also wired in this
PR via `--asr-backend cohere` (see [Cohere ASR backend in the
harness](#cohere-asr-backend-in-the-harness) below) and agrees with
whisper at the 3–5% CER range on a 10-phrase sub-sample, but hits a
`MILCompilerForANE` cache failure on this M2 host that drops it to RTFx
~0.13×, so whisper is the practical source-of-truth for the full
100-phrase run.

Full numbers (cold start, p50/p95 synth, peak RSS, WER/CER per category)
live in `Documentation/TTS/Benchmarks.md`. Corpus attribution +
reproduction notes live in `Documentation/TTS/MinimaxCorpus.md`.

### RTFx caveat — phrase length and streaming granularity both matter

Aggregate RTFx (audio_duration / wall_clock) is **only directly
comparable between backends when both produce similar phrase lengths and
yield audio at the same granularity**. Two things skew the headline
number on this corpus:

**1. Phrase-length spread.** StyleTTS2 emits ~22 s p50 of audio per
`minimax-english` phrase while Kokoro emits ~7 s — same input text, ~3×
more audio out. That's mostly long inter-word pauses + slow speaking
rate baked into the LibriTTS multi-speaker checkpoint, not a measurement
artifact. A 2.72× RTFx on 22 s audio = ~8 s wall — which matches the
TTFT p50 column. Kokoro's 2.02× on 7 s audio = ~3.5 s wall. Same-corpus
RTFx ratios alone hide this.

**2. Streaming granularity.** PocketTTS posts 0.61× agg-RTFx vs.
Kokoro's 2.02× but it's **not slower from a user perspective**:
PocketTTS yields its first 80 ms audio frame at TTFT **1244 ms**,
Kokoro's first frame at TTFT **3113 ms** (full one-shot chunk). The
0.61× is the per-frame cost averaged across the streaming run; what
users feel is TTFT.

| Backend | TTFT p50 | First yield | Implication |

|-------------|----------|------------------|--------------------------------------------|
| PocketTTS | 1244 ms | 80 ms frame | true streaming;
conversational-ready |
| Kokoro ANE | 1586 ms | full ~8 s chunk | ~1.6 s to any audio;
ANE-tuned |
| Kokoro | 3113 ms | full ~7 s chunk | clean quality, slower first-byte
|
| StyleTTS2 | 6671 ms | full ~22 s chunk | one-shot only; long phrase
output amortizes the wall |
| Magpie | **9580 ms** | first chunk @ 22.05 kHz | streaming via
`synthesizeStream`; TTFT-p50 9.6 s vs full synth 15.1 s — 36% earlier
playback start |
| CosyVoice3 | 14091 / 35681 ms (zh / yue) | full chunk @ 24 kHz |
one-shot per chunk; multi-chunk phrases pay TTFT for the first chunk
only |

For conversational use cases, **TTFT > RTFx**. PocketTTS (true
streaming), Magpie (streaming via `synthesizeStream`), and Kokoro ANE
(small one-shot chunks) are the three backends that meaningfully clear
the "user feels it's responsive" bar today.

### Beta callouts (StyleTTS2, Magpie, CosyVoice3)

Three of the six shipping backends post numbers that callers should
weigh against an explicit caveat:

- **StyleTTS2** — WER 44% on `minimax-english` is ~30× Kokoro's 1.3%.
The misaki→espeak post-pass remap closed half the gap; the remainder is
BART G2P misses + diffusion-sampler formant breaks on long phrases.
- **Magpie** — agg-RTFx 0.64× on M2 — below real-time but streaming via
`synthesizeStream` so TTFT (9.6 s p50) is significantly better than
full-synth wall (15.1 s p50). Long-tail phrases still pull p95 wall to
~30 s.
- **CosyVoice3** — agg-RTFx 0.357× on `minimax-chinese` (0.249× on the
longer-phrase `minimax-cantonese` after the auto-chunker). The 250-token
Flow input cap is now worked around at the call site by the auto-chunker
(long phrases split + crossfaded), dropping cantonese truncation from
80/100 → 5/100 and lifting longest output from 6.5 s → 16.1 s. The 5/100
residual is the long-tail token-rate worst case; the structural fix is
re-exporting Flow with a larger fixed input shape (tracked in
`mobius-cosyvoice3`). `CosyVoice3SynthesisResult.finishedOnEos: Bool` +
a `.warning`-level `LLM-Decode budget exhausted` log still surface any
truncation, and the harness writes `finished_on_eos` into each phrase in
the JSON report.

Each manager now logs a `.warning`-level beta notice on `initialize()`
(mirroring the existing CosyVoice3 pattern) so anyone wiring these into
a product gets a console signal, not a silent surprise. Docs
(`Documentation/TTS/Magpie.md`, `Documentation/TTS/Benchmarks.md`
StyleTTS2 footnote, existing `CosyVoice3.md` callout) carry the same
caveat at the top.

### Model + runtime fixes landed in this PR

#### CosyVoice3 stateless port (`71130c9fb`)
Switches LLM-Decode from the macOS 15+ stateful `MLState` path to the
non-stateful `LLM-Decode-M768-fp16` graph that's actually shipped on
HuggingFace. Drops ~95 LOC of state plumbing for ~30 LOC of plain
`MLDictionaryFeatureProvider` prediction with explicit kv carry-forward;
lowers the availability gate from macOS 15 / iOS 18 back to the package
baseline (macOS 14 / iOS 17). `CosyVoice3ModelNameTests` guard the
rename.

#### CosyVoice3 HiFT timeout fix (`267766b62`)
`minimax-chinese` runs were aborting mid-corpus with `E5RT: Submit Async
failed for [3:29] ... HiFT-T500-fp16_main__Op104_BnnsCpuInference has
timed out`. Root cause: HiFT was loaded with `.cpuAndNeuralEngine`,
which let the planner place most of the graph on ANE but kept at least
one op on the BNNS CPU async-dispatch path; long phrases tripped the
BNNS async watchdog. Fix pins HiFT to `.cpuAndGPU` regardless of
user-supplied compute-units, removing the BNNS path entirely. Verified
on 100/100 zh + 100/100 yue.

#### CosyVoice3 LLM-Decode `outputBackings` double-buffer (`248c638c6`)
The autoregressive decode loop runs ~163 steps per phrase to fill the
250-token cap. Each step takes the previous step's KV cache as `kv_k` /
`kv_v` (fp32 `[24, 1, 2, 768, 64]` = 9 MB each) and produces fresh
`kv_k_out` / `kv_v_out` plus logits — i.e. ~36 MB of host-side
`MLMultiArray` allocation **per step**. Fix pre-allocates 4 KV
back-buffers + a logits backing, rotates front/back/spare across steps
via `MLPredictionOptions.outputBackings`, and falls back to fresh-alloc
on first rejection (one-shot `logger.warning`). Mirrors the Magpie
pattern. Result on full `minimax-chinese`: agg-RTFx **0.269 → 0.357
(+33%)**, TTFT-p50 14091 ms → 9619 ms (-31%), peak RSS 3302 MB → 2470
MB.

#### CosyVoice3 auto-chunker (`f80e0b92e` + `fd22624b5` + `f60cccd0d`)
The 250-token Flow input cap means a single synth pass produces at most
~6.5 s of audio regardless of input length. Re-exporting Flow with a
larger fixed input shape is gated on upstream conversion work, so this
PR works around it at the call site: long inputs are split at
sentence/clause boundaries by `CosyVoice3TextChunker`, synthesized
independently, and merged with an 8 ms equal-power cosine crossfade.

**Splitter policy**: hard enders (`. ! ? 。 ！ ？ \n`) commit always; soft
enders (`， 、 ； ： ; ,` + ASCII space) commit only at-or-past budget;
force-split at +30 token overshoot if no natural boundary exists.
`defaultMaxSpeechTokens` = 110 (leaves margin under the 250-token cap
minus a typical 60–90-token speech-prompt context). Token-rate heuristic
is calibrated against minimax-zh + minimax-yue runs:

| Char class | Tokens / char | Rationale |

|------------|---------------|--------------------------------------------------------------|
| CJK | 7.5 | worst-case observed in real generation; varies 5.5–9 per
char |
| ASCII | 1.5 | matches BPE rate on English text |
| Other | 2.5 | conservative for accented Latin / non-CJK Unicode |

**Validation** on full `minimax-cantonese` (100 phrases, M2):

| Metric | Pre-chunker | Post-chunker | Δ |

|-------------------------------------------|-------------|--------------|------------|
| `finished_on_eos=false` (truncated) | 80 / 100 | **5 / 100** | −94% |
| Longest audio output | 6.5 s | **16.1 s** | +148% |
| agg-RTFx | 0.245× | 0.249× | +1.6% |
| TTFT p50 | 23.9 s | 35.7 s | +49% |

The TTFT regression is the cost of running multiple synth passes per
long phrase — splitting unblocks long-form output at the price of
wall-clock latency. The 5/100 residual truncation is the long-tail
token-rate worst case (some chars hit ~9 tokens/char); raising the
per-CJK heuristic further would over-fragment short phrases. Cleaner fix
is the Flow re-export.

16-test suite covers tokenization estimates, hard/soft/force-split
policy, and the crossfade arithmetic. Lives in
`Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Preprocess/CosyVoice3TextChunker.swift`
+ `CosyVoice3TtsManager.concatWithCrossfade`.

#### Magpie streaming TTFT wire-up (`ace0bf485`)
`TtsBenchmarkCommand.swift` now drives Magpie through
`MagpieTtsManager.synthesizeStream`, recording `ttft_ms` at first
`MagpieAudioChunk` emit instead of conflating it with full-synth wall
time. Result on full `minimax-english` (100 phrases, M2): TTFT-p50 **9.6
s** vs full synth-p50 **15.1 s** — agents start playback ~36% earlier
than waiting for full synth. agg-RTFx 0.41× → 0.64× (warm-cache re-run
benefit; fundamentals unchanged).

#### StyleTTS2 `FlexibleShapeInfo` fix (`c24900731` + `8f9e42fd9`)
`text_predictor.mlmodelc` aborted on long MiniMax phrases with `E5RT:
tensor_buffer has known strides while the model has FlexibleShapeInfo`.
The CoreML runtime rejects two access patterns on outputs from a
flex-shape model: `arr.strides` reads, and `arr[idx].floatValue` element
subscripts — and the original `sliceFirstAxis2D` helper used both. Fix
rewrites it to read via `arr.dataPointer.bindMemory(...)` (handling
`.float32`, `.float16`, `.double`) and computes the flat index from the
known `(1, leading, trailing)` row-major layout. Verified on full
100/100 minimax-en with a `ref_s.bin` dumped from the upstream LibriTTS
demo voice.

#### StyleTTS2 misaki → espeak post-pass remap (`ded0b9467`)
After `sliceFirstAxis2D` unblocked the full corpus, StyleTTS2 still
landed at **WER 0.581 / CER 0.476** — an order of magnitude worse than
Kokoro (0.013). Instrumented the encoder via a new `--tokenize-only
--corpus` mode and disproved the silent-vocab-drop hypothesis: only
**0.09% of scalars** dropped on the full 100-phrase corpus (11 ASCII
hyphens / 12247 scalars).

Real root cause: G2P convention mismatch. Both Kokoro and StyleTTS2
share the in-tree misaki BART G2P (`G2PModel`), but the StyleTTS2
LibriTTS checkpoint was trained by yl4579 on **espeak-ng-phonemized**
LibriTTS — predating misaki by years. The 178-vocab accepts both forms
(e.g. both `ʧ` U+02A7 and `tʃ` decomposed encode), but acoustic
embeddings for the misaki ligature glyphs are essentially untrained
noise.

Side-by-side comparison against locally-installed `espeak-ng -v en-us
--ipa -q` flagged four systematic divergences:

| misaki | espeak-ng | example                  |
|--------|-----------|--------------------------|
| `ʧ`    | `tʃ`      | choice → `tʃˈɔɪs`        |
| `ʤ`    | `dʒ`      | jump   → `dʒˈʌmps`       |
| `ɜɹ`   | `ɝ`       | girl   → `ɡˈɝl`          |
| `əɹ`   | `ɚ`       | over   → `ˈoʊvɚ`         |

Fix: a 4-rule post-pass remap in `StyleTTS2Phonemizer.phonemize`, gated
on `.americanEnglish` and applied to the assembled phoneme string after
every word has been emitted by the BART G2P. Lives alongside the
existing per-piece misaki diphthong remap. Result on the same 100-phrase
MiniMax-English run with the same `libritts_696` voice and same Parakeet
TDT roundtrip:

| Metric          | Pre   | Post  | Δ      |
|-----------------|-------|-------|--------|
| Macro WER       | 0.581 | 0.440 | −24.2% |
| Macro CER       | 0.476 | 0.241 | −49.5% |
| TTFT p50 (ms)   | 8937  | 6671  | −25.4% |
| Agg RTFx        | 2.36× | 2.72× | +15.3% |
| Peak RSS (MB)   | 1428  | 963   | −32.6% |

Phrase 1 (`"…simple choice. Get busy living…"`) went from `simple voice.
Busy dying.` (0.40 WER) to a perfect roundtrip. Remaining errors cluster
on word-level G2P misses from the BART itself (`practical →
practicckles`, `separation → expiration`) and diffusion-sampler formant
breaks; closing the rest of the gap to Kokoro likely needs richer espeak
coverage or libespeak-ng vendor — tracked separately.

#### Beta callouts on StyleTTS2 + Magpie managers (`25e2b492a`)
`StyleTTS2Manager.initialize` and `MagpieTtsManager.initialize` now emit
`logger.warning` beta notices mirroring the existing
`CosyVoice3TtsManager.initialize` pattern. Backends docs (`Magpie.md`
Status section, `Benchmarks.md` StyleTTS2 footnote) gain matching `⚠️
Beta / experimental` callouts so the perf / quality posture is visible
at every entry point — runtime, manager docstring, doc top, PR body.

#### Magpie `outputBackings` rejection fallback (`72dae8400` +
`9767e1ef9`)
The shipped `decoder_step.mlmodelc` reaches the user before the rebuild
lands, so CoreML can reject our `outputBackings` dictionary on a
name-mismatch. Latched fallback path falls back to a fresh-alloc decode
so the model still runs; first rejection latches the flag for the rest
of the run.

### Cohere ASR backend in the harness (`8e741e659`)

Lets non-English TTS runs (CosyVoice3, Magpie zh, etc.) score WER / CER
through the harness against [Cohere
Transcribe](Sources/FluidAudio/ASR/Cohere/) instead of being forced into
`--skip-asr`. Four new flags on `tts-benchmark`:

- `--asr-backend parakeet|cohere|none` — selects the ASR roundtrip
engine. Default is `parakeet` for English-only runs and skipped for
CosyVoice3.
- `--cohere-model-dir <path>` — path to a directory containing
`cohere_encoder.mlmodelc`, `cohere_decoder_cache_external_v2.mlmodelc`,
and `vocab.json`.
- `--asr-language <code>` — overrides the inferred language code (covers
all 14 Cohere languages: en, fr, de, es, it, pt, nl, pl, el, ar, ja, zh,
ko, vi).
- `--cohere-compute-units all|cpu-and-gpu|cpu-only|all-ane` — pins
`MLComputeUnits` at `CoherePipeline.loadModels` time. Use `cpu-and-gpu`
when the q8 encoder fails ANE compilation (`MILCompilerForANE error:
failed to compile ANE model using ANEF`) to skip the multi-minute
fallback compile on the first call. The harness logs a WER caveat for
zh/ja runs flagging that whitespace-tokenized WER is meaningless and the
CER column is the real signal.

Example end-to-end:
```bash
fluidaudio tts-benchmark \
    --backend cosyvoice3 \
    --corpus minimax-chinese \
    --asr-backend cohere \
    --cohere-model-dir /path/to/cohere/q8 \
    --asr-language zh \
    --output-json benchmark_results/cv3-zh-cohere.json \
    --audio-dir benchmark_results/cv3-zh-cohere/audio
```

On this M2 host the q8 encoder hits a CoreML ANE-cache failure
(`MILCompilerForANE error: ANECCompile() FAILED`) and CoreML silently
falls back to CPU+GPU, dropping Cohere from its documented RTFx ~2× (per
`Documentation/ASR/Cohere.md`) to RTFx ~0.13× — correctness is
unaffected (same graph, same output), only latency. The full 100-phrase
CosyVoice3 zh CER number reported above (1.68% macro / 1.84% micro) was
therefore produced via `whisper-large-v3` (Python CPU FP32,
`Scripts/whisper_zh_cer.py`) rather than by running Cohere over all 100
phrases. A 10-phrase Cohere sub-sample agrees with whisper at the 3–5%
CER range.

### Corpus migration (`4cc7d3111`) + on-demand fetch CLI (`8022e8384`)

Replaces the original `prose-en` / `numbers-en` / `names-en` /
`prose-zh` shipped with the first cut of this PR with the [MiniMax
Multilingual TTS Test
Set](https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set)
(CC-BY-SA-4.0; 100 phrases × 25 languages). Same public corpus used by
[MiniMax-Speech](https://arxiv.org/abs/2505.07916), seed-tts-eval, and
Gradium — numbers in this PR are paper-comparable.

The 24 per-language `.txt` files used to be vendored in
`Benchmarks/tts/corpus/minimax/`. **Removed in this PR** in favor of an
on-demand `fluidaudio minimax-corpus` CLI subcommand that fetches them
from the upstream HF dataset at the pinned revision and writes them to
the same path. Reuses `DownloadUtils.fetchHuggingFaceFile` for HF auth
(HF_TOKEN env) + retry/backoff — no `swift-transformers` dep added, no
hardcoded asset URLs. The `.txt` files now live in `.gitignore` since
they're CC-BY-SA-4.0 derivative content; only
`Documentation/TTS/MinimaxCorpus.md` (attribution + revision pin + WER
caveats — moved from `Benchmarks/tts/corpus/minimax/README.md` in
`ac21d60bf`) and the CLI subcommand are tracked. Replaces the prior
`python Scripts/fetch_minimax_tts_corpus.py` (also deleted). Per-backend
language scope:

| Backend | Languages benchmarked |
|---|---|
| Kokoro / Kokoro ANE | en (af_heart) |
| PocketTTS | en + de + it + pt + es + fr |
| Magpie | en + es + de + fr + it + vi + zh + hi |
| StyleTTS2 | en (LibriTTS multi-spk) |
| CosyVoice3 | zh + yue |

### PocketTTS streaming TTFT (`c26f1e163`)
PocketTTS now drives the harness through its `synthesizeStreaming` API
so TTFT measures time-to-first-80ms-frame instead of full one-shot
synth. TTFT 1244 ms vs. full synth 8757 ms — a 7× streaming advantage
that one-shot benchmarking previously hid.

### Reference voice dumper helper (mobius-styletts2)
`mobius-styletts2/scripts/06_dump_ref_s.py` (added in the sibling repo)
wraps `style_encoder` + `predictor_encoder` from `99_parity_check.py` to
dump a 256-fp32 LE `ref_s.bin` that `StyleTTS2Manager.synthesize`
consumes via `--voice`. Required because the shipped CoreML bundle
doesn't include those upstream-only PyTorch encoders.

## Test plan

- [x] `swift build -c release` clean
- [x] `swift format lint` clean for new files
- [x] `fluidaudio tts-benchmark --help` lists all 6 backends
- [x] `fluidaudio minimax-corpus --languages english --out-dir /tmp/x`
produces byte-identical output to the deleted Python script
- [x] Kokoro / Kokoro ANE / PocketTTS / Magpie — full 100/100 minimax-en
- [x] StyleTTS2 — full 100/100 minimax-en (verified after
`sliceFirstAxis2D` fix + post-pass remap)
- [x] CosyVoice3 — full 100/100 minimax-zh + 100/100 minimax-yue
(verified after HiFT + LLM-Decode `outputBackings` fixes)
- [x] `CosyVoice3ModelNameTests` + `TtsComputeUnitPresetTests` green
- [x] No `@unchecked Sendable`; per-backend error enums use `Error,
LocalizedError`
- [x] StyleTTS2 + Magpie + CosyVoice3 emit beta `logger.warning` on
`initialize()`
- [x] Corpus README moved to `Documentation/TTS/MinimaxCorpus.md`;
cross-refs in `Benchmarks.md`, `MinimaxCorpusCommand.swift`,
`TtsBenchmarkCommand.swift` updated
- [x] CosyVoice3 6.5 s output cap investigated — confirmed structural
(250-token Flow input shape, 40 ms / token); surfaced via
`finishedOnEos` + warning log + JSON `finished_on_eos` field. See
[Decode budget
cap](Documentation/TTS/Benchmarks.md#cosyvoice3-decode-budget-cap)
- [x] **CosyVoice3 auto-chunker** lands in this PR as a call-site
workaround. Validated on full minimax-cantonese: truncation **80/100 →
5/100**, longest output **6.5 s → 16.1 s**, agg-RTFx 0.245× → 0.249×.
16-test suite (`CosyVoice3TextChunkerTests`) green. See [CosyVoice3
auto-chunker](Documentation/TTS/Benchmarks.md#cosyvoice3-auto-chunker)
- [x] **Magpie streaming TTFT** wired through `synthesizeStream` in
`TtsBenchmarkCommand.swift`. Validated on full minimax-english: TTFT-p50
**9.6 s** (first chunk) vs full-synth-p50 **15.1 s** — 36% earlier
playback start. agg-RTFx 0.41× → 0.64× (warm-cache re-run)
- [x] **Cohere ASR harness wiring** (`--asr-backend cohere` +
`--cohere-model-dir` + `--asr-language` + `--cohere-compute-units`).
Smoke-tested on a 10-phrase `minimax-chinese` sub-sample (Cohere q8
macro CER 4.88%, hit `MILCompilerForANE` fallback, RTFx ~0.13× on this
M2 host). Whisper-large-v3 cross-check on the same WAVs: macro CER 3.04%
— both backends agree
- [x] **CosyVoice3 zh CER on full corpus** measured via
`whisper-large-v3` (Python CPU FP32, `Scripts/whisper_zh_cer.py`) over
all 100 minimax-chinese WAVs: macro CER **1.68%**, micro CER **1.84%**.
Recorded in `Documentation/TTS/Benchmarks.md` (CosyVoice3 row + footnote
‡)

2026-05-01 09:09:42 -04:00

17 KiB

Raw Blame History

TTS Benchmarks

Setup: MacBook Air M2 (2022), 16 GB, macOS 26, on AC. Corpus: MiniMax Multilingual TTS Test Set (100 phrases / language, CC-BY-SA-4.0) — the same public corpus used by MiniMax-Speech, seed-tts-eval, and Gradium, so numbers here are directly paper-comparable. Status: Kokoro, Kokoro ANE, PocketTTS, Magpie, StyleTTS2 all complete the English run; CosyVoice3 completes the full Mandarin run.

Why not just RTFx?

RTFx (audio_seconds / synth_seconds) is a useful single number for batch synthesis, but for conversational use it hides the things users actually feel:

Cold start — first model load + ANE compile after install or reboot. On Apple Silicon the system's anecompilerservice can take tens of seconds on first invocation; subsequent loads finish in ~1 s.
TTFT (time-to-first-audio) — for streaming agents the question is "how long until the user hears something", not "how long until the whole utterance is rendered". For one-shot backends in this slice ttft_ms == synth_ms. PocketTTS and Magpie are wired through their respective streaming APIs (synthesizeStreaming / synthesizeStream), so their ttft_ms is honest first-frame latency.
Per-stage compute units — Kokoro ANE / Magpie are pipelines of 6–7 graphs. Sometimes ANE is slower per call but more efficient. The "right" compute-unit choice differs per stage.
Memory footprint — drives whether a backend is mobile-viable.
Quality — RTFx alone tells you nothing about whether the model pronounced "Reykjavík" or "$1,234.56" correctly. We measure WER + CER via Parakeet roundtrip on a fixed English corpus; non-English backends run with --skip-asr for now.

Methodology

Corpus

All shipped corpora come from the MiniMax Multilingual TTS Test Set (MiniMaxAI/TTS-Multilingual-Test-Set on Hugging Face, CC-BY-SA-4.0). The fetched files land under Benchmarks/tts/corpus/minimax/<lang>.txt (24 languages × 100 phrases = 2400 phrases) and are gitignored — populate them on demand with swift run fluidaudio minimax-corpus. Attribution, revision pin, and WER caveats live in MinimaxCorpus.md.

Reference each language as --corpus minimax-<lang>:

Backend	Default corpus	Other supported MiniMax languages
Kokoro / Kokoro ANE	`minimax-english`	`english` only (`af_heart` voice)
PocketTTS	`minimax-english`	`english`, `german`, `italian`, `portuguese`, `spanish`, `french`
StyleTTS2	`minimax-english`	`english` only (LibriTTS multi-speaker)
Magpie	`minimax-english`	`english`, `spanish`, `german`, `french`, `italian`, `vietnamese`, `chinese`, `hindi`
CosyVoice3	`minimax-chinese`	`chinese`, `cantonese`

Lines beginning with # are comments. Custom corpora can still be passed with --corpus-path <file.txt>.

Metrics

Per phrase:

ttft_ms — time-to-first-audio. For one-shot backends this equals synth_ms. PocketTTS is benchmarked through synthesizeStreaming, so its ttft_ms is the timestamp of the first 80 ms audio frame (1920 samples @ 24 kHz). Magpie is benchmarked through synthesizeStream, so its ttft_ms is the first MagpieAudioChunk emit time (typically ~9.6 s on M2 vs ~15 s for full synth).
synth_ms — total synth wall time.
audio_ms — generated audio duration.
rtfx — audio_ms / synth_ms.
wer, cer — via Parakeet ASR roundtrip on the rendered WAV.
stage_ms — per-stage breakdown (backend-specific keys; populated for Kokoro ANE + Magpie; empty for Kokoro / PocketTTS / StyleTTS2 / CosyVoice3).
Backend-specific extras: encoder_tokens, acoustic_frames, chunk_count, frame_count, code_count, finished_on_eos, generated_token_count, etc.

Aggregates:

cold_start_s — manager.initialize() wall time. CosyVoice3 also includes voice-asset load.
first_synth_ms — first synth call after init (still cold-ish).
ttft_ms_p50 / ttft_ms_p95.
warm_synth_ms_p50 / warm_synth_ms_p95.
agg_rtfx — Σ audio_ms / Σ synth_ms across the corpus.
peak_rss_mb — process-wide peak resident set, via task_vm_info_data_t.resident_size_peak.
Per-category macro WER / CER.

Reproducibility

# From the package root.
swift run fluidaudio tts-benchmark \
  --backend kokoro-ane \
  --corpus minimax-english \
  --voice af_heart \
  --compute-units default \
  --output-json bench.json \
  --audio-dir bench-wavs/

The harness writes a JSON report to --output-json and (optionally) keeps WAVs under --audio-dir. Pass --skip-asr to drop the ASR roundtrip. The default ASR backend is parakeet for English-only runs and is skipped for CosyVoice3; pass --asr-backend cohere --cohere-model-dir <dir> to score Mandarin (or any of the 14 Cohere languages) against Cohere Transcribe.

Results

Per-backend top-line

Reference machine: MacBook Air, Apple M2 (2022), 8-core CPU / 8-core GPU / 16-core Neural Engine, 16 GB unified memory, macOS 26 (Mac14,2, on AC). All English runs use --compute-units default, voice = backend default (af_heart for Kokoro, alba for PocketTTS, John for Magpie), corpus = minimax-english (100 phrases), Parakeet TDT roundtrip for WER / CER.

Backend	License	Languages	Footprint	Cold start	TTFT p50 / p95*	Synth p50 / p95	Agg RTFx	Peak RSS	WER	CER	Notes
Kokoro ANE	Apache-2.0	en (af_heart only)	~330 MB	37.9 s	1586 / 2515 ms	1586 / 2515 ms	5.19×	738 MB	0.108	0.040	one-shot; per-stage CU sweep, 7-graph pipeline
Kokoro	Apache-2.0	en (af_heart only)	~330 MB	92.2 s	3113 / 4696 ms	3113 / 4696 ms	2.02×	736 MB	0.013	0.005	one-shot; cleanest English ASR roundtrip
PocketTTS	research	en + de + it + pt + es + fr (6L / 24L)	~140 / ~520 MB	6.0 s	1244 / 4749 ms	8757 / 19174 ms	0.61×	1503 MB	0.014	0.006	streaming; TTFT is first 80 ms audio frame
StyleTTS2	MIT	en (LibriTTS multi-spk)	~280 MB	955 s§	6671 / 15990 ms§	6671 / 15990 ms§	2.72×§	963 MB§	0.440§	0.241§	full 100/100 `minimax-english` via misaki→espeak post-pass remap; ref_s = LibriTTS `696_92939_000016_000006.wav` (StyleTTS2 demo voice)
Magpie	research	en/es/de/fr/it/vi/zh/hi	~1.3 GB	38.5 s∥	9580 / 23796 ms∥	15080 / 29895 ms∥	0.64×∥	762 MB∥	0.056	0.033	streaming TTFT: first audio chunk at 9.6 s p50 on M2 (full synth 15.1 s); split-K/V decoder; outputBackings fast path with latched fallback
CosyVoice3	Apache-2.0	zh (mandarin)	~1.5 GB	29.2 s†	14091 / 23679 ms†	14091 / 23679 ms†	0.357×†	3302 MB†	n/a‡	0.017‡	beta; full `minimax-chinese` (100/100 phrases) for latency / RSS and whisper-large-v3 CER‡; cantonese supported via auto-chunker but not benchmarked (no yue ASR)

* TTFT for PocketTTS / Magpie is first-frame emit through the streaming API; the others are one-shot, so ttft_ms == synth_ms.

† CosyVoice3 chinese: 100/100, 0 errors, ASR skipped. Cold-start dropped from 302.7 s to 29.2 s on the warm re-run.

‡ CosyVoice3 CER measured on the full 100-phrase minimax-chinese corpus via whisper-large-v3 (Python CPU FP32, Scripts/whisper_zh_cer.py) on the WAVs rendered by tts-benchmark --backend cosyvoice3 --corpus minimax-chinese --skip-asr --audio-dir <dir>: macro CER 1.68% (0.0168), micro CER 1.84% (0.0184) across 100 phrases. Whisper is the source of truth here because Cohere Transcribe q8 hit a MILCompilerForANE cache failure on this M2 host and ran on the CPU+GPU fallback path at RTFx ~0.13× (would have taken multiple hours for the full 100-phrase set vs. ~70 min for whisper). WER is omitted because Mandarin has no word boundaries and WERCalculator splits on whitespace, so word-level WER reads near 100% and is meaningless.

∥ Magpie: streamed via synthesizeStream. TTFT (9.6 s p50) is first-chunk emit; synth (15.1 s p50) is full-utterance wall time — the 5.5 s gap is the streaming win.

§ StyleTTS2 (beta — StyleTTS2Manager.initialize emits a runtime warning): warm-cache run; first cold compile of the bucketed text_predictor / diffusion_step / decoder graphs is multi-second. ref_s dumped via 06_dump_ref_s.py. Read WER relatively per the WER caveat; StyleTTS2's own demo notebook reports artifacts on long sentences at default alpha/beta/diffusion_steps.

Kokoro ANE — per-stage breakdown (default preset, MiniMax-English)

Means across 100 minimax-english phrases on M2. Stages map to the 7-CoreML-graph split documented in KokoroAne.md. Vocoder

noise together account for ~92% of synth time, which is the natural target for any further per-stage compute-unit re-tuning. The MiniMax mean is meaningfully higher than the prior Harvard-sentences run because phrases 81–100 are paragraph-length news / story sentences.

Stage	Mean ms	% of total
`albert`	28.2	2.0%
`post_albert`	12.1	0.9%
`alignment`	1.8	0.1%
`prosody`	49.2	3.5%
`noise`	242.6	17.4%
`vocoder`	1039.8	74.4%
`tail`	24.6	1.8%
total	1398.4	100%

Magpie — per-stage breakdown (default preset, MiniMax-English)

Means across 100 minimax-english phrases on M2 (John voice, en, default compute units), captured during the original one-shot profiling run. ar_loop is the umbrella for the per-step decoder_step + sampler (so it is not added on top in the total). nanocodec runs concurrently with the AR loop in chunked-streaming mode, which is why the per-stage means do not sum to total warm-synth mean. The AR loop dominates the wall clock, and its cost grows super-linearly with phrase length — long news / story phrases drive the long-tail p95.

Stage	Mean ms
`text_encoder`	91
`prefill`	281
`ar_loop`	17946
└── `decoder_step`	14840
└── `sampler`	3081
`nanocodec`	17948

About the WER / CER numbers

The MiniMax corpus mixes short conversational phrases, medium news headlines, and long narrative paragraphs. WER on the long tail is sensitive to the ASR + text-normalizer stack (e.g. "3,5%" → "three point five percent" vs. "three and a half percent"); per the upstream community discussion, absolute WER is best read relatively (backend A vs. backend B on the same corpus + same ASR + same normalizer) rather than against raw paper numbers.

StyleTTS2 misaki → espeak post-pass remap

StyleTTS2's LibriTTS checkpoint was trained on espeak-ng-phonemized text, but the in-tree BART G2P (shared with Kokoro) emits misaki output. The 178-token vocab accepts both forms, but the acoustic embeddings for the misaki ligature glyphs are essentially untrained noise — every training utterance saw the espeak form.

Four systematic divergences vs. espeak-ng -v en-us --ipa -q:

misaki	espeak-ng	example
`ʧ`	`tʃ`	choice → `tʃˈɔɪs`
`ʤ`	`dʒ`	jump → `dʒˈʌmps`
`ɜɹ`	`ɝ`	girl → `ɡˈɝl`
`əɹ`	`ɚ`	over → `ˈoʊvɚ`

Fix: 4-rule post-pass remap in StyleTTS2Phonemizer.phonemize, gated on .americanEnglish. Result on minimax-english: WER 0.581 → 0.440, CER 0.476 → 0.241, agg-RTFx 2.36× → 2.72× (warm-cache re-run, so latency / RSS deltas are noise — WER / CER are the real signal). WER is still 30× worse than Kokoro; remaining errors cluster on word-level BART mispronunciations and long-tail diffusion artifacts. Further gains likely need a richer remap layer or swapping BART for libespeak-ng directly.

CosyVoice3 Decode budget cap

CosyVoice3's Flow CFM was exported with a fixed input shape of [1, 250] speech tokens (flowTotalTokens in CosyVoice3Constants.swift:45). The LLM-Decode AR loop is allowed to emit up to flowTotalTokens − N_prompt tokens before being cut off (typically ~163 generated tokens after the speech-prompt portion). At tokenMelRatio=2 × hiftSamplesPerFrame=480 / sampleRate=24000 that's 40 ms of audio per generated token, so the loop produces at most ~6.5 s of speech per phrase, regardless of how long the input text is.

When the AR loop exits because it ran out of budget (i.e. no EOS token in stopRange = 6_561…6_760) instead of natural termination, CosyVoice3Synthesizer now:

Logs a .warning (one-shot per phrase) naming the decoded.count / maxNew budget and the produced audio duration.
Sets CosyVoice3SynthesisResult.finishedOnEos = false, which the benchmark harness surfaces as the finished_on_eos field on each phrase in the JSON report.

Footprint on the cantonese corpus (minimax-cantonese, 100 phrases) without the chunker: 80 / 100 phrases would hit the cap, all producing exactly 163 generated tokens / ~6.5 s of audio. The mandarin corpus sees a much lower truncation rate because MiniMax-zh phrases are shorter on average.

The structural fix — re-exporting the Flow CFM from mobius-cosyvoice3 with a larger fixed input shape (e.g. [1, 500]) — is upstream work; bumping the constant in Swift alone would make the Flow input/output shapes mismatch at predict time. The shipped workaround is the call-site auto-chunker, which drops cantonese truncation from 80/100 → 5/100 by splitting long inputs at clause boundaries and crossfading the results.

Surfaced in CosyVoice3Synthesizer.synthesize (Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Synthesize/CosyVoice3Synthesizer.swift) and CosyVoice3SynthesisResult.finishedOnEos (Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Synthesize/CosyVoice3Types.swift).

CosyVoice3 auto-chunker

Re-exporting Flow CFM with a larger fixed input shape is gated on upstream conversion work. Until that lands, CosyVoice3TtsManager splits long inputs at the call site, synthesizes each chunk independently, and merges with an 8 ms equal-power cosine crossfade.

Splitter policy (CosyVoice3TextChunker):

Hard enders commit always: ., !, ?, 。, ！, ？, \n.
Soft enders commit only when the running estimate is at or past the budget: ，, 、, ；, ：, ;, ,, ASCII space.
Force-split at budget + 30 tokens of overshoot if no natural boundary appeared (rare; mostly continuous CJK with no punctuation).

Token-rate estimate (calibrated against minimax-zh + minimax-yue runs):

Char class	Tokens / char	Rationale
CJK	7.5	worst-case observed in real generation; varies 5.5–9 per char
ASCII	1.5	matches BPE rate on English text
Other	2.5	conservative for accented Latin / non-CJK Unicode

defaultMaxSpeechTokens is 110, leaving margin under the 250-token Flow cap minus typical 60–90 token speech-prompt context.

Concatenation: 8 ms equal-power cosine crossfade at 24 kHz between adjacent chunks; single-chunk path short-circuits to plain copy.

Validation (full minimax-cantonese, 100 phrases, M2):

Metric	Pre-chunker	Post-chunker	Δ
`finished_on_eos=false` (truncated)	80 / 100	5 / 100	−94%
Longest audio output	6.5 s	16.1 s	+148%
agg-RTFx	0.245×	0.249×	+1.6%
TTFT p50	23.9 s	35.7 s	+49%
TTFT p95	41.2 s	60.5 s	+47%
Peak RSS	2016 MB	3264 MB	+62%

The 5/100 residual is the long-tail token-rate worst case (some Cantonese characters generate >9 speech tokens); raising the per-CJK heuristic further would over-fragment short phrases. Cleaner fix is the upstream Flow re-export.

17 KiB Raw Blame History Unescape Escape