Files
FluidAudio/Sources/FluidAudioCLI/Commands/TtsBenchmarkCommand.swift
T
Alex 7603ac6733 feat(tts/benchmark): tts-benchmark CLI covering all TTS backends (#557)
## Summary

Adds `fluidaudio tts-benchmark`, a unified harness for measuring
**latency × efficiency × quality** across every shipping TTS backend in
FluidAudio, plus the model + runtime fixes needed to actually clear all
six backends end-to-end on the [MiniMax Multilingual TTS Test
Set](https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set).
Also tags Magpie / StyleTTS2 / CosyVoice3 as **beta** at the API + docs
level so users get a runtime warning on `initialize()` reflecting their
actual perf / quality posture.

### Backends — all green on M2 / macOS 26

| Backend | Corpus | Status | Audio out (min / p50 / max) | RTFx | WER |
Notes |
|---|---|---|---|---|---|---|
| Kokoro ANE | minimax-en (100/100) |  | 3.5 s / 8.0 s / 11.4 s | 5.19×
| 10.8% | one-shot @ 24 kHz, 7-graph pipeline; per-stage CU sweep |
| Kokoro | minimax-en (100/100) |  | 3.5 s / 6.8 s / 9.3 s | 2.02× |
1.3% | one-shot @ 24 kHz; multi-chunk w/ 8 ms crossfade; cleanest
English ASR roundtrip |
| PocketTTS | minimax-en (100/100) |  | 2.8 s / 6.3 s / 9.4 s | 0.61× |
1.4% | **streaming** @ 24 kHz, 80 ms frames; TTFT 1244 ms — RTFx looks
slow but is honest per-frame cost (see "RTFx caveat" below) |
| Magpie | minimax-en (100/100) | ⚠️ **BETA** | 4.7 s / 10.0 s / 20.6 s
| 0.64× | 5.6% | **streaming TTFT** @ 22.05 kHz: first chunk at **9.6 s
p50** vs full synth 15.1 s; split-K/V decoder + `outputBackings` fast
path; below real-time, runtime warning on init |
| StyleTTS2 | minimax-en (100/100) | ⚠️ **BETA** | 9.6 s / 22.6 s / 32.6
s | 2.72× | 44.0% | one-shot @ 24 kHz; flex-shape fix + misaki→espeak
post-pass remap (WER 58.1% → 44.0%); WER ~30× Kokoro's, runtime warning
on init |
| CosyVoice3 | minimax-zh (100/100) | ⚠️ **BETA** | 2.2 s / 6.5 s /
**16.0 s** | 0.357׆ | n/a‡ | post auto-chunker @ 24 kHz; long phrases
now split + crossfaded (8 ms cosine) — longest output 16.0 s (was capped
at ~6.5 s); HiFT `.cpuAndGPU` + LLM-Decode `outputBackings` (+33% RTFx);
**whisper-large-v3 CER 1.68% (macro) / 1.84% (micro)** across 100/100
phrases‡; RTFx < 1, runtime warning on init |
| CosyVoice3 | minimax-yue (100/100) | ⚠️ **BETA** | 3.3 s / 8.0 s /
**16.1 s** | 0.249× | n/a | post auto-chunker; **truncation 80/100 →
5/100 phrases** (`finished_on_eos=false` field), longest output 6.5 s →
16.1 s. TTFT-p50 climbs (24 s → 36 s) as the cost of multi-chunk synth |

⚠️ **BETA** = `${Backend}TtsManager.initialize()` emits a
`logger.warning` flagging the perf / quality posture; safe to ship in
non-latency-sensitive paths but read the per-backend doc first.

‡ CosyVoice3 zh WER stays `n/a` because `WERCalculator`
whitespace-tokenizes and Mandarin has no word boundaries (word-level WER
reads ~100% and is meaningless). CER is `whisper-large-v3` against the
rendered WAVs from the full 100-phrase `minimax-chinese` run via
`Scripts/whisper_zh_cer.py`. Cohere Transcribe q8 is also wired in this
PR via `--asr-backend cohere` (see [Cohere ASR backend in the
harness](#cohere-asr-backend-in-the-harness) below) and agrees with
whisper at the 3–5% CER range on a 10-phrase sub-sample, but hits a
`MILCompilerForANE` cache failure on this M2 host that drops it to RTFx
~0.13×, so whisper is the practical source-of-truth for the full
100-phrase run.

Full numbers (cold start, p50/p95 synth, peak RSS, WER/CER per category)
live in `Documentation/TTS/Benchmarks.md`. Corpus attribution +
reproduction notes live in `Documentation/TTS/MinimaxCorpus.md`.

### RTFx caveat — phrase length and streaming granularity both matter

Aggregate RTFx (audio_duration / wall_clock) is **only directly
comparable between backends when both produce similar phrase lengths and
yield audio at the same granularity**. Two things skew the headline
number on this corpus:

**1. Phrase-length spread.** StyleTTS2 emits ~22 s p50 of audio per
`minimax-english` phrase while Kokoro emits ~7 s — same input text, ~3×
more audio out. That's mostly long inter-word pauses + slow speaking
rate baked into the LibriTTS multi-speaker checkpoint, not a measurement
artifact. A 2.72× RTFx on 22 s audio = ~8 s wall — which matches the
TTFT p50 column. Kokoro's 2.02× on 7 s audio = ~3.5 s wall. Same-corpus
RTFx ratios alone hide this.

**2. Streaming granularity.** PocketTTS posts 0.61× agg-RTFx vs.
Kokoro's 2.02× but it's **not slower from a user perspective**:
PocketTTS yields its first 80 ms audio frame at TTFT **1244 ms**,
Kokoro's first frame at TTFT **3113 ms** (full one-shot chunk). The
0.61× is the per-frame cost averaged across the streaming run; what
users feel is TTFT.

| Backend | TTFT p50 | First yield | Implication |

|-------------|----------|------------------|--------------------------------------------|
| PocketTTS | 1244 ms | 80 ms frame | true streaming;
conversational-ready |
| Kokoro ANE | 1586 ms | full ~8 s chunk | ~1.6 s to any audio;
ANE-tuned |
| Kokoro | 3113 ms | full ~7 s chunk | clean quality, slower first-byte
|
| StyleTTS2 | 6671 ms | full ~22 s chunk | one-shot only; long phrase
output amortizes the wall |
| Magpie | **9580 ms** | first chunk @ 22.05 kHz | streaming via
`synthesizeStream`; TTFT-p50 9.6 s vs full synth 15.1 s — 36% earlier
playback start |
| CosyVoice3 | 14091 / 35681 ms (zh / yue) | full chunk @ 24 kHz |
one-shot per chunk; multi-chunk phrases pay TTFT for the first chunk
only |

For conversational use cases, **TTFT > RTFx**. PocketTTS (true
streaming), Magpie (streaming via `synthesizeStream`), and Kokoro ANE
(small one-shot chunks) are the three backends that meaningfully clear
the "user feels it's responsive" bar today.

### Beta callouts (StyleTTS2, Magpie, CosyVoice3)

Three of the six shipping backends post numbers that callers should
weigh against an explicit caveat:

- **StyleTTS2** — WER 44% on `minimax-english` is ~30× Kokoro's 1.3%.
The misaki→espeak post-pass remap closed half the gap; the remainder is
BART G2P misses + diffusion-sampler formant breaks on long phrases.
- **Magpie** — agg-RTFx 0.64× on M2 — below real-time but streaming via
`synthesizeStream` so TTFT (9.6 s p50) is significantly better than
full-synth wall (15.1 s p50). Long-tail phrases still pull p95 wall to
~30 s.
- **CosyVoice3** — agg-RTFx 0.357× on `minimax-chinese` (0.249× on the
longer-phrase `minimax-cantonese` after the auto-chunker). The 250-token
Flow input cap is now worked around at the call site by the auto-chunker
(long phrases split + crossfaded), dropping cantonese truncation from
80/100 → 5/100 and lifting longest output from 6.5 s → 16.1 s. The 5/100
residual is the long-tail token-rate worst case; the structural fix is
re-exporting Flow with a larger fixed input shape (tracked in
`mobius-cosyvoice3`). `CosyVoice3SynthesisResult.finishedOnEos: Bool` +
a `.warning`-level `LLM-Decode budget exhausted` log still surface any
truncation, and the harness writes `finished_on_eos` into each phrase in
the JSON report.

Each manager now logs a `.warning`-level beta notice on `initialize()`
(mirroring the existing CosyVoice3 pattern) so anyone wiring these into
a product gets a console signal, not a silent surprise. Docs
(`Documentation/TTS/Magpie.md`, `Documentation/TTS/Benchmarks.md`
StyleTTS2 footnote, existing `CosyVoice3.md` callout) carry the same
caveat at the top.

### Model + runtime fixes landed in this PR

#### CosyVoice3 stateless port (`71130c9fb`)
Switches LLM-Decode from the macOS 15+ stateful `MLState` path to the
non-stateful `LLM-Decode-M768-fp16` graph that's actually shipped on
HuggingFace. Drops ~95 LOC of state plumbing for ~30 LOC of plain
`MLDictionaryFeatureProvider` prediction with explicit kv carry-forward;
lowers the availability gate from macOS 15 / iOS 18 back to the package
baseline (macOS 14 / iOS 17). `CosyVoice3ModelNameTests` guard the
rename.

#### CosyVoice3 HiFT timeout fix (`267766b62`)
`minimax-chinese` runs were aborting mid-corpus with `E5RT: Submit Async
failed for [3:29] ... HiFT-T500-fp16_main__Op104_BnnsCpuInference has
timed out`. Root cause: HiFT was loaded with `.cpuAndNeuralEngine`,
which let the planner place most of the graph on ANE but kept at least
one op on the BNNS CPU async-dispatch path; long phrases tripped the
BNNS async watchdog. Fix pins HiFT to `.cpuAndGPU` regardless of
user-supplied compute-units, removing the BNNS path entirely. Verified
on 100/100 zh + 100/100 yue.

#### CosyVoice3 LLM-Decode `outputBackings` double-buffer (`248c638c6`)
The autoregressive decode loop runs ~163 steps per phrase to fill the
250-token cap. Each step takes the previous step's KV cache as `kv_k` /
`kv_v` (fp32 `[24, 1, 2, 768, 64]` = 9 MB each) and produces fresh
`kv_k_out` / `kv_v_out` plus logits — i.e. ~36 MB of host-side
`MLMultiArray` allocation **per step**. Fix pre-allocates 4 KV
back-buffers + a logits backing, rotates front/back/spare across steps
via `MLPredictionOptions.outputBackings`, and falls back to fresh-alloc
on first rejection (one-shot `logger.warning`). Mirrors the Magpie
pattern. Result on full `minimax-chinese`: agg-RTFx **0.269 → 0.357
(+33%)**, TTFT-p50 14091 ms → 9619 ms (-31%), peak RSS 3302 MB → 2470
MB.

#### CosyVoice3 auto-chunker (`f80e0b92e` + `fd22624b5` + `f60cccd0d`)
The 250-token Flow input cap means a single synth pass produces at most
~6.5 s of audio regardless of input length. Re-exporting Flow with a
larger fixed input shape is gated on upstream conversion work, so this
PR works around it at the call site: long inputs are split at
sentence/clause boundaries by `CosyVoice3TextChunker`, synthesized
independently, and merged with an 8 ms equal-power cosine crossfade.

**Splitter policy**: hard enders (`. ! ? 。 ! ? \n`) commit always; soft
enders (`, 、 ; : ; ,` + ASCII space) commit only at-or-past budget;
force-split at +30 token overshoot if no natural boundary exists.
`defaultMaxSpeechTokens` = 110 (leaves margin under the 250-token cap
minus a typical 60–90-token speech-prompt context). Token-rate heuristic
is calibrated against minimax-zh + minimax-yue runs:

| Char class | Tokens / char | Rationale |

|------------|---------------|--------------------------------------------------------------|
| CJK | 7.5 | worst-case observed in real generation; varies 5.5–9 per
char |
| ASCII | 1.5 | matches BPE rate on English text |
| Other | 2.5 | conservative for accented Latin / non-CJK Unicode |

**Validation** on full `minimax-cantonese` (100 phrases, M2):

| Metric | Pre-chunker | Post-chunker | Δ |

|-------------------------------------------|-------------|--------------|------------|
| `finished_on_eos=false` (truncated) | 80 / 100 | **5 / 100** | −94% |
| Longest audio output | 6.5 s | **16.1 s** | +148% |
| agg-RTFx | 0.245× | 0.249× | +1.6% |
| TTFT p50 | 23.9 s | 35.7 s | +49% |

The TTFT regression is the cost of running multiple synth passes per
long phrase — splitting unblocks long-form output at the price of
wall-clock latency. The 5/100 residual truncation is the long-tail
token-rate worst case (some chars hit ~9 tokens/char); raising the
per-CJK heuristic further would over-fragment short phrases. Cleaner fix
is the Flow re-export.

16-test suite covers tokenization estimates, hard/soft/force-split
policy, and the crossfade arithmetic. Lives in
`Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Preprocess/CosyVoice3TextChunker.swift`
+ `CosyVoice3TtsManager.concatWithCrossfade`.

#### Magpie streaming TTFT wire-up (`ace0bf485`)
`TtsBenchmarkCommand.swift` now drives Magpie through
`MagpieTtsManager.synthesizeStream`, recording `ttft_ms` at first
`MagpieAudioChunk` emit instead of conflating it with full-synth wall
time. Result on full `minimax-english` (100 phrases, M2): TTFT-p50 **9.6
s** vs full synth-p50 **15.1 s** — agents start playback ~36% earlier
than waiting for full synth. agg-RTFx 0.41× → 0.64× (warm-cache re-run
benefit; fundamentals unchanged).

#### StyleTTS2 `FlexibleShapeInfo` fix (`c24900731` + `8f9e42fd9`)
`text_predictor.mlmodelc` aborted on long MiniMax phrases with `E5RT:
tensor_buffer has known strides while the model has FlexibleShapeInfo`.
The CoreML runtime rejects two access patterns on outputs from a
flex-shape model: `arr.strides` reads, and `arr[idx].floatValue` element
subscripts — and the original `sliceFirstAxis2D` helper used both. Fix
rewrites it to read via `arr.dataPointer.bindMemory(...)` (handling
`.float32`, `.float16`, `.double`) and computes the flat index from the
known `(1, leading, trailing)` row-major layout. Verified on full
100/100 minimax-en with a `ref_s.bin` dumped from the upstream LibriTTS
demo voice.

#### StyleTTS2 misaki → espeak post-pass remap (`ded0b9467`)
After `sliceFirstAxis2D` unblocked the full corpus, StyleTTS2 still
landed at **WER 0.581 / CER 0.476** — an order of magnitude worse than
Kokoro (0.013). Instrumented the encoder via a new `--tokenize-only
--corpus` mode and disproved the silent-vocab-drop hypothesis: only
**0.09% of scalars** dropped on the full 100-phrase corpus (11 ASCII
hyphens / 12247 scalars).

Real root cause: G2P convention mismatch. Both Kokoro and StyleTTS2
share the in-tree misaki BART G2P (`G2PModel`), but the StyleTTS2
LibriTTS checkpoint was trained by yl4579 on **espeak-ng-phonemized**
LibriTTS — predating misaki by years. The 178-vocab accepts both forms
(e.g. both `ʧ` U+02A7 and `tʃ` decomposed encode), but acoustic
embeddings for the misaki ligature glyphs are essentially untrained
noise.

Side-by-side comparison against locally-installed `espeak-ng -v en-us
--ipa -q` flagged four systematic divergences:

| misaki | espeak-ng | example                  |
|--------|-----------|--------------------------|
| `ʧ`    | `tʃ`      | choice → `tʃˈɔɪs`        |
| `ʤ`    | `dʒ`      | jump   → `dʒˈʌmps`       |
| `ɜɹ`   | `ɝ`       | girl   → `ɡˈɝl`          |
| `əɹ`   | `ɚ`       | over   → `ˈoʊvɚ`         |

Fix: a 4-rule post-pass remap in `StyleTTS2Phonemizer.phonemize`, gated
on `.americanEnglish` and applied to the assembled phoneme string after
every word has been emitted by the BART G2P. Lives alongside the
existing per-piece misaki diphthong remap. Result on the same 100-phrase
MiniMax-English run with the same `libritts_696` voice and same Parakeet
TDT roundtrip:

| Metric          | Pre   | Post  | Δ      |
|-----------------|-------|-------|--------|
| Macro WER       | 0.581 | 0.440 | −24.2% |
| Macro CER       | 0.476 | 0.241 | −49.5% |
| TTFT p50 (ms)   | 8937  | 6671  | −25.4% |
| Agg RTFx        | 2.36× | 2.72× | +15.3% |
| Peak RSS (MB)   | 1428  | 963   | −32.6% |

Phrase 1 (`"…simple choice. Get busy living…"`) went from `simple voice.
Busy dying.` (0.40 WER) to a perfect roundtrip. Remaining errors cluster
on word-level G2P misses from the BART itself (`practical →
practicckles`, `separation → expiration`) and diffusion-sampler formant
breaks; closing the rest of the gap to Kokoro likely needs richer espeak
coverage or libespeak-ng vendor — tracked separately.

#### Beta callouts on StyleTTS2 + Magpie managers (`25e2b492a`)
`StyleTTS2Manager.initialize` and `MagpieTtsManager.initialize` now emit
`logger.warning` beta notices mirroring the existing
`CosyVoice3TtsManager.initialize` pattern. Backends docs (`Magpie.md`
Status section, `Benchmarks.md` StyleTTS2 footnote) gain matching `⚠️
Beta / experimental` callouts so the perf / quality posture is visible
at every entry point — runtime, manager docstring, doc top, PR body.

#### Magpie `outputBackings` rejection fallback (`72dae8400` +
`9767e1ef9`)
The shipped `decoder_step.mlmodelc` reaches the user before the rebuild
lands, so CoreML can reject our `outputBackings` dictionary on a
name-mismatch. Latched fallback path falls back to a fresh-alloc decode
so the model still runs; first rejection latches the flag for the rest
of the run.

### Cohere ASR backend in the harness (`8e741e659`)

Lets non-English TTS runs (CosyVoice3, Magpie zh, etc.) score WER / CER
through the harness against [Cohere
Transcribe](Sources/FluidAudio/ASR/Cohere/) instead of being forced into
`--skip-asr`. Four new flags on `tts-benchmark`:

- `--asr-backend parakeet|cohere|none` — selects the ASR roundtrip
engine. Default is `parakeet` for English-only runs and skipped for
CosyVoice3.
- `--cohere-model-dir <path>` — path to a directory containing
`cohere_encoder.mlmodelc`, `cohere_decoder_cache_external_v2.mlmodelc`,
and `vocab.json`.
- `--asr-language <code>` — overrides the inferred language code (covers
all 14 Cohere languages: en, fr, de, es, it, pt, nl, pl, el, ar, ja, zh,
ko, vi).
- `--cohere-compute-units all|cpu-and-gpu|cpu-only|all-ane` — pins
`MLComputeUnits` at `CoherePipeline.loadModels` time. Use `cpu-and-gpu`
when the q8 encoder fails ANE compilation (`MILCompilerForANE error:
failed to compile ANE model using ANEF`) to skip the multi-minute
fallback compile on the first call. The harness logs a WER caveat for
zh/ja runs flagging that whitespace-tokenized WER is meaningless and the
CER column is the real signal.

Example end-to-end:
```bash
fluidaudio tts-benchmark \
    --backend cosyvoice3 \
    --corpus minimax-chinese \
    --asr-backend cohere \
    --cohere-model-dir /path/to/cohere/q8 \
    --asr-language zh \
    --output-json benchmark_results/cv3-zh-cohere.json \
    --audio-dir benchmark_results/cv3-zh-cohere/audio
```

On this M2 host the q8 encoder hits a CoreML ANE-cache failure
(`MILCompilerForANE error: ANECCompile() FAILED`) and CoreML silently
falls back to CPU+GPU, dropping Cohere from its documented RTFx ~2× (per
`Documentation/ASR/Cohere.md`) to RTFx ~0.13× — correctness is
unaffected (same graph, same output), only latency. The full 100-phrase
CosyVoice3 zh CER number reported above (1.68% macro / 1.84% micro) was
therefore produced via `whisper-large-v3` (Python CPU FP32,
`Scripts/whisper_zh_cer.py`) rather than by running Cohere over all 100
phrases. A 10-phrase Cohere sub-sample agrees with whisper at the 3–5%
CER range.

### Corpus migration (`4cc7d3111`) + on-demand fetch CLI (`8022e8384`)

Replaces the original `prose-en` / `numbers-en` / `names-en` /
`prose-zh` shipped with the first cut of this PR with the [MiniMax
Multilingual TTS Test
Set](https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set)
(CC-BY-SA-4.0; 100 phrases × 25 languages). Same public corpus used by
[MiniMax-Speech](https://arxiv.org/abs/2505.07916), seed-tts-eval, and
Gradium — numbers in this PR are paper-comparable.

The 24 per-language `.txt` files used to be vendored in
`Benchmarks/tts/corpus/minimax/`. **Removed in this PR** in favor of an
on-demand `fluidaudio minimax-corpus` CLI subcommand that fetches them
from the upstream HF dataset at the pinned revision and writes them to
the same path. Reuses `DownloadUtils.fetchHuggingFaceFile` for HF auth
(HF_TOKEN env) + retry/backoff — no `swift-transformers` dep added, no
hardcoded asset URLs. The `.txt` files now live in `.gitignore` since
they're CC-BY-SA-4.0 derivative content; only
`Documentation/TTS/MinimaxCorpus.md` (attribution + revision pin + WER
caveats — moved from `Benchmarks/tts/corpus/minimax/README.md` in
`ac21d60bf`) and the CLI subcommand are tracked. Replaces the prior
`python Scripts/fetch_minimax_tts_corpus.py` (also deleted). Per-backend
language scope:

| Backend | Languages benchmarked |
|---|---|
| Kokoro / Kokoro ANE | en (af_heart) |
| PocketTTS | en + de + it + pt + es + fr |
| Magpie | en + es + de + fr + it + vi + zh + hi |
| StyleTTS2 | en (LibriTTS multi-spk) |
| CosyVoice3 | zh + yue |

### PocketTTS streaming TTFT (`c26f1e163`)
PocketTTS now drives the harness through its `synthesizeStreaming` API
so TTFT measures time-to-first-80ms-frame instead of full one-shot
synth. TTFT 1244 ms vs. full synth 8757 ms — a 7× streaming advantage
that one-shot benchmarking previously hid.

### Reference voice dumper helper (mobius-styletts2)
`mobius-styletts2/scripts/06_dump_ref_s.py` (added in the sibling repo)
wraps `style_encoder` + `predictor_encoder` from `99_parity_check.py` to
dump a 256-fp32 LE `ref_s.bin` that `StyleTTS2Manager.synthesize`
consumes via `--voice`. Required because the shipped CoreML bundle
doesn't include those upstream-only PyTorch encoders.

## Test plan

- [x] `swift build -c release` clean
- [x] `swift format lint` clean for new files
- [x] `fluidaudio tts-benchmark --help` lists all 6 backends
- [x] `fluidaudio minimax-corpus --languages english --out-dir /tmp/x`
produces byte-identical output to the deleted Python script
- [x] Kokoro / Kokoro ANE / PocketTTS / Magpie — full 100/100 minimax-en
- [x] StyleTTS2 — full 100/100 minimax-en (verified after
`sliceFirstAxis2D` fix + post-pass remap)
- [x] CosyVoice3 — full 100/100 minimax-zh + 100/100 minimax-yue
(verified after HiFT + LLM-Decode `outputBackings` fixes)
- [x] `CosyVoice3ModelNameTests` + `TtsComputeUnitPresetTests` green
- [x] No `@unchecked Sendable`; per-backend error enums use `Error,
LocalizedError`
- [x] StyleTTS2 + Magpie + CosyVoice3 emit beta `logger.warning` on
`initialize()`
- [x] Corpus README moved to `Documentation/TTS/MinimaxCorpus.md`;
cross-refs in `Benchmarks.md`, `MinimaxCorpusCommand.swift`,
`TtsBenchmarkCommand.swift` updated
- [x] CosyVoice3 6.5 s output cap investigated — confirmed structural
(250-token Flow input shape, 40 ms / token); surfaced via
`finishedOnEos` + warning log + JSON `finished_on_eos` field. See
[Decode budget
cap](Documentation/TTS/Benchmarks.md#cosyvoice3-decode-budget-cap)
- [x] **CosyVoice3 auto-chunker** lands in this PR as a call-site
workaround. Validated on full minimax-cantonese: truncation **80/100 →
5/100**, longest output **6.5 s → 16.1 s**, agg-RTFx 0.245× → 0.249×.
16-test suite (`CosyVoice3TextChunkerTests`) green. See [CosyVoice3
auto-chunker](Documentation/TTS/Benchmarks.md#cosyvoice3-auto-chunker)
- [x] **Magpie streaming TTFT** wired through `synthesizeStream` in
`TtsBenchmarkCommand.swift`. Validated on full minimax-english: TTFT-p50
**9.6 s** (first chunk) vs full-synth-p50 **15.1 s** — 36% earlier
playback start. agg-RTFx 0.41× → 0.64× (warm-cache re-run)
- [x] **Cohere ASR harness wiring** (`--asr-backend cohere` +
`--cohere-model-dir` + `--asr-language` + `--cohere-compute-units`).
Smoke-tested on a 10-phrase `minimax-chinese` sub-sample (Cohere q8
macro CER 4.88%, hit `MILCompilerForANE` fallback, RTFx ~0.13× on this
M2 host). Whisper-large-v3 cross-check on the same WAVs: macro CER 3.04%
— both backends agree
- [x] **CosyVoice3 zh CER on full corpus** measured via
`whisper-large-v3` (Python CPU FP32, `Scripts/whisper_zh_cer.py`) over
all 100 minimax-chinese WAVs: macro CER **1.68%**, micro CER **1.84%**.
Recorded in `Documentation/TTS/Benchmarks.md` (CosyVoice3 row + footnote
‡)
2026-05-01 09:09:42 -04:00

1360 lines
56 KiB
Swift
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
#if os(macOS)
import CoreML
import FluidAudio
import Foundation
/// `fluidaudio tts-benchmark` quantitative TTS benchmark harness.
///
/// Reports **TTFT / cold-start / warm-start latency, per-stage timings,
/// peak RSS, WER + CER per category** i.e. the things conversational
/// TTS users actually feel instead of just RTFx.
///
/// Backends:
/// kokoro-ane 7-stage ANE pipeline (per-stage timings, per-stage CU)
/// kokoro single-graph CPU+GPU (chunk-level only)
/// pocket-tts streaming flow-matching (no per-stage timings)
/// magpie encoder-decoder + NanoCodec (6-stage timings, slow)
/// cosyvoice3 Mandarin LLM-based (Mandarin corpus only, no WER)
/// styletts2 diffusion + HiFi-GAN (one-shot, requires --voice ref_s.bin)
///
/// Usage:
/// fluidaudio tts-benchmark --backend kokoro-ane \
/// --corpus minimax-english \
/// --voice af_heart \
/// --compute-units default \
/// --output-json bench.json
///
/// Corpora land in `Benchmarks/tts/corpus/minimax/<lang>.txt`
/// the MiniMax Multilingual TTS Test Set (CC-BY-SA-4.0,
/// 24 languages × 100 phrases). The `.txt` files are gitignored;
/// populate them with `swift run fluidaudio minimax-corpus`. See
/// `Documentation/TTS/MinimaxCorpus.md` for attribution + reproduction
/// notes and `Documentation/TTS/Benchmarks.md` for the per-backend
/// language coverage matrix. Reference with `--corpus minimax-<lang>`
/// (e.g. `minimax-english`, `minimax-chinese`, `minimax-vietnamese`, ).
public enum TtsBenchmarkCommand {
private static let logger = AppLogger(category: "TtsBenchmarkCommand")
// MARK: - Per-phrase sample emitted by every backend driver.
private struct BackendPhraseSample {
let synthMs: Double
let ttftMs: Double // For one-shot backends, == synthMs.
let samples: [Float]
let sampleRate: Int
let stageMs: [String: Double] // Empty if backend has no per-stage timings.
let extraFields: [String: Any] // encoder_tokens, finished_on_eos, etc.
}
// MARK: - ASR backend selection
//
// The harness supports two ASR backends for the TTSASR roundtrip:
// .parakeet Parakeet TDT (English-only, auto-downloaded).
// .cohere Cohere Transcribe cache-external (14 languages incl. zh).
// CosyVoice3's Mandarin output requires `.cohere` for a meaningful CER
// Parakeet's English-only output collapses to ~100% on zh.
fileprivate enum AsrChoice {
case skip
case parakeet
case cohere(modelDir: URL, language: CohereAsrConfig.Language, computeUnits: MLComputeUnits)
var label: String {
switch self {
case .skip: return "skip"
case .parakeet: return "parakeet-tdt"
case .cohere(_, let lang, let cu):
return "cohere-transcribe-\(lang.rawValue)/\(Self.computeLabel(cu))"
}
}
private static func computeLabel(_ cu: MLComputeUnits) -> String {
switch cu {
case .all: return "all"
case .cpuAndNeuralEngine: return "cpu+ane"
case .cpuAndGPU: return "cpu+gpu"
case .cpuOnly: return "cpu"
@unknown default: return "unknown"
}
}
var skipped: Bool {
if case .skip = self { return true } else { return false }
}
}
/// Closure-based ASR adapter so `runPhraseLoop` doesn't have to know
/// which backend it's driving. Built once before the per-phrase loop,
/// torn down after.
fileprivate struct AsrLoop {
let label: String
let transcribeOne: (URL) async throws -> String
let cleanup: () async -> Void
}
public static func run(arguments: [String]) async {
var backendName = "kokoro-ane"
var corpusName: String?
var corpusPath: String?
var voice: String?
var speakerName: String?
var languageName: String?
var computeUnitsName = "default"
var outputJson: String?
var audioDir: String?
var skipAsr = false
var asrBackendName: String?
var cohereModelDirArg: String?
var asrLanguageArg: String?
var cohereComputeUnitsArg: String?
var i = 0
while i < arguments.count {
let arg = arguments[i]
switch arg {
case "--backend":
if i + 1 < arguments.count {
backendName = arguments[i + 1]
i += 1
}
case "--corpus":
if i + 1 < arguments.count {
corpusName = arguments[i + 1]
i += 1
}
case "--corpus-path":
if i + 1 < arguments.count {
corpusPath = arguments[i + 1]
i += 1
}
case "--voice":
if i + 1 < arguments.count {
voice = arguments[i + 1]
i += 1
}
case "--speaker":
if i + 1 < arguments.count {
speakerName = arguments[i + 1]
i += 1
}
case "--language":
if i + 1 < arguments.count {
languageName = arguments[i + 1]
i += 1
}
case "--compute-units":
if i + 1 < arguments.count {
computeUnitsName = arguments[i + 1]
i += 1
}
case "--output-json":
if i + 1 < arguments.count {
outputJson = arguments[i + 1]
i += 1
}
case "--audio-dir":
if i + 1 < arguments.count {
audioDir = arguments[i + 1]
i += 1
}
case "--skip-asr":
skipAsr = true
case "--asr-backend":
if i + 1 < arguments.count {
asrBackendName = arguments[i + 1]
i += 1
}
case "--cohere-model-dir":
if i + 1 < arguments.count {
cohereModelDirArg = arguments[i + 1]
i += 1
}
case "--asr-language":
if i + 1 < arguments.count {
asrLanguageArg = arguments[i + 1]
i += 1
}
case "--cohere-compute-units":
if i + 1 < arguments.count {
cohereComputeUnitsArg = arguments[i + 1]
i += 1
}
case "--help", "-h":
printUsage()
return
default:
logger.warning("Unknown argument: \(arg)")
}
i += 1
}
let backend = parseBackend(backendName)
// Resolve corpus.
let phrases: [(category: String, text: String)]
let corpusLabel: String
do {
if let corpusPath {
let url = resolveURL(corpusPath, isDirectory: false)
let raw = try String(contentsOf: url, encoding: .utf8)
phrases = parseCorpus(raw, category: url.deletingPathExtension().lastPathComponent)
corpusLabel = url.lastPathComponent
} else {
let resolved = corpusName ?? backend.defaultCorpus
phrases = try loadShippedCorpus(resolved)
corpusLabel = resolved
}
} catch {
logger.error("Failed to load corpus: \(error.localizedDescription)")
exit(1)
}
guard !phrases.isEmpty else {
logger.error("Corpus is empty after parsing")
exit(1)
}
logger.info("Loaded \(phrases.count) phrase(s) from corpus '\(corpusLabel)'")
guard let preset = TtsComputeUnitPreset(cliValue: computeUnitsName) else {
logger.error(
"Unknown --compute-units value: \(computeUnitsName). Expected default | all-ane | cpu-and-gpu | cpu-only."
)
exit(1)
}
// Resolve ASR backend choice. Precedence:
// --skip-asr or --asr-backend none .skip
// --asr-backend cohere .cohere(modelDir, language)
// --asr-backend parakeet .parakeet
// no flag, backend == cosyvoice3 .skip (Parakeet is English-only;
// Mandarin output collapses to ~100% WER)
// no flag, otherwise .parakeet
let asrChoice: AsrChoice
do {
asrChoice = try resolveAsrChoice(
skipAsrFlag: skipAsr,
backendName: asrBackendName,
cohereModelDir: cohereModelDirArg,
asrLanguage: asrLanguageArg,
cohereComputeUnits: cohereComputeUnitsArg,
corpusLabel: corpusLabel,
ttsBackend: backend)
} catch {
logger.error("Failed to resolve ASR backend: \(error.localizedDescription)")
exit(1)
}
logger.info("ASR backend: \(asrChoice.label)")
do {
switch backend {
case .kokoroAne:
try await runKokoroAne(
phrases: phrases, corpusLabel: corpusLabel,
voice: voice ?? KokoroAneConstants.defaultVoice,
preset: preset, outputJson: outputJson, audioDir: audioDir,
asrChoice: asrChoice)
case .kokoro:
try await runKokoro(
phrases: phrases, corpusLabel: corpusLabel,
voice: voice ?? TtsConstants.recommendedVoice,
preset: preset, outputJson: outputJson, audioDir: audioDir,
asrChoice: asrChoice)
case .pocketTts:
try await runPocketTts(
phrases: phrases, corpusLabel: corpusLabel,
voice: voice ?? PocketTtsConstants.defaultVoice,
languageName: languageName,
preset: preset, outputJson: outputJson, audioDir: audioDir,
asrChoice: asrChoice)
case .magpie:
try await runMagpie(
phrases: phrases, corpusLabel: corpusLabel,
speakerName: speakerName, languageName: languageName,
preset: preset, outputJson: outputJson, audioDir: audioDir,
asrChoice: asrChoice)
case .cosyVoice3:
try await runCosyVoice3(
phrases: phrases, corpusLabel: corpusLabel,
voice: voice,
preset: preset, outputJson: outputJson, audioDir: audioDir,
asrChoice: asrChoice)
case .styleTts2:
try await runStyleTts2(
phrases: phrases, corpusLabel: corpusLabel,
voicePath: voice,
preset: preset, outputJson: outputJson, audioDir: audioDir,
asrChoice: asrChoice)
}
} catch {
logger.error("tts-benchmark failed: \(error)")
exit(1)
}
}
// MARK: - Kokoro ANE driver
private static func runKokoroAne(
phrases: [(category: String, text: String)],
corpusLabel: String,
voice: String,
preset: TtsComputeUnitPreset,
outputJson: String?,
audioDir: String?,
asrChoice: AsrChoice
) async throws {
let units = KokoroAneComputeUnits(preset: preset)
let manager = KokoroAneManager(defaultVoice: voice, computeUnits: units)
let coldStart = Date()
try await manager.initialize()
let coldStartS = Date().timeIntervalSince(coldStart)
logger.info(String(format: "Cold start (initialize): %.2fs", coldStartS))
let firstStart = Date()
_ = try await manager.synthesizeDetailed(
text: "Initialization warm-up.", voice: voice, speed: 1.0)
let firstSynthMs = Date().timeIntervalSince(firstStart) * 1000
logger.info(String(format: "First synth: %.0f ms", firstSynthMs))
try await runPhraseLoop(
backendId: "kokoro-ane",
voiceLabel: voice,
corpusLabel: corpusLabel,
phrases: phrases,
preset: preset,
coldStartS: coldStartS,
firstSynthMs: firstSynthMs,
outputJson: outputJson,
audioDir: audioDir,
asrChoice: asrChoice,
extraSummary: ["voice": voice]
) { text in
let t0 = Date()
let result = try await manager.synthesizeDetailed(
text: text, voice: voice, speed: 1.0)
let synthMs = Date().timeIntervalSince(t0) * 1000
return BackendPhraseSample(
synthMs: synthMs,
ttftMs: synthMs,
samples: result.samples,
sampleRate: result.sampleRate,
stageMs: [
"albert": result.timings.albert,
"post_albert": result.timings.postAlbert,
"alignment": result.timings.alignment,
"prosody": result.timings.prosody,
"noise": result.timings.noise,
"vocoder": result.timings.vocoder,
"tail": result.timings.tail,
"total": result.timings.totalMs,
],
extraFields: [
"encoder_tokens": result.encoderTokens,
"acoustic_frames": result.acousticFrames,
]
)
}
}
// MARK: - Kokoro driver (single-graph)
private static func runKokoro(
phrases: [(category: String, text: String)],
corpusLabel: String,
voice: String,
preset: TtsComputeUnitPreset,
outputJson: String?,
audioDir: String?,
asrChoice: AsrChoice
) async throws {
let units = preset.uniformUnits ?? .all
let manager = KokoroTtsManager(defaultVoice: voice, computeUnits: units)
let coldStart = Date()
try await manager.initialize(preloadVoices: [voice])
let coldStartS = Date().timeIntervalSince(coldStart)
logger.info(String(format: "Cold start (initialize): %.2fs", coldStartS))
let firstStart = Date()
_ = try await manager.synthesizeDetailed(text: "Initialization warm-up.", voice: voice)
let firstSynthMs = Date().timeIntervalSince(firstStart) * 1000
logger.info(String(format: "First synth: %.0f ms", firstSynthMs))
try await runPhraseLoop(
backendId: "kokoro",
voiceLabel: voice,
corpusLabel: corpusLabel,
phrases: phrases,
preset: preset,
coldStartS: coldStartS,
firstSynthMs: firstSynthMs,
outputJson: outputJson,
audioDir: audioDir,
asrChoice: asrChoice,
extraSummary: ["voice": voice]
) { text in
let t0 = Date()
let result = try await manager.synthesizeDetailed(text: text, voice: voice)
let synthMs = Date().timeIntervalSince(t0) * 1000
let samples = result.chunks.flatMap { $0.samples }
return BackendPhraseSample(
synthMs: synthMs,
ttftMs: synthMs,
samples: samples,
sampleRate: 24000,
stageMs: [:],
extraFields: [
"chunk_count": result.chunks.count,
"wav_bytes": result.audio.count,
]
)
}
}
// MARK: - PocketTTS driver
private static func runPocketTts(
phrases: [(category: String, text: String)],
corpusLabel: String,
voice: String,
languageName: String?,
preset: TtsComputeUnitPreset,
outputJson: String?,
audioDir: String?,
asrChoice: AsrChoice
) async throws {
if preset != .default {
logger.warning(
"PocketTTS does not expose per-call compute-unit overrides; --compute-units \(preset.cliValue) ignored."
)
}
let language = parsePocketLanguage(languageName)
logger.info("PocketTTS language: \(language.rawValue)")
let manager = PocketTtsManager(defaultVoice: voice, language: language)
let coldStart = Date()
try await manager.initialize()
let coldStartS = Date().timeIntervalSince(coldStart)
logger.info(String(format: "Cold start (initialize): %.2fs", coldStartS))
let firstStart = Date()
var firstFrameMs: Double = 0
var firstFrameCount = 0
let warmupStream = try await manager.synthesizeStreaming(
text: "Initialization warm-up.", voice: voice)
for try await frame in warmupStream {
if firstFrameCount == 0 {
firstFrameMs = Date().timeIntervalSince(firstStart) * 1000
}
firstFrameCount += 1
_ = frame.samples
}
let firstSynthMs = Date().timeIntervalSince(firstStart) * 1000
logger.info(
String(
format: "First synth: %.0f ms total, %.0f ms TTFT (frames=%d)",
firstSynthMs, firstFrameMs, firstFrameCount))
try await runPhraseLoop(
backendId: "pocket-tts",
voiceLabel: voice,
corpusLabel: corpusLabel,
phrases: phrases,
preset: preset,
coldStartS: coldStartS,
firstSynthMs: firstSynthMs,
outputJson: outputJson,
audioDir: audioDir,
asrChoice: asrChoice,
extraSummary: ["voice": voice, "language": language.rawValue]
) { text in
// PocketTTS is streaming-first: we measure TTFT (time to first
// audio frame) separately from total synth time so the benchmark
// numbers reflect what a streaming consumer actually experiences.
let t0 = Date()
let stream = try await manager.synthesizeStreaming(text: text, voice: voice)
var aggregated: [Float] = []
var ttftMs: Double = 0
var frameCount = 0
var lastChunkCount = 0
for try await frame in stream {
if frameCount == 0 {
ttftMs = Date().timeIntervalSince(t0) * 1000
}
aggregated.append(contentsOf: frame.samples)
frameCount += 1
lastChunkCount = frame.chunkCount
}
let synthMs = Date().timeIntervalSince(t0) * 1000
return BackendPhraseSample(
synthMs: synthMs,
ttftMs: ttftMs,
samples: aggregated,
sampleRate: PocketTtsConstants.audioSampleRate,
stageMs: [:],
extraFields: [
"frame_count": frameCount,
"chunk_count": lastChunkCount,
]
)
}
}
// MARK: - Magpie driver
private static func runMagpie(
phrases: [(category: String, text: String)],
corpusLabel: String,
speakerName: String?,
languageName: String?,
preset: TtsComputeUnitPreset,
outputJson: String?,
audioDir: String?,
asrChoice: AsrChoice
) async throws {
let units = preset.uniformUnits ?? .cpuAndNeuralEngine
let language = parseMagpieLanguage(languageName)
let speaker = parseMagpieSpeaker(speakerName)
logger.info("Magpie speaker=\(speaker.displayName) language=\(language.rawValue)")
let manager = MagpieTtsManager(
computeUnits: units, preferredLanguages: [language])
let coldStart = Date()
try await manager.initialize()
let coldStartS = Date().timeIntervalSince(coldStart)
logger.info(String(format: "Cold start (initialize): %.2fs", coldStartS))
let firstStart = Date()
_ = try await manager.synthesize(
text: "Initialization warm-up.", speaker: speaker, language: language)
let firstSynthMs = Date().timeIntervalSince(firstStart) * 1000
logger.info(String(format: "First synth: %.0f ms", firstSynthMs))
try await runPhraseLoop(
backendId: "magpie",
voiceLabel: speaker.displayName,
corpusLabel: corpusLabel,
phrases: phrases,
preset: preset,
coldStartS: coldStartS,
firstSynthMs: firstSynthMs,
outputJson: outputJson,
audioDir: audioDir,
asrChoice: asrChoice,
extraSummary: [
"speaker": speaker.displayName, "language": language.rawValue,
]
) { text in
// Drive Magpie through `synthesizeStream` so TTFT measures
// time-to-first-chunk-yield rather than full-utterance wall.
// The chunker carves a small first chunk
// (`MagpieChunker.streamingFirstChunkCap` = 50 codec frames
// 2.3 s of audio) when the first sentence is long enough; for
// short phrases the stream degrades to one chunk == whole
// utterance and TTFT == synthMs (no streaming benefit, no
// measurement penalty).
//
// Trade-off vs. the prior `synthesize()` path: per-stage
// timings (`text_encoder`/`prefill`/`ar_loop`/) are only
// surfaced on `MagpieSynthesisResult`, not per
// `MagpieAudioChunk`, so `stageMs` is empty here. That matches
// PocketTTS streaming which also publishes empty `stageMs`.
let t0 = Date()
let stream = try await manager.synthesizeStream(
text: text, speaker: speaker, language: language)
var aggregated: [Float] = []
var ttftMs: Double = 0
var chunkCount = 0
var codeCount = 0
var finishedOnEos = false
var sampleRate = MagpieConstants.audioSampleRate
for try await chunk in stream {
if chunkCount == 0 {
ttftMs = Date().timeIntervalSince(t0) * 1000
}
aggregated.append(contentsOf: chunk.samples)
chunkCount += 1
codeCount += chunk.codeCount
sampleRate = chunk.sampleRate
if chunk.isFinal {
finishedOnEos = chunk.finishedOnEos
}
}
let synthMs = Date().timeIntervalSince(t0) * 1000
// Empty-stream guard (synthesizeStream returns immediately on
// zero-length input). Fall back to synthMs so downstream
// percentile math doesn't see ttftMs == 0.
if chunkCount == 0 { ttftMs = synthMs }
return BackendPhraseSample(
synthMs: synthMs,
ttftMs: ttftMs,
samples: aggregated,
sampleRate: sampleRate,
stageMs: [:],
extraFields: [
"code_count": codeCount,
"finished_on_eos": finishedOnEos,
"chunk_count": chunkCount,
]
)
}
}
// MARK: - CosyVoice3 driver
private static func runCosyVoice3(
phrases: [(category: String, text: String)],
corpusLabel: String,
voice: String?,
preset: TtsComputeUnitPreset,
outputJson: String?,
audioDir: String?,
asrChoice: AsrChoice
) async throws {
let units = preset.uniformUnits ?? .cpuAndNeuralEngine
let voiceId = voice ?? "cosyvoice3-default-zh"
let coldStart = Date()
let manager = try await CosyVoice3TtsManager.downloadAndCreate(
cacheDirectory: nil, includeDefaultVoice: true, computeUnits: units)
try await manager.initialize()
let promptAssets = try await manager.loadVoice(voiceId)
let coldStartS = Date().timeIntervalSince(coldStart)
logger.info(String(format: "Cold start (download+init+voice): %.2fs", coldStartS))
let firstStart = Date()
_ = try await manager.synthesize(text: "你好", promptAssets: promptAssets)
let firstSynthMs = Date().timeIntervalSince(firstStart) * 1000
logger.info(String(format: "First synth: %.0f ms", firstSynthMs))
try await runPhraseLoop(
backendId: "cosyvoice3",
voiceLabel: voiceId,
corpusLabel: corpusLabel,
phrases: phrases,
preset: preset,
coldStartS: coldStartS,
firstSynthMs: firstSynthMs,
outputJson: outputJson,
audioDir: audioDir,
asrChoice: asrChoice,
extraSummary: ["voice": voiceId]
) { text in
let t0 = Date()
let result = try await manager.synthesize(text: text, promptAssets: promptAssets)
let synthMs = Date().timeIntervalSince(t0) * 1000
return BackendPhraseSample(
synthMs: synthMs,
ttftMs: synthMs,
samples: result.samples,
sampleRate: result.sampleRate,
stageMs: [:],
extraFields: [
"generated_token_count": result.generatedTokenCount,
"decoded_token_count": result.decodedTokens.count,
// Surface the structural 250-token Flow-input cap as a
// per-phrase boolean so corpus reports can tally how many
// long phrases hit silent truncation.
"finished_on_eos": result.finishedOnEos,
]
)
}
}
// MARK: - StyleTTS2 driver
private static func runStyleTts2(
phrases: [(category: String, text: String)],
corpusLabel: String,
voicePath: String?,
preset: TtsComputeUnitPreset,
outputJson: String?,
audioDir: String?,
asrChoice: AsrChoice
) async throws {
guard let voicePath, !voicePath.isEmpty else {
logger.error(
"StyleTTS2 requires --voice <path/to/ref_s.bin> "
+ "(256 fp32 LE blob from mobius-styletts2/scripts/06_dump_ref_s.py)")
exit(1)
}
let voiceURL = resolveURL(voicePath, isDirectory: false)
let voiceLabel = voiceURL.deletingPathExtension().lastPathComponent
// StyleTTS2 doesn't expose a compute-units knob today; --compute-units
// is accepted for parity with other backends but only labels the run.
let manager = StyleTTS2Manager()
let coldStart = Date()
try await manager.initialize()
let coldStartS = Date().timeIntervalSince(coldStart)
logger.info(String(format: "Cold start (initialize): %.2fs", coldStartS))
let firstStart = Date()
_ = try await manager.synthesizeSamples(
text: "Initialization warm-up.", voiceStyleURL: voiceURL, randomSeed: 42)
let firstSynthMs = Date().timeIntervalSince(firstStart) * 1000
logger.info(String(format: "First synth: %.0f ms", firstSynthMs))
try await runPhraseLoop(
backendId: "styletts2",
voiceLabel: voiceLabel,
corpusLabel: corpusLabel,
phrases: phrases,
preset: preset,
coldStartS: coldStartS,
firstSynthMs: firstSynthMs,
outputJson: outputJson,
audioDir: audioDir,
asrChoice: asrChoice,
extraSummary: ["voice": voiceLabel]
) { text in
let t0 = Date()
let result = try await manager.synthesizeSamples(
text: text, voiceStyleURL: voiceURL, randomSeed: 42)
let synthMs = Date().timeIntervalSince(t0) * 1000
return BackendPhraseSample(
synthMs: synthMs,
ttftMs: synthMs,
samples: result.samples,
sampleRate: result.sampleRate,
stageMs: [:],
extraFields: [:]
)
}
}
// MARK: - Shared per-phrase loop + summary
private static func runPhraseLoop(
backendId: String,
voiceLabel: String,
corpusLabel: String,
phrases: [(category: String, text: String)],
preset: TtsComputeUnitPreset,
coldStartS: Double,
firstSynthMs: Double,
outputJson: String?,
audioDir: String?,
asrChoice: AsrChoice,
extraSummary: [String: Any],
synthOne: (String) async throws -> BackendPhraseSample
) async throws {
// Optional output dir for WAVs.
var audioDirURL: URL? = nil
if let audioDir {
let url = resolveURL(audioDir, isDirectory: true)
try FileManager.default.createDirectory(
at: url, withIntermediateDirectories: true)
audioDirURL = url
}
// Build optional ASR backend (Parakeet, Cohere, or none).
let asrLoop = try await buildAsrLoop(asrChoice)
var perPhrase: [[String: Any]] = []
var byCategory: [String: [Int]] = [:]
for (idx, item) in phrases.enumerated() {
let label = String(format: "[%02d/%02d]", idx + 1, phrases.count)
logger.info("\(label) [\(item.category)] \(item.text)")
let sample = try await synthOne(item.text)
let audioMs =
Double(sample.samples.count) / Double(sample.sampleRate) * 1000
let rtfx = sample.synthMs > 0 ? audioMs / sample.synthMs : 0
// Persist WAV (audioDir if set, else temp file for ASR).
let wavURL: URL
if let audioDirURL {
wavURL = audioDirURL.appendingPathComponent(
String(format: "phrase_%03d.wav", idx + 1))
} else {
wavURL = FileManager.default.temporaryDirectory
.appendingPathComponent("tts-benchmark-\(UUID().uuidString).wav")
}
let wavData = try AudioWAV.data(
from: sample.samples, sampleRate: Double(sample.sampleRate))
try wavData.write(to: wavURL)
var werValue = Double.nan
var cerValue = Double.nan
var hypothesis = ""
var asrMs = 0.0
if let asrLoop {
let asr0 = Date()
hypothesis = try await asrLoop.transcribeOne(wavURL)
asrMs = Date().timeIntervalSince(asr0) * 1000
let m = WERCalculator.calculateWERAndCER(
hypothesis: hypothesis, reference: item.text)
werValue = m.wer
cerValue = m.cer
}
if audioDirURL == nil {
try? FileManager.default.removeItem(at: wavURL)
}
logger.info(
String(
format:
" ttft=%.0f ms synth=%.0f ms audio=%.0f ms rtfx=%.2fx wer=%.1f%% cer=%.1f%%",
sample.ttftMs, sample.synthMs, audioMs, rtfx,
werValue.isNaN ? 0 : werValue * 100,
cerValue.isNaN ? 0 : cerValue * 100))
byCategory[item.category, default: []].append(perPhrase.count)
var phraseDict: [String: Any] = [
"index": idx + 1,
"category": item.category,
"reference": item.text,
"hypothesis": hypothesis,
"ttft_ms": sample.ttftMs,
"synth_ms": sample.synthMs,
"audio_ms": audioMs,
"rtfx": rtfx,
"wer": werValue.isNaN ? NSNull() : werValue as Any,
"cer": cerValue.isNaN ? NSNull() : cerValue as Any,
"asr_ms": asrMs,
"stage_ms": sample.stageMs,
"wav_path": audioDirURL == nil ? "" : wavURL.path,
]
for (k, v) in sample.extraFields {
phraseDict[k] = v
}
perPhrase.append(phraseDict)
}
if let asrLoop {
await asrLoop.cleanup()
}
// Aggregate.
let totalSynthMs = perPhrase.reduce(0.0) { $0 + ($1["synth_ms"] as? Double ?? 0) }
let totalAudioMs = perPhrase.reduce(0.0) { $0 + ($1["audio_ms"] as? Double ?? 0) }
let aggRtfx = totalSynthMs > 0 ? totalAudioMs / totalSynthMs : 0
let synthMsValues = perPhrase.compactMap { $0["synth_ms"] as? Double }.sorted()
let p50 = percentile(synthMsValues, 0.5)
let p95 = percentile(synthMsValues, 0.95)
let ttftValues = perPhrase.compactMap { $0["ttft_ms"] as? Double }.sorted()
let ttftP50 = percentile(ttftValues, 0.5)
let ttftP95 = percentile(ttftValues, 0.95)
var categories: [[String: Any]] = []
for (cat, indexes) in byCategory.sorted(by: { $0.key < $1.key }) {
let werVals = indexes.compactMap { perPhrase[$0]["wer"] as? Double }
let cerVals = indexes.compactMap { perPhrase[$0]["cer"] as? Double }
let synthVals = indexes.compactMap { perPhrase[$0]["synth_ms"] as? Double }
let audioVals = indexes.compactMap { perPhrase[$0]["audio_ms"] as? Double }
let synthSum = synthVals.reduce(0, +)
let audioSum = audioVals.reduce(0, +)
let macroWer =
werVals.isEmpty ? Double.nan : werVals.reduce(0, +) / Double(werVals.count)
let macroCer =
cerVals.isEmpty ? Double.nan : cerVals.reduce(0, +) / Double(cerVals.count)
categories.append([
"category": cat,
"phrase_count": indexes.count,
"macro_wer": macroWer.isNaN ? NSNull() : macroWer as Any,
"macro_cer": macroCer.isNaN ? NSNull() : macroCer as Any,
"synth_ms_p50": percentile(synthVals.sorted(), 0.5),
"synth_ms_p95": percentile(synthVals.sorted(), 0.95),
"rtfx": synthSum > 0 ? audioSum / synthSum : 0,
])
}
let peakRssMb =
Double(FluidAudioCLI.fetchPeakMemoryUsageBytes() ?? 0) / 1024 / 1024
// Banner.
logger.info("--- Summary ---")
logger.info(" backend: \(backendId)")
logger.info(" voice/speaker: \(voiceLabel)")
logger.info(" corpus: \(corpusLabel) (n=\(phrases.count))")
logger.info(" compute units: \(preset.cliValue)")
logger.info(String(format: " cold start: %.2fs", coldStartS))
logger.info(String(format: " first synth: %.0f ms", firstSynthMs))
logger.info(String(format: " TTFT p50/p95: %.0f / %.0f ms", ttftP50, ttftP95))
logger.info(String(format: " warm synth p50: %.0f ms", p50))
logger.info(String(format: " warm synth p95: %.0f ms", p95))
logger.info(String(format: " agg RTFx: %.2fx", aggRtfx))
logger.info(String(format: " peak RSS: %.0f MB", peakRssMb))
if !asrChoice.skipped {
let werVals = perPhrase.compactMap { $0["wer"] as? Double }
let cerVals = perPhrase.compactMap { $0["cer"] as? Double }
let macroWer =
werVals.isEmpty ? 0 : werVals.reduce(0, +) / Double(werVals.count)
let macroCer =
cerVals.isEmpty ? 0 : cerVals.reduce(0, +) / Double(cerVals.count)
logger.info(" ASR backend: \(asrChoice.label)")
logger.info(String(format: " macro WER: %.2f%%", macroWer * 100))
logger.info(String(format: " macro CER: %.2f%%", macroCer * 100))
// Word-level WER is meaningless on whitespace-free scripts (zh, ja).
// Surface that explicitly so readers don't trust ~100% WER for zh.
if case .cohere(_, let lang, _) = asrChoice,
lang == .chinese || lang == .japanese
{
logger.info(
" note: WER is whitespace-tokenized; trust CER for \(lang.rawValue).")
}
} else {
logger.info(" WER/CER: skipped")
}
if let outputJson {
var summary: [String: Any] = [
"backend": backendId,
"corpus": corpusLabel,
"phrase_count": phrases.count,
"compute_units": preset.cliValue,
"cold_start_s": coldStartS,
"first_synth_ms": firstSynthMs,
"ttft_ms_p50": ttftP50,
"ttft_ms_p95": ttftP95,
"warm_synth_ms_p50": p50,
"warm_synth_ms_p95": p95,
"agg_rtfx": aggRtfx,
"peak_rss_mb": peakRssMb,
"asr_skipped": asrChoice.skipped,
"asr_backend": asrChoice.label,
]
for (k, v) in extraSummary {
summary[k] = v
}
let report: [String: Any] = [
"summary": summary,
"categories": categories,
"phrases": perPhrase,
]
let url = resolveURL(outputJson, isDirectory: false)
try FileManager.default.createDirectory(
at: url.deletingLastPathComponent(),
withIntermediateDirectories: true)
let data = try JSONSerialization.data(
withJSONObject: report, options: [.prettyPrinted, .sortedKeys])
try data.write(to: url)
logger.info("Report written: \(url.path)")
}
}
// MARK: - Corpus loading
private static func loadShippedCorpus(
_ name: String
) throws -> [(category: String, text: String)] {
let cwd = URL(
fileURLWithPath: FileManager.default.currentDirectoryPath, isDirectory: true)
let relativePath = corpusRelativePath(for: name)
let url = cwd.appendingPathComponent(relativePath, isDirectory: false)
let raw = try String(contentsOf: url, encoding: .utf8)
return parseCorpus(raw, category: name)
}
/// Map a `--corpus` name to its on-disk relative path.
///
/// All shipped corpora are MiniMax Multilingual TTS Test Set
/// languages `minimax-<lang>` resolves to
/// `Benchmarks/tts/corpus/minimax/<lang>.txt`. The CC-BY-SA-4.0
/// attribution lives next to the data in `minimax/README.md`.
/// Pass `--corpus-path` for ad-hoc files outside the shipped set.
private static func corpusRelativePath(for name: String) -> String {
let prefix = "minimax-"
if name.hasPrefix(prefix) {
let lang = String(name.dropFirst(prefix.count))
return "Benchmarks/tts/corpus/minimax/\(lang).txt"
}
// Back-compat shim anything else is assumed to live next to
// the minimax subdirectory. Prefer `--corpus-path` for non-shipped
// corpora.
return "Benchmarks/tts/corpus/\(name).txt"
}
private static func parseCorpus(
_ raw: String, category: String
) -> [(category: String, text: String)] {
return
raw
.split(whereSeparator: \.isNewline)
.map { $0.trimmingCharacters(in: .whitespaces) }
.filter { !$0.isEmpty && !$0.hasPrefix("#") }
.map { (category: category, text: $0) }
}
// MARK: - Backend dispatch
private enum Backend: String {
case kokoroAne
case kokoro
case pocketTts
case magpie
case cosyVoice3
case styleTts2
var defaultCorpus: String {
switch self {
case .cosyVoice3: return "minimax-chinese"
default: return "minimax-english"
}
}
}
private static func parseBackend(_ name: String) -> Backend {
switch name.lowercased() {
case "kokoro-ane", "kokoroane", "kokoro_ane", "lai":
return .kokoroAne
case "kokoro":
return .kokoro
case "pocket-tts", "pockettts", "pocket":
return .pocketTts
case "magpie":
return .magpie
case "cosyvoice3", "cosyvoice", "cosy":
return .cosyVoice3
case "styletts2", "style-tts2", "styletts", "style":
return .styleTts2
default:
logger.warning("Unknown backend '\(name)' — defaulting to kokoro-ane")
return .kokoroAne
}
}
private static func parsePocketLanguage(_ name: String?) -> PocketTtsLanguage {
guard let name, let l = PocketTtsLanguage(rawValue: name.lowercased()) else {
return .english
}
return l
}
private static func parseMagpieLanguage(_ name: String?) -> MagpieLanguage {
guard let name, let l = MagpieLanguage(rawValue: name.lowercased()) else {
return .english
}
return l
}
private static func parseMagpieSpeaker(_ name: String?) -> MagpieSpeaker {
switch name?.lowercased() {
case "sofia": return .sofia
case "aria": return .aria
case "jason": return .jason
case "leo": return .leo
case "john", nil, "": return .john
default: return .john
}
}
// MARK: - Helpers
private static func percentile(_ sorted: [Double], _ p: Double) -> Double {
guard !sorted.isEmpty else { return 0 }
let idx = Int((Double(sorted.count - 1) * p).rounded())
return sorted[max(0, min(sorted.count - 1, idx))]
}
private static func resolveURL(_ path: String, isDirectory: Bool) -> URL {
let expanded = (path as NSString).expandingTildeInPath
if expanded.hasPrefix("/") {
return URL(fileURLWithPath: expanded, isDirectory: isDirectory)
}
let cwd = URL(
fileURLWithPath: FileManager.default.currentDirectoryPath, isDirectory: true)
return cwd.appendingPathComponent(expanded, isDirectory: isDirectory)
}
// MARK: - ASR backend resolution & adapter construction
/// Map CLI flags + TTS backend defaults to a concrete `AsrChoice`.
///
/// Precedence: `--skip-asr` and `--asr-backend none` always win. With
/// no flag, English-friendly TTS backends default to Parakeet TDT and
/// CosyVoice3 defaults to `.skip` (Parakeet is English-only its WER
/// on Mandarin output reads ~100% and is meaningless).
private static func resolveAsrChoice(
skipAsrFlag: Bool,
backendName: String?,
cohereModelDir: String?,
asrLanguage: String?,
cohereComputeUnits: String?,
corpusLabel: String,
ttsBackend: Backend
) throws -> AsrChoice {
let normalized = backendName?.lowercased()
if skipAsrFlag || normalized == "none" {
return .skip
}
switch normalized {
case "cohere":
let dir = try resolveCohereModelDir(cohereModelDir)
let language = inferCohereLanguage(
explicit: asrLanguage, corpus: corpusLabel)
let units = try resolveCohereComputeUnits(cohereComputeUnits)
return .cohere(modelDir: dir, language: language, computeUnits: units)
case "parakeet":
return .parakeet
case nil:
// Implicit defaults: skip for CosyVoice3 (no English ASR pairing),
// Parakeet otherwise.
if ttsBackend == .cosyVoice3 {
logger.info(
"CosyVoice3: no --asr-backend selected; skipping ASR. "
+ "Pass `--asr-backend cohere --cohere-model-dir <dir>` for CER.")
return .skip
}
return .parakeet
default:
logger.warning(
"Unknown --asr-backend value '\(normalized ?? "")', falling back to parakeet.")
return .parakeet
}
}
/// Resolve a Cohere Transcribe model directory (must contain
/// `cohere_encoder.mlmodelc`, `cohere_decoder_cache_external_v2.mlmodelc`,
/// and `vocab.json`).
///
/// Order of resolution:
/// 1. Explicit `--cohere-model-dir <path>`.
/// 2. The default cache location at
/// `~/Library/Application Support/FluidAudio/Models/cohere-transcribe/q8`,
/// matching `Repo.cohereTranscribeCoreml.folderName`.
///
/// Auto-download is intentionally not wired here: the upstream
/// `Repo.cohereTranscribeCoreml` registration ships `vocab.json` in
/// `requiredModels`, but the file lives at the repo root rather than
/// under the `q8/` subPath, so `DownloadUtils.downloadRepo` would fail
/// the post-download verify. Fix this when the registry learns about
/// repo-root files; until then, callers must pre-populate the cache
/// (e.g. via `fluidaudio cohere-transcribe ... --model-dir <dir>`).
private static func resolveCohereModelDir(_ override: String?) throws -> URL {
if let override {
return resolveURL(override, isDirectory: true)
}
let appSupport = try FileManager.default.url(
for: .applicationSupportDirectory,
in: .userDomainMask, appropriateFor: nil, create: true)
let target =
appSupport
.appendingPathComponent("FluidAudio/Models/cohere-transcribe/q8")
let needed = [
ModelNames.CohereTranscribe.encoderCompiledFile,
ModelNames.CohereTranscribe.decoderCacheExternalV2CompiledFile,
"vocab.json",
]
let missing = needed.filter { name in
!FileManager.default.fileExists(
atPath: target.appendingPathComponent(name).path)
}
guard missing.isEmpty else {
throw NSError(
domain: "TtsBenchmark", code: 1,
userInfo: [
NSLocalizedDescriptionKey:
"Cohere model dir incomplete at \(target.path). "
+ "Missing: \(missing.joined(separator: ", ")). "
+ "Pass --cohere-model-dir <dir> with the required files, or "
+ "pre-populate the cache via `fluidaudio cohere-transcribe`."
])
}
return target
}
/// Pick a `CohereAsrConfig.Language` from an explicit flag value or by
/// scanning the corpus label (covers the shipped `minimax-<lang>` set).
private static func inferCohereLanguage(
explicit: String?, corpus: String
) -> CohereAsrConfig.Language {
if let explicit,
let lang = CohereAsrConfig.Language(rawValue: explicit.lowercased())
{
return lang
}
let lower = corpus.lowercased()
if lower.contains("chinese") || lower.contains("mandarin") || lower.hasSuffix("-zh") {
return .chinese
}
if lower.contains("japanese") || lower.contains("-ja") { return .japanese }
if lower.contains("korean") || lower.contains("-ko") { return .korean }
if lower.contains("vietnamese") || lower.contains("-vi") { return .vietnamese }
if lower.contains("french") || lower.contains("-fr") { return .french }
if lower.contains("german") || lower.contains("-de") { return .german }
if lower.contains("spanish") || lower.contains("-es") { return .spanish }
if lower.contains("italian") || lower.contains("-it") { return .italian }
if lower.contains("portuguese") || lower.contains("-pt") { return .portuguese }
if lower.contains("dutch") || lower.contains("-nl") { return .dutch }
if lower.contains("polish") || lower.contains("-pl") { return .polish }
if lower.contains("greek") || lower.contains("-el") { return .greek }
if lower.contains("arabic") || lower.contains("-ar") { return .arabic }
return .english
}
/// Parse `--cohere-compute-units` into `MLComputeUnits`. Defaults to
/// `.all` (CoreML decides). Use `cpu-and-gpu` to skip the ANE compile
/// attempt when the q8 encoder fails ANE compilation (observed:
/// `MILCompilerForANE error: failed to compile ANE model using ANEF`,
/// CoreML falls back to CPU+GPU but pays a multi-minute compile cost
/// on the first call).
private static func resolveCohereComputeUnits(
_ flag: String?
) throws
-> MLComputeUnits
{
guard let raw = flag?.lowercased(), !raw.isEmpty else { return .all }
switch raw {
case "all", "default": return .all
case "all-ane", "ane", "neural-engine", "cpu-and-ane":
return .cpuAndNeuralEngine
case "cpu-and-gpu", "cpuandgpu", "gpu": return .cpuAndGPU
case "cpu-only", "cpu", "cpuonly": return .cpuOnly
default:
throw NSError(
domain: "TtsBenchmark", code: 3,
userInfo: [
NSLocalizedDescriptionKey:
"Unknown --cohere-compute-units value '\(raw)'. "
+ "Expected: all | cpu-and-gpu | cpu-only | all-ane."
])
}
}
/// Human-readable label for log lines.
private static func describeComputeUnits(_ cu: MLComputeUnits) -> String {
switch cu {
case .all: return "all (CPU+GPU+ANE)"
case .cpuAndNeuralEngine: return "cpu-and-ane"
case .cpuAndGPU: return "cpu-and-gpu"
case .cpuOnly: return "cpu-only"
@unknown default: return "unknown"
}
}
/// Build the per-phrase ASR adapter for a resolved choice. Returns
/// `nil` for `.skip` so the loop can short-circuit.
private static func buildAsrLoop(_ choice: AsrChoice) async throws -> AsrLoop? {
switch choice {
case .skip:
return nil
case .parakeet:
let asrModels = try await AsrModels.downloadAndLoad()
let asr = AsrManager()
try await asr.loadModels(asrModels)
let layers = await asr.decoderLayerCount
return AsrLoop(
label: "parakeet-tdt",
transcribeOne: { url in
var state = TdtDecoderState.make(decoderLayers: layers)
let r = try await asr.transcribe(url, decoderState: &state)
return r.text
},
cleanup: { await asr.cleanup() }
)
case .cohere(let modelDir, let language, let computeUnits):
guard #available(macOS 14, iOS 17, *) else {
throw NSError(
domain: "TtsBenchmark", code: 2,
userInfo: [
NSLocalizedDescriptionKey:
"Cohere ASR backend requires macOS 14+ / iOS 17+."
])
}
logger.info(
"Loading Cohere Transcribe (lang=\(language.englishName), "
+ "compute=\(describeComputeUnits(computeUnits))) from \(modelDir.path)")
let models = try await CoherePipeline.loadModels(
encoderDir: modelDir,
decoderDir: modelDir,
vocabDir: modelDir,
decoderVariant: .v2,
computeUnits: computeUnits)
let pipeline = CoherePipeline()
let converter = AudioConverter()
return AsrLoop(
label: "cohere-transcribe-\(language.rawValue)",
transcribeOne: { url in
let samples = try converter.resampleAudioFile(path: url.path)
let r = try await pipeline.transcribe(
audio: samples,
models: models,
language: language,
maxNewTokens: 108,
repetitionPenalty: 1.1,
noRepeatNgram: 3)
return r.text
},
cleanup: {}
)
}
}
private static func printUsage() {
logger.info(
"""
Usage: fluidaudio tts-benchmark [options]
Quantitative TTS benchmark — TTFT, cold/warm split, per-stage timings,
peak RSS, WER + CER per category, configurable compute-unit preset.
Backends:
kokoro-ane 7-stage ANE pipeline (per-stage timings, per-stage CU)
kokoro Single-graph CPU+GPU
pocket-tts Streaming flow-matching (multilingual)
magpie Encoder-decoder + NanoCodec (per-stage, slow)
cosyvoice3 Mandarin LLM-based (auto-picks Cohere ASR for zh)
Options:
--backend <name> See list above (default: kokoro-ane)
--corpus <name> MiniMax corpus name: minimax-<lang>
(e.g. minimax-english, minimax-chinese,
minimax-vietnamese — 24 languages total;
see Documentation/TTS/MinimaxCorpus.md)
--corpus-path <path> Custom corpus file (overrides --corpus)
--voice <name> Voice id (Kokoro/PocketTTS/CosyVoice3)
--speaker <name> Magpie speaker: john|sofia|aria|jason|leo
--language <code> PocketTTS lang pack or Magpie language code
--compute-units <preset> default | all-ane | cpu-and-gpu | cpu-only
--output-json <path> Write JSON report
--audio-dir <path> Keep generated WAVs under this dir
--skip-asr Skip ASR roundtrip (no WER/CER)
--asr-backend <name> ASR engine for the WER/CER pass:
parakeet English-only (default for en)
cohere Multilingual (default for non-en)
none Same as --skip-asr
--cohere-model-dir <path> Path to a directory containing Cohere
Transcribe encoder/decoder/vocab.json.
Required when --asr-backend cohere is
active (auto-download is not wired —
vocab.json lives at the repo root, not
under /q8). Default: cache at
~/Library/Application Support/FluidAudio/
Models/cohere-transcribe/q8
--asr-language <code> Override Cohere language code (default:
inferred from corpus name). One of:
en, zh, ja, ko, vi, fr, de, es, it, pt,
nl, pl, el, ar
--cohere-compute-units <p> Cohere ASR compute mapping:
all (default; CoreML decides) |
cpu-and-gpu | cpu-only | all-ane.
Use cpu-and-gpu when q8 ANE compile
fails (`MILCompilerForANE error: …`)
— avoids the multi-minute fallback
compile on first call.
--help, -h Show this help
Examples:
fluidaudio tts-benchmark --backend kokoro-ane --output-json bench.json
fluidaudio tts-benchmark --backend kokoro --corpus minimax-english
fluidaudio tts-benchmark --backend pocket-tts --corpus minimax-german --language german
fluidaudio tts-benchmark --backend magpie --speaker sofia --language en
fluidaudio tts-benchmark --backend cosyvoice3 --corpus minimax-chinese \\
--asr-backend cohere --cohere-model-dir ~/.fluidaudio/cohere/q8
Notes:
For Chinese (zh) and Japanese (ja), WER is meaningless because
WERCalculator splits on whitespace; trust the CER column instead.
The summary banner prints an explicit reminder for these langs.
"""
)
}
}
#endif