mirror of
https://github.com/FluidInference/FluidAudio.git
synced 2026-05-12 20:20:36 +00:00
7603ac6733
## Summary Adds `fluidaudio tts-benchmark`, a unified harness for measuring **latency × efficiency × quality** across every shipping TTS backend in FluidAudio, plus the model + runtime fixes needed to actually clear all six backends end-to-end on the [MiniMax Multilingual TTS Test Set](https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set). Also tags Magpie / StyleTTS2 / CosyVoice3 as **beta** at the API + docs level so users get a runtime warning on `initialize()` reflecting their actual perf / quality posture. ### Backends — all green on M2 / macOS 26 | Backend | Corpus | Status | Audio out (min / p50 / max) | RTFx | WER | Notes | |---|---|---|---|---|---|---| | Kokoro ANE | minimax-en (100/100) | ✅ | 3.5 s / 8.0 s / 11.4 s | 5.19× | 10.8% | one-shot @ 24 kHz, 7-graph pipeline; per-stage CU sweep | | Kokoro | minimax-en (100/100) | ✅ | 3.5 s / 6.8 s / 9.3 s | 2.02× | 1.3% | one-shot @ 24 kHz; multi-chunk w/ 8 ms crossfade; cleanest English ASR roundtrip | | PocketTTS | minimax-en (100/100) | ✅ | 2.8 s / 6.3 s / 9.4 s | 0.61× | 1.4% | **streaming** @ 24 kHz, 80 ms frames; TTFT 1244 ms — RTFx looks slow but is honest per-frame cost (see "RTFx caveat" below) | | Magpie | minimax-en (100/100) | ⚠️ **BETA** | 4.7 s / 10.0 s / 20.6 s | 0.64× | 5.6% | **streaming TTFT** @ 22.05 kHz: first chunk at **9.6 s p50** vs full synth 15.1 s; split-K/V decoder + `outputBackings` fast path; below real-time, runtime warning on init | | StyleTTS2 | minimax-en (100/100) | ⚠️ **BETA** | 9.6 s / 22.6 s / 32.6 s | 2.72× | 44.0% | one-shot @ 24 kHz; flex-shape fix + misaki→espeak post-pass remap (WER 58.1% → 44.0%); WER ~30× Kokoro's, runtime warning on init | | CosyVoice3 | minimax-zh (100/100) | ⚠️ **BETA** | 2.2 s / 6.5 s / **16.0 s** | 0.357׆ | n/a‡ | post auto-chunker @ 24 kHz; long phrases now split + crossfaded (8 ms cosine) — longest output 16.0 s (was capped at ~6.5 s); HiFT `.cpuAndGPU` + LLM-Decode `outputBackings` (+33% RTFx); **whisper-large-v3 CER 1.68% (macro) / 1.84% (micro)** across 100/100 phrases‡; RTFx < 1, runtime warning on init | | CosyVoice3 | minimax-yue (100/100) | ⚠️ **BETA** | 3.3 s / 8.0 s / **16.1 s** | 0.249× | n/a | post auto-chunker; **truncation 80/100 → 5/100 phrases** (`finished_on_eos=false` field), longest output 6.5 s → 16.1 s. TTFT-p50 climbs (24 s → 36 s) as the cost of multi-chunk synth | ⚠️ **BETA** = `${Backend}TtsManager.initialize()` emits a `logger.warning` flagging the perf / quality posture; safe to ship in non-latency-sensitive paths but read the per-backend doc first. ‡ CosyVoice3 zh WER stays `n/a` because `WERCalculator` whitespace-tokenizes and Mandarin has no word boundaries (word-level WER reads ~100% and is meaningless). CER is `whisper-large-v3` against the rendered WAVs from the full 100-phrase `minimax-chinese` run via `Scripts/whisper_zh_cer.py`. Cohere Transcribe q8 is also wired in this PR via `--asr-backend cohere` (see [Cohere ASR backend in the harness](#cohere-asr-backend-in-the-harness) below) and agrees with whisper at the 3–5% CER range on a 10-phrase sub-sample, but hits a `MILCompilerForANE` cache failure on this M2 host that drops it to RTFx ~0.13×, so whisper is the practical source-of-truth for the full 100-phrase run. Full numbers (cold start, p50/p95 synth, peak RSS, WER/CER per category) live in `Documentation/TTS/Benchmarks.md`. Corpus attribution + reproduction notes live in `Documentation/TTS/MinimaxCorpus.md`. ### RTFx caveat — phrase length and streaming granularity both matter Aggregate RTFx (audio_duration / wall_clock) is **only directly comparable between backends when both produce similar phrase lengths and yield audio at the same granularity**. Two things skew the headline number on this corpus: **1. Phrase-length spread.** StyleTTS2 emits ~22 s p50 of audio per `minimax-english` phrase while Kokoro emits ~7 s — same input text, ~3× more audio out. That's mostly long inter-word pauses + slow speaking rate baked into the LibriTTS multi-speaker checkpoint, not a measurement artifact. A 2.72× RTFx on 22 s audio = ~8 s wall — which matches the TTFT p50 column. Kokoro's 2.02× on 7 s audio = ~3.5 s wall. Same-corpus RTFx ratios alone hide this. **2. Streaming granularity.** PocketTTS posts 0.61× agg-RTFx vs. Kokoro's 2.02× but it's **not slower from a user perspective**: PocketTTS yields its first 80 ms audio frame at TTFT **1244 ms**, Kokoro's first frame at TTFT **3113 ms** (full one-shot chunk). The 0.61× is the per-frame cost averaged across the streaming run; what users feel is TTFT. | Backend | TTFT p50 | First yield | Implication | |-------------|----------|------------------|--------------------------------------------| | PocketTTS | 1244 ms | 80 ms frame | true streaming; conversational-ready | | Kokoro ANE | 1586 ms | full ~8 s chunk | ~1.6 s to any audio; ANE-tuned | | Kokoro | 3113 ms | full ~7 s chunk | clean quality, slower first-byte | | StyleTTS2 | 6671 ms | full ~22 s chunk | one-shot only; long phrase output amortizes the wall | | Magpie | **9580 ms** | first chunk @ 22.05 kHz | streaming via `synthesizeStream`; TTFT-p50 9.6 s vs full synth 15.1 s — 36% earlier playback start | | CosyVoice3 | 14091 / 35681 ms (zh / yue) | full chunk @ 24 kHz | one-shot per chunk; multi-chunk phrases pay TTFT for the first chunk only | For conversational use cases, **TTFT > RTFx**. PocketTTS (true streaming), Magpie (streaming via `synthesizeStream`), and Kokoro ANE (small one-shot chunks) are the three backends that meaningfully clear the "user feels it's responsive" bar today. ### Beta callouts (StyleTTS2, Magpie, CosyVoice3) Three of the six shipping backends post numbers that callers should weigh against an explicit caveat: - **StyleTTS2** — WER 44% on `minimax-english` is ~30× Kokoro's 1.3%. The misaki→espeak post-pass remap closed half the gap; the remainder is BART G2P misses + diffusion-sampler formant breaks on long phrases. - **Magpie** — agg-RTFx 0.64× on M2 — below real-time but streaming via `synthesizeStream` so TTFT (9.6 s p50) is significantly better than full-synth wall (15.1 s p50). Long-tail phrases still pull p95 wall to ~30 s. - **CosyVoice3** — agg-RTFx 0.357× on `minimax-chinese` (0.249× on the longer-phrase `minimax-cantonese` after the auto-chunker). The 250-token Flow input cap is now worked around at the call site by the auto-chunker (long phrases split + crossfaded), dropping cantonese truncation from 80/100 → 5/100 and lifting longest output from 6.5 s → 16.1 s. The 5/100 residual is the long-tail token-rate worst case; the structural fix is re-exporting Flow with a larger fixed input shape (tracked in `mobius-cosyvoice3`). `CosyVoice3SynthesisResult.finishedOnEos: Bool` + a `.warning`-level `LLM-Decode budget exhausted` log still surface any truncation, and the harness writes `finished_on_eos` into each phrase in the JSON report. Each manager now logs a `.warning`-level beta notice on `initialize()` (mirroring the existing CosyVoice3 pattern) so anyone wiring these into a product gets a console signal, not a silent surprise. Docs (`Documentation/TTS/Magpie.md`, `Documentation/TTS/Benchmarks.md` StyleTTS2 footnote, existing `CosyVoice3.md` callout) carry the same caveat at the top. ### Model + runtime fixes landed in this PR #### CosyVoice3 stateless port (`71130c9fb`) Switches LLM-Decode from the macOS 15+ stateful `MLState` path to the non-stateful `LLM-Decode-M768-fp16` graph that's actually shipped on HuggingFace. Drops ~95 LOC of state plumbing for ~30 LOC of plain `MLDictionaryFeatureProvider` prediction with explicit kv carry-forward; lowers the availability gate from macOS 15 / iOS 18 back to the package baseline (macOS 14 / iOS 17). `CosyVoice3ModelNameTests` guard the rename. #### CosyVoice3 HiFT timeout fix (`267766b62`) `minimax-chinese` runs were aborting mid-corpus with `E5RT: Submit Async failed for [3:29] ... HiFT-T500-fp16_main__Op104_BnnsCpuInference has timed out`. Root cause: HiFT was loaded with `.cpuAndNeuralEngine`, which let the planner place most of the graph on ANE but kept at least one op on the BNNS CPU async-dispatch path; long phrases tripped the BNNS async watchdog. Fix pins HiFT to `.cpuAndGPU` regardless of user-supplied compute-units, removing the BNNS path entirely. Verified on 100/100 zh + 100/100 yue. #### CosyVoice3 LLM-Decode `outputBackings` double-buffer (`248c638c6`) The autoregressive decode loop runs ~163 steps per phrase to fill the 250-token cap. Each step takes the previous step's KV cache as `kv_k` / `kv_v` (fp32 `[24, 1, 2, 768, 64]` = 9 MB each) and produces fresh `kv_k_out` / `kv_v_out` plus logits — i.e. ~36 MB of host-side `MLMultiArray` allocation **per step**. Fix pre-allocates 4 KV back-buffers + a logits backing, rotates front/back/spare across steps via `MLPredictionOptions.outputBackings`, and falls back to fresh-alloc on first rejection (one-shot `logger.warning`). Mirrors the Magpie pattern. Result on full `minimax-chinese`: agg-RTFx **0.269 → 0.357 (+33%)**, TTFT-p50 14091 ms → 9619 ms (-31%), peak RSS 3302 MB → 2470 MB. #### CosyVoice3 auto-chunker (`f80e0b92e` + `fd22624b5` + `f60cccd0d`) The 250-token Flow input cap means a single synth pass produces at most ~6.5 s of audio regardless of input length. Re-exporting Flow with a larger fixed input shape is gated on upstream conversion work, so this PR works around it at the call site: long inputs are split at sentence/clause boundaries by `CosyVoice3TextChunker`, synthesized independently, and merged with an 8 ms equal-power cosine crossfade. **Splitter policy**: hard enders (`. ! ? 。 ! ? \n`) commit always; soft enders (`, 、 ; : ; ,` + ASCII space) commit only at-or-past budget; force-split at +30 token overshoot if no natural boundary exists. `defaultMaxSpeechTokens` = 110 (leaves margin under the 250-token cap minus a typical 60–90-token speech-prompt context). Token-rate heuristic is calibrated against minimax-zh + minimax-yue runs: | Char class | Tokens / char | Rationale | |------------|---------------|--------------------------------------------------------------| | CJK | 7.5 | worst-case observed in real generation; varies 5.5–9 per char | | ASCII | 1.5 | matches BPE rate on English text | | Other | 2.5 | conservative for accented Latin / non-CJK Unicode | **Validation** on full `minimax-cantonese` (100 phrases, M2): | Metric | Pre-chunker | Post-chunker | Δ | |-------------------------------------------|-------------|--------------|------------| | `finished_on_eos=false` (truncated) | 80 / 100 | **5 / 100** | −94% | | Longest audio output | 6.5 s | **16.1 s** | +148% | | agg-RTFx | 0.245× | 0.249× | +1.6% | | TTFT p50 | 23.9 s | 35.7 s | +49% | The TTFT regression is the cost of running multiple synth passes per long phrase — splitting unblocks long-form output at the price of wall-clock latency. The 5/100 residual truncation is the long-tail token-rate worst case (some chars hit ~9 tokens/char); raising the per-CJK heuristic further would over-fragment short phrases. Cleaner fix is the Flow re-export. 16-test suite covers tokenization estimates, hard/soft/force-split policy, and the crossfade arithmetic. Lives in `Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Preprocess/CosyVoice3TextChunker.swift` + `CosyVoice3TtsManager.concatWithCrossfade`. #### Magpie streaming TTFT wire-up (`ace0bf485`) `TtsBenchmarkCommand.swift` now drives Magpie through `MagpieTtsManager.synthesizeStream`, recording `ttft_ms` at first `MagpieAudioChunk` emit instead of conflating it with full-synth wall time. Result on full `minimax-english` (100 phrases, M2): TTFT-p50 **9.6 s** vs full synth-p50 **15.1 s** — agents start playback ~36% earlier than waiting for full synth. agg-RTFx 0.41× → 0.64× (warm-cache re-run benefit; fundamentals unchanged). #### StyleTTS2 `FlexibleShapeInfo` fix (`c24900731` + `8f9e42fd9`) `text_predictor.mlmodelc` aborted on long MiniMax phrases with `E5RT: tensor_buffer has known strides while the model has FlexibleShapeInfo`. The CoreML runtime rejects two access patterns on outputs from a flex-shape model: `arr.strides` reads, and `arr[idx].floatValue` element subscripts — and the original `sliceFirstAxis2D` helper used both. Fix rewrites it to read via `arr.dataPointer.bindMemory(...)` (handling `.float32`, `.float16`, `.double`) and computes the flat index from the known `(1, leading, trailing)` row-major layout. Verified on full 100/100 minimax-en with a `ref_s.bin` dumped from the upstream LibriTTS demo voice. #### StyleTTS2 misaki → espeak post-pass remap (`ded0b9467`) After `sliceFirstAxis2D` unblocked the full corpus, StyleTTS2 still landed at **WER 0.581 / CER 0.476** — an order of magnitude worse than Kokoro (0.013). Instrumented the encoder via a new `--tokenize-only --corpus` mode and disproved the silent-vocab-drop hypothesis: only **0.09% of scalars** dropped on the full 100-phrase corpus (11 ASCII hyphens / 12247 scalars). Real root cause: G2P convention mismatch. Both Kokoro and StyleTTS2 share the in-tree misaki BART G2P (`G2PModel`), but the StyleTTS2 LibriTTS checkpoint was trained by yl4579 on **espeak-ng-phonemized** LibriTTS — predating misaki by years. The 178-vocab accepts both forms (e.g. both `ʧ` U+02A7 and `tʃ` decomposed encode), but acoustic embeddings for the misaki ligature glyphs are essentially untrained noise. Side-by-side comparison against locally-installed `espeak-ng -v en-us --ipa -q` flagged four systematic divergences: | misaki | espeak-ng | example | |--------|-----------|--------------------------| | `ʧ` | `tʃ` | choice → `tʃˈɔɪs` | | `ʤ` | `dʒ` | jump → `dʒˈʌmps` | | `ɜɹ` | `ɝ` | girl → `ɡˈɝl` | | `əɹ` | `ɚ` | over → `ˈoʊvɚ` | Fix: a 4-rule post-pass remap in `StyleTTS2Phonemizer.phonemize`, gated on `.americanEnglish` and applied to the assembled phoneme string after every word has been emitted by the BART G2P. Lives alongside the existing per-piece misaki diphthong remap. Result on the same 100-phrase MiniMax-English run with the same `libritts_696` voice and same Parakeet TDT roundtrip: | Metric | Pre | Post | Δ | |-----------------|-------|-------|--------| | Macro WER | 0.581 | 0.440 | −24.2% | | Macro CER | 0.476 | 0.241 | −49.5% | | TTFT p50 (ms) | 8937 | 6671 | −25.4% | | Agg RTFx | 2.36× | 2.72× | +15.3% | | Peak RSS (MB) | 1428 | 963 | −32.6% | Phrase 1 (`"…simple choice. Get busy living…"`) went from `simple voice. Busy dying.` (0.40 WER) to a perfect roundtrip. Remaining errors cluster on word-level G2P misses from the BART itself (`practical → practicckles`, `separation → expiration`) and diffusion-sampler formant breaks; closing the rest of the gap to Kokoro likely needs richer espeak coverage or libespeak-ng vendor — tracked separately. #### Beta callouts on StyleTTS2 + Magpie managers (`25e2b492a`) `StyleTTS2Manager.initialize` and `MagpieTtsManager.initialize` now emit `logger.warning` beta notices mirroring the existing `CosyVoice3TtsManager.initialize` pattern. Backends docs (`Magpie.md` Status section, `Benchmarks.md` StyleTTS2 footnote) gain matching `⚠️ Beta / experimental` callouts so the perf / quality posture is visible at every entry point — runtime, manager docstring, doc top, PR body. #### Magpie `outputBackings` rejection fallback (`72dae8400` + `9767e1ef9`) The shipped `decoder_step.mlmodelc` reaches the user before the rebuild lands, so CoreML can reject our `outputBackings` dictionary on a name-mismatch. Latched fallback path falls back to a fresh-alloc decode so the model still runs; first rejection latches the flag for the rest of the run. ### Cohere ASR backend in the harness (`8e741e659`) Lets non-English TTS runs (CosyVoice3, Magpie zh, etc.) score WER / CER through the harness against [Cohere Transcribe](Sources/FluidAudio/ASR/Cohere/) instead of being forced into `--skip-asr`. Four new flags on `tts-benchmark`: - `--asr-backend parakeet|cohere|none` — selects the ASR roundtrip engine. Default is `parakeet` for English-only runs and skipped for CosyVoice3. - `--cohere-model-dir <path>` — path to a directory containing `cohere_encoder.mlmodelc`, `cohere_decoder_cache_external_v2.mlmodelc`, and `vocab.json`. - `--asr-language <code>` — overrides the inferred language code (covers all 14 Cohere languages: en, fr, de, es, it, pt, nl, pl, el, ar, ja, zh, ko, vi). - `--cohere-compute-units all|cpu-and-gpu|cpu-only|all-ane` — pins `MLComputeUnits` at `CoherePipeline.loadModels` time. Use `cpu-and-gpu` when the q8 encoder fails ANE compilation (`MILCompilerForANE error: failed to compile ANE model using ANEF`) to skip the multi-minute fallback compile on the first call. The harness logs a WER caveat for zh/ja runs flagging that whitespace-tokenized WER is meaningless and the CER column is the real signal. Example end-to-end: ```bash fluidaudio tts-benchmark \ --backend cosyvoice3 \ --corpus minimax-chinese \ --asr-backend cohere \ --cohere-model-dir /path/to/cohere/q8 \ --asr-language zh \ --output-json benchmark_results/cv3-zh-cohere.json \ --audio-dir benchmark_results/cv3-zh-cohere/audio ``` On this M2 host the q8 encoder hits a CoreML ANE-cache failure (`MILCompilerForANE error: ANECCompile() FAILED`) and CoreML silently falls back to CPU+GPU, dropping Cohere from its documented RTFx ~2× (per `Documentation/ASR/Cohere.md`) to RTFx ~0.13× — correctness is unaffected (same graph, same output), only latency. The full 100-phrase CosyVoice3 zh CER number reported above (1.68% macro / 1.84% micro) was therefore produced via `whisper-large-v3` (Python CPU FP32, `Scripts/whisper_zh_cer.py`) rather than by running Cohere over all 100 phrases. A 10-phrase Cohere sub-sample agrees with whisper at the 3–5% CER range. ### Corpus migration (`4cc7d3111`) + on-demand fetch CLI (`8022e8384`) Replaces the original `prose-en` / `numbers-en` / `names-en` / `prose-zh` shipped with the first cut of this PR with the [MiniMax Multilingual TTS Test Set](https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set) (CC-BY-SA-4.0; 100 phrases × 25 languages). Same public corpus used by [MiniMax-Speech](https://arxiv.org/abs/2505.07916), seed-tts-eval, and Gradium — numbers in this PR are paper-comparable. The 24 per-language `.txt` files used to be vendored in `Benchmarks/tts/corpus/minimax/`. **Removed in this PR** in favor of an on-demand `fluidaudio minimax-corpus` CLI subcommand that fetches them from the upstream HF dataset at the pinned revision and writes them to the same path. Reuses `DownloadUtils.fetchHuggingFaceFile` for HF auth (HF_TOKEN env) + retry/backoff — no `swift-transformers` dep added, no hardcoded asset URLs. The `.txt` files now live in `.gitignore` since they're CC-BY-SA-4.0 derivative content; only `Documentation/TTS/MinimaxCorpus.md` (attribution + revision pin + WER caveats — moved from `Benchmarks/tts/corpus/minimax/README.md` in `ac21d60bf`) and the CLI subcommand are tracked. Replaces the prior `python Scripts/fetch_minimax_tts_corpus.py` (also deleted). Per-backend language scope: | Backend | Languages benchmarked | |---|---| | Kokoro / Kokoro ANE | en (af_heart) | | PocketTTS | en + de + it + pt + es + fr | | Magpie | en + es + de + fr + it + vi + zh + hi | | StyleTTS2 | en (LibriTTS multi-spk) | | CosyVoice3 | zh + yue | ### PocketTTS streaming TTFT (`c26f1e163`) PocketTTS now drives the harness through its `synthesizeStreaming` API so TTFT measures time-to-first-80ms-frame instead of full one-shot synth. TTFT 1244 ms vs. full synth 8757 ms — a 7× streaming advantage that one-shot benchmarking previously hid. ### Reference voice dumper helper (mobius-styletts2) `mobius-styletts2/scripts/06_dump_ref_s.py` (added in the sibling repo) wraps `style_encoder` + `predictor_encoder` from `99_parity_check.py` to dump a 256-fp32 LE `ref_s.bin` that `StyleTTS2Manager.synthesize` consumes via `--voice`. Required because the shipped CoreML bundle doesn't include those upstream-only PyTorch encoders. ## Test plan - [x] `swift build -c release` clean - [x] `swift format lint` clean for new files - [x] `fluidaudio tts-benchmark --help` lists all 6 backends - [x] `fluidaudio minimax-corpus --languages english --out-dir /tmp/x` produces byte-identical output to the deleted Python script - [x] Kokoro / Kokoro ANE / PocketTTS / Magpie — full 100/100 minimax-en - [x] StyleTTS2 — full 100/100 minimax-en (verified after `sliceFirstAxis2D` fix + post-pass remap) - [x] CosyVoice3 — full 100/100 minimax-zh + 100/100 minimax-yue (verified after HiFT + LLM-Decode `outputBackings` fixes) - [x] `CosyVoice3ModelNameTests` + `TtsComputeUnitPresetTests` green - [x] No `@unchecked Sendable`; per-backend error enums use `Error, LocalizedError` - [x] StyleTTS2 + Magpie + CosyVoice3 emit beta `logger.warning` on `initialize()` - [x] Corpus README moved to `Documentation/TTS/MinimaxCorpus.md`; cross-refs in `Benchmarks.md`, `MinimaxCorpusCommand.swift`, `TtsBenchmarkCommand.swift` updated - [x] CosyVoice3 6.5 s output cap investigated — confirmed structural (250-token Flow input shape, 40 ms / token); surfaced via `finishedOnEos` + warning log + JSON `finished_on_eos` field. See [Decode budget cap](Documentation/TTS/Benchmarks.md#cosyvoice3-decode-budget-cap) - [x] **CosyVoice3 auto-chunker** lands in this PR as a call-site workaround. Validated on full minimax-cantonese: truncation **80/100 → 5/100**, longest output **6.5 s → 16.1 s**, agg-RTFx 0.245× → 0.249×. 16-test suite (`CosyVoice3TextChunkerTests`) green. See [CosyVoice3 auto-chunker](Documentation/TTS/Benchmarks.md#cosyvoice3-auto-chunker) - [x] **Magpie streaming TTFT** wired through `synthesizeStream` in `TtsBenchmarkCommand.swift`. Validated on full minimax-english: TTFT-p50 **9.6 s** (first chunk) vs full-synth-p50 **15.1 s** — 36% earlier playback start. agg-RTFx 0.41× → 0.64× (warm-cache re-run) - [x] **Cohere ASR harness wiring** (`--asr-backend cohere` + `--cohere-model-dir` + `--asr-language` + `--cohere-compute-units`). Smoke-tested on a 10-phrase `minimax-chinese` sub-sample (Cohere q8 macro CER 4.88%, hit `MILCompilerForANE` fallback, RTFx ~0.13× on this M2 host). Whisper-large-v3 cross-check on the same WAVs: macro CER 3.04% — both backends agree - [x] **CosyVoice3 zh CER on full corpus** measured via `whisper-large-v3` (Python CPU FP32, `Scripts/whisper_zh_cer.py`) over all 100 minimax-chinese WAVs: macro CER **1.68%**, micro CER **1.84%**. Recorded in `Documentation/TTS/Benchmarks.md` (CosyVoice3 row + footnote ‡)
1360 lines
56 KiB
Swift
1360 lines
56 KiB
Swift
#if os(macOS)
|
||
import CoreML
|
||
import FluidAudio
|
||
import Foundation
|
||
|
||
/// `fluidaudio tts-benchmark` — quantitative TTS benchmark harness.
|
||
///
|
||
/// Reports **TTFT / cold-start / warm-start latency, per-stage timings,
|
||
/// peak RSS, WER + CER per category** — i.e. the things conversational
|
||
/// TTS users actually feel — instead of just RTFx.
|
||
///
|
||
/// Backends:
|
||
/// kokoro-ane — 7-stage ANE pipeline (per-stage timings, per-stage CU)
|
||
/// kokoro — single-graph CPU+GPU (chunk-level only)
|
||
/// pocket-tts — streaming flow-matching (no per-stage timings)
|
||
/// magpie — encoder-decoder + NanoCodec (6-stage timings, slow)
|
||
/// cosyvoice3 — Mandarin LLM-based (Mandarin corpus only, no WER)
|
||
/// styletts2 — diffusion + HiFi-GAN (one-shot, requires --voice ref_s.bin)
|
||
///
|
||
/// Usage:
|
||
/// fluidaudio tts-benchmark --backend kokoro-ane \
|
||
/// --corpus minimax-english \
|
||
/// --voice af_heart \
|
||
/// --compute-units default \
|
||
/// --output-json bench.json
|
||
///
|
||
/// Corpora land in `Benchmarks/tts/corpus/minimax/<lang>.txt` —
|
||
/// the MiniMax Multilingual TTS Test Set (CC-BY-SA-4.0,
|
||
/// 24 languages × 100 phrases). The `.txt` files are gitignored;
|
||
/// populate them with `swift run fluidaudio minimax-corpus`. See
|
||
/// `Documentation/TTS/MinimaxCorpus.md` for attribution + reproduction
|
||
/// notes and `Documentation/TTS/Benchmarks.md` for the per-backend ↔
|
||
/// language coverage matrix. Reference with `--corpus minimax-<lang>`
|
||
/// (e.g. `minimax-english`, `minimax-chinese`, `minimax-vietnamese`, …).
|
||
public enum TtsBenchmarkCommand {
|
||
|
||
private static let logger = AppLogger(category: "TtsBenchmarkCommand")
|
||
|
||
// MARK: - Per-phrase sample emitted by every backend driver.
|
||
private struct BackendPhraseSample {
|
||
let synthMs: Double
|
||
let ttftMs: Double // For one-shot backends, == synthMs.
|
||
let samples: [Float]
|
||
let sampleRate: Int
|
||
let stageMs: [String: Double] // Empty if backend has no per-stage timings.
|
||
let extraFields: [String: Any] // encoder_tokens, finished_on_eos, etc.
|
||
}
|
||
|
||
// MARK: - ASR backend selection
|
||
//
|
||
// The harness supports two ASR backends for the TTS→ASR roundtrip:
|
||
// .parakeet — Parakeet TDT (English-only, auto-downloaded).
|
||
// .cohere — Cohere Transcribe cache-external (14 languages incl. zh).
|
||
// CosyVoice3's Mandarin output requires `.cohere` for a meaningful CER
|
||
// — Parakeet's English-only output collapses to ~100% on zh.
|
||
fileprivate enum AsrChoice {
|
||
case skip
|
||
case parakeet
|
||
case cohere(modelDir: URL, language: CohereAsrConfig.Language, computeUnits: MLComputeUnits)
|
||
|
||
var label: String {
|
||
switch self {
|
||
case .skip: return "skip"
|
||
case .parakeet: return "parakeet-tdt"
|
||
case .cohere(_, let lang, let cu):
|
||
return "cohere-transcribe-\(lang.rawValue)/\(Self.computeLabel(cu))"
|
||
}
|
||
}
|
||
|
||
private static func computeLabel(_ cu: MLComputeUnits) -> String {
|
||
switch cu {
|
||
case .all: return "all"
|
||
case .cpuAndNeuralEngine: return "cpu+ane"
|
||
case .cpuAndGPU: return "cpu+gpu"
|
||
case .cpuOnly: return "cpu"
|
||
@unknown default: return "unknown"
|
||
}
|
||
}
|
||
|
||
var skipped: Bool {
|
||
if case .skip = self { return true } else { return false }
|
||
}
|
||
}
|
||
|
||
/// Closure-based ASR adapter so `runPhraseLoop` doesn't have to know
|
||
/// which backend it's driving. Built once before the per-phrase loop,
|
||
/// torn down after.
|
||
fileprivate struct AsrLoop {
|
||
let label: String
|
||
let transcribeOne: (URL) async throws -> String
|
||
let cleanup: () async -> Void
|
||
}
|
||
|
||
public static func run(arguments: [String]) async {
|
||
var backendName = "kokoro-ane"
|
||
var corpusName: String?
|
||
var corpusPath: String?
|
||
var voice: String?
|
||
var speakerName: String?
|
||
var languageName: String?
|
||
var computeUnitsName = "default"
|
||
var outputJson: String?
|
||
var audioDir: String?
|
||
var skipAsr = false
|
||
var asrBackendName: String?
|
||
var cohereModelDirArg: String?
|
||
var asrLanguageArg: String?
|
||
var cohereComputeUnitsArg: String?
|
||
|
||
var i = 0
|
||
while i < arguments.count {
|
||
let arg = arguments[i]
|
||
switch arg {
|
||
case "--backend":
|
||
if i + 1 < arguments.count {
|
||
backendName = arguments[i + 1]
|
||
i += 1
|
||
}
|
||
case "--corpus":
|
||
if i + 1 < arguments.count {
|
||
corpusName = arguments[i + 1]
|
||
i += 1
|
||
}
|
||
case "--corpus-path":
|
||
if i + 1 < arguments.count {
|
||
corpusPath = arguments[i + 1]
|
||
i += 1
|
||
}
|
||
case "--voice":
|
||
if i + 1 < arguments.count {
|
||
voice = arguments[i + 1]
|
||
i += 1
|
||
}
|
||
case "--speaker":
|
||
if i + 1 < arguments.count {
|
||
speakerName = arguments[i + 1]
|
||
i += 1
|
||
}
|
||
case "--language":
|
||
if i + 1 < arguments.count {
|
||
languageName = arguments[i + 1]
|
||
i += 1
|
||
}
|
||
case "--compute-units":
|
||
if i + 1 < arguments.count {
|
||
computeUnitsName = arguments[i + 1]
|
||
i += 1
|
||
}
|
||
case "--output-json":
|
||
if i + 1 < arguments.count {
|
||
outputJson = arguments[i + 1]
|
||
i += 1
|
||
}
|
||
case "--audio-dir":
|
||
if i + 1 < arguments.count {
|
||
audioDir = arguments[i + 1]
|
||
i += 1
|
||
}
|
||
case "--skip-asr":
|
||
skipAsr = true
|
||
case "--asr-backend":
|
||
if i + 1 < arguments.count {
|
||
asrBackendName = arguments[i + 1]
|
||
i += 1
|
||
}
|
||
case "--cohere-model-dir":
|
||
if i + 1 < arguments.count {
|
||
cohereModelDirArg = arguments[i + 1]
|
||
i += 1
|
||
}
|
||
case "--asr-language":
|
||
if i + 1 < arguments.count {
|
||
asrLanguageArg = arguments[i + 1]
|
||
i += 1
|
||
}
|
||
case "--cohere-compute-units":
|
||
if i + 1 < arguments.count {
|
||
cohereComputeUnitsArg = arguments[i + 1]
|
||
i += 1
|
||
}
|
||
case "--help", "-h":
|
||
printUsage()
|
||
return
|
||
default:
|
||
logger.warning("Unknown argument: \(arg)")
|
||
}
|
||
i += 1
|
||
}
|
||
|
||
let backend = parseBackend(backendName)
|
||
|
||
// Resolve corpus.
|
||
let phrases: [(category: String, text: String)]
|
||
let corpusLabel: String
|
||
do {
|
||
if let corpusPath {
|
||
let url = resolveURL(corpusPath, isDirectory: false)
|
||
let raw = try String(contentsOf: url, encoding: .utf8)
|
||
phrases = parseCorpus(raw, category: url.deletingPathExtension().lastPathComponent)
|
||
corpusLabel = url.lastPathComponent
|
||
} else {
|
||
let resolved = corpusName ?? backend.defaultCorpus
|
||
phrases = try loadShippedCorpus(resolved)
|
||
corpusLabel = resolved
|
||
}
|
||
} catch {
|
||
logger.error("Failed to load corpus: \(error.localizedDescription)")
|
||
exit(1)
|
||
}
|
||
guard !phrases.isEmpty else {
|
||
logger.error("Corpus is empty after parsing")
|
||
exit(1)
|
||
}
|
||
logger.info("Loaded \(phrases.count) phrase(s) from corpus '\(corpusLabel)'")
|
||
|
||
guard let preset = TtsComputeUnitPreset(cliValue: computeUnitsName) else {
|
||
logger.error(
|
||
"Unknown --compute-units value: \(computeUnitsName). Expected default | all-ane | cpu-and-gpu | cpu-only."
|
||
)
|
||
exit(1)
|
||
}
|
||
|
||
// Resolve ASR backend choice. Precedence:
|
||
// --skip-asr or --asr-backend none → .skip
|
||
// --asr-backend cohere → .cohere(modelDir, language)
|
||
// --asr-backend parakeet → .parakeet
|
||
// no flag, backend == cosyvoice3 → .skip (Parakeet is English-only;
|
||
// Mandarin output collapses to ~100% WER)
|
||
// no flag, otherwise → .parakeet
|
||
let asrChoice: AsrChoice
|
||
do {
|
||
asrChoice = try resolveAsrChoice(
|
||
skipAsrFlag: skipAsr,
|
||
backendName: asrBackendName,
|
||
cohereModelDir: cohereModelDirArg,
|
||
asrLanguage: asrLanguageArg,
|
||
cohereComputeUnits: cohereComputeUnitsArg,
|
||
corpusLabel: corpusLabel,
|
||
ttsBackend: backend)
|
||
} catch {
|
||
logger.error("Failed to resolve ASR backend: \(error.localizedDescription)")
|
||
exit(1)
|
||
}
|
||
logger.info("ASR backend: \(asrChoice.label)")
|
||
|
||
do {
|
||
switch backend {
|
||
case .kokoroAne:
|
||
try await runKokoroAne(
|
||
phrases: phrases, corpusLabel: corpusLabel,
|
||
voice: voice ?? KokoroAneConstants.defaultVoice,
|
||
preset: preset, outputJson: outputJson, audioDir: audioDir,
|
||
asrChoice: asrChoice)
|
||
case .kokoro:
|
||
try await runKokoro(
|
||
phrases: phrases, corpusLabel: corpusLabel,
|
||
voice: voice ?? TtsConstants.recommendedVoice,
|
||
preset: preset, outputJson: outputJson, audioDir: audioDir,
|
||
asrChoice: asrChoice)
|
||
case .pocketTts:
|
||
try await runPocketTts(
|
||
phrases: phrases, corpusLabel: corpusLabel,
|
||
voice: voice ?? PocketTtsConstants.defaultVoice,
|
||
languageName: languageName,
|
||
preset: preset, outputJson: outputJson, audioDir: audioDir,
|
||
asrChoice: asrChoice)
|
||
case .magpie:
|
||
try await runMagpie(
|
||
phrases: phrases, corpusLabel: corpusLabel,
|
||
speakerName: speakerName, languageName: languageName,
|
||
preset: preset, outputJson: outputJson, audioDir: audioDir,
|
||
asrChoice: asrChoice)
|
||
case .cosyVoice3:
|
||
try await runCosyVoice3(
|
||
phrases: phrases, corpusLabel: corpusLabel,
|
||
voice: voice,
|
||
preset: preset, outputJson: outputJson, audioDir: audioDir,
|
||
asrChoice: asrChoice)
|
||
case .styleTts2:
|
||
try await runStyleTts2(
|
||
phrases: phrases, corpusLabel: corpusLabel,
|
||
voicePath: voice,
|
||
preset: preset, outputJson: outputJson, audioDir: audioDir,
|
||
asrChoice: asrChoice)
|
||
}
|
||
} catch {
|
||
logger.error("tts-benchmark failed: \(error)")
|
||
exit(1)
|
||
}
|
||
}
|
||
|
||
// MARK: - Kokoro ANE driver
|
||
|
||
private static func runKokoroAne(
|
||
phrases: [(category: String, text: String)],
|
||
corpusLabel: String,
|
||
voice: String,
|
||
preset: TtsComputeUnitPreset,
|
||
outputJson: String?,
|
||
audioDir: String?,
|
||
asrChoice: AsrChoice
|
||
) async throws {
|
||
let units = KokoroAneComputeUnits(preset: preset)
|
||
let manager = KokoroAneManager(defaultVoice: voice, computeUnits: units)
|
||
|
||
let coldStart = Date()
|
||
try await manager.initialize()
|
||
let coldStartS = Date().timeIntervalSince(coldStart)
|
||
logger.info(String(format: "Cold start (initialize): %.2fs", coldStartS))
|
||
|
||
let firstStart = Date()
|
||
_ = try await manager.synthesizeDetailed(
|
||
text: "Initialization warm-up.", voice: voice, speed: 1.0)
|
||
let firstSynthMs = Date().timeIntervalSince(firstStart) * 1000
|
||
logger.info(String(format: "First synth: %.0f ms", firstSynthMs))
|
||
|
||
try await runPhraseLoop(
|
||
backendId: "kokoro-ane",
|
||
voiceLabel: voice,
|
||
corpusLabel: corpusLabel,
|
||
phrases: phrases,
|
||
preset: preset,
|
||
coldStartS: coldStartS,
|
||
firstSynthMs: firstSynthMs,
|
||
outputJson: outputJson,
|
||
audioDir: audioDir,
|
||
asrChoice: asrChoice,
|
||
extraSummary: ["voice": voice]
|
||
) { text in
|
||
let t0 = Date()
|
||
let result = try await manager.synthesizeDetailed(
|
||
text: text, voice: voice, speed: 1.0)
|
||
let synthMs = Date().timeIntervalSince(t0) * 1000
|
||
return BackendPhraseSample(
|
||
synthMs: synthMs,
|
||
ttftMs: synthMs,
|
||
samples: result.samples,
|
||
sampleRate: result.sampleRate,
|
||
stageMs: [
|
||
"albert": result.timings.albert,
|
||
"post_albert": result.timings.postAlbert,
|
||
"alignment": result.timings.alignment,
|
||
"prosody": result.timings.prosody,
|
||
"noise": result.timings.noise,
|
||
"vocoder": result.timings.vocoder,
|
||
"tail": result.timings.tail,
|
||
"total": result.timings.totalMs,
|
||
],
|
||
extraFields: [
|
||
"encoder_tokens": result.encoderTokens,
|
||
"acoustic_frames": result.acousticFrames,
|
||
]
|
||
)
|
||
}
|
||
}
|
||
|
||
// MARK: - Kokoro driver (single-graph)
|
||
|
||
private static func runKokoro(
|
||
phrases: [(category: String, text: String)],
|
||
corpusLabel: String,
|
||
voice: String,
|
||
preset: TtsComputeUnitPreset,
|
||
outputJson: String?,
|
||
audioDir: String?,
|
||
asrChoice: AsrChoice
|
||
) async throws {
|
||
let units = preset.uniformUnits ?? .all
|
||
let manager = KokoroTtsManager(defaultVoice: voice, computeUnits: units)
|
||
|
||
let coldStart = Date()
|
||
try await manager.initialize(preloadVoices: [voice])
|
||
let coldStartS = Date().timeIntervalSince(coldStart)
|
||
logger.info(String(format: "Cold start (initialize): %.2fs", coldStartS))
|
||
|
||
let firstStart = Date()
|
||
_ = try await manager.synthesizeDetailed(text: "Initialization warm-up.", voice: voice)
|
||
let firstSynthMs = Date().timeIntervalSince(firstStart) * 1000
|
||
logger.info(String(format: "First synth: %.0f ms", firstSynthMs))
|
||
|
||
try await runPhraseLoop(
|
||
backendId: "kokoro",
|
||
voiceLabel: voice,
|
||
corpusLabel: corpusLabel,
|
||
phrases: phrases,
|
||
preset: preset,
|
||
coldStartS: coldStartS,
|
||
firstSynthMs: firstSynthMs,
|
||
outputJson: outputJson,
|
||
audioDir: audioDir,
|
||
asrChoice: asrChoice,
|
||
extraSummary: ["voice": voice]
|
||
) { text in
|
||
let t0 = Date()
|
||
let result = try await manager.synthesizeDetailed(text: text, voice: voice)
|
||
let synthMs = Date().timeIntervalSince(t0) * 1000
|
||
let samples = result.chunks.flatMap { $0.samples }
|
||
return BackendPhraseSample(
|
||
synthMs: synthMs,
|
||
ttftMs: synthMs,
|
||
samples: samples,
|
||
sampleRate: 24000,
|
||
stageMs: [:],
|
||
extraFields: [
|
||
"chunk_count": result.chunks.count,
|
||
"wav_bytes": result.audio.count,
|
||
]
|
||
)
|
||
}
|
||
}
|
||
|
||
// MARK: - PocketTTS driver
|
||
|
||
private static func runPocketTts(
|
||
phrases: [(category: String, text: String)],
|
||
corpusLabel: String,
|
||
voice: String,
|
||
languageName: String?,
|
||
preset: TtsComputeUnitPreset,
|
||
outputJson: String?,
|
||
audioDir: String?,
|
||
asrChoice: AsrChoice
|
||
) async throws {
|
||
if preset != .default {
|
||
logger.warning(
|
||
"PocketTTS does not expose per-call compute-unit overrides; --compute-units \(preset.cliValue) ignored."
|
||
)
|
||
}
|
||
let language = parsePocketLanguage(languageName)
|
||
logger.info("PocketTTS language: \(language.rawValue)")
|
||
|
||
let manager = PocketTtsManager(defaultVoice: voice, language: language)
|
||
|
||
let coldStart = Date()
|
||
try await manager.initialize()
|
||
let coldStartS = Date().timeIntervalSince(coldStart)
|
||
logger.info(String(format: "Cold start (initialize): %.2fs", coldStartS))
|
||
|
||
let firstStart = Date()
|
||
var firstFrameMs: Double = 0
|
||
var firstFrameCount = 0
|
||
let warmupStream = try await manager.synthesizeStreaming(
|
||
text: "Initialization warm-up.", voice: voice)
|
||
for try await frame in warmupStream {
|
||
if firstFrameCount == 0 {
|
||
firstFrameMs = Date().timeIntervalSince(firstStart) * 1000
|
||
}
|
||
firstFrameCount += 1
|
||
_ = frame.samples
|
||
}
|
||
let firstSynthMs = Date().timeIntervalSince(firstStart) * 1000
|
||
logger.info(
|
||
String(
|
||
format: "First synth: %.0f ms total, %.0f ms TTFT (frames=%d)",
|
||
firstSynthMs, firstFrameMs, firstFrameCount))
|
||
|
||
try await runPhraseLoop(
|
||
backendId: "pocket-tts",
|
||
voiceLabel: voice,
|
||
corpusLabel: corpusLabel,
|
||
phrases: phrases,
|
||
preset: preset,
|
||
coldStartS: coldStartS,
|
||
firstSynthMs: firstSynthMs,
|
||
outputJson: outputJson,
|
||
audioDir: audioDir,
|
||
asrChoice: asrChoice,
|
||
extraSummary: ["voice": voice, "language": language.rawValue]
|
||
) { text in
|
||
// PocketTTS is streaming-first: we measure TTFT (time to first
|
||
// audio frame) separately from total synth time so the benchmark
|
||
// numbers reflect what a streaming consumer actually experiences.
|
||
let t0 = Date()
|
||
let stream = try await manager.synthesizeStreaming(text: text, voice: voice)
|
||
var aggregated: [Float] = []
|
||
var ttftMs: Double = 0
|
||
var frameCount = 0
|
||
var lastChunkCount = 0
|
||
for try await frame in stream {
|
||
if frameCount == 0 {
|
||
ttftMs = Date().timeIntervalSince(t0) * 1000
|
||
}
|
||
aggregated.append(contentsOf: frame.samples)
|
||
frameCount += 1
|
||
lastChunkCount = frame.chunkCount
|
||
}
|
||
let synthMs = Date().timeIntervalSince(t0) * 1000
|
||
return BackendPhraseSample(
|
||
synthMs: synthMs,
|
||
ttftMs: ttftMs,
|
||
samples: aggregated,
|
||
sampleRate: PocketTtsConstants.audioSampleRate,
|
||
stageMs: [:],
|
||
extraFields: [
|
||
"frame_count": frameCount,
|
||
"chunk_count": lastChunkCount,
|
||
]
|
||
)
|
||
}
|
||
}
|
||
|
||
// MARK: - Magpie driver
|
||
|
||
private static func runMagpie(
|
||
phrases: [(category: String, text: String)],
|
||
corpusLabel: String,
|
||
speakerName: String?,
|
||
languageName: String?,
|
||
preset: TtsComputeUnitPreset,
|
||
outputJson: String?,
|
||
audioDir: String?,
|
||
asrChoice: AsrChoice
|
||
) async throws {
|
||
let units = preset.uniformUnits ?? .cpuAndNeuralEngine
|
||
let language = parseMagpieLanguage(languageName)
|
||
let speaker = parseMagpieSpeaker(speakerName)
|
||
logger.info("Magpie speaker=\(speaker.displayName) language=\(language.rawValue)")
|
||
|
||
let manager = MagpieTtsManager(
|
||
computeUnits: units, preferredLanguages: [language])
|
||
|
||
let coldStart = Date()
|
||
try await manager.initialize()
|
||
let coldStartS = Date().timeIntervalSince(coldStart)
|
||
logger.info(String(format: "Cold start (initialize): %.2fs", coldStartS))
|
||
|
||
let firstStart = Date()
|
||
_ = try await manager.synthesize(
|
||
text: "Initialization warm-up.", speaker: speaker, language: language)
|
||
let firstSynthMs = Date().timeIntervalSince(firstStart) * 1000
|
||
logger.info(String(format: "First synth: %.0f ms", firstSynthMs))
|
||
|
||
try await runPhraseLoop(
|
||
backendId: "magpie",
|
||
voiceLabel: speaker.displayName,
|
||
corpusLabel: corpusLabel,
|
||
phrases: phrases,
|
||
preset: preset,
|
||
coldStartS: coldStartS,
|
||
firstSynthMs: firstSynthMs,
|
||
outputJson: outputJson,
|
||
audioDir: audioDir,
|
||
asrChoice: asrChoice,
|
||
extraSummary: [
|
||
"speaker": speaker.displayName, "language": language.rawValue,
|
||
]
|
||
) { text in
|
||
// Drive Magpie through `synthesizeStream` so TTFT measures
|
||
// time-to-first-chunk-yield rather than full-utterance wall.
|
||
// The chunker carves a small first chunk
|
||
// (`MagpieChunker.streamingFirstChunkCap` = 50 codec frames ≈
|
||
// 2.3 s of audio) when the first sentence is long enough; for
|
||
// short phrases the stream degrades to one chunk == whole
|
||
// utterance and TTFT == synthMs (no streaming benefit, no
|
||
// measurement penalty).
|
||
//
|
||
// Trade-off vs. the prior `synthesize()` path: per-stage
|
||
// timings (`text_encoder`/`prefill`/`ar_loop`/…) are only
|
||
// surfaced on `MagpieSynthesisResult`, not per
|
||
// `MagpieAudioChunk`, so `stageMs` is empty here. That matches
|
||
// PocketTTS streaming which also publishes empty `stageMs`.
|
||
let t0 = Date()
|
||
let stream = try await manager.synthesizeStream(
|
||
text: text, speaker: speaker, language: language)
|
||
var aggregated: [Float] = []
|
||
var ttftMs: Double = 0
|
||
var chunkCount = 0
|
||
var codeCount = 0
|
||
var finishedOnEos = false
|
||
var sampleRate = MagpieConstants.audioSampleRate
|
||
for try await chunk in stream {
|
||
if chunkCount == 0 {
|
||
ttftMs = Date().timeIntervalSince(t0) * 1000
|
||
}
|
||
aggregated.append(contentsOf: chunk.samples)
|
||
chunkCount += 1
|
||
codeCount += chunk.codeCount
|
||
sampleRate = chunk.sampleRate
|
||
if chunk.isFinal {
|
||
finishedOnEos = chunk.finishedOnEos
|
||
}
|
||
}
|
||
let synthMs = Date().timeIntervalSince(t0) * 1000
|
||
// Empty-stream guard (synthesizeStream returns immediately on
|
||
// zero-length input). Fall back to synthMs so downstream
|
||
// percentile math doesn't see ttftMs == 0.
|
||
if chunkCount == 0 { ttftMs = synthMs }
|
||
return BackendPhraseSample(
|
||
synthMs: synthMs,
|
||
ttftMs: ttftMs,
|
||
samples: aggregated,
|
||
sampleRate: sampleRate,
|
||
stageMs: [:],
|
||
extraFields: [
|
||
"code_count": codeCount,
|
||
"finished_on_eos": finishedOnEos,
|
||
"chunk_count": chunkCount,
|
||
]
|
||
)
|
||
}
|
||
}
|
||
|
||
// MARK: - CosyVoice3 driver
|
||
|
||
private static func runCosyVoice3(
|
||
phrases: [(category: String, text: String)],
|
||
corpusLabel: String,
|
||
voice: String?,
|
||
preset: TtsComputeUnitPreset,
|
||
outputJson: String?,
|
||
audioDir: String?,
|
||
asrChoice: AsrChoice
|
||
) async throws {
|
||
let units = preset.uniformUnits ?? .cpuAndNeuralEngine
|
||
let voiceId = voice ?? "cosyvoice3-default-zh"
|
||
|
||
let coldStart = Date()
|
||
let manager = try await CosyVoice3TtsManager.downloadAndCreate(
|
||
cacheDirectory: nil, includeDefaultVoice: true, computeUnits: units)
|
||
try await manager.initialize()
|
||
let promptAssets = try await manager.loadVoice(voiceId)
|
||
let coldStartS = Date().timeIntervalSince(coldStart)
|
||
logger.info(String(format: "Cold start (download+init+voice): %.2fs", coldStartS))
|
||
|
||
let firstStart = Date()
|
||
_ = try await manager.synthesize(text: "你好", promptAssets: promptAssets)
|
||
let firstSynthMs = Date().timeIntervalSince(firstStart) * 1000
|
||
logger.info(String(format: "First synth: %.0f ms", firstSynthMs))
|
||
|
||
try await runPhraseLoop(
|
||
backendId: "cosyvoice3",
|
||
voiceLabel: voiceId,
|
||
corpusLabel: corpusLabel,
|
||
phrases: phrases,
|
||
preset: preset,
|
||
coldStartS: coldStartS,
|
||
firstSynthMs: firstSynthMs,
|
||
outputJson: outputJson,
|
||
audioDir: audioDir,
|
||
asrChoice: asrChoice,
|
||
extraSummary: ["voice": voiceId]
|
||
) { text in
|
||
let t0 = Date()
|
||
let result = try await manager.synthesize(text: text, promptAssets: promptAssets)
|
||
let synthMs = Date().timeIntervalSince(t0) * 1000
|
||
return BackendPhraseSample(
|
||
synthMs: synthMs,
|
||
ttftMs: synthMs,
|
||
samples: result.samples,
|
||
sampleRate: result.sampleRate,
|
||
stageMs: [:],
|
||
extraFields: [
|
||
"generated_token_count": result.generatedTokenCount,
|
||
"decoded_token_count": result.decodedTokens.count,
|
||
// Surface the structural 250-token Flow-input cap as a
|
||
// per-phrase boolean so corpus reports can tally how many
|
||
// long phrases hit silent truncation.
|
||
"finished_on_eos": result.finishedOnEos,
|
||
]
|
||
)
|
||
}
|
||
}
|
||
|
||
// MARK: - StyleTTS2 driver
|
||
|
||
private static func runStyleTts2(
|
||
phrases: [(category: String, text: String)],
|
||
corpusLabel: String,
|
||
voicePath: String?,
|
||
preset: TtsComputeUnitPreset,
|
||
outputJson: String?,
|
||
audioDir: String?,
|
||
asrChoice: AsrChoice
|
||
) async throws {
|
||
guard let voicePath, !voicePath.isEmpty else {
|
||
logger.error(
|
||
"StyleTTS2 requires --voice <path/to/ref_s.bin> "
|
||
+ "(256 fp32 LE blob from mobius-styletts2/scripts/06_dump_ref_s.py)")
|
||
exit(1)
|
||
}
|
||
let voiceURL = resolveURL(voicePath, isDirectory: false)
|
||
let voiceLabel = voiceURL.deletingPathExtension().lastPathComponent
|
||
|
||
// StyleTTS2 doesn't expose a compute-units knob today; --compute-units
|
||
// is accepted for parity with other backends but only labels the run.
|
||
let manager = StyleTTS2Manager()
|
||
|
||
let coldStart = Date()
|
||
try await manager.initialize()
|
||
let coldStartS = Date().timeIntervalSince(coldStart)
|
||
logger.info(String(format: "Cold start (initialize): %.2fs", coldStartS))
|
||
|
||
let firstStart = Date()
|
||
_ = try await manager.synthesizeSamples(
|
||
text: "Initialization warm-up.", voiceStyleURL: voiceURL, randomSeed: 42)
|
||
let firstSynthMs = Date().timeIntervalSince(firstStart) * 1000
|
||
logger.info(String(format: "First synth: %.0f ms", firstSynthMs))
|
||
|
||
try await runPhraseLoop(
|
||
backendId: "styletts2",
|
||
voiceLabel: voiceLabel,
|
||
corpusLabel: corpusLabel,
|
||
phrases: phrases,
|
||
preset: preset,
|
||
coldStartS: coldStartS,
|
||
firstSynthMs: firstSynthMs,
|
||
outputJson: outputJson,
|
||
audioDir: audioDir,
|
||
asrChoice: asrChoice,
|
||
extraSummary: ["voice": voiceLabel]
|
||
) { text in
|
||
let t0 = Date()
|
||
let result = try await manager.synthesizeSamples(
|
||
text: text, voiceStyleURL: voiceURL, randomSeed: 42)
|
||
let synthMs = Date().timeIntervalSince(t0) * 1000
|
||
return BackendPhraseSample(
|
||
synthMs: synthMs,
|
||
ttftMs: synthMs,
|
||
samples: result.samples,
|
||
sampleRate: result.sampleRate,
|
||
stageMs: [:],
|
||
extraFields: [:]
|
||
)
|
||
}
|
||
}
|
||
|
||
// MARK: - Shared per-phrase loop + summary
|
||
|
||
private static func runPhraseLoop(
|
||
backendId: String,
|
||
voiceLabel: String,
|
||
corpusLabel: String,
|
||
phrases: [(category: String, text: String)],
|
||
preset: TtsComputeUnitPreset,
|
||
coldStartS: Double,
|
||
firstSynthMs: Double,
|
||
outputJson: String?,
|
||
audioDir: String?,
|
||
asrChoice: AsrChoice,
|
||
extraSummary: [String: Any],
|
||
synthOne: (String) async throws -> BackendPhraseSample
|
||
) async throws {
|
||
// Optional output dir for WAVs.
|
||
var audioDirURL: URL? = nil
|
||
if let audioDir {
|
||
let url = resolveURL(audioDir, isDirectory: true)
|
||
try FileManager.default.createDirectory(
|
||
at: url, withIntermediateDirectories: true)
|
||
audioDirURL = url
|
||
}
|
||
|
||
// Build optional ASR backend (Parakeet, Cohere, or none).
|
||
let asrLoop = try await buildAsrLoop(asrChoice)
|
||
|
||
var perPhrase: [[String: Any]] = []
|
||
var byCategory: [String: [Int]] = [:]
|
||
|
||
for (idx, item) in phrases.enumerated() {
|
||
let label = String(format: "[%02d/%02d]", idx + 1, phrases.count)
|
||
logger.info("\(label) [\(item.category)] \(item.text)")
|
||
|
||
let sample = try await synthOne(item.text)
|
||
let audioMs =
|
||
Double(sample.samples.count) / Double(sample.sampleRate) * 1000
|
||
let rtfx = sample.synthMs > 0 ? audioMs / sample.synthMs : 0
|
||
|
||
// Persist WAV (audioDir if set, else temp file for ASR).
|
||
let wavURL: URL
|
||
if let audioDirURL {
|
||
wavURL = audioDirURL.appendingPathComponent(
|
||
String(format: "phrase_%03d.wav", idx + 1))
|
||
} else {
|
||
wavURL = FileManager.default.temporaryDirectory
|
||
.appendingPathComponent("tts-benchmark-\(UUID().uuidString).wav")
|
||
}
|
||
let wavData = try AudioWAV.data(
|
||
from: sample.samples, sampleRate: Double(sample.sampleRate))
|
||
try wavData.write(to: wavURL)
|
||
|
||
var werValue = Double.nan
|
||
var cerValue = Double.nan
|
||
var hypothesis = ""
|
||
var asrMs = 0.0
|
||
if let asrLoop {
|
||
let asr0 = Date()
|
||
hypothesis = try await asrLoop.transcribeOne(wavURL)
|
||
asrMs = Date().timeIntervalSince(asr0) * 1000
|
||
let m = WERCalculator.calculateWERAndCER(
|
||
hypothesis: hypothesis, reference: item.text)
|
||
werValue = m.wer
|
||
cerValue = m.cer
|
||
}
|
||
|
||
if audioDirURL == nil {
|
||
try? FileManager.default.removeItem(at: wavURL)
|
||
}
|
||
|
||
logger.info(
|
||
String(
|
||
format:
|
||
" ttft=%.0f ms synth=%.0f ms audio=%.0f ms rtfx=%.2fx wer=%.1f%% cer=%.1f%%",
|
||
sample.ttftMs, sample.synthMs, audioMs, rtfx,
|
||
werValue.isNaN ? 0 : werValue * 100,
|
||
cerValue.isNaN ? 0 : cerValue * 100))
|
||
|
||
byCategory[item.category, default: []].append(perPhrase.count)
|
||
var phraseDict: [String: Any] = [
|
||
"index": idx + 1,
|
||
"category": item.category,
|
||
"reference": item.text,
|
||
"hypothesis": hypothesis,
|
||
"ttft_ms": sample.ttftMs,
|
||
"synth_ms": sample.synthMs,
|
||
"audio_ms": audioMs,
|
||
"rtfx": rtfx,
|
||
"wer": werValue.isNaN ? NSNull() : werValue as Any,
|
||
"cer": cerValue.isNaN ? NSNull() : cerValue as Any,
|
||
"asr_ms": asrMs,
|
||
"stage_ms": sample.stageMs,
|
||
"wav_path": audioDirURL == nil ? "" : wavURL.path,
|
||
]
|
||
for (k, v) in sample.extraFields {
|
||
phraseDict[k] = v
|
||
}
|
||
perPhrase.append(phraseDict)
|
||
}
|
||
|
||
if let asrLoop {
|
||
await asrLoop.cleanup()
|
||
}
|
||
|
||
// Aggregate.
|
||
let totalSynthMs = perPhrase.reduce(0.0) { $0 + ($1["synth_ms"] as? Double ?? 0) }
|
||
let totalAudioMs = perPhrase.reduce(0.0) { $0 + ($1["audio_ms"] as? Double ?? 0) }
|
||
let aggRtfx = totalSynthMs > 0 ? totalAudioMs / totalSynthMs : 0
|
||
|
||
let synthMsValues = perPhrase.compactMap { $0["synth_ms"] as? Double }.sorted()
|
||
let p50 = percentile(synthMsValues, 0.5)
|
||
let p95 = percentile(synthMsValues, 0.95)
|
||
let ttftValues = perPhrase.compactMap { $0["ttft_ms"] as? Double }.sorted()
|
||
let ttftP50 = percentile(ttftValues, 0.5)
|
||
let ttftP95 = percentile(ttftValues, 0.95)
|
||
|
||
var categories: [[String: Any]] = []
|
||
for (cat, indexes) in byCategory.sorted(by: { $0.key < $1.key }) {
|
||
let werVals = indexes.compactMap { perPhrase[$0]["wer"] as? Double }
|
||
let cerVals = indexes.compactMap { perPhrase[$0]["cer"] as? Double }
|
||
let synthVals = indexes.compactMap { perPhrase[$0]["synth_ms"] as? Double }
|
||
let audioVals = indexes.compactMap { perPhrase[$0]["audio_ms"] as? Double }
|
||
let synthSum = synthVals.reduce(0, +)
|
||
let audioSum = audioVals.reduce(0, +)
|
||
let macroWer =
|
||
werVals.isEmpty ? Double.nan : werVals.reduce(0, +) / Double(werVals.count)
|
||
let macroCer =
|
||
cerVals.isEmpty ? Double.nan : cerVals.reduce(0, +) / Double(cerVals.count)
|
||
categories.append([
|
||
"category": cat,
|
||
"phrase_count": indexes.count,
|
||
"macro_wer": macroWer.isNaN ? NSNull() : macroWer as Any,
|
||
"macro_cer": macroCer.isNaN ? NSNull() : macroCer as Any,
|
||
"synth_ms_p50": percentile(synthVals.sorted(), 0.5),
|
||
"synth_ms_p95": percentile(synthVals.sorted(), 0.95),
|
||
"rtfx": synthSum > 0 ? audioSum / synthSum : 0,
|
||
])
|
||
}
|
||
|
||
let peakRssMb =
|
||
Double(FluidAudioCLI.fetchPeakMemoryUsageBytes() ?? 0) / 1024 / 1024
|
||
|
||
// Banner.
|
||
logger.info("--- Summary ---")
|
||
logger.info(" backend: \(backendId)")
|
||
logger.info(" voice/speaker: \(voiceLabel)")
|
||
logger.info(" corpus: \(corpusLabel) (n=\(phrases.count))")
|
||
logger.info(" compute units: \(preset.cliValue)")
|
||
logger.info(String(format: " cold start: %.2fs", coldStartS))
|
||
logger.info(String(format: " first synth: %.0f ms", firstSynthMs))
|
||
logger.info(String(format: " TTFT p50/p95: %.0f / %.0f ms", ttftP50, ttftP95))
|
||
logger.info(String(format: " warm synth p50: %.0f ms", p50))
|
||
logger.info(String(format: " warm synth p95: %.0f ms", p95))
|
||
logger.info(String(format: " agg RTFx: %.2fx", aggRtfx))
|
||
logger.info(String(format: " peak RSS: %.0f MB", peakRssMb))
|
||
if !asrChoice.skipped {
|
||
let werVals = perPhrase.compactMap { $0["wer"] as? Double }
|
||
let cerVals = perPhrase.compactMap { $0["cer"] as? Double }
|
||
let macroWer =
|
||
werVals.isEmpty ? 0 : werVals.reduce(0, +) / Double(werVals.count)
|
||
let macroCer =
|
||
cerVals.isEmpty ? 0 : cerVals.reduce(0, +) / Double(cerVals.count)
|
||
logger.info(" ASR backend: \(asrChoice.label)")
|
||
logger.info(String(format: " macro WER: %.2f%%", macroWer * 100))
|
||
logger.info(String(format: " macro CER: %.2f%%", macroCer * 100))
|
||
// Word-level WER is meaningless on whitespace-free scripts (zh, ja).
|
||
// Surface that explicitly so readers don't trust ~100% WER for zh.
|
||
if case .cohere(_, let lang, _) = asrChoice,
|
||
lang == .chinese || lang == .japanese
|
||
{
|
||
logger.info(
|
||
" note: WER is whitespace-tokenized; trust CER for \(lang.rawValue).")
|
||
}
|
||
} else {
|
||
logger.info(" WER/CER: skipped")
|
||
}
|
||
|
||
if let outputJson {
|
||
var summary: [String: Any] = [
|
||
"backend": backendId,
|
||
"corpus": corpusLabel,
|
||
"phrase_count": phrases.count,
|
||
"compute_units": preset.cliValue,
|
||
"cold_start_s": coldStartS,
|
||
"first_synth_ms": firstSynthMs,
|
||
"ttft_ms_p50": ttftP50,
|
||
"ttft_ms_p95": ttftP95,
|
||
"warm_synth_ms_p50": p50,
|
||
"warm_synth_ms_p95": p95,
|
||
"agg_rtfx": aggRtfx,
|
||
"peak_rss_mb": peakRssMb,
|
||
"asr_skipped": asrChoice.skipped,
|
||
"asr_backend": asrChoice.label,
|
||
]
|
||
for (k, v) in extraSummary {
|
||
summary[k] = v
|
||
}
|
||
let report: [String: Any] = [
|
||
"summary": summary,
|
||
"categories": categories,
|
||
"phrases": perPhrase,
|
||
]
|
||
let url = resolveURL(outputJson, isDirectory: false)
|
||
try FileManager.default.createDirectory(
|
||
at: url.deletingLastPathComponent(),
|
||
withIntermediateDirectories: true)
|
||
let data = try JSONSerialization.data(
|
||
withJSONObject: report, options: [.prettyPrinted, .sortedKeys])
|
||
try data.write(to: url)
|
||
logger.info("Report written: \(url.path)")
|
||
}
|
||
}
|
||
|
||
// MARK: - Corpus loading
|
||
|
||
private static func loadShippedCorpus(
|
||
_ name: String
|
||
) throws -> [(category: String, text: String)] {
|
||
let cwd = URL(
|
||
fileURLWithPath: FileManager.default.currentDirectoryPath, isDirectory: true)
|
||
let relativePath = corpusRelativePath(for: name)
|
||
let url = cwd.appendingPathComponent(relativePath, isDirectory: false)
|
||
let raw = try String(contentsOf: url, encoding: .utf8)
|
||
return parseCorpus(raw, category: name)
|
||
}
|
||
|
||
/// Map a `--corpus` name to its on-disk relative path.
|
||
///
|
||
/// All shipped corpora are MiniMax Multilingual TTS Test Set
|
||
/// languages — `minimax-<lang>` resolves to
|
||
/// `Benchmarks/tts/corpus/minimax/<lang>.txt`. The CC-BY-SA-4.0
|
||
/// attribution lives next to the data in `minimax/README.md`.
|
||
/// Pass `--corpus-path` for ad-hoc files outside the shipped set.
|
||
private static func corpusRelativePath(for name: String) -> String {
|
||
let prefix = "minimax-"
|
||
if name.hasPrefix(prefix) {
|
||
let lang = String(name.dropFirst(prefix.count))
|
||
return "Benchmarks/tts/corpus/minimax/\(lang).txt"
|
||
}
|
||
// Back-compat shim — anything else is assumed to live next to
|
||
// the minimax subdirectory. Prefer `--corpus-path` for non-shipped
|
||
// corpora.
|
||
return "Benchmarks/tts/corpus/\(name).txt"
|
||
}
|
||
|
||
private static func parseCorpus(
|
||
_ raw: String, category: String
|
||
) -> [(category: String, text: String)] {
|
||
return
|
||
raw
|
||
.split(whereSeparator: \.isNewline)
|
||
.map { $0.trimmingCharacters(in: .whitespaces) }
|
||
.filter { !$0.isEmpty && !$0.hasPrefix("#") }
|
||
.map { (category: category, text: $0) }
|
||
}
|
||
|
||
// MARK: - Backend dispatch
|
||
|
||
private enum Backend: String {
|
||
case kokoroAne
|
||
case kokoro
|
||
case pocketTts
|
||
case magpie
|
||
case cosyVoice3
|
||
case styleTts2
|
||
|
||
var defaultCorpus: String {
|
||
switch self {
|
||
case .cosyVoice3: return "minimax-chinese"
|
||
default: return "minimax-english"
|
||
}
|
||
}
|
||
}
|
||
|
||
private static func parseBackend(_ name: String) -> Backend {
|
||
switch name.lowercased() {
|
||
case "kokoro-ane", "kokoroane", "kokoro_ane", "lai":
|
||
return .kokoroAne
|
||
case "kokoro":
|
||
return .kokoro
|
||
case "pocket-tts", "pockettts", "pocket":
|
||
return .pocketTts
|
||
case "magpie":
|
||
return .magpie
|
||
case "cosyvoice3", "cosyvoice", "cosy":
|
||
return .cosyVoice3
|
||
case "styletts2", "style-tts2", "styletts", "style":
|
||
return .styleTts2
|
||
default:
|
||
logger.warning("Unknown backend '\(name)' — defaulting to kokoro-ane")
|
||
return .kokoroAne
|
||
}
|
||
}
|
||
|
||
private static func parsePocketLanguage(_ name: String?) -> PocketTtsLanguage {
|
||
guard let name, let l = PocketTtsLanguage(rawValue: name.lowercased()) else {
|
||
return .english
|
||
}
|
||
return l
|
||
}
|
||
|
||
private static func parseMagpieLanguage(_ name: String?) -> MagpieLanguage {
|
||
guard let name, let l = MagpieLanguage(rawValue: name.lowercased()) else {
|
||
return .english
|
||
}
|
||
return l
|
||
}
|
||
|
||
private static func parseMagpieSpeaker(_ name: String?) -> MagpieSpeaker {
|
||
switch name?.lowercased() {
|
||
case "sofia": return .sofia
|
||
case "aria": return .aria
|
||
case "jason": return .jason
|
||
case "leo": return .leo
|
||
case "john", nil, "": return .john
|
||
default: return .john
|
||
}
|
||
}
|
||
|
||
// MARK: - Helpers
|
||
|
||
private static func percentile(_ sorted: [Double], _ p: Double) -> Double {
|
||
guard !sorted.isEmpty else { return 0 }
|
||
let idx = Int((Double(sorted.count - 1) * p).rounded())
|
||
return sorted[max(0, min(sorted.count - 1, idx))]
|
||
}
|
||
|
||
private static func resolveURL(_ path: String, isDirectory: Bool) -> URL {
|
||
let expanded = (path as NSString).expandingTildeInPath
|
||
if expanded.hasPrefix("/") {
|
||
return URL(fileURLWithPath: expanded, isDirectory: isDirectory)
|
||
}
|
||
let cwd = URL(
|
||
fileURLWithPath: FileManager.default.currentDirectoryPath, isDirectory: true)
|
||
return cwd.appendingPathComponent(expanded, isDirectory: isDirectory)
|
||
}
|
||
|
||
// MARK: - ASR backend resolution & adapter construction
|
||
|
||
/// Map CLI flags + TTS backend defaults to a concrete `AsrChoice`.
|
||
///
|
||
/// Precedence: `--skip-asr` and `--asr-backend none` always win. With
|
||
/// no flag, English-friendly TTS backends default to Parakeet TDT and
|
||
/// CosyVoice3 defaults to `.skip` (Parakeet is English-only — its WER
|
||
/// on Mandarin output reads ~100% and is meaningless).
|
||
private static func resolveAsrChoice(
|
||
skipAsrFlag: Bool,
|
||
backendName: String?,
|
||
cohereModelDir: String?,
|
||
asrLanguage: String?,
|
||
cohereComputeUnits: String?,
|
||
corpusLabel: String,
|
||
ttsBackend: Backend
|
||
) throws -> AsrChoice {
|
||
let normalized = backendName?.lowercased()
|
||
if skipAsrFlag || normalized == "none" {
|
||
return .skip
|
||
}
|
||
switch normalized {
|
||
case "cohere":
|
||
let dir = try resolveCohereModelDir(cohereModelDir)
|
||
let language = inferCohereLanguage(
|
||
explicit: asrLanguage, corpus: corpusLabel)
|
||
let units = try resolveCohereComputeUnits(cohereComputeUnits)
|
||
return .cohere(modelDir: dir, language: language, computeUnits: units)
|
||
case "parakeet":
|
||
return .parakeet
|
||
case nil:
|
||
// Implicit defaults: skip for CosyVoice3 (no English ASR pairing),
|
||
// Parakeet otherwise.
|
||
if ttsBackend == .cosyVoice3 {
|
||
logger.info(
|
||
"CosyVoice3: no --asr-backend selected; skipping ASR. "
|
||
+ "Pass `--asr-backend cohere --cohere-model-dir <dir>` for CER.")
|
||
return .skip
|
||
}
|
||
return .parakeet
|
||
default:
|
||
logger.warning(
|
||
"Unknown --asr-backend value '\(normalized ?? "")', falling back to parakeet.")
|
||
return .parakeet
|
||
}
|
||
}
|
||
|
||
/// Resolve a Cohere Transcribe model directory (must contain
|
||
/// `cohere_encoder.mlmodelc`, `cohere_decoder_cache_external_v2.mlmodelc`,
|
||
/// and `vocab.json`).
|
||
///
|
||
/// Order of resolution:
|
||
/// 1. Explicit `--cohere-model-dir <path>`.
|
||
/// 2. The default cache location at
|
||
/// `~/Library/Application Support/FluidAudio/Models/cohere-transcribe/q8`,
|
||
/// matching `Repo.cohereTranscribeCoreml.folderName`.
|
||
///
|
||
/// Auto-download is intentionally not wired here: the upstream
|
||
/// `Repo.cohereTranscribeCoreml` registration ships `vocab.json` in
|
||
/// `requiredModels`, but the file lives at the repo root rather than
|
||
/// under the `q8/` subPath, so `DownloadUtils.downloadRepo` would fail
|
||
/// the post-download verify. Fix this when the registry learns about
|
||
/// repo-root files; until then, callers must pre-populate the cache
|
||
/// (e.g. via `fluidaudio cohere-transcribe ... --model-dir <dir>`).
|
||
private static func resolveCohereModelDir(_ override: String?) throws -> URL {
|
||
if let override {
|
||
return resolveURL(override, isDirectory: true)
|
||
}
|
||
let appSupport = try FileManager.default.url(
|
||
for: .applicationSupportDirectory,
|
||
in: .userDomainMask, appropriateFor: nil, create: true)
|
||
let target =
|
||
appSupport
|
||
.appendingPathComponent("FluidAudio/Models/cohere-transcribe/q8")
|
||
let needed = [
|
||
ModelNames.CohereTranscribe.encoderCompiledFile,
|
||
ModelNames.CohereTranscribe.decoderCacheExternalV2CompiledFile,
|
||
"vocab.json",
|
||
]
|
||
let missing = needed.filter { name in
|
||
!FileManager.default.fileExists(
|
||
atPath: target.appendingPathComponent(name).path)
|
||
}
|
||
guard missing.isEmpty else {
|
||
throw NSError(
|
||
domain: "TtsBenchmark", code: 1,
|
||
userInfo: [
|
||
NSLocalizedDescriptionKey:
|
||
"Cohere model dir incomplete at \(target.path). "
|
||
+ "Missing: \(missing.joined(separator: ", ")). "
|
||
+ "Pass --cohere-model-dir <dir> with the required files, or "
|
||
+ "pre-populate the cache via `fluidaudio cohere-transcribe`."
|
||
])
|
||
}
|
||
return target
|
||
}
|
||
|
||
/// Pick a `CohereAsrConfig.Language` from an explicit flag value or by
|
||
/// scanning the corpus label (covers the shipped `minimax-<lang>` set).
|
||
private static func inferCohereLanguage(
|
||
explicit: String?, corpus: String
|
||
) -> CohereAsrConfig.Language {
|
||
if let explicit,
|
||
let lang = CohereAsrConfig.Language(rawValue: explicit.lowercased())
|
||
{
|
||
return lang
|
||
}
|
||
let lower = corpus.lowercased()
|
||
if lower.contains("chinese") || lower.contains("mandarin") || lower.hasSuffix("-zh") {
|
||
return .chinese
|
||
}
|
||
if lower.contains("japanese") || lower.contains("-ja") { return .japanese }
|
||
if lower.contains("korean") || lower.contains("-ko") { return .korean }
|
||
if lower.contains("vietnamese") || lower.contains("-vi") { return .vietnamese }
|
||
if lower.contains("french") || lower.contains("-fr") { return .french }
|
||
if lower.contains("german") || lower.contains("-de") { return .german }
|
||
if lower.contains("spanish") || lower.contains("-es") { return .spanish }
|
||
if lower.contains("italian") || lower.contains("-it") { return .italian }
|
||
if lower.contains("portuguese") || lower.contains("-pt") { return .portuguese }
|
||
if lower.contains("dutch") || lower.contains("-nl") { return .dutch }
|
||
if lower.contains("polish") || lower.contains("-pl") { return .polish }
|
||
if lower.contains("greek") || lower.contains("-el") { return .greek }
|
||
if lower.contains("arabic") || lower.contains("-ar") { return .arabic }
|
||
return .english
|
||
}
|
||
|
||
/// Parse `--cohere-compute-units` into `MLComputeUnits`. Defaults to
|
||
/// `.all` (CoreML decides). Use `cpu-and-gpu` to skip the ANE compile
|
||
/// attempt when the q8 encoder fails ANE compilation (observed:
|
||
/// `MILCompilerForANE error: failed to compile ANE model using ANEF`,
|
||
/// CoreML falls back to CPU+GPU but pays a multi-minute compile cost
|
||
/// on the first call).
|
||
private static func resolveCohereComputeUnits(
|
||
_ flag: String?
|
||
) throws
|
||
-> MLComputeUnits
|
||
{
|
||
guard let raw = flag?.lowercased(), !raw.isEmpty else { return .all }
|
||
switch raw {
|
||
case "all", "default": return .all
|
||
case "all-ane", "ane", "neural-engine", "cpu-and-ane":
|
||
return .cpuAndNeuralEngine
|
||
case "cpu-and-gpu", "cpuandgpu", "gpu": return .cpuAndGPU
|
||
case "cpu-only", "cpu", "cpuonly": return .cpuOnly
|
||
default:
|
||
throw NSError(
|
||
domain: "TtsBenchmark", code: 3,
|
||
userInfo: [
|
||
NSLocalizedDescriptionKey:
|
||
"Unknown --cohere-compute-units value '\(raw)'. "
|
||
+ "Expected: all | cpu-and-gpu | cpu-only | all-ane."
|
||
])
|
||
}
|
||
}
|
||
|
||
/// Human-readable label for log lines.
|
||
private static func describeComputeUnits(_ cu: MLComputeUnits) -> String {
|
||
switch cu {
|
||
case .all: return "all (CPU+GPU+ANE)"
|
||
case .cpuAndNeuralEngine: return "cpu-and-ane"
|
||
case .cpuAndGPU: return "cpu-and-gpu"
|
||
case .cpuOnly: return "cpu-only"
|
||
@unknown default: return "unknown"
|
||
}
|
||
}
|
||
|
||
/// Build the per-phrase ASR adapter for a resolved choice. Returns
|
||
/// `nil` for `.skip` so the loop can short-circuit.
|
||
private static func buildAsrLoop(_ choice: AsrChoice) async throws -> AsrLoop? {
|
||
switch choice {
|
||
case .skip:
|
||
return nil
|
||
case .parakeet:
|
||
let asrModels = try await AsrModels.downloadAndLoad()
|
||
let asr = AsrManager()
|
||
try await asr.loadModels(asrModels)
|
||
let layers = await asr.decoderLayerCount
|
||
return AsrLoop(
|
||
label: "parakeet-tdt",
|
||
transcribeOne: { url in
|
||
var state = TdtDecoderState.make(decoderLayers: layers)
|
||
let r = try await asr.transcribe(url, decoderState: &state)
|
||
return r.text
|
||
},
|
||
cleanup: { await asr.cleanup() }
|
||
)
|
||
case .cohere(let modelDir, let language, let computeUnits):
|
||
guard #available(macOS 14, iOS 17, *) else {
|
||
throw NSError(
|
||
domain: "TtsBenchmark", code: 2,
|
||
userInfo: [
|
||
NSLocalizedDescriptionKey:
|
||
"Cohere ASR backend requires macOS 14+ / iOS 17+."
|
||
])
|
||
}
|
||
logger.info(
|
||
"Loading Cohere Transcribe (lang=\(language.englishName), "
|
||
+ "compute=\(describeComputeUnits(computeUnits))) from \(modelDir.path)")
|
||
let models = try await CoherePipeline.loadModels(
|
||
encoderDir: modelDir,
|
||
decoderDir: modelDir,
|
||
vocabDir: modelDir,
|
||
decoderVariant: .v2,
|
||
computeUnits: computeUnits)
|
||
let pipeline = CoherePipeline()
|
||
let converter = AudioConverter()
|
||
return AsrLoop(
|
||
label: "cohere-transcribe-\(language.rawValue)",
|
||
transcribeOne: { url in
|
||
let samples = try converter.resampleAudioFile(path: url.path)
|
||
let r = try await pipeline.transcribe(
|
||
audio: samples,
|
||
models: models,
|
||
language: language,
|
||
maxNewTokens: 108,
|
||
repetitionPenalty: 1.1,
|
||
noRepeatNgram: 3)
|
||
return r.text
|
||
},
|
||
cleanup: {}
|
||
)
|
||
}
|
||
}
|
||
|
||
private static func printUsage() {
|
||
logger.info(
|
||
"""
|
||
Usage: fluidaudio tts-benchmark [options]
|
||
|
||
Quantitative TTS benchmark — TTFT, cold/warm split, per-stage timings,
|
||
peak RSS, WER + CER per category, configurable compute-unit preset.
|
||
|
||
Backends:
|
||
kokoro-ane 7-stage ANE pipeline (per-stage timings, per-stage CU)
|
||
kokoro Single-graph CPU+GPU
|
||
pocket-tts Streaming flow-matching (multilingual)
|
||
magpie Encoder-decoder + NanoCodec (per-stage, slow)
|
||
cosyvoice3 Mandarin LLM-based (auto-picks Cohere ASR for zh)
|
||
|
||
Options:
|
||
--backend <name> See list above (default: kokoro-ane)
|
||
--corpus <name> MiniMax corpus name: minimax-<lang>
|
||
(e.g. minimax-english, minimax-chinese,
|
||
minimax-vietnamese — 24 languages total;
|
||
see Documentation/TTS/MinimaxCorpus.md)
|
||
--corpus-path <path> Custom corpus file (overrides --corpus)
|
||
--voice <name> Voice id (Kokoro/PocketTTS/CosyVoice3)
|
||
--speaker <name> Magpie speaker: john|sofia|aria|jason|leo
|
||
--language <code> PocketTTS lang pack or Magpie language code
|
||
--compute-units <preset> default | all-ane | cpu-and-gpu | cpu-only
|
||
--output-json <path> Write JSON report
|
||
--audio-dir <path> Keep generated WAVs under this dir
|
||
--skip-asr Skip ASR roundtrip (no WER/CER)
|
||
--asr-backend <name> ASR engine for the WER/CER pass:
|
||
parakeet English-only (default for en)
|
||
cohere Multilingual (default for non-en)
|
||
none Same as --skip-asr
|
||
--cohere-model-dir <path> Path to a directory containing Cohere
|
||
Transcribe encoder/decoder/vocab.json.
|
||
Required when --asr-backend cohere is
|
||
active (auto-download is not wired —
|
||
vocab.json lives at the repo root, not
|
||
under /q8). Default: cache at
|
||
~/Library/Application Support/FluidAudio/
|
||
Models/cohere-transcribe/q8
|
||
--asr-language <code> Override Cohere language code (default:
|
||
inferred from corpus name). One of:
|
||
en, zh, ja, ko, vi, fr, de, es, it, pt,
|
||
nl, pl, el, ar
|
||
--cohere-compute-units <p> Cohere ASR compute mapping:
|
||
all (default; CoreML decides) |
|
||
cpu-and-gpu | cpu-only | all-ane.
|
||
Use cpu-and-gpu when q8 ANE compile
|
||
fails (`MILCompilerForANE error: …`)
|
||
— avoids the multi-minute fallback
|
||
compile on first call.
|
||
--help, -h Show this help
|
||
|
||
Examples:
|
||
fluidaudio tts-benchmark --backend kokoro-ane --output-json bench.json
|
||
fluidaudio tts-benchmark --backend kokoro --corpus minimax-english
|
||
fluidaudio tts-benchmark --backend pocket-tts --corpus minimax-german --language german
|
||
fluidaudio tts-benchmark --backend magpie --speaker sofia --language en
|
||
fluidaudio tts-benchmark --backend cosyvoice3 --corpus minimax-chinese \\
|
||
--asr-backend cohere --cohere-model-dir ~/.fluidaudio/cohere/q8
|
||
|
||
Notes:
|
||
For Chinese (zh) and Japanese (ja), WER is meaningless because
|
||
WERCalculator splits on whitespace; trust the CER column instead.
|
||
The summary banner prints an explicit reminder for these langs.
|
||
"""
|
||
)
|
||
}
|
||
}
|
||
#endif
|