## Summary Adds `fluidaudio tts-benchmark`, a unified harness for measuring **latency × efficiency × quality** across every shipping TTS backend in FluidAudio, plus the model + runtime fixes needed to actually clear all six backends end-to-end on the [MiniMax Multilingual TTS Test Set](https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set). Also tags Magpie / StyleTTS2 / CosyVoice3 as **beta** at the API + docs level so users get a runtime warning on `initialize()` reflecting their actual perf / quality posture. ### Backends — all green on M2 / macOS 26 | Backend | Corpus | Status | Audio out (min / p50 / max) | RTFx | WER | Notes | |---|---|---|---|---|---|---| | Kokoro ANE | minimax-en (100/100) | ✅ | 3.5 s / 8.0 s / 11.4 s | 5.19× | 10.8% | one-shot @ 24 kHz, 7-graph pipeline; per-stage CU sweep | | Kokoro | minimax-en (100/100) | ✅ | 3.5 s / 6.8 s / 9.3 s | 2.02× | 1.3% | one-shot @ 24 kHz; multi-chunk w/ 8 ms crossfade; cleanest English ASR roundtrip | | PocketTTS | minimax-en (100/100) | ✅ | 2.8 s / 6.3 s / 9.4 s | 0.61× | 1.4% | **streaming** @ 24 kHz, 80 ms frames; TTFT 1244 ms — RTFx looks slow but is honest per-frame cost (see "RTFx caveat" below) | | Magpie | minimax-en (100/100) | ⚠️ **BETA** | 4.7 s / 10.0 s / 20.6 s | 0.64× | 5.6% | **streaming TTFT** @ 22.05 kHz: first chunk at **9.6 s p50** vs full synth 15.1 s; split-K/V decoder + `outputBackings` fast path; below real-time, runtime warning on init | | StyleTTS2 | minimax-en (100/100) | ⚠️ **BETA** | 9.6 s / 22.6 s / 32.6 s | 2.72× | 44.0% | one-shot @ 24 kHz; flex-shape fix + misaki→espeak post-pass remap (WER 58.1% → 44.0%); WER ~30× Kokoro's, runtime warning on init | | CosyVoice3 | minimax-zh (100/100) | ⚠️ **BETA** | 2.2 s / 6.5 s / **16.0 s** | 0.357׆ | n/a‡ | post auto-chunker @ 24 kHz; long phrases now split + crossfaded (8 ms cosine) — longest output 16.0 s (was capped at ~6.5 s); HiFT `.cpuAndGPU` + LLM-Decode `outputBackings` (+33% RTFx); **whisper-large-v3 CER 1.68% (macro) / 1.84% (micro)** across 100/100 phrases‡; RTFx < 1, runtime warning on init | | CosyVoice3 | minimax-yue (100/100) | ⚠️ **BETA** | 3.3 s / 8.0 s / **16.1 s** | 0.249× | n/a | post auto-chunker; **truncation 80/100 → 5/100 phrases** (`finished_on_eos=false` field), longest output 6.5 s → 16.1 s. TTFT-p50 climbs (24 s → 36 s) as the cost of multi-chunk synth | ⚠️ **BETA** = `${Backend}TtsManager.initialize()` emits a `logger.warning` flagging the perf / quality posture; safe to ship in non-latency-sensitive paths but read the per-backend doc first. ‡ CosyVoice3 zh WER stays `n/a` because `WERCalculator` whitespace-tokenizes and Mandarin has no word boundaries (word-level WER reads ~100% and is meaningless). CER is `whisper-large-v3` against the rendered WAVs from the full 100-phrase `minimax-chinese` run via `Scripts/whisper_zh_cer.py`. Cohere Transcribe q8 is also wired in this PR via `--asr-backend cohere` (see [Cohere ASR backend in the harness](#cohere-asr-backend-in-the-harness) below) and agrees with whisper at the 3–5% CER range on a 10-phrase sub-sample, but hits a `MILCompilerForANE` cache failure on this M2 host that drops it to RTFx ~0.13×, so whisper is the practical source-of-truth for the full 100-phrase run. Full numbers (cold start, p50/p95 synth, peak RSS, WER/CER per category) live in `Documentation/TTS/Benchmarks.md`. Corpus attribution + reproduction notes live in `Documentation/TTS/MinimaxCorpus.md`. ### RTFx caveat — phrase length and streaming granularity both matter Aggregate RTFx (audio_duration / wall_clock) is **only directly comparable between backends when both produce similar phrase lengths and yield audio at the same granularity**. Two things skew the headline number on this corpus: **1. Phrase-length spread.** StyleTTS2 emits ~22 s p50 of audio per `minimax-english` phrase while Kokoro emits ~7 s — same input text, ~3× more audio out. That's mostly long inter-word pauses + slow speaking rate baked into the LibriTTS multi-speaker checkpoint, not a measurement artifact. A 2.72× RTFx on 22 s audio = ~8 s wall — which matches the TTFT p50 column. Kokoro's 2.02× on 7 s audio = ~3.5 s wall. Same-corpus RTFx ratios alone hide this. **2. Streaming granularity.** PocketTTS posts 0.61× agg-RTFx vs. Kokoro's 2.02× but it's **not slower from a user perspective**: PocketTTS yields its first 80 ms audio frame at TTFT **1244 ms**, Kokoro's first frame at TTFT **3113 ms** (full one-shot chunk). The 0.61× is the per-frame cost averaged across the streaming run; what users feel is TTFT. | Backend | TTFT p50 | First yield | Implication | |-------------|----------|------------------|--------------------------------------------| | PocketTTS | 1244 ms | 80 ms frame | true streaming; conversational-ready | | Kokoro ANE | 1586 ms | full ~8 s chunk | ~1.6 s to any audio; ANE-tuned | | Kokoro | 3113 ms | full ~7 s chunk | clean quality, slower first-byte | | StyleTTS2 | 6671 ms | full ~22 s chunk | one-shot only; long phrase output amortizes the wall | | Magpie | **9580 ms** | first chunk @ 22.05 kHz | streaming via `synthesizeStream`; TTFT-p50 9.6 s vs full synth 15.1 s — 36% earlier playback start | | CosyVoice3 | 14091 / 35681 ms (zh / yue) | full chunk @ 24 kHz | one-shot per chunk; multi-chunk phrases pay TTFT for the first chunk only | For conversational use cases, **TTFT > RTFx**. PocketTTS (true streaming), Magpie (streaming via `synthesizeStream`), and Kokoro ANE (small one-shot chunks) are the three backends that meaningfully clear the "user feels it's responsive" bar today. ### Beta callouts (StyleTTS2, Magpie, CosyVoice3) Three of the six shipping backends post numbers that callers should weigh against an explicit caveat: - **StyleTTS2** — WER 44% on `minimax-english` is ~30× Kokoro's 1.3%. The misaki→espeak post-pass remap closed half the gap; the remainder is BART G2P misses + diffusion-sampler formant breaks on long phrases. - **Magpie** — agg-RTFx 0.64× on M2 — below real-time but streaming via `synthesizeStream` so TTFT (9.6 s p50) is significantly better than full-synth wall (15.1 s p50). Long-tail phrases still pull p95 wall to ~30 s. - **CosyVoice3** — agg-RTFx 0.357× on `minimax-chinese` (0.249× on the longer-phrase `minimax-cantonese` after the auto-chunker). The 250-token Flow input cap is now worked around at the call site by the auto-chunker (long phrases split + crossfaded), dropping cantonese truncation from 80/100 → 5/100 and lifting longest output from 6.5 s → 16.1 s. The 5/100 residual is the long-tail token-rate worst case; the structural fix is re-exporting Flow with a larger fixed input shape (tracked in `mobius-cosyvoice3`). `CosyVoice3SynthesisResult.finishedOnEos: Bool` + a `.warning`-level `LLM-Decode budget exhausted` log still surface any truncation, and the harness writes `finished_on_eos` into each phrase in the JSON report. Each manager now logs a `.warning`-level beta notice on `initialize()` (mirroring the existing CosyVoice3 pattern) so anyone wiring these into a product gets a console signal, not a silent surprise. Docs (`Documentation/TTS/Magpie.md`, `Documentation/TTS/Benchmarks.md` StyleTTS2 footnote, existing `CosyVoice3.md` callout) carry the same caveat at the top. ### Model + runtime fixes landed in this PR #### CosyVoice3 stateless port (`71130c9fb`) Switches LLM-Decode from the macOS 15+ stateful `MLState` path to the non-stateful `LLM-Decode-M768-fp16` graph that's actually shipped on HuggingFace. Drops ~95 LOC of state plumbing for ~30 LOC of plain `MLDictionaryFeatureProvider` prediction with explicit kv carry-forward; lowers the availability gate from macOS 15 / iOS 18 back to the package baseline (macOS 14 / iOS 17). `CosyVoice3ModelNameTests` guard the rename. #### CosyVoice3 HiFT timeout fix (`267766b62`) `minimax-chinese` runs were aborting mid-corpus with `E5RT: Submit Async failed for [3:29] ... HiFT-T500-fp16_main__Op104_BnnsCpuInference has timed out`. Root cause: HiFT was loaded with `.cpuAndNeuralEngine`, which let the planner place most of the graph on ANE but kept at least one op on the BNNS CPU async-dispatch path; long phrases tripped the BNNS async watchdog. Fix pins HiFT to `.cpuAndGPU` regardless of user-supplied compute-units, removing the BNNS path entirely. Verified on 100/100 zh + 100/100 yue. #### CosyVoice3 LLM-Decode `outputBackings` double-buffer (`248c638c6`) The autoregressive decode loop runs ~163 steps per phrase to fill the 250-token cap. Each step takes the previous step's KV cache as `kv_k` / `kv_v` (fp32 `[24, 1, 2, 768, 64]` = 9 MB each) and produces fresh `kv_k_out` / `kv_v_out` plus logits — i.e. ~36 MB of host-side `MLMultiArray` allocation **per step**. Fix pre-allocates 4 KV back-buffers + a logits backing, rotates front/back/spare across steps via `MLPredictionOptions.outputBackings`, and falls back to fresh-alloc on first rejection (one-shot `logger.warning`). Mirrors the Magpie pattern. Result on full `minimax-chinese`: agg-RTFx **0.269 → 0.357 (+33%)**, TTFT-p50 14091 ms → 9619 ms (-31%), peak RSS 3302 MB → 2470 MB. #### CosyVoice3 auto-chunker (`f80e0b92e` + `fd22624b5` + `f60cccd0d`) The 250-token Flow input cap means a single synth pass produces at most ~6.5 s of audio regardless of input length. Re-exporting Flow with a larger fixed input shape is gated on upstream conversion work, so this PR works around it at the call site: long inputs are split at sentence/clause boundaries by `CosyVoice3TextChunker`, synthesized independently, and merged with an 8 ms equal-power cosine crossfade. **Splitter policy**: hard enders (`. ! ? 。 ! ? \n`) commit always; soft enders (`, 、 ; : ; ,` + ASCII space) commit only at-or-past budget; force-split at +30 token overshoot if no natural boundary exists. `defaultMaxSpeechTokens` = 110 (leaves margin under the 250-token cap minus a typical 60–90-token speech-prompt context). Token-rate heuristic is calibrated against minimax-zh + minimax-yue runs: | Char class | Tokens / char | Rationale | |------------|---------------|--------------------------------------------------------------| | CJK | 7.5 | worst-case observed in real generation; varies 5.5–9 per char | | ASCII | 1.5 | matches BPE rate on English text | | Other | 2.5 | conservative for accented Latin / non-CJK Unicode | **Validation** on full `minimax-cantonese` (100 phrases, M2): | Metric | Pre-chunker | Post-chunker | Δ | |-------------------------------------------|-------------|--------------|------------| | `finished_on_eos=false` (truncated) | 80 / 100 | **5 / 100** | −94% | | Longest audio output | 6.5 s | **16.1 s** | +148% | | agg-RTFx | 0.245× | 0.249× | +1.6% | | TTFT p50 | 23.9 s | 35.7 s | +49% | The TTFT regression is the cost of running multiple synth passes per long phrase — splitting unblocks long-form output at the price of wall-clock latency. The 5/100 residual truncation is the long-tail token-rate worst case (some chars hit ~9 tokens/char); raising the per-CJK heuristic further would over-fragment short phrases. Cleaner fix is the Flow re-export. 16-test suite covers tokenization estimates, hard/soft/force-split policy, and the crossfade arithmetic. Lives in `Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Preprocess/CosyVoice3TextChunker.swift` + `CosyVoice3TtsManager.concatWithCrossfade`. #### Magpie streaming TTFT wire-up (`ace0bf485`) `TtsBenchmarkCommand.swift` now drives Magpie through `MagpieTtsManager.synthesizeStream`, recording `ttft_ms` at first `MagpieAudioChunk` emit instead of conflating it with full-synth wall time. Result on full `minimax-english` (100 phrases, M2): TTFT-p50 **9.6 s** vs full synth-p50 **15.1 s** — agents start playback ~36% earlier than waiting for full synth. agg-RTFx 0.41× → 0.64× (warm-cache re-run benefit; fundamentals unchanged). #### StyleTTS2 `FlexibleShapeInfo` fix (`c24900731` + `8f9e42fd9`) `text_predictor.mlmodelc` aborted on long MiniMax phrases with `E5RT: tensor_buffer has known strides while the model has FlexibleShapeInfo`. The CoreML runtime rejects two access patterns on outputs from a flex-shape model: `arr.strides` reads, and `arr[idx].floatValue` element subscripts — and the original `sliceFirstAxis2D` helper used both. Fix rewrites it to read via `arr.dataPointer.bindMemory(...)` (handling `.float32`, `.float16`, `.double`) and computes the flat index from the known `(1, leading, trailing)` row-major layout. Verified on full 100/100 minimax-en with a `ref_s.bin` dumped from the upstream LibriTTS demo voice. #### StyleTTS2 misaki → espeak post-pass remap (`ded0b9467`) After `sliceFirstAxis2D` unblocked the full corpus, StyleTTS2 still landed at **WER 0.581 / CER 0.476** — an order of magnitude worse than Kokoro (0.013). Instrumented the encoder via a new `--tokenize-only --corpus` mode and disproved the silent-vocab-drop hypothesis: only **0.09% of scalars** dropped on the full 100-phrase corpus (11 ASCII hyphens / 12247 scalars). Real root cause: G2P convention mismatch. Both Kokoro and StyleTTS2 share the in-tree misaki BART G2P (`G2PModel`), but the StyleTTS2 LibriTTS checkpoint was trained by yl4579 on **espeak-ng-phonemized** LibriTTS — predating misaki by years. The 178-vocab accepts both forms (e.g. both `ʧ` U+02A7 and `tʃ` decomposed encode), but acoustic embeddings for the misaki ligature glyphs are essentially untrained noise. Side-by-side comparison against locally-installed `espeak-ng -v en-us --ipa -q` flagged four systematic divergences: | misaki | espeak-ng | example | |--------|-----------|--------------------------| | `ʧ` | `tʃ` | choice → `tʃˈɔɪs` | | `ʤ` | `dʒ` | jump → `dʒˈʌmps` | | `ɜɹ` | `ɝ` | girl → `ɡˈɝl` | | `əɹ` | `ɚ` | over → `ˈoʊvɚ` | Fix: a 4-rule post-pass remap in `StyleTTS2Phonemizer.phonemize`, gated on `.americanEnglish` and applied to the assembled phoneme string after every word has been emitted by the BART G2P. Lives alongside the existing per-piece misaki diphthong remap. Result on the same 100-phrase MiniMax-English run with the same `libritts_696` voice and same Parakeet TDT roundtrip: | Metric | Pre | Post | Δ | |-----------------|-------|-------|--------| | Macro WER | 0.581 | 0.440 | −24.2% | | Macro CER | 0.476 | 0.241 | −49.5% | | TTFT p50 (ms) | 8937 | 6671 | −25.4% | | Agg RTFx | 2.36× | 2.72× | +15.3% | | Peak RSS (MB) | 1428 | 963 | −32.6% | Phrase 1 (`"…simple choice. Get busy living…"`) went from `simple voice. Busy dying.` (0.40 WER) to a perfect roundtrip. Remaining errors cluster on word-level G2P misses from the BART itself (`practical → practicckles`, `separation → expiration`) and diffusion-sampler formant breaks; closing the rest of the gap to Kokoro likely needs richer espeak coverage or libespeak-ng vendor — tracked separately. #### Beta callouts on StyleTTS2 + Magpie managers (`25e2b492a`) `StyleTTS2Manager.initialize` and `MagpieTtsManager.initialize` now emit `logger.warning` beta notices mirroring the existing `CosyVoice3TtsManager.initialize` pattern. Backends docs (`Magpie.md` Status section, `Benchmarks.md` StyleTTS2 footnote) gain matching `⚠️ Beta / experimental` callouts so the perf / quality posture is visible at every entry point — runtime, manager docstring, doc top, PR body. #### Magpie `outputBackings` rejection fallback (`72dae8400` + `9767e1ef9`) The shipped `decoder_step.mlmodelc` reaches the user before the rebuild lands, so CoreML can reject our `outputBackings` dictionary on a name-mismatch. Latched fallback path falls back to a fresh-alloc decode so the model still runs; first rejection latches the flag for the rest of the run. ### Cohere ASR backend in the harness (`8e741e659`) Lets non-English TTS runs (CosyVoice3, Magpie zh, etc.) score WER / CER through the harness against [Cohere Transcribe](Sources/FluidAudio/ASR/Cohere/) instead of being forced into `--skip-asr`. Four new flags on `tts-benchmark`: - `--asr-backend parakeet|cohere|none` — selects the ASR roundtrip engine. Default is `parakeet` for English-only runs and skipped for CosyVoice3. - `--cohere-model-dir <path>` — path to a directory containing `cohere_encoder.mlmodelc`, `cohere_decoder_cache_external_v2.mlmodelc`, and `vocab.json`. - `--asr-language <code>` — overrides the inferred language code (covers all 14 Cohere languages: en, fr, de, es, it, pt, nl, pl, el, ar, ja, zh, ko, vi). - `--cohere-compute-units all|cpu-and-gpu|cpu-only|all-ane` — pins `MLComputeUnits` at `CoherePipeline.loadModels` time. Use `cpu-and-gpu` when the q8 encoder fails ANE compilation (`MILCompilerForANE error: failed to compile ANE model using ANEF`) to skip the multi-minute fallback compile on the first call. The harness logs a WER caveat for zh/ja runs flagging that whitespace-tokenized WER is meaningless and the CER column is the real signal. Example end-to-end: ```bash fluidaudio tts-benchmark \ --backend cosyvoice3 \ --corpus minimax-chinese \ --asr-backend cohere \ --cohere-model-dir /path/to/cohere/q8 \ --asr-language zh \ --output-json benchmark_results/cv3-zh-cohere.json \ --audio-dir benchmark_results/cv3-zh-cohere/audio ``` On this M2 host the q8 encoder hits a CoreML ANE-cache failure (`MILCompilerForANE error: ANECCompile() FAILED`) and CoreML silently falls back to CPU+GPU, dropping Cohere from its documented RTFx ~2× (per `Documentation/ASR/Cohere.md`) to RTFx ~0.13× — correctness is unaffected (same graph, same output), only latency. The full 100-phrase CosyVoice3 zh CER number reported above (1.68% macro / 1.84% micro) was therefore produced via `whisper-large-v3` (Python CPU FP32, `Scripts/whisper_zh_cer.py`) rather than by running Cohere over all 100 phrases. A 10-phrase Cohere sub-sample agrees with whisper at the 3–5% CER range. ### Corpus migration (`4cc7d3111`) + on-demand fetch CLI (`8022e8384`) Replaces the original `prose-en` / `numbers-en` / `names-en` / `prose-zh` shipped with the first cut of this PR with the [MiniMax Multilingual TTS Test Set](https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set) (CC-BY-SA-4.0; 100 phrases × 25 languages). Same public corpus used by [MiniMax-Speech](https://arxiv.org/abs/2505.07916), seed-tts-eval, and Gradium — numbers in this PR are paper-comparable. The 24 per-language `.txt` files used to be vendored in `Benchmarks/tts/corpus/minimax/`. **Removed in this PR** in favor of an on-demand `fluidaudio minimax-corpus` CLI subcommand that fetches them from the upstream HF dataset at the pinned revision and writes them to the same path. Reuses `DownloadUtils.fetchHuggingFaceFile` for HF auth (HF_TOKEN env) + retry/backoff — no `swift-transformers` dep added, no hardcoded asset URLs. The `.txt` files now live in `.gitignore` since they're CC-BY-SA-4.0 derivative content; only `Documentation/TTS/MinimaxCorpus.md` (attribution + revision pin + WER caveats — moved from `Benchmarks/tts/corpus/minimax/README.md` in `ac21d60bf`) and the CLI subcommand are tracked. Replaces the prior `python Scripts/fetch_minimax_tts_corpus.py` (also deleted). Per-backend language scope: | Backend | Languages benchmarked | |---|---| | Kokoro / Kokoro ANE | en (af_heart) | | PocketTTS | en + de + it + pt + es + fr | | Magpie | en + es + de + fr + it + vi + zh + hi | | StyleTTS2 | en (LibriTTS multi-spk) | | CosyVoice3 | zh + yue | ### PocketTTS streaming TTFT (`c26f1e163`) PocketTTS now drives the harness through its `synthesizeStreaming` API so TTFT measures time-to-first-80ms-frame instead of full one-shot synth. TTFT 1244 ms vs. full synth 8757 ms — a 7× streaming advantage that one-shot benchmarking previously hid. ### Reference voice dumper helper (mobius-styletts2) `mobius-styletts2/scripts/06_dump_ref_s.py` (added in the sibling repo) wraps `style_encoder` + `predictor_encoder` from `99_parity_check.py` to dump a 256-fp32 LE `ref_s.bin` that `StyleTTS2Manager.synthesize` consumes via `--voice`. Required because the shipped CoreML bundle doesn't include those upstream-only PyTorch encoders. ## Test plan - [x] `swift build -c release` clean - [x] `swift format lint` clean for new files - [x] `fluidaudio tts-benchmark --help` lists all 6 backends - [x] `fluidaudio minimax-corpus --languages english --out-dir /tmp/x` produces byte-identical output to the deleted Python script - [x] Kokoro / Kokoro ANE / PocketTTS / Magpie — full 100/100 minimax-en - [x] StyleTTS2 — full 100/100 minimax-en (verified after `sliceFirstAxis2D` fix + post-pass remap) - [x] CosyVoice3 — full 100/100 minimax-zh + 100/100 minimax-yue (verified after HiFT + LLM-Decode `outputBackings` fixes) - [x] `CosyVoice3ModelNameTests` + `TtsComputeUnitPresetTests` green - [x] No `@unchecked Sendable`; per-backend error enums use `Error, LocalizedError` - [x] StyleTTS2 + Magpie + CosyVoice3 emit beta `logger.warning` on `initialize()` - [x] Corpus README moved to `Documentation/TTS/MinimaxCorpus.md`; cross-refs in `Benchmarks.md`, `MinimaxCorpusCommand.swift`, `TtsBenchmarkCommand.swift` updated - [x] CosyVoice3 6.5 s output cap investigated — confirmed structural (250-token Flow input shape, 40 ms / token); surfaced via `finishedOnEos` + warning log + JSON `finished_on_eos` field. See [Decode budget cap](Documentation/TTS/Benchmarks.md#cosyvoice3-decode-budget-cap) - [x] **CosyVoice3 auto-chunker** lands in this PR as a call-site workaround. Validated on full minimax-cantonese: truncation **80/100 → 5/100**, longest output **6.5 s → 16.1 s**, agg-RTFx 0.245× → 0.249×. 16-test suite (`CosyVoice3TextChunkerTests`) green. See [CosyVoice3 auto-chunker](Documentation/TTS/Benchmarks.md#cosyvoice3-auto-chunker) - [x] **Magpie streaming TTFT** wired through `synthesizeStream` in `TtsBenchmarkCommand.swift`. Validated on full minimax-english: TTFT-p50 **9.6 s** (first chunk) vs full-synth-p50 **15.1 s** — 36% earlier playback start. agg-RTFx 0.41× → 0.64× (warm-cache re-run) - [x] **Cohere ASR harness wiring** (`--asr-backend cohere` + `--cohere-model-dir` + `--asr-language` + `--cohere-compute-units`). Smoke-tested on a 10-phrase `minimax-chinese` sub-sample (Cohere q8 macro CER 4.88%, hit `MILCompilerForANE` fallback, RTFx ~0.13× on this M2 host). Whisper-large-v3 cross-check on the same WAVs: macro CER 3.04% — both backends agree - [x] **CosyVoice3 zh CER on full corpus** measured via `whisper-large-v3` (Python CPU FP32, `Scripts/whisper_zh_cer.py`) over all 100 minimax-chinese WAVs: macro CER **1.68%**, micro CER **1.84%**. Recorded in `Documentation/TTS/Benchmarks.md` (CosyVoice3 row + footnote ‡)
16 KiB
CosyVoice3 Swift Inference
Mandarin zero-shot voice cloning via Qwen2 LM + CFM Flow + HiFT vocoder, running on CoreML.
⚠️ Beta / experimental. End-to-end synthesis is below real-time on Apple Silicon — agg-RTFx 0.357× and p50 TTFT ~9.6 s on the full
minimax-chinese100-phrase corpus (M2, default compute units), after the HiFT timeout fix and LLM-DecodeoutputBackingsdouble-buffer. The slowdown is partly the Flow CFM stage (fp32, CPU-or-GPU only because fp16 + ANE produces NaNs through the fusedlayer_norm— CoreMLTools limitation, tracked upstream) and partly HiFT sinegen / windowing ops that fall back to CPU. May be a model issue, may be recoverable through better conversion. Treat performance numbers as preliminary; the Swift API, model layout, and prompt-asset format may change in subsequent releases without deprecation aliases.
Files
| File | Role |
|---|---|
CosyVoice3TtsManager.swift |
Public actor — initialize(), synthesize(), synthesizeFromFixture(), loadVoice(), downloadAndCreate() |
CosyVoice3Models.swift |
The 4 CoreML model handles (prefill, decode, flow, hift) |
Assets/CosyVoice3ModelStore.swift |
Loads + compiles the four mlpackages, probes flat / nested layouts |
Assets/CosyVoice3ResourceDownloader.swift |
HuggingFace pull for FluidInference/CosyVoice3-0.5B-coreml |
Pipeline/Synthesize/CosyVoice3Synthesizer.swift |
Actor — prefill → decode loop → Flow → HiFT |
Pipeline/Synthesize/CosyVoice3RasSampler.swift |
top-p / top-k / repetition mask, seed-tokens bypass |
Pipeline/Synthesize/CosyVoice3SpeechEmbeddings.swift |
mmap of 6761×896 fp16 speech-embedding table (12 MB) |
Pipeline/Synthesize/CosyVoice3Types.swift |
CosyVoice3SynthesisOptions, CosyVoice3SynthesisResult, CosyVoice3ParityOptions |
Pipeline/Preprocess/CosyVoice3TextFrontend.swift |
Special-token splitting + lm_input_embeds assembly |
Pipeline/Preprocess/Qwen2BpeTokenizer.swift |
tiktoken-compatible byte-level BPE, 151 936 vocab (incl. fileprivate ByteEncoder 188-symbol byte→unicode shim) |
Pipeline/Preprocess/CosyVoice3TextEmbeddings.swift |
mmap of 151 936×896 fp16 text embedding table |
Pipeline/Preprocess/CosyVoice3ChineseNormalizer.swift |
Minimal regex-free port of frontend_utils.py |
Pipeline/Preprocess/CosyVoice3PromptMel.swift |
24 kHz 80-bin log-mel matching matcha audio.py |
Pipeline/Preprocess/CosyVoice3PromptAssets.swift |
Voice-prompt bundle DTO (precomputed IDs / mel / spk-emb) |
Pipeline/Preprocess/CosyVoice3FrontendFixture.swift |
Phase 1 parity-fixture loader |
CosyVoice3Constants.swift |
Stop-token range, hidden dim, frame counts, etc. |
Shared/SafetensorsReader.swift |
~170 LoC pure-Swift mmap + fp16/fp32/i32 accessors |
Call Flow
CosyVoice3TtsManager.synthesize(text:promptAssets:options:)
|
v
CosyVoice3TextFrontend.assembleLmInput(text:promptAssets:)
|
|-- normalizeText() split on <|endofprompt|>, replace_blank, etc.
|-- Qwen2BpeTokenizer.encode byte-level BPE → token IDs
|-- text_embedding lookup 151 936×896 fp16 mmap → [N_text, 896]
|-- speech_embedding lookup 6761×896 fp16 mmap → [N_speech, 896]
|-- concat([SOS, text, TASK, prompt_speech_ids]) → lm_input_embeds
|
v
CosyVoice3Synthesizer.synthesize(lm_input_embeds:promptAssets:)
|
|-- runPrefill() Qwen2 24L prefill, T <= 256
| |-- in: lm_input_embeds, attn_mask
| |-- out: logits[1,T,6761], kv_cache[24,1,2,768,64] fp16
|
|-- DECODE LOOP (until stop-range hit or maxNewTokens):
| |
| |-- runDecodeStep() takes prev token + cached KV
| | |-- in: token_id, kv_cache (in-place state)
| | |-- out: logits[1,1,6761]
| |
| |-- RasSampler.sample() top-p/top-k/repetition + seed-tokens bypass
| |-- if topId in stopRange (6561...6760): break
| |-- decoded.append(topId)
|
|-- runFlow() CFM 10-step ODE, conditional on prompt mel + spk_emb
| |-- in: decoded[N], prompt_mel, spk_embedding
| |-- out: full_mel[1, 80, M] fp32
|
|-- runHiFT() vocoder, chunk-packed (T<=500 frames)
| |-- in: full_mel slice from newMelStart..newMelStart+newMelFrames
| |-- out: audio samples [N*hop_len] @ 24 kHz
|
|-- concatenate chunks → CosyVoice3SynthesisResult.samples
Public API
import FluidAudio
// One-shot creation that downloads everything to ~/.cache/fluidaudio/
let manager = try await CosyVoice3TtsManager.downloadAndCreate(
computeUnits: .cpuAndNeuralEngine
)
try await manager.initialize()
// Load a voice prompt bundle (precomputed by mobius/.../bootstrap_aishell3_voices.py)
let voice = try CosyVoice3PromptAssets.load(from: voiceBundleURL)
let result = try await manager.synthesize(
text: "希望你以后能够做的比我还好用",
promptAssets: voice,
options: CosyVoice3SynthesisOptions(maxNewTokens: 1024, seed: 42)
)
// result.samples : [Float] (mono fp32, 24 kHz)
// result.sampleRate : 24000
CosyVoice3SynthesisOptions:
| Field | Default | Notes |
|---|---|---|
maxNewTokens |
nil (= flowTotalTokens − N_prompt) |
Soft ceiling on the LLM-Decode AR loop. The hard ceiling is the structural 250-token cap below — maxNewTokens only lets you generate fewer than that. |
seed |
42 | Drives the RAS sampler RNG; reproducible runs |
disableAutoChunking |
false |
When true, bypasses CosyVoice3TextChunker and runs a single synthesizer call regardless of input length. Use when you've pre-segmented input upstream (UI streaming, paragraph-at-a-time playback, etc.). The structural 250-token cap then applies and long inputs truncate mid-utterance. |
CosyVoice3SynthesisResult:
| Field | Type | Notes |
|---|---|---|
samples |
[Float] |
mono, fp32, range ~[-1.0, 1.0] |
sampleRate |
Int |
always 24000 |
generatedTokenCount |
Int |
tokens before EOS |
decodedTokens |
[Int32] |
full speech token sequence (debug) |
finishedOnEos |
Bool |
true = AR loop exited on an EOS token (natural termination); false = budget exhausted, audio truncated mid-utterance. See "Decode budget cap" below. |
Decode budget cap + auto-chunking
The Flow CFM model is exported with a fixed-shape token_total input of
[1, 250] (CosyVoice3Constants.flowTotalTokens = 250). Each LLM-Decode
token corresponds to 40 ms of audio (tokenMelRatio = 2 × hiftSamplesPerFrame = 480 / sampleRate = 24 000),
so the generated portion of a single synthesizer call is bounded by
(250 − N_prompt) × 40 ms. With a typical prompt of ~85–95 tokens,
this leaves ~6.4–6.6 s of generated audio per call — long Mandarin
phrases would truncate mid-utterance if synthesized in one shot.
CosyVoice3TtsManager.synthesize(...) auto-chunks long input to
sidestep this. Pipeline:
- Run the existing Chinese normalizer (or skip it, per
prenormalized). CosyVoice3TextChunker.chunk(normalized)greedily splits on hard sentence enders (. ! ? 。 ! ?) and falls back to soft clause separators (, ; , ; 、 :) when sentences exceed the budget. The default budget isdefaultMaxSpeechTokens = 110speech tokens (~45-token margin under the typical 155 room-for-new; the 30-token force-split overshoot may push committed chunks to ~140 estimated).- If the chunker returns one segment, take the fast path — single synthesizer call, no concat overhead.
- Otherwise loop, calling the synthesizer once per chunk, then merge
results: PCM concatenated with an 8 ms cosine cross-fade at each
boundary (masks DC/phase mismatch from independent synth calls);
generatedTokenCount/decodedTokenssummed/concatenated;finishedOnEos= AND across all chunks.
Tunables: CosyVoice3TextChunker.defaultMaxSpeechTokens (110) is the
default budget; pass disableAutoChunking: true in
CosyVoice3SynthesisOptions to bypass the chunker entirely and run a
single call (useful for UI-driven sentence-at-a-time streaming where
the caller already controls segmentation).
Token-rate estimate inside the chunker (calibrated against minimax-zh corpus runs — initial 5.5 figure was too optimistic and let ~16% of phrases hit the cap; 7.5 covers the worst-case observed real rate):
| Class | Tokens/char | Rationale |
|---|---|---|
| CJK | 7.5 | worst-case observed in real generation; varies 5.5–9 per char |
| ASCII | 1.5 | BPE compresses; English speaks faster than Mandarin per char |
| Other (Latin-1, etc.) | 2.5 | middle ground |
Caveats:
- Prosody discontinuity at boundaries. Each chunk re-establishes the pitch contour from the prompt, so concatenated audio has audible breaks at chunk seams. The 8 ms cross-fade hides clicks/DC offsets but cannot reconstruct cross-sentence prosody.
- Per-chunk prefill cost. Each segment pays the prefill cost
separately, so total wall-clock for an N-chunk synth is roughly
N × prefill + Σ decode_per_chunk. Single-chunk inputs are unaffected. - Estimate slack. The token-per-char heuristic is rough; if a chunk
somehow exceeds the model's structural budget at runtime, the
synthesizer still emits the
LLM-Decode budget exhaustedwarning and returnsfinishedOnEos: falsefor that chunk.
Behavior of the underlying synthesizer when its budget is hit (still
applies for disableAutoChunking: true or for one-shot mode):
- AR loop exhausts
maxNewwithout observing an EOS inCosyVoice3Constants.stopRange(6_561…6_760). CosyVoice3Synthesizeremits a.warning-level log:"LLM-Decode budget exhausted: <N> generated tokens / <maxNew> cap (no EOS observed). Output truncated at ~<S>s of audio.".result.finishedOnEosisfalseso callers can detect it programmatically (thetts-benchmarkharness surfaces this as a per-phrasefinished_on_eosfield in the JSON report).
Lifting the cap structurally (no auto-chunk, no prosody seams) requires
re-exporting Flow with a larger token_total shape (e.g. [1, 500] for
~16 s) — handled upstream in the mobius-cosyvoice3 conversion pipeline;
not changeable from the Swift host.
Key State
KV cache (kv_k / kv_v each [24, 1, 2, 768, 64] fp32)
- 24 transformer layers ×
[K,V]× heads × dim, split across twoMLMultiArrayoutputs (kv_k,kv_v) that prefill produces and the decode loop carries forward across steps viaMLPredictionOptions.outputBackingsdouble-buffering. - No
MLStatedependency — runs on the package baseline (macOS 14 / iOS 17). - ~9 MB per array; pre-allocated front/back/spare buffers rotated each
step (see LLM-Decode
outputBackingsfix). - Reset per
synthesize()call.
Prompt assets (CosyVoice3PromptAssets)
promptText— Mandarin reference text (must contain<|endofprompt|>).promptSpeechIds: [Int32]— pre-tokenized speech IDs from the SpeechTokenizerV3 mlpackage (computed offline, reused across calls).promptMel: [Float],promptMelFrames— 80-bin log-mel of the reference audio at 24 kHz.spkEmbedding: [Float]— 192-dim speaker embedding from CAMPPlus.
Bundles are produced by
mobius/models/tts/cosyvoice3/coreml/verify/bootstrap_aishell3_voices.py
or extract_voice_prompt.py for arbitrary speakers.
CoreML details
- Compute units: caller chooses (
.cpuAndNeuralEngineworks for prefill + decode + HiFT). Flow is forced to.cpuAndGPUregardless — fp32 graph, ANE NaNs through the fusedlayer_norm. - All four mlpackages compiled
.mlpackage → .mlmodelcon first load and cached on disk under~/.cache/fluidaudio/Models/cosyvoice3/. CosyVoice3ModelStoreis an actor;CosyVoice3Synthesizeris an actor.CosyVoice3Models(the four-tuple) conforms toSendablevia@preconcurrency import CoreML, matching the existingTtsModelspattern.
Stop-token handling
- Speech vocab is
0..<6761; tokens6561..<6761are the EOS range. CosyVoice3Constants.stopRange = 6561...6760(closed range). The decode loop breaks whentopIdfalls in that range.- If the prefill emits a stop token at step 0 the synthesizer throws
CosyVoice3Error.predictionFailedinstead of falling through — feeding the stop-token embedding into the decode loop would accumulate semantically meaningless tokens.
CLI
fluidaudio tts --backend cosyvoice3 \
--text "希望你以后能够做的比我还好用" \
--models-dir ~/.cache/fluidaudio/Models/cosyvoice3 \
--tokenizer-dir … --embeddings-file … --special-tokens-file … \
--prompt-assets path/to/voice.safetensors \
--output out.wav
--backend cosyvoice3 (and the cv3 alias) runs the production
text-driven synthesis path. --backend help text flags it as
[BETA — slow, RTFx < 1.0] and the dispatcher emits a runtime
logger.warning so the beta status shows up without reading docs.
Dev sub-backends (for debugging the Python ↔ Swift contract)
These are the harnesses future contributors use to bisect divergence between the Swift port and the upstream Python reference. Each isolates a distinct stage of the pipeline:
fluidaudio tts --backend cosyvoice3-tokenizer-parity \
--tokenizer-dir … --fixture tokenizer_fixture.json
# Qwen2 BPE encode/decode parity vs tiktoken reference
fluidaudio tts --backend cosyvoice3-frontend-parity \
--tokenizer-dir … --embeddings-file … \
--fixture shipping.safetensors --tok-fixture …
# lm_input_embeds assembly parity (text+speech embed lookup, SOS/TASK splice)
fluidaudio tts --backend cosyvoice3-parity \
--fixture shipping.safetensors --models-dir build/
# Phase 1 fixture parity (Synthesizer: prefill → decode → Flow → HiFT)
Recommended bisection order when end-to-end output diverges from Python: tokenizer-parity → frontend-parity → fixture parity.
The production backend auto-downloads its CoreML mlpackages, tokenizer,
embeddings, and default voice from HuggingFace on first synthesis (cached
under ~/.cache/fluidaudio/Models/cosyvoice3/) — there is no separate
download CLI mode, matching how Kokoro and PocketTTS work.
Models
| Component | mlpackage | Precision | Notes |
|---|---|---|---|
| Qwen2 LLM — Prefill (T=256, M=768) | LLM-Prefill-T256-M768-fp16 |
fp16 | KV-cache out |
| Qwen2 LLM — Decode (M=768) | LLM-Decode-M768-fp16 |
fp16 | KV-cache in-place |
| CFM Flow (N=250 → M=500 mel) | Flow-N250-fp32 |
fp32 | CPU/GPU only |
| HiFT vocoder (T=500 → 10 s @ 24 kHz) | HiFT-T500-fp16 |
fp16 | sinegen on CPU |
| Qwen2 + speech embedding tables | embeddings-fp16.safetensors |
fp16 | mmap'd at runtime |
All shipped at
FluidInference/CosyVoice3-0.5B-coreml.
The conversion pipeline that produced them lives in
FluidInference/mobius#42.
Non-goals / known limits
- No on-device prompt-asset preparation. SpeechTokenizerV3 and
CAMPPlus have CoreML mlpackages but the surrounding DSP isn't ported
to Swift yet. Callers either use the bundled
cosyvoice3-default-zhvoice or run the Pythonextract_voice_prompt.pyoffline. - No production-grade Mandarin TN.
CosyVoice3ChineseNormalizeronly mirrors the simple cleanups in upstreamfrontend_utils.py. For year / currency / decimal / unit normalization, runwetext.ZhNormalizerserver-side and passprenormalized: trueonsynthesize(). - Flow stays fp32 (~1.2 GB). Until CoreMLTools pins fused-
layer_normfp16 the model NaNs on ANE. Loaded once, kept resident. - Streaming API not yet exposed. The synthesizer runs Phase 1
(prefill) and Phase 2 (Flow + HiFT) sequentially against the full
token sequence. Token streaming is internal but not surfaced through
an
AsyncStream.
License
- CosyVoice3 model weights: Apache 2.0, inherited from
FunAudioLLM/CosyVoice
upstream (
speech_300m,Fun-CosyVoice3-0.5B-2512). - FluidAudio SDK: Apache 2.0.