mirror of
https://github.com/FluidInference/FluidAudio.git
synced 2026-05-12 20:20:36 +00:00
feat(tts/benchmark): tts-benchmark CLI covering all TTS backends (#557)
## Summary Adds `fluidaudio tts-benchmark`, a unified harness for measuring **latency × efficiency × quality** across every shipping TTS backend in FluidAudio, plus the model + runtime fixes needed to actually clear all six backends end-to-end on the [MiniMax Multilingual TTS Test Set](https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set). Also tags Magpie / StyleTTS2 / CosyVoice3 as **beta** at the API + docs level so users get a runtime warning on `initialize()` reflecting their actual perf / quality posture. ### Backends — all green on M2 / macOS 26 | Backend | Corpus | Status | Audio out (min / p50 / max) | RTFx | WER | Notes | |---|---|---|---|---|---|---| | Kokoro ANE | minimax-en (100/100) | ✅ | 3.5 s / 8.0 s / 11.4 s | 5.19× | 10.8% | one-shot @ 24 kHz, 7-graph pipeline; per-stage CU sweep | | Kokoro | minimax-en (100/100) | ✅ | 3.5 s / 6.8 s / 9.3 s | 2.02× | 1.3% | one-shot @ 24 kHz; multi-chunk w/ 8 ms crossfade; cleanest English ASR roundtrip | | PocketTTS | minimax-en (100/100) | ✅ | 2.8 s / 6.3 s / 9.4 s | 0.61× | 1.4% | **streaming** @ 24 kHz, 80 ms frames; TTFT 1244 ms — RTFx looks slow but is honest per-frame cost (see "RTFx caveat" below) | | Magpie | minimax-en (100/100) | ⚠️ **BETA** | 4.7 s / 10.0 s / 20.6 s | 0.64× | 5.6% | **streaming TTFT** @ 22.05 kHz: first chunk at **9.6 s p50** vs full synth 15.1 s; split-K/V decoder + `outputBackings` fast path; below real-time, runtime warning on init | | StyleTTS2 | minimax-en (100/100) | ⚠️ **BETA** | 9.6 s / 22.6 s / 32.6 s | 2.72× | 44.0% | one-shot @ 24 kHz; flex-shape fix + misaki→espeak post-pass remap (WER 58.1% → 44.0%); WER ~30× Kokoro's, runtime warning on init | | CosyVoice3 | minimax-zh (100/100) | ⚠️ **BETA** | 2.2 s / 6.5 s / **16.0 s** | 0.357׆ | n/a‡ | post auto-chunker @ 24 kHz; long phrases now split + crossfaded (8 ms cosine) — longest output 16.0 s (was capped at ~6.5 s); HiFT `.cpuAndGPU` + LLM-Decode `outputBackings` (+33% RTFx); **whisper-large-v3 CER 1.68% (macro) / 1.84% (micro)** across 100/100 phrases‡; RTFx < 1, runtime warning on init | | CosyVoice3 | minimax-yue (100/100) | ⚠️ **BETA** | 3.3 s / 8.0 s / **16.1 s** | 0.249× | n/a | post auto-chunker; **truncation 80/100 → 5/100 phrases** (`finished_on_eos=false` field), longest output 6.5 s → 16.1 s. TTFT-p50 climbs (24 s → 36 s) as the cost of multi-chunk synth | ⚠️ **BETA** = `${Backend}TtsManager.initialize()` emits a `logger.warning` flagging the perf / quality posture; safe to ship in non-latency-sensitive paths but read the per-backend doc first. ‡ CosyVoice3 zh WER stays `n/a` because `WERCalculator` whitespace-tokenizes and Mandarin has no word boundaries (word-level WER reads ~100% and is meaningless). CER is `whisper-large-v3` against the rendered WAVs from the full 100-phrase `minimax-chinese` run via `Scripts/whisper_zh_cer.py`. Cohere Transcribe q8 is also wired in this PR via `--asr-backend cohere` (see [Cohere ASR backend in the harness](#cohere-asr-backend-in-the-harness) below) and agrees with whisper at the 3–5% CER range on a 10-phrase sub-sample, but hits a `MILCompilerForANE` cache failure on this M2 host that drops it to RTFx ~0.13×, so whisper is the practical source-of-truth for the full 100-phrase run. Full numbers (cold start, p50/p95 synth, peak RSS, WER/CER per category) live in `Documentation/TTS/Benchmarks.md`. Corpus attribution + reproduction notes live in `Documentation/TTS/MinimaxCorpus.md`. ### RTFx caveat — phrase length and streaming granularity both matter Aggregate RTFx (audio_duration / wall_clock) is **only directly comparable between backends when both produce similar phrase lengths and yield audio at the same granularity**. Two things skew the headline number on this corpus: **1. Phrase-length spread.** StyleTTS2 emits ~22 s p50 of audio per `minimax-english` phrase while Kokoro emits ~7 s — same input text, ~3× more audio out. That's mostly long inter-word pauses + slow speaking rate baked into the LibriTTS multi-speaker checkpoint, not a measurement artifact. A 2.72× RTFx on 22 s audio = ~8 s wall — which matches the TTFT p50 column. Kokoro's 2.02× on 7 s audio = ~3.5 s wall. Same-corpus RTFx ratios alone hide this. **2. Streaming granularity.** PocketTTS posts 0.61× agg-RTFx vs. Kokoro's 2.02× but it's **not slower from a user perspective**: PocketTTS yields its first 80 ms audio frame at TTFT **1244 ms**, Kokoro's first frame at TTFT **3113 ms** (full one-shot chunk). The 0.61× is the per-frame cost averaged across the streaming run; what users feel is TTFT. | Backend | TTFT p50 | First yield | Implication | |-------------|----------|------------------|--------------------------------------------| | PocketTTS | 1244 ms | 80 ms frame | true streaming; conversational-ready | | Kokoro ANE | 1586 ms | full ~8 s chunk | ~1.6 s to any audio; ANE-tuned | | Kokoro | 3113 ms | full ~7 s chunk | clean quality, slower first-byte | | StyleTTS2 | 6671 ms | full ~22 s chunk | one-shot only; long phrase output amortizes the wall | | Magpie | **9580 ms** | first chunk @ 22.05 kHz | streaming via `synthesizeStream`; TTFT-p50 9.6 s vs full synth 15.1 s — 36% earlier playback start | | CosyVoice3 | 14091 / 35681 ms (zh / yue) | full chunk @ 24 kHz | one-shot per chunk; multi-chunk phrases pay TTFT for the first chunk only | For conversational use cases, **TTFT > RTFx**. PocketTTS (true streaming), Magpie (streaming via `synthesizeStream`), and Kokoro ANE (small one-shot chunks) are the three backends that meaningfully clear the "user feels it's responsive" bar today. ### Beta callouts (StyleTTS2, Magpie, CosyVoice3) Three of the six shipping backends post numbers that callers should weigh against an explicit caveat: - **StyleTTS2** — WER 44% on `minimax-english` is ~30× Kokoro's 1.3%. The misaki→espeak post-pass remap closed half the gap; the remainder is BART G2P misses + diffusion-sampler formant breaks on long phrases. - **Magpie** — agg-RTFx 0.64× on M2 — below real-time but streaming via `synthesizeStream` so TTFT (9.6 s p50) is significantly better than full-synth wall (15.1 s p50). Long-tail phrases still pull p95 wall to ~30 s. - **CosyVoice3** — agg-RTFx 0.357× on `minimax-chinese` (0.249× on the longer-phrase `minimax-cantonese` after the auto-chunker). The 250-token Flow input cap is now worked around at the call site by the auto-chunker (long phrases split + crossfaded), dropping cantonese truncation from 80/100 → 5/100 and lifting longest output from 6.5 s → 16.1 s. The 5/100 residual is the long-tail token-rate worst case; the structural fix is re-exporting Flow with a larger fixed input shape (tracked in `mobius-cosyvoice3`). `CosyVoice3SynthesisResult.finishedOnEos: Bool` + a `.warning`-level `LLM-Decode budget exhausted` log still surface any truncation, and the harness writes `finished_on_eos` into each phrase in the JSON report. Each manager now logs a `.warning`-level beta notice on `initialize()` (mirroring the existing CosyVoice3 pattern) so anyone wiring these into a product gets a console signal, not a silent surprise. Docs (`Documentation/TTS/Magpie.md`, `Documentation/TTS/Benchmarks.md` StyleTTS2 footnote, existing `CosyVoice3.md` callout) carry the same caveat at the top. ### Model + runtime fixes landed in this PR #### CosyVoice3 stateless port (`71130c9fb`) Switches LLM-Decode from the macOS 15+ stateful `MLState` path to the non-stateful `LLM-Decode-M768-fp16` graph that's actually shipped on HuggingFace. Drops ~95 LOC of state plumbing for ~30 LOC of plain `MLDictionaryFeatureProvider` prediction with explicit kv carry-forward; lowers the availability gate from macOS 15 / iOS 18 back to the package baseline (macOS 14 / iOS 17). `CosyVoice3ModelNameTests` guard the rename. #### CosyVoice3 HiFT timeout fix (`267766b62`) `minimax-chinese` runs were aborting mid-corpus with `E5RT: Submit Async failed for [3:29] ... HiFT-T500-fp16_main__Op104_BnnsCpuInference has timed out`. Root cause: HiFT was loaded with `.cpuAndNeuralEngine`, which let the planner place most of the graph on ANE but kept at least one op on the BNNS CPU async-dispatch path; long phrases tripped the BNNS async watchdog. Fix pins HiFT to `.cpuAndGPU` regardless of user-supplied compute-units, removing the BNNS path entirely. Verified on 100/100 zh + 100/100 yue. #### CosyVoice3 LLM-Decode `outputBackings` double-buffer (`248c638c6`) The autoregressive decode loop runs ~163 steps per phrase to fill the 250-token cap. Each step takes the previous step's KV cache as `kv_k` / `kv_v` (fp32 `[24, 1, 2, 768, 64]` = 9 MB each) and produces fresh `kv_k_out` / `kv_v_out` plus logits — i.e. ~36 MB of host-side `MLMultiArray` allocation **per step**. Fix pre-allocates 4 KV back-buffers + a logits backing, rotates front/back/spare across steps via `MLPredictionOptions.outputBackings`, and falls back to fresh-alloc on first rejection (one-shot `logger.warning`). Mirrors the Magpie pattern. Result on full `minimax-chinese`: agg-RTFx **0.269 → 0.357 (+33%)**, TTFT-p50 14091 ms → 9619 ms (-31%), peak RSS 3302 MB → 2470 MB. #### CosyVoice3 auto-chunker (`f80e0b92e` + `fd22624b5` + `f60cccd0d`) The 250-token Flow input cap means a single synth pass produces at most ~6.5 s of audio regardless of input length. Re-exporting Flow with a larger fixed input shape is gated on upstream conversion work, so this PR works around it at the call site: long inputs are split at sentence/clause boundaries by `CosyVoice3TextChunker`, synthesized independently, and merged with an 8 ms equal-power cosine crossfade. **Splitter policy**: hard enders (`. ! ? 。 ! ? \n`) commit always; soft enders (`, 、 ; : ; ,` + ASCII space) commit only at-or-past budget; force-split at +30 token overshoot if no natural boundary exists. `defaultMaxSpeechTokens` = 110 (leaves margin under the 250-token cap minus a typical 60–90-token speech-prompt context). Token-rate heuristic is calibrated against minimax-zh + minimax-yue runs: | Char class | Tokens / char | Rationale | |------------|---------------|--------------------------------------------------------------| | CJK | 7.5 | worst-case observed in real generation; varies 5.5–9 per char | | ASCII | 1.5 | matches BPE rate on English text | | Other | 2.5 | conservative for accented Latin / non-CJK Unicode | **Validation** on full `minimax-cantonese` (100 phrases, M2): | Metric | Pre-chunker | Post-chunker | Δ | |-------------------------------------------|-------------|--------------|------------| | `finished_on_eos=false` (truncated) | 80 / 100 | **5 / 100** | −94% | | Longest audio output | 6.5 s | **16.1 s** | +148% | | agg-RTFx | 0.245× | 0.249× | +1.6% | | TTFT p50 | 23.9 s | 35.7 s | +49% | The TTFT regression is the cost of running multiple synth passes per long phrase — splitting unblocks long-form output at the price of wall-clock latency. The 5/100 residual truncation is the long-tail token-rate worst case (some chars hit ~9 tokens/char); raising the per-CJK heuristic further would over-fragment short phrases. Cleaner fix is the Flow re-export. 16-test suite covers tokenization estimates, hard/soft/force-split policy, and the crossfade arithmetic. Lives in `Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Preprocess/CosyVoice3TextChunker.swift` + `CosyVoice3TtsManager.concatWithCrossfade`. #### Magpie streaming TTFT wire-up (`ace0bf485`) `TtsBenchmarkCommand.swift` now drives Magpie through `MagpieTtsManager.synthesizeStream`, recording `ttft_ms` at first `MagpieAudioChunk` emit instead of conflating it with full-synth wall time. Result on full `minimax-english` (100 phrases, M2): TTFT-p50 **9.6 s** vs full synth-p50 **15.1 s** — agents start playback ~36% earlier than waiting for full synth. agg-RTFx 0.41× → 0.64× (warm-cache re-run benefit; fundamentals unchanged). #### StyleTTS2 `FlexibleShapeInfo` fix (`c24900731` + `8f9e42fd9`) `text_predictor.mlmodelc` aborted on long MiniMax phrases with `E5RT: tensor_buffer has known strides while the model has FlexibleShapeInfo`. The CoreML runtime rejects two access patterns on outputs from a flex-shape model: `arr.strides` reads, and `arr[idx].floatValue` element subscripts — and the original `sliceFirstAxis2D` helper used both. Fix rewrites it to read via `arr.dataPointer.bindMemory(...)` (handling `.float32`, `.float16`, `.double`) and computes the flat index from the known `(1, leading, trailing)` row-major layout. Verified on full 100/100 minimax-en with a `ref_s.bin` dumped from the upstream LibriTTS demo voice. #### StyleTTS2 misaki → espeak post-pass remap (`ded0b9467`) After `sliceFirstAxis2D` unblocked the full corpus, StyleTTS2 still landed at **WER 0.581 / CER 0.476** — an order of magnitude worse than Kokoro (0.013). Instrumented the encoder via a new `--tokenize-only --corpus` mode and disproved the silent-vocab-drop hypothesis: only **0.09% of scalars** dropped on the full 100-phrase corpus (11 ASCII hyphens / 12247 scalars). Real root cause: G2P convention mismatch. Both Kokoro and StyleTTS2 share the in-tree misaki BART G2P (`G2PModel`), but the StyleTTS2 LibriTTS checkpoint was trained by yl4579 on **espeak-ng-phonemized** LibriTTS — predating misaki by years. The 178-vocab accepts both forms (e.g. both `ʧ` U+02A7 and `tʃ` decomposed encode), but acoustic embeddings for the misaki ligature glyphs are essentially untrained noise. Side-by-side comparison against locally-installed `espeak-ng -v en-us --ipa -q` flagged four systematic divergences: | misaki | espeak-ng | example | |--------|-----------|--------------------------| | `ʧ` | `tʃ` | choice → `tʃˈɔɪs` | | `ʤ` | `dʒ` | jump → `dʒˈʌmps` | | `ɜɹ` | `ɝ` | girl → `ɡˈɝl` | | `əɹ` | `ɚ` | over → `ˈoʊvɚ` | Fix: a 4-rule post-pass remap in `StyleTTS2Phonemizer.phonemize`, gated on `.americanEnglish` and applied to the assembled phoneme string after every word has been emitted by the BART G2P. Lives alongside the existing per-piece misaki diphthong remap. Result on the same 100-phrase MiniMax-English run with the same `libritts_696` voice and same Parakeet TDT roundtrip: | Metric | Pre | Post | Δ | |-----------------|-------|-------|--------| | Macro WER | 0.581 | 0.440 | −24.2% | | Macro CER | 0.476 | 0.241 | −49.5% | | TTFT p50 (ms) | 8937 | 6671 | −25.4% | | Agg RTFx | 2.36× | 2.72× | +15.3% | | Peak RSS (MB) | 1428 | 963 | −32.6% | Phrase 1 (`"…simple choice. Get busy living…"`) went from `simple voice. Busy dying.` (0.40 WER) to a perfect roundtrip. Remaining errors cluster on word-level G2P misses from the BART itself (`practical → practicckles`, `separation → expiration`) and diffusion-sampler formant breaks; closing the rest of the gap to Kokoro likely needs richer espeak coverage or libespeak-ng vendor — tracked separately. #### Beta callouts on StyleTTS2 + Magpie managers (`25e2b492a`) `StyleTTS2Manager.initialize` and `MagpieTtsManager.initialize` now emit `logger.warning` beta notices mirroring the existing `CosyVoice3TtsManager.initialize` pattern. Backends docs (`Magpie.md` Status section, `Benchmarks.md` StyleTTS2 footnote) gain matching `⚠️ Beta / experimental` callouts so the perf / quality posture is visible at every entry point — runtime, manager docstring, doc top, PR body. #### Magpie `outputBackings` rejection fallback (`72dae8400` + `9767e1ef9`) The shipped `decoder_step.mlmodelc` reaches the user before the rebuild lands, so CoreML can reject our `outputBackings` dictionary on a name-mismatch. Latched fallback path falls back to a fresh-alloc decode so the model still runs; first rejection latches the flag for the rest of the run. ### Cohere ASR backend in the harness (`8e741e659`) Lets non-English TTS runs (CosyVoice3, Magpie zh, etc.) score WER / CER through the harness against [Cohere Transcribe](Sources/FluidAudio/ASR/Cohere/) instead of being forced into `--skip-asr`. Four new flags on `tts-benchmark`: - `--asr-backend parakeet|cohere|none` — selects the ASR roundtrip engine. Default is `parakeet` for English-only runs and skipped for CosyVoice3. - `--cohere-model-dir <path>` — path to a directory containing `cohere_encoder.mlmodelc`, `cohere_decoder_cache_external_v2.mlmodelc`, and `vocab.json`. - `--asr-language <code>` — overrides the inferred language code (covers all 14 Cohere languages: en, fr, de, es, it, pt, nl, pl, el, ar, ja, zh, ko, vi). - `--cohere-compute-units all|cpu-and-gpu|cpu-only|all-ane` — pins `MLComputeUnits` at `CoherePipeline.loadModels` time. Use `cpu-and-gpu` when the q8 encoder fails ANE compilation (`MILCompilerForANE error: failed to compile ANE model using ANEF`) to skip the multi-minute fallback compile on the first call. The harness logs a WER caveat for zh/ja runs flagging that whitespace-tokenized WER is meaningless and the CER column is the real signal. Example end-to-end: ```bash fluidaudio tts-benchmark \ --backend cosyvoice3 \ --corpus minimax-chinese \ --asr-backend cohere \ --cohere-model-dir /path/to/cohere/q8 \ --asr-language zh \ --output-json benchmark_results/cv3-zh-cohere.json \ --audio-dir benchmark_results/cv3-zh-cohere/audio ``` On this M2 host the q8 encoder hits a CoreML ANE-cache failure (`MILCompilerForANE error: ANECCompile() FAILED`) and CoreML silently falls back to CPU+GPU, dropping Cohere from its documented RTFx ~2× (per `Documentation/ASR/Cohere.md`) to RTFx ~0.13× — correctness is unaffected (same graph, same output), only latency. The full 100-phrase CosyVoice3 zh CER number reported above (1.68% macro / 1.84% micro) was therefore produced via `whisper-large-v3` (Python CPU FP32, `Scripts/whisper_zh_cer.py`) rather than by running Cohere over all 100 phrases. A 10-phrase Cohere sub-sample agrees with whisper at the 3–5% CER range. ### Corpus migration (`4cc7d3111`) + on-demand fetch CLI (`8022e8384`) Replaces the original `prose-en` / `numbers-en` / `names-en` / `prose-zh` shipped with the first cut of this PR with the [MiniMax Multilingual TTS Test Set](https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set) (CC-BY-SA-4.0; 100 phrases × 25 languages). Same public corpus used by [MiniMax-Speech](https://arxiv.org/abs/2505.07916), seed-tts-eval, and Gradium — numbers in this PR are paper-comparable. The 24 per-language `.txt` files used to be vendored in `Benchmarks/tts/corpus/minimax/`. **Removed in this PR** in favor of an on-demand `fluidaudio minimax-corpus` CLI subcommand that fetches them from the upstream HF dataset at the pinned revision and writes them to the same path. Reuses `DownloadUtils.fetchHuggingFaceFile` for HF auth (HF_TOKEN env) + retry/backoff — no `swift-transformers` dep added, no hardcoded asset URLs. The `.txt` files now live in `.gitignore` since they're CC-BY-SA-4.0 derivative content; only `Documentation/TTS/MinimaxCorpus.md` (attribution + revision pin + WER caveats — moved from `Benchmarks/tts/corpus/minimax/README.md` in `ac21d60bf`) and the CLI subcommand are tracked. Replaces the prior `python Scripts/fetch_minimax_tts_corpus.py` (also deleted). Per-backend language scope: | Backend | Languages benchmarked | |---|---| | Kokoro / Kokoro ANE | en (af_heart) | | PocketTTS | en + de + it + pt + es + fr | | Magpie | en + es + de + fr + it + vi + zh + hi | | StyleTTS2 | en (LibriTTS multi-spk) | | CosyVoice3 | zh + yue | ### PocketTTS streaming TTFT (`c26f1e163`) PocketTTS now drives the harness through its `synthesizeStreaming` API so TTFT measures time-to-first-80ms-frame instead of full one-shot synth. TTFT 1244 ms vs. full synth 8757 ms — a 7× streaming advantage that one-shot benchmarking previously hid. ### Reference voice dumper helper (mobius-styletts2) `mobius-styletts2/scripts/06_dump_ref_s.py` (added in the sibling repo) wraps `style_encoder` + `predictor_encoder` from `99_parity_check.py` to dump a 256-fp32 LE `ref_s.bin` that `StyleTTS2Manager.synthesize` consumes via `--voice`. Required because the shipped CoreML bundle doesn't include those upstream-only PyTorch encoders. ## Test plan - [x] `swift build -c release` clean - [x] `swift format lint` clean for new files - [x] `fluidaudio tts-benchmark --help` lists all 6 backends - [x] `fluidaudio minimax-corpus --languages english --out-dir /tmp/x` produces byte-identical output to the deleted Python script - [x] Kokoro / Kokoro ANE / PocketTTS / Magpie — full 100/100 minimax-en - [x] StyleTTS2 — full 100/100 minimax-en (verified after `sliceFirstAxis2D` fix + post-pass remap) - [x] CosyVoice3 — full 100/100 minimax-zh + 100/100 minimax-yue (verified after HiFT + LLM-Decode `outputBackings` fixes) - [x] `CosyVoice3ModelNameTests` + `TtsComputeUnitPresetTests` green - [x] No `@unchecked Sendable`; per-backend error enums use `Error, LocalizedError` - [x] StyleTTS2 + Magpie + CosyVoice3 emit beta `logger.warning` on `initialize()` - [x] Corpus README moved to `Documentation/TTS/MinimaxCorpus.md`; cross-refs in `Benchmarks.md`, `MinimaxCorpusCommand.swift`, `TtsBenchmarkCommand.swift` updated - [x] CosyVoice3 6.5 s output cap investigated — confirmed structural (250-token Flow input shape, 40 ms / token); surfaced via `finishedOnEos` + warning log + JSON `finished_on_eos` field. See [Decode budget cap](Documentation/TTS/Benchmarks.md#cosyvoice3-decode-budget-cap) - [x] **CosyVoice3 auto-chunker** lands in this PR as a call-site workaround. Validated on full minimax-cantonese: truncation **80/100 → 5/100**, longest output **6.5 s → 16.1 s**, agg-RTFx 0.245× → 0.249×. 16-test suite (`CosyVoice3TextChunkerTests`) green. See [CosyVoice3 auto-chunker](Documentation/TTS/Benchmarks.md#cosyvoice3-auto-chunker) - [x] **Magpie streaming TTFT** wired through `synthesizeStream` in `TtsBenchmarkCommand.swift`. Validated on full minimax-english: TTFT-p50 **9.6 s** (first chunk) vs full-synth-p50 **15.1 s** — 36% earlier playback start. agg-RTFx 0.41× → 0.64× (warm-cache re-run) - [x] **Cohere ASR harness wiring** (`--asr-backend cohere` + `--cohere-model-dir` + `--asr-language` + `--cohere-compute-units`). Smoke-tested on a 10-phrase `minimax-chinese` sub-sample (Cohere q8 macro CER 4.88%, hit `MILCompilerForANE` fallback, RTFx ~0.13× on this M2 host). Whisper-large-v3 cross-check on the same WAVs: macro CER 3.04% — both backends agree - [x] **CosyVoice3 zh CER on full corpus** measured via `whisper-large-v3` (Python CPU FP32, `Scripts/whisper_zh_cer.py`) over all 100 minimax-chinese WAVs: macro CER **1.68%**, micro CER **1.84%**. Recorded in `Documentation/TTS/Benchmarks.md` (CosyVoice3 row + footnote ‡)
This commit is contained in:
@@ -11,6 +11,7 @@ xcuserdata/
|
||||
*.hmap
|
||||
|
||||
*.txt
|
||||
!Benchmarks/**/*.txt
|
||||
|
||||
## App packaging
|
||||
*.ipa
|
||||
@@ -104,6 +105,10 @@ Resources/
|
||||
scripts/
|
||||
!Scripts/parakeet_subset_benchmark.sh
|
||||
!Scripts/diarizer_subset_benchmark.sh
|
||||
|
||||
# MiniMax TTS corpus is CC-BY-SA-4.0 derivative content fetched on demand
|
||||
# via `fluidaudio minimax-corpus`; only the README is checked in.
|
||||
Benchmarks/tts/corpus/minimax/*.txt
|
||||
Documentation/parakeet-tdt/
|
||||
docs/parakeet-tdt/
|
||||
|
||||
|
||||
@@ -0,0 +1,343 @@
|
||||
# TTS Benchmarks
|
||||
|
||||
> **Setup:** MacBook Air M2 (2022), 16 GB, macOS 26, on AC.
|
||||
> **Corpus:** [MiniMax Multilingual TTS Test Set][minimax] (100
|
||||
> phrases / language, CC-BY-SA-4.0) — the same public corpus used
|
||||
> by [MiniMax-Speech][mms], seed-tts-eval, and Gradium, so numbers
|
||||
> here are directly paper-comparable.
|
||||
> **Status:** Kokoro, Kokoro ANE, PocketTTS, Magpie, StyleTTS2 all
|
||||
> complete the English run; CosyVoice3 completes the full Mandarin
|
||||
> run.
|
||||
>
|
||||
> [minimax]: https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set
|
||||
> [mms]: https://arxiv.org/abs/2505.07916
|
||||
|
||||
## Why not just RTFx?
|
||||
|
||||
RTFx (audio_seconds / synth_seconds) is a useful single number for batch
|
||||
synthesis, but for conversational use it hides the things users actually
|
||||
feel:
|
||||
|
||||
1. **Cold start** — first model load + ANE compile after install or
|
||||
reboot. On Apple Silicon the system's `anecompilerservice` can take
|
||||
tens of seconds on first invocation; subsequent loads finish in ~1 s.
|
||||
2. **TTFT (time-to-first-audio)** — for streaming agents the question
|
||||
is "how long until the user hears *something*", not "how long until
|
||||
the whole utterance is rendered". For one-shot backends in this
|
||||
slice `ttft_ms == synth_ms`. **PocketTTS** and **Magpie** are
|
||||
wired through their respective streaming APIs (`synthesizeStreaming`
|
||||
/ `synthesizeStream`), so their `ttft_ms` is honest first-frame
|
||||
latency.
|
||||
3. **Per-stage compute units** — Kokoro ANE / Magpie are pipelines of
|
||||
6–7 graphs. Sometimes ANE is *slower per call* but more efficient.
|
||||
The "right" compute-unit choice differs per stage.
|
||||
4. **Memory footprint** — drives whether a backend is mobile-viable.
|
||||
5. **Quality** — RTFx alone tells you nothing about whether the model
|
||||
pronounced "Reykjavík" or "$1,234.56" correctly. We measure WER +
|
||||
CER via Parakeet roundtrip on a fixed English corpus; non-English
|
||||
backends run with `--skip-asr` for now.
|
||||
|
||||
## Methodology
|
||||
|
||||
### Corpus
|
||||
|
||||
All shipped corpora come from the **MiniMax Multilingual TTS Test
|
||||
Set** (`MiniMaxAI/TTS-Multilingual-Test-Set` on Hugging Face,
|
||||
CC-BY-SA-4.0). The fetched files land under
|
||||
`Benchmarks/tts/corpus/minimax/<lang>.txt` (24 languages × 100 phrases
|
||||
= 2400 phrases) and are gitignored — populate them on demand with
|
||||
`swift run fluidaudio minimax-corpus`. Attribution, revision pin,
|
||||
and WER caveats live in [`MinimaxCorpus.md`](MinimaxCorpus.md).
|
||||
|
||||
Reference each language as `--corpus minimax-<lang>`:
|
||||
|
||||
| Backend | Default corpus | Other supported MiniMax languages |
|
||||
|-------------|--------------------|------------------------------------------------|
|
||||
| Kokoro / Kokoro ANE | `minimax-english` | `english` only (`af_heart` voice) |
|
||||
| PocketTTS | `minimax-english` | `english`, `german`, `italian`, `portuguese`, `spanish`, `french` |
|
||||
| StyleTTS2 | `minimax-english` | `english` only (LibriTTS multi-speaker) |
|
||||
| Magpie | `minimax-english` | `english`, `spanish`, `german`, `french`, `italian`, `vietnamese`, `chinese`, `hindi` |
|
||||
| CosyVoice3 | `minimax-chinese` | `chinese`, `cantonese` |
|
||||
|
||||
Lines beginning with `#` are comments. Custom corpora can still be
|
||||
passed with `--corpus-path <file.txt>`.
|
||||
|
||||
### Metrics
|
||||
|
||||
Per phrase:
|
||||
- `ttft_ms` — time-to-first-audio. For one-shot backends this equals
|
||||
`synth_ms`. **PocketTTS** is benchmarked through
|
||||
`synthesizeStreaming`, so its `ttft_ms` is the timestamp of the first
|
||||
80 ms audio frame (1920 samples @ 24 kHz). **Magpie** is benchmarked
|
||||
through `synthesizeStream`, so its `ttft_ms` is the first
|
||||
`MagpieAudioChunk` emit time (typically ~9.6 s on M2 vs ~15 s for
|
||||
full synth).
|
||||
- `synth_ms` — total synth wall time.
|
||||
- `audio_ms` — generated audio duration.
|
||||
- `rtfx` — `audio_ms / synth_ms`.
|
||||
- `wer`, `cer` — via Parakeet ASR roundtrip on the rendered WAV.
|
||||
- `stage_ms` — per-stage breakdown (backend-specific keys; populated
|
||||
for Kokoro ANE + Magpie; empty for Kokoro / PocketTTS /
|
||||
StyleTTS2 / CosyVoice3).
|
||||
- Backend-specific extras: `encoder_tokens`, `acoustic_frames`,
|
||||
`chunk_count`, `frame_count`, `code_count`, `finished_on_eos`,
|
||||
`generated_token_count`, etc.
|
||||
|
||||
Aggregates:
|
||||
- `cold_start_s` — `manager.initialize()` wall time. CosyVoice3 also
|
||||
includes voice-asset load.
|
||||
- `first_synth_ms` — first synth call after init (still cold-ish).
|
||||
- `ttft_ms_p50` / `ttft_ms_p95`.
|
||||
- `warm_synth_ms_p50` / `warm_synth_ms_p95`.
|
||||
- `agg_rtfx` — `Σ audio_ms / Σ synth_ms` across the corpus.
|
||||
- `peak_rss_mb` — process-wide peak resident set, via
|
||||
`task_vm_info_data_t.resident_size_peak`.
|
||||
- Per-category macro WER / CER.
|
||||
|
||||
### Reproducibility
|
||||
|
||||
```bash
|
||||
# From the package root.
|
||||
swift run fluidaudio tts-benchmark \
|
||||
--backend kokoro-ane \
|
||||
--corpus minimax-english \
|
||||
--voice af_heart \
|
||||
--compute-units default \
|
||||
--output-json bench.json \
|
||||
--audio-dir bench-wavs/
|
||||
```
|
||||
|
||||
The harness writes a JSON report to `--output-json` and (optionally)
|
||||
keeps WAVs under `--audio-dir`. Pass `--skip-asr` to drop the ASR
|
||||
roundtrip. The default ASR backend is `parakeet` for English-only
|
||||
runs and is skipped for CosyVoice3; pass `--asr-backend cohere
|
||||
--cohere-model-dir <dir>` to score Mandarin (or any of the 14
|
||||
Cohere languages) against [Cohere Transcribe](../../Sources/FluidAudio/ASR/Cohere/).
|
||||
|
||||
## Results
|
||||
|
||||
### Per-backend top-line
|
||||
|
||||
Reference machine: **MacBook Air, Apple M2 (2022), 8-core CPU /
|
||||
8-core GPU / 16-core Neural Engine, 16 GB unified memory, macOS 26**
|
||||
(`Mac14,2`, on AC). All English runs use `--compute-units default`,
|
||||
voice = backend default
|
||||
(`af_heart` for Kokoro, `alba` for PocketTTS, `John` for Magpie),
|
||||
corpus = `minimax-english` (100 phrases), Parakeet TDT roundtrip for
|
||||
WER / CER.
|
||||
|
||||
| Backend | License | Languages | Footprint | Cold start | TTFT p50 / p95\* | Synth p50 / p95 | Agg RTFx | Peak RSS | WER | CER | Notes |
|
||||
|-------------|-------------|------------------------|-----------|------------|---------------------|---------------------|----------|----------|---------|---------|-------|
|
||||
| Kokoro ANE | Apache-2.0 | en (af_heart only) | ~330 MB | 37.9 s | 1586 / 2515 ms | 1586 / 2515 ms | 5.19× | 738 MB | 0.108 | 0.040 | one-shot; per-stage CU sweep, 7-graph pipeline |
|
||||
| Kokoro | Apache-2.0 | en (af_heart only) | ~330 MB | 92.2 s | 3113 / 4696 ms | 3113 / 4696 ms | 2.02× | 736 MB | 0.013 | 0.005 | one-shot; cleanest English ASR roundtrip |
|
||||
| PocketTTS | research | en + de + it + pt + es + fr (6L / 24L) | ~140 / ~520 MB | 6.0 s | **1244 / 4749 ms** | 8757 / 19174 ms | 0.61× | 1503 MB | 0.014 | 0.006 | **streaming**; TTFT is first 80 ms audio frame |
|
||||
| StyleTTS2 | MIT | en (LibriTTS multi-spk) | ~280 MB | 955 s§ | 6671 / 15990 ms§ | 6671 / 15990 ms§ | 2.72×§ | 963 MB§ | 0.440§ | 0.241§ | full 100/100 `minimax-english` via [misaki→espeak post-pass remap](#styletts2-misaki--espeak-post-pass-remap); ref_s = LibriTTS `696_92939_000016_000006.wav` (StyleTTS2 demo voice) |
|
||||
| Magpie | research | en/es/de/fr/it/vi/zh/hi | ~1.3 GB | 38.5 s∥ | **9580 / 23796 ms**∥ | 15080 / 29895 ms∥ | 0.64×∥ | 762 MB∥ | 0.056 | 0.033 | **streaming TTFT**: first audio chunk at 9.6 s p50 on M2 (full synth 15.1 s); split-K/V decoder; outputBackings fast path with latched fallback |
|
||||
| CosyVoice3 | Apache-2.0 | zh (mandarin) | ~1.5 GB | 29.2 s† | 14091 / 23679 ms† | 14091 / 23679 ms† | 0.357׆ | 3302 MB† | n/a‡ | 0.017‡ | beta; full `minimax-chinese` (100/100 phrases) for latency / RSS and whisper-large-v3 CER‡; cantonese supported via [auto-chunker](#cosyvoice3-auto-chunker) but not benchmarked (no yue ASR) |
|
||||
|
||||
\* TTFT for **PocketTTS / Magpie** is first-frame emit through the
|
||||
streaming API; the others are one-shot, so `ttft_ms == synth_ms`.
|
||||
|
||||
† CosyVoice3 chinese: 100/100, 0 errors, ASR skipped. Cold-start
|
||||
dropped from 302.7 s to 29.2 s on the warm re-run.
|
||||
|
||||
‡ CosyVoice3 CER measured on the **full 100-phrase**
|
||||
`minimax-chinese` corpus via `whisper-large-v3` (Python CPU FP32,
|
||||
[`Scripts/whisper_zh_cer.py`](../../Scripts/whisper_zh_cer.py)) on
|
||||
the WAVs rendered by `tts-benchmark --backend cosyvoice3 --corpus
|
||||
minimax-chinese --skip-asr --audio-dir <dir>`: **macro CER 1.68%
|
||||
(0.0168)**, **micro CER 1.84% (0.0184)** across 100 phrases.
|
||||
Whisper is the source of truth here because Cohere Transcribe q8
|
||||
hit a `MILCompilerForANE` cache failure on this M2 host and ran on
|
||||
the CPU+GPU fallback path at RTFx ~0.13× (would have taken multiple
|
||||
hours for the full 100-phrase set vs. ~70 min for whisper). WER is
|
||||
omitted because Mandarin has no word boundaries and `WERCalculator`
|
||||
splits on whitespace, so word-level WER reads near 100% and is
|
||||
meaningless.
|
||||
|
||||
∥ Magpie: streamed via `synthesizeStream`. TTFT (9.6 s p50) is
|
||||
first-chunk emit; synth (15.1 s p50) is full-utterance wall time —
|
||||
the 5.5 s gap is the streaming win.
|
||||
|
||||
§ StyleTTS2 (**beta** — `StyleTTS2Manager.initialize` emits a
|
||||
runtime warning): warm-cache run; first cold compile of the
|
||||
bucketed text_predictor / diffusion_step / decoder graphs is
|
||||
multi-second. ref_s dumped via
|
||||
[`06_dump_ref_s.py`](https://github.com/voicelink-ai/mobius-styletts2/blob/main/models/tts/styletts2/scripts/06_dump_ref_s.py).
|
||||
Read WER **relatively** per the
|
||||
[WER caveat](#about-the-wer--cer-numbers); StyleTTS2's own demo
|
||||
notebook reports artifacts on long sentences at default
|
||||
`alpha/beta/diffusion_steps`.
|
||||
|
||||
### Kokoro ANE — per-stage breakdown (default preset, MiniMax-English)
|
||||
|
||||
Means across 100 `minimax-english` phrases on M2. Stages map to the
|
||||
7-CoreML-graph split documented in [KokoroAne.md](KokoroAne.md). Vocoder
|
||||
+ noise together account for ~92% of synth time, which is the natural
|
||||
target for any further per-stage compute-unit re-tuning. The MiniMax
|
||||
mean is meaningfully higher than the prior Harvard-sentences run
|
||||
because phrases 81–100 are paragraph-length news / story sentences.
|
||||
|
||||
| Stage | Mean ms | % of total |
|
||||
|---------------|---------|------------|
|
||||
| `albert` | 28.2 | 2.0% |
|
||||
| `post_albert` | 12.1 | 0.9% |
|
||||
| `alignment` | 1.8 | 0.1% |
|
||||
| `prosody` | 49.2 | 3.5% |
|
||||
| `noise` | 242.6 | 17.4% |
|
||||
| `vocoder` | 1039.8 | 74.4% |
|
||||
| `tail` | 24.6 | 1.8% |
|
||||
| **total** | 1398.4 | 100% |
|
||||
|
||||
### Magpie — per-stage breakdown (default preset, MiniMax-English)
|
||||
|
||||
Means across 100 `minimax-english` phrases on M2 (`John` voice, en,
|
||||
default compute units), captured during the original one-shot
|
||||
profiling run. `ar_loop` is the umbrella for the per-step
|
||||
`decoder_step` + `sampler` (so it is not added on top in the total).
|
||||
`nanocodec` runs concurrently with the AR loop in chunked-streaming
|
||||
mode, which is why the per-stage means do not sum to total warm-synth
|
||||
mean. The AR loop dominates the wall clock, and its cost grows
|
||||
super-linearly with phrase length — long news / story phrases drive
|
||||
the long-tail p95.
|
||||
|
||||
| Stage | Mean ms |
|
||||
|--------------------|---------|
|
||||
| `text_encoder` | 91 |
|
||||
| `prefill` | 281 |
|
||||
| `ar_loop` | 17946 |
|
||||
| └── `decoder_step` | 14840 |
|
||||
| └── `sampler` | 3081 |
|
||||
| `nanocodec` | 17948 |
|
||||
|
||||
### About the WER / CER numbers
|
||||
|
||||
The MiniMax corpus mixes short conversational phrases, medium news
|
||||
headlines, and long narrative paragraphs. WER on the long tail is
|
||||
sensitive to the ASR + text-normalizer stack (e.g. `"3,5%"` →
|
||||
`"three point five percent"` vs. `"three and a half percent"`); per
|
||||
the [upstream community
|
||||
discussion](https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set/discussions/10),
|
||||
absolute WER is best read **relatively** (backend A vs. backend B on
|
||||
the same corpus + same ASR + same normalizer) rather than against
|
||||
raw paper numbers.
|
||||
|
||||
## StyleTTS2 misaki → espeak post-pass remap
|
||||
|
||||
StyleTTS2's LibriTTS checkpoint was trained on **espeak-ng-phonemized**
|
||||
text, but the in-tree BART G2P (shared with Kokoro) emits **misaki**
|
||||
output. The 178-token vocab accepts both forms, but the acoustic
|
||||
embeddings for the misaki ligature glyphs are essentially untrained
|
||||
noise — every training utterance saw the espeak form.
|
||||
|
||||
Four systematic divergences vs. `espeak-ng -v en-us --ipa -q`:
|
||||
|
||||
| misaki | espeak-ng | example |
|
||||
|--------|-----------|--------------------------|
|
||||
| `ʧ` | `tʃ` | choice → `tʃˈɔɪs` |
|
||||
| `ʤ` | `dʒ` | jump → `dʒˈʌmps` |
|
||||
| `ɜɹ` | `ɝ` | girl → `ɡˈɝl` |
|
||||
| `əɹ` | `ɚ` | over → `ˈoʊvɚ` |
|
||||
|
||||
Fix: 4-rule post-pass remap in `StyleTTS2Phonemizer.phonemize`, gated
|
||||
on `.americanEnglish`. Result on `minimax-english`: WER 0.581 →
|
||||
0.440, CER 0.476 → 0.241, agg-RTFx 2.36× → 2.72× (warm-cache
|
||||
re-run, so latency / RSS deltas are noise — WER / CER are the real
|
||||
signal). WER is still 30× worse than Kokoro; remaining errors cluster
|
||||
on word-level BART mispronunciations and long-tail diffusion artifacts.
|
||||
Further gains likely need a richer remap layer or swapping BART for
|
||||
libespeak-ng directly.
|
||||
|
||||
## CosyVoice3 Decode budget cap
|
||||
|
||||
CosyVoice3's Flow CFM was exported with a fixed input shape of
|
||||
`[1, 250]` speech tokens (`flowTotalTokens` in
|
||||
`CosyVoice3Constants.swift:45`). The LLM-Decode AR loop is allowed to
|
||||
emit up to `flowTotalTokens − N_prompt` tokens before being cut off
|
||||
(typically ~163 generated tokens after the speech-prompt portion).
|
||||
At `tokenMelRatio=2 × hiftSamplesPerFrame=480 / sampleRate=24000`
|
||||
that's **40 ms of audio per generated token**, so the loop produces
|
||||
**at most ~6.5 s of speech per phrase**, regardless of how long the
|
||||
input text is.
|
||||
|
||||
When the AR loop exits because it ran out of budget (i.e. no EOS
|
||||
token in `stopRange = 6_561…6_760`) instead of natural termination,
|
||||
`CosyVoice3Synthesizer` now:
|
||||
|
||||
1. Logs a `.warning` (one-shot per phrase) naming the
|
||||
`decoded.count / maxNew` budget and the produced audio duration.
|
||||
2. Sets `CosyVoice3SynthesisResult.finishedOnEos = false`, which the
|
||||
benchmark harness surfaces as the `finished_on_eos` field on each
|
||||
phrase in the JSON report.
|
||||
|
||||
Footprint on the cantonese corpus (`minimax-cantonese`,
|
||||
100 phrases) **without the chunker**: 80 / 100 phrases would hit the
|
||||
cap, all producing exactly 163 generated tokens / ~6.5 s of audio.
|
||||
The mandarin corpus sees a much lower truncation rate because
|
||||
MiniMax-zh phrases are shorter on average.
|
||||
|
||||
The structural fix — re-exporting the Flow CFM from
|
||||
[`mobius-cosyvoice3`](https://github.com/voicelink-ai/mobius-cosyvoice3)
|
||||
with a larger fixed input shape (e.g. `[1, 500]`) — is upstream
|
||||
work; bumping the constant in Swift alone would make the Flow
|
||||
input/output shapes mismatch at predict time. The shipped workaround
|
||||
is the call-site [auto-chunker](#cosyvoice3-auto-chunker), which
|
||||
drops cantonese truncation from 80/100 → 5/100 by splitting long
|
||||
inputs at clause boundaries and crossfading the results.
|
||||
|
||||
Surfaced in
|
||||
`CosyVoice3Synthesizer.synthesize`
|
||||
(`Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Synthesize/CosyVoice3Synthesizer.swift`)
|
||||
and
|
||||
`CosyVoice3SynthesisResult.finishedOnEos`
|
||||
(`Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Synthesize/CosyVoice3Types.swift`).
|
||||
|
||||
## CosyVoice3 auto-chunker
|
||||
|
||||
Re-exporting Flow CFM with a larger fixed input shape is gated on
|
||||
upstream conversion work. Until that lands, `CosyVoice3TtsManager`
|
||||
splits long inputs at the call site, synthesizes each chunk
|
||||
independently, and merges with an 8 ms equal-power cosine crossfade.
|
||||
|
||||
**Splitter policy** (`CosyVoice3TextChunker`):
|
||||
|
||||
- **Hard enders** commit always: `.`, `!`, `?`, `。`, `!`, `?`,
|
||||
`\n`.
|
||||
- **Soft enders** commit only when the running estimate is at or past
|
||||
the budget: `,`, `、`, `;`, `:`, `;`, `,`, ASCII space.
|
||||
- **Force-split** at `budget + 30` tokens of overshoot if no natural
|
||||
boundary appeared (rare; mostly continuous CJK with no
|
||||
punctuation).
|
||||
|
||||
**Token-rate estimate** (calibrated against minimax-zh + minimax-yue
|
||||
runs):
|
||||
|
||||
| Char class | Tokens / char | Rationale |
|
||||
|------------|---------------|--------------------------------------------------------------|
|
||||
| CJK | 7.5 | worst-case observed in real generation; varies 5.5–9 per char |
|
||||
| ASCII | 1.5 | matches BPE rate on English text |
|
||||
| Other | 2.5 | conservative for accented Latin / non-CJK Unicode |
|
||||
|
||||
`defaultMaxSpeechTokens` is **110**, leaving margin under the
|
||||
250-token Flow cap minus typical 60–90 token speech-prompt context.
|
||||
|
||||
**Concatenation**: 8 ms equal-power cosine crossfade at 24 kHz
|
||||
between adjacent chunks; single-chunk path short-circuits to plain
|
||||
copy.
|
||||
|
||||
**Validation** (full `minimax-cantonese`, 100 phrases, M2):
|
||||
|
||||
| Metric | Pre-chunker | Post-chunker | Δ |
|
||||
|-------------------------------------------|-------------|--------------|------------|
|
||||
| `finished_on_eos=false` (truncated) | 80 / 100 | **5 / 100** | −94% |
|
||||
| Longest audio output | 6.5 s | **16.1 s** | +148% |
|
||||
| agg-RTFx | 0.245× | 0.249× | +1.6% |
|
||||
| TTFT p50 | 23.9 s | 35.7 s | +49% |
|
||||
| TTFT p95 | 41.2 s | 60.5 s | +47% |
|
||||
| Peak RSS | 2016 MB | 3264 MB | +62% |
|
||||
|
||||
The 5/100 residual is the long-tail token-rate worst case (some
|
||||
Cantonese characters generate >9 speech tokens); raising the
|
||||
per-CJK heuristic further would over-fragment short phrases.
|
||||
Cleaner fix is the upstream Flow re-export.
|
||||
|
||||
@@ -3,16 +3,19 @@
|
||||
Mandarin zero-shot voice cloning via Qwen2 LM + CFM Flow + HiFT vocoder,
|
||||
running on CoreML.
|
||||
|
||||
> ⚠️ **Beta / experimental.** End-to-end synthesis is currently slow on
|
||||
> Apple Silicon — RTFx < 1.0 typical, several seconds of latency for
|
||||
> short Mandarin utterances. The slowdown is partly the Flow CFM stage
|
||||
> (fp32, CPU-or-GPU only because fp16 + ANE produces NaNs through the
|
||||
> fused `layer_norm` — CoreMLTools limitation, tracked upstream) and
|
||||
> partly HiFT sinegen / windowing ops that fall back to CPU. May be a
|
||||
> model issue, may be recoverable through better conversion. Treat
|
||||
> performance numbers as preliminary; the Swift API, model layout, and
|
||||
> prompt-asset format may change in subsequent releases without
|
||||
> deprecation aliases.
|
||||
> ⚠️ **Beta / experimental.** End-to-end synthesis is below real-time
|
||||
> on Apple Silicon — agg-RTFx **0.357×** and p50 TTFT **~9.6 s** on
|
||||
> the full `minimax-chinese` 100-phrase corpus (M2, default compute
|
||||
> units), after the
|
||||
> [HiFT timeout fix](Benchmarks.md#cosyvoice3-hift-timeout-fix) and
|
||||
> [LLM-Decode `outputBackings` double-buffer](Benchmarks.md#cosyvoice3-llm-decode-outputbackings-fix).
|
||||
> The slowdown is partly the Flow CFM stage (fp32, CPU-or-GPU only
|
||||
> because fp16 + ANE produces NaNs through the fused `layer_norm` —
|
||||
> CoreMLTools limitation, tracked upstream) and partly HiFT sinegen
|
||||
> / windowing ops that fall back to CPU. May be a model issue, may
|
||||
> be recoverable through better conversion. Treat performance numbers
|
||||
> as preliminary; the Swift API, model layout, and prompt-asset format
|
||||
> may change in subsequent releases without deprecation aliases.
|
||||
|
||||
## Files
|
||||
|
||||
@@ -105,8 +108,9 @@ let result = try await manager.synthesize(
|
||||
|
||||
| Field | Default | Notes |
|
||||
|---|---|---|
|
||||
| `maxNewTokens` | `nil` (cap = 1024) | Hard ceiling on speech-token count |
|
||||
| `maxNewTokens` | `nil` (= `flowTotalTokens − N_prompt`) | Soft ceiling on the LLM-Decode AR loop. The hard ceiling is the structural 250-token cap below — `maxNewTokens` only lets you generate fewer than that. |
|
||||
| `seed` | 42 | Drives the RAS sampler RNG; reproducible runs |
|
||||
| `disableAutoChunking` | `false` | When `true`, bypasses `CosyVoice3TextChunker` and runs a single synthesizer call regardless of input length. Use when you've pre-segmented input upstream (UI streaming, paragraph-at-a-time playback, etc.). The structural 250-token cap then applies and long inputs truncate mid-utterance. |
|
||||
|
||||
`CosyVoice3SynthesisResult`:
|
||||
|
||||
@@ -116,13 +120,92 @@ let result = try await manager.synthesize(
|
||||
| `sampleRate` | `Int` | always 24000 |
|
||||
| `generatedTokenCount` | `Int` | tokens before EOS |
|
||||
| `decodedTokens` | `[Int32]` | full speech token sequence (debug) |
|
||||
| `finishedOnEos` | `Bool` | `true` = AR loop exited on an EOS token (natural termination); `false` = budget exhausted, audio truncated mid-utterance. See "Decode budget cap" below. |
|
||||
|
||||
### Decode budget cap + auto-chunking
|
||||
|
||||
The Flow CFM model is exported with a fixed-shape `token_total` input of
|
||||
`[1, 250]` (`CosyVoice3Constants.flowTotalTokens = 250`). Each LLM-Decode
|
||||
token corresponds to **40 ms of audio** (`tokenMelRatio = 2 × hiftSamplesPerFrame = 480 / sampleRate = 24 000`),
|
||||
so the *generated* portion of a single synthesizer call is bounded by
|
||||
`(250 − N_prompt) × 40 ms`. With a typical prompt of ~85–95 tokens,
|
||||
this leaves ~6.4–6.6 s of generated audio per call — long Mandarin
|
||||
phrases would truncate mid-utterance if synthesized in one shot.
|
||||
|
||||
**`CosyVoice3TtsManager.synthesize(...)` auto-chunks long input** to
|
||||
sidestep this. Pipeline:
|
||||
|
||||
1. Run the existing Chinese normalizer (or skip it, per `prenormalized`).
|
||||
2. `CosyVoice3TextChunker.chunk(normalized)` greedily splits on hard
|
||||
sentence enders (`. ! ? 。 ! ?`) and falls back to soft clause
|
||||
separators (`, ; , ; 、 :`) when sentences exceed the budget. The
|
||||
default budget is `defaultMaxSpeechTokens = 110` speech tokens
|
||||
(`~45-token margin under the typical 155 room-for-new`; the 30-token
|
||||
force-split overshoot may push committed chunks to ~140 estimated).
|
||||
3. If the chunker returns one segment, take the fast path — single
|
||||
synthesizer call, no concat overhead.
|
||||
4. Otherwise loop, calling the synthesizer once per chunk, then merge
|
||||
results: PCM concatenated with an 8 ms cosine cross-fade at each
|
||||
boundary (masks DC/phase mismatch from independent synth calls);
|
||||
`generatedTokenCount`/`decodedTokens` summed/concatenated;
|
||||
`finishedOnEos` = AND across all chunks.
|
||||
|
||||
Tunables: `CosyVoice3TextChunker.defaultMaxSpeechTokens` (110) is the
|
||||
default budget; pass `disableAutoChunking: true` in
|
||||
`CosyVoice3SynthesisOptions` to bypass the chunker entirely and run a
|
||||
single call (useful for UI-driven sentence-at-a-time streaming where
|
||||
the caller already controls segmentation).
|
||||
|
||||
Token-rate estimate inside the chunker (calibrated against minimax-zh
|
||||
corpus runs — initial 5.5 figure was too optimistic and let ~16% of
|
||||
phrases hit the cap; 7.5 covers the worst-case observed real rate):
|
||||
|
||||
| Class | Tokens/char | Rationale |
|
||||
|---|---|---|
|
||||
| CJK | 7.5 | worst-case observed in real generation; varies 5.5–9 per char |
|
||||
| ASCII | 1.5 | BPE compresses; English speaks faster than Mandarin per char |
|
||||
| Other (Latin-1, etc.) | 2.5 | middle ground |
|
||||
|
||||
Caveats:
|
||||
|
||||
- **Prosody discontinuity at boundaries.** Each chunk re-establishes the
|
||||
pitch contour from the prompt, so concatenated audio has audible breaks
|
||||
at chunk seams. The 8 ms cross-fade hides clicks/DC offsets but cannot
|
||||
reconstruct cross-sentence prosody.
|
||||
- **Per-chunk prefill cost.** Each segment pays the prefill cost
|
||||
separately, so total wall-clock for an N-chunk synth is roughly
|
||||
`N × prefill + Σ decode_per_chunk`. Single-chunk inputs are unaffected.
|
||||
- **Estimate slack.** The token-per-char heuristic is rough; if a chunk
|
||||
somehow exceeds the model's structural budget at runtime, the
|
||||
synthesizer still emits the `LLM-Decode budget exhausted` warning and
|
||||
returns `finishedOnEos: false` for that chunk.
|
||||
|
||||
Behavior of the underlying synthesizer when its budget is hit (still
|
||||
applies for `disableAutoChunking: true` or for one-shot mode):
|
||||
|
||||
- **AR loop exhausts `maxNew` without observing an EOS** in
|
||||
`CosyVoice3Constants.stopRange` (`6_561…6_760`).
|
||||
- `CosyVoice3Synthesizer` emits a `.warning`-level log:
|
||||
`"LLM-Decode budget exhausted: <N> generated tokens / <maxNew> cap (no EOS observed). Output truncated at ~<S>s of audio."`.
|
||||
- `result.finishedOnEos` is `false` so callers can detect it
|
||||
programmatically (the `tts-benchmark` harness surfaces this as a
|
||||
per-phrase `finished_on_eos` field in the JSON report).
|
||||
|
||||
Lifting the cap structurally (no auto-chunk, no prosody seams) requires
|
||||
re-exporting Flow with a larger `token_total` shape (e.g. `[1, 500]` for
|
||||
~16 s) — handled upstream in the `mobius-cosyvoice3` conversion pipeline;
|
||||
not changeable from the Swift host.
|
||||
|
||||
## Key State
|
||||
|
||||
### KV cache (`kv_cache[24, 1, 2, 768, 64]` fp16)
|
||||
- 24 transformer layers × `[K,V]` × heads × dim, packed into one `MLState`-style
|
||||
`MLMultiArray` that the prefill produces and the decode loop both reads
|
||||
and overwrites in-place.
|
||||
### KV cache (`kv_k` / `kv_v` each `[24, 1, 2, 768, 64]` fp32)
|
||||
- 24 transformer layers × `[K,V]` × heads × dim, split across two
|
||||
`MLMultiArray` outputs (`kv_k`, `kv_v`) that prefill produces and the
|
||||
decode loop carries forward across steps via
|
||||
`MLPredictionOptions.outputBackings` double-buffering.
|
||||
- No `MLState` dependency — runs on the package baseline (macOS 14 / iOS 17).
|
||||
- ~9 MB per array; pre-allocated front/back/spare buffers rotated each
|
||||
step (see [LLM-Decode `outputBackings` fix](Benchmarks.md#cosyvoice3-llm-decode-outputbackings-fix)).
|
||||
- Reset per `synthesize()` call.
|
||||
|
||||
### Prompt assets (`CosyVoice3PromptAssets`)
|
||||
|
||||
@@ -148,7 +148,11 @@ timing (5 s of audio, M1):
|
||||
| Vocoder | ~120 ms |
|
||||
| Tail | ~50 ms |
|
||||
|
||||
Vocoder dominates. Total ≈ 300 ms for 5 s audio (~16× RTFx).
|
||||
Vocoder dominates. Total ≈ 300 ms for 5 s audio (~16× RTFx). For
|
||||
full-corpus numbers (warm-synth p50 / p95, peak RSS, WER) on the
|
||||
MiniMax-English 100-phrase suite — including the longer paragraph
|
||||
phrases that pull the per-corpus aggregate down to ~5.2× — see
|
||||
[Benchmarks.md](Benchmarks.md).
|
||||
|
||||
## Source
|
||||
|
||||
|
||||
+20
-10
@@ -5,16 +5,26 @@ Lives under `Sources/FluidAudio/TTS/Magpie/`.
|
||||
|
||||
## Status
|
||||
|
||||
Functional but **quite slow — needs significant perf work, not for real-time
|
||||
or latency-sensitive use.** First synth on a fresh process is dominated by
|
||||
CoreML model load + first-call ANE compile (~30 s); warm synths run at
|
||||
~96 s wall for an 8-word English sentence on M-series, i.e. RTFx ≈ **0.04**
|
||||
(~25× slower than realtime). Whether the throughput ceiling is a model
|
||||
characteristic, a CoreML conversion limitation, or both is still being
|
||||
investigated and is expected to improve in subsequent iterations. For
|
||||
real-time use prefer Kokoro (~20× RTFx) or PocketTTS (~1.5–2× RTFx);
|
||||
Magpie's value prop is multilingual coverage and the 5 built-in speaker
|
||||
contexts, not throughput.
|
||||
> ⚠️ **Beta / experimental.** Below real-time on Apple Silicon
|
||||
> (agg-RTFx ~0.41× on M2). Not for latency-sensitive use; prefer
|
||||
> Kokoro / Kokoro ANE or PocketTTS for real-time. Initializing
|
||||
> `MagpieTtsManager` logs a runtime beta warning at `.warning` level.
|
||||
|
||||
Functional but **below real-time — not for latency-sensitive use.**
|
||||
On the full `minimax-english` 100-phrase corpus (M2, default compute
|
||||
units), Magpie posts agg-RTFx **0.41×** with p50 warm synth ~19.8 s
|
||||
and p95 ~57.5 s — most of the long tail comes from paragraph-length
|
||||
news / story phrases (max 107 s on a single 18 s utterance). Cold
|
||||
start ~19 s on warm ANE caches, dominated by first-call decoder_step
|
||||
compile. The AR loop (`decoder_step` + sampler) dominates wall clock
|
||||
and grows super-linearly with phrase length; the
|
||||
[`outputBackings` fast path](Benchmarks.md#magpie-outputbackings-fast-path)
|
||||
already eliminated the per-step KV reallocation cost. Further gains
|
||||
likely need an MLX-backed LocalTransformer or a smaller-K/V variant.
|
||||
For real-time use prefer Kokoro / Kokoro ANE (2–5× RTFx) or PocketTTS
|
||||
(streaming, TTFT ~1.2 s); Magpie's value prop is multilingual coverage
|
||||
(en/es/de/fr/it/vi/zh/hi) and 5 built-in speaker contexts, not
|
||||
throughput.
|
||||
|
||||
Audio quality is perceptually clean across all 5 speakers and ASR-clean on
|
||||
4/5; speaker 0 has a single trailing-word artifact ("…and") attributable
|
||||
|
||||
@@ -0,0 +1,89 @@
|
||||
# MiniMax Multilingual TTS Test Set
|
||||
|
||||
The FluidAudio `tts-benchmark` corpus is sourced on demand from the
|
||||
[MiniMaxAI/TTS-Multilingual-Test-Set](https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set)
|
||||
Hugging Face dataset and converted to the harness format (one phrase
|
||||
per non-empty, non-`#` line). The fetched `.txt` files land under
|
||||
`Benchmarks/tts/corpus/minimax/<lang>.txt`; they are gitignored — only
|
||||
this document is checked in.
|
||||
|
||||
| Field | Value |
|
||||
|----------|-------|
|
||||
| Source | https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set |
|
||||
| Revision | `cb416f0ac3658da0577e97873065e19fe6488917` (initial public release) |
|
||||
| License | [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/) |
|
||||
| Citation | MiniMax-Speech tech report — [arXiv 2505.07916](https://arxiv.org/pdf/2505.07916) |
|
||||
| Languages | 24 (arabic, cantonese, chinese, czech, dutch, english, finnish, french, german, greek, hindi, indonesian, italian, japanese, korean, polish, portuguese, romanian, russian, spanish, thai, turkish, ukrainian, vietnamese) |
|
||||
| Phrases | 100 per language (2400 total) |
|
||||
|
||||
The fetched text files are derivative works of the upstream dataset
|
||||
and remain under **CC-BY-SA-4.0**. The rest of the FluidAudio
|
||||
repository is licensed separately (see top-level `LICENSE`); only the
|
||||
contents of `Benchmarks/tts/corpus/minimax/` are share-alike-bound to
|
||||
CC-BY-SA-4.0.
|
||||
|
||||
## Why this corpus?
|
||||
|
||||
MiniMax positions this as *"a public benchmark used in a number of
|
||||
recent TTS papers, which makes our numbers directly comparable to
|
||||
existing work"* (Gradium, MiniMax-Speech, seed-tts-eval, etc.).
|
||||
FluidAudio's `tts-benchmark` ships exclusively against this corpus
|
||||
so the resulting RTFx / WER numbers land on the same axis as
|
||||
published TTS work.
|
||||
|
||||
## Format conversion
|
||||
|
||||
Upstream lines have a `<cloning_audio_filename>|<text>` pipe-delimited
|
||||
shape because the dataset also ships per-speaker reference audio for
|
||||
zero-shot voice cloning. The FluidAudio harness only needs the text —
|
||||
voice selection is a per-backend concern (Kokoro / PocketTTS / Magpie /
|
||||
StyleTTS2 each have their own voice plumbing). The leading
|
||||
`<filename>|` is stripped at fetch time; if you need the cloning audio
|
||||
later, fetch it from the upstream HF repo's `audio/` directory.
|
||||
|
||||
## Fetching
|
||||
|
||||
The `fluidaudio minimax-corpus` CLI subcommand pins the upstream
|
||||
revision to the value above so re-runs are deterministic. From the
|
||||
package root:
|
||||
|
||||
```bash
|
||||
# All 24 languages
|
||||
swift run fluidaudio minimax-corpus
|
||||
|
||||
# Subset
|
||||
swift run fluidaudio minimax-corpus --languages english,spanish,hindi
|
||||
|
||||
# Refresh against a newer release
|
||||
swift run fluidaudio minimax-corpus --revision <commit-sha>
|
||||
```
|
||||
|
||||
Output lands in `Benchmarks/tts/corpus/minimax/<lang>.txt` (relative
|
||||
to the package root) by default; override with `--out-dir <path>`.
|
||||
Auth-gated revisions are honored via the standard `HF_TOKEN` /
|
||||
`HUGGING_FACE_HUB_TOKEN` env vars (same as every other HF asset pull
|
||||
in the project). Run `fluidaudio minimax-corpus --help` for the full
|
||||
flag list.
|
||||
|
||||
Per-backend ↔ language coverage and `tts-benchmark --corpus minimax-<lang>`
|
||||
usage live in [`Benchmarks.md`](Benchmarks.md#corpus).
|
||||
|
||||
## WER caveats
|
||||
|
||||
Per the [open community discussion on the upstream
|
||||
dataset](https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set/discussions/10),
|
||||
WER on this corpus is sensitive to the ASR + text-normalization stack:
|
||||
|
||||
- Whisper-v3 (and similarly Parakeet) often need text normalization on
|
||||
the reference (`"32"` → `"thirty two"`) before comparing against the
|
||||
hypothesis to get a clean WER.
|
||||
- For non-Latin-script languages (Hindi, Japanese, Cantonese, etc.) the
|
||||
ASR may emit transliterated forms that don't match the reference
|
||||
script, inflating WER even when the synthesis is intelligible.
|
||||
- For non-word-segmented languages (Chinese, Japanese, Thai), CER is
|
||||
the more meaningful metric — `tts-benchmark` already reports both.
|
||||
|
||||
This means **MiniMax WER is best read relatively (FluidAudio backend
|
||||
A vs. backend B on the same corpus + same ASR), not absolutely**, and
|
||||
side-by-side comparison with published numbers requires matching the
|
||||
upstream ASR + normalizer choice.
|
||||
@@ -716,7 +716,7 @@ public enum ModelNames {
|
||||
/// expected local directory layout is encoded in `CosyVoice3Constants.Files`.
|
||||
public enum CosyVoice3 {
|
||||
public static let llmPrefill = "LLM-Prefill-T256-M768-fp16"
|
||||
public static let llmDecode = "LLM-Decode-M768-fp16-stateful"
|
||||
public static let llmDecode = "LLM-Decode-M768-fp16"
|
||||
public static let flow = "Flow-N250-fp16"
|
||||
public static let hift = "HiFT-T500-fp16"
|
||||
public static let speechEmbeddings = "speech_embedding-fp16.safetensors"
|
||||
|
||||
@@ -28,11 +28,12 @@ public actor CosyVoice3ModelStore {
|
||||
|
||||
/// - Parameters:
|
||||
/// - directory: Base build directory that contains
|
||||
/// `llm-fp16/`, `llm-fp16-stateful/`, `flow-fp16-n250/`,
|
||||
/// `llm-fp16/`, `llm-fp16-decode/`, `flow-fp16-n250/`,
|
||||
/// `hift-fp16-t500/`, `embeddings/`.
|
||||
/// - computeUnits: Defaults to `.cpuAndNeuralEngine`. Applied to
|
||||
/// LLM-Prefill + HiFT models only. LLM-Decode (stateful) and Flow
|
||||
/// both force `.cpuAndGPU` regardless (see `loadIfNeeded()`).
|
||||
/// LLM-Prefill only. LLM-Decode (stateless external cache),
|
||||
/// Flow, and HiFT all pin `.cpuAndGPU` regardless (see
|
||||
/// `loadIfNeeded()`).
|
||||
public init(directory: URL, computeUnits: MLComputeUnits = .cpuAndNeuralEngine) {
|
||||
self.directory = directory
|
||||
self.computeUnits = computeUnits
|
||||
@@ -67,10 +68,10 @@ public actor CosyVoice3ModelStore {
|
||||
let prefill = try await compileAndLoad(prefillURL, configuration: config)
|
||||
logger.info("Loaded \(CosyVoice3Constants.Files.llmPrefill)")
|
||||
|
||||
// Stateful decode MUST run on `.cpuAndGPU`:
|
||||
// - ANE refuses to compile the stateful graph (same failure mode
|
||||
// as Flow: `MILCompilerForANE ANECCompile() FAILED`), so
|
||||
// `.cpuAndNE` / `.all` deadlock load
|
||||
// Stateless decode MUST run on `.cpuAndGPU`:
|
||||
// - ANE refuses to compile the rotary + sliced SDPA decode graph
|
||||
// (same failure mode as Flow: `MILCompilerForANE ANECCompile()
|
||||
// FAILED`), so `.cpuAndNE` / `.all` deadlock load
|
||||
// - CPU-only works but is ~2× slower than the GPU path
|
||||
// Ignore the user-supplied `computeUnits` for decode.
|
||||
let decodeConfig = MLModelConfiguration()
|
||||
@@ -98,7 +99,25 @@ public actor CosyVoice3ModelStore {
|
||||
let flow = try await compileAndLoad(flowURL, configuration: flowConfig)
|
||||
logger.info("Loaded \(CosyVoice3Constants.Files.flow)")
|
||||
|
||||
let hift = try await compileAndLoad(hiftURL, configuration: config)
|
||||
// HiFT runs on `.cpuAndGPU` (fp16). With `.cpuAndNeuralEngine`
|
||||
// CoreML's planner placed most of HiFT on ANE but kept at least
|
||||
// one op (`HiFT-T500-fp16_main__Op104`) on the BNNS CPU path,
|
||||
// which trips a hard async-dispatch watchdog mid-corpus on
|
||||
// long phrases:
|
||||
//
|
||||
// E5RT: Submit Async failed for [3:29]: Async task:
|
||||
// HiFT-T500-fp16_main__Op104_BnnsCpuInference has timed out.
|
||||
// @ CancelTimedOutAsyncTask_block_invoke
|
||||
//
|
||||
// Pinning HiFT to `.cpuAndGPU` removes the ANE+BNNS mixed-compute
|
||||
// pathology (the same family of issue that already forced Flow
|
||||
// and Decode off ANE above). The model is fixed-shape
|
||||
// [1, 80, 500] so GPU placement is predictable. Trade-off: a
|
||||
// small per-call latency increase vs. ANE — acceptable, since
|
||||
// the prior ANE config didn't actually complete the corpus.
|
||||
let hiftConfig = MLModelConfiguration()
|
||||
hiftConfig.computeUnits = .cpuAndGPU
|
||||
let hift = try await compileAndLoad(hiftURL, configuration: hiftConfig)
|
||||
logger.info("Loaded \(CosyVoice3Constants.Files.hift)")
|
||||
|
||||
loadedModels = CosyVoice3Models(prefill: prefill, decode: decode, flow: flow, hift: hift)
|
||||
|
||||
@@ -4,7 +4,7 @@ import Foundation
|
||||
///
|
||||
/// Shipping config (frozen):
|
||||
/// - LLM-Prefill-T256-M768-fp16 (cpuAndNeuralEngine)
|
||||
/// - LLM-Decode-M768-fp16-stateful (cpuAndGPU — see note)
|
||||
/// - LLM-Decode-M768-fp16 (cpuAndGPU — see note)
|
||||
/// - Flow-N250-fp16 (cpuAndGPU — an ANE-port
|
||||
/// BC1S rewrite was attempted and reverted: the converted graph ran
|
||||
/// ~3× faster but numerically broken (mel dynamic range collapsed
|
||||
@@ -15,14 +15,22 @@ import Foundation
|
||||
/// `input_embed.conv_pos_embed` (`Conv1d(1024,1024,k=31)+Mish`)
|
||||
/// that three rewrite attempts couldn't move — ANEF rejects the
|
||||
/// conv footprint regardless of group count.)
|
||||
/// - HiFT-T500-fp16 (cpuAndNeuralEngine)
|
||||
/// - HiFT-T500-fp16 (cpuAndGPU — pinned off
|
||||
/// ANE because the `.cpuAndNeuralEngine` planner left at least one
|
||||
/// op on the BNNS CPU path, which tripped a hard async-dispatch
|
||||
/// watchdog mid-corpus on long phrases:
|
||||
/// `E5RT: Submit Async failed ... HiFT-T500-fp16_main__Op104_BnnsCpuInference
|
||||
/// has timed out`. GPU placement is deterministic and avoids the
|
||||
/// ANE+BNNS mixed-compute pathology.)
|
||||
///
|
||||
/// The stateful decode model uses per-layer `MLState` buffers for the
|
||||
/// KV cache (48 tensors, `[1, 2, 768, 64]` fp16 each) instead of
|
||||
/// round-tripping 18 MB of kv_k / kv_v MLMultiArrays every step. ANE
|
||||
/// refuses to compile the stateful graph (`MILCompilerForANE
|
||||
/// ANECCompile() FAILED`); decode therefore runs on `.cpuAndGPU`.
|
||||
/// Requires macOS 15 / iOS 18.
|
||||
/// Decode runs **stateless** with an external KV cache: prefill emits
|
||||
/// `kv_k` / `kv_v` of shape `[24, 1, 2, 768, 64]` fp32, and decode
|
||||
/// accepts the same tensors as inputs and returns `kv_k_out` / `kv_v_out`
|
||||
/// at the same shape/dtype. The cache is round-tripped once per step
|
||||
/// (≈18 MB total). ANE still rejects this graph (`MILCompilerForANE
|
||||
/// ANECCompile() FAILED` on the rotary + sliced SDPA), so decode is
|
||||
/// pinned to `.cpuAndGPU`. The library floor is macOS 14 / iOS 17 — no
|
||||
/// MLState dependency.
|
||||
public enum CosyVoice3Constants {
|
||||
|
||||
// MARK: - LLM shapes
|
||||
@@ -66,8 +74,8 @@ public enum CosyVoice3Constants {
|
||||
public enum Files {
|
||||
public static let llmPrefill = "LLM-Prefill-T256-M768-fp16.mlpackage"
|
||||
public static let llmPrefillSubdir = "llm-fp16"
|
||||
public static let llmDecode = "LLM-Decode-M768-fp16-stateful.mlpackage"
|
||||
public static let llmDecodeSubdir = "llm-fp16-stateful"
|
||||
public static let llmDecode = "LLM-Decode-M768-fp16.mlpackage"
|
||||
public static let llmDecodeSubdir = "llm-fp16-decode"
|
||||
public static let flow = "Flow-N250-fp16.mlpackage"
|
||||
public static let flowSubdir = "flow-fp16-n250"
|
||||
public static let hift = "HiFT-T500-fp16.mlpackage"
|
||||
|
||||
@@ -38,11 +38,9 @@ import Foundation
|
||||
/// the 281 runtime-added special tokens (CosyVoice3Tokenizer). Same format
|
||||
/// that `tokenizer_fixture.json` dumps under its `special_tokens` key.
|
||||
///
|
||||
/// > Note: Gated to macOS 15 / iOS 18 because the underlying
|
||||
/// > `CosyVoice3Synthesizer` uses CoreML `MLState` for the decode KV cache.
|
||||
/// > Other FluidAudio modules (ASR, Diarization, VAD, Kokoro, PocketTTS)
|
||||
/// > remain available on macOS 14 / iOS 17.
|
||||
@available(macOS 15, iOS 18, *)
|
||||
/// > Available on the same floor as the rest of FluidAudio (macOS 14 /
|
||||
/// > iOS 17). Decode runs stateless with an external KV cache rather than
|
||||
/// > `MLState`, so no extra OS gate is required.
|
||||
public actor CosyVoice3TtsManager {
|
||||
|
||||
private let logger = AppLogger(subsystem: "com.fluidaudio.tts", category: "CosyVoice3TtsManager")
|
||||
@@ -216,9 +214,60 @@ public actor CosyVoice3TtsManager {
|
||||
normalized = CosyVoice3ChineseNormalizer.normalize(text)
|
||||
}
|
||||
|
||||
// Auto-chunk long input under the structural 250-token Flow cap.
|
||||
// The chunker greedily splits on hard sentence enders + soft clause
|
||||
// separators when the running speech-token estimate exceeds budget;
|
||||
// short inputs return a single chunk and take the fast path. Caller
|
||||
// can opt out via `options.disableAutoChunking` for pre-segmented
|
||||
// input (e.g. UI-driven streaming).
|
||||
let chunks: [String]
|
||||
if options.disableAutoChunking {
|
||||
chunks = [normalized]
|
||||
} else {
|
||||
let split = CosyVoice3TextChunker.chunk(normalized)
|
||||
chunks = split.isEmpty ? [normalized] : split
|
||||
}
|
||||
|
||||
if chunks.count == 1 {
|
||||
return try await synthesizeChunk(
|
||||
text: chunks[0], promptAssets: promptAssets,
|
||||
options: options, frontend: frontend, synthesizer: synthesizer)
|
||||
}
|
||||
|
||||
logger.info(
|
||||
"Auto-chunking long input into \(chunks.count) segments to fit "
|
||||
+ "the 250-token Flow cap (estimated speech tokens: "
|
||||
+ "\(CosyVoice3TextChunker.estimateSpeechTokens(normalized))).")
|
||||
var results: [CosyVoice3SynthesisResult] = []
|
||||
results.reserveCapacity(chunks.count)
|
||||
for (i, chunk) in chunks.enumerated() {
|
||||
logger.info(
|
||||
" chunk \(i + 1)/\(chunks.count): "
|
||||
+ "\(chunk.count) chars, ~"
|
||||
+ "\(CosyVoice3TextChunker.estimateSpeechTokens(chunk)) speech tokens")
|
||||
let r = try await synthesizeChunk(
|
||||
text: chunk, promptAssets: promptAssets,
|
||||
options: options, frontend: frontend, synthesizer: synthesizer)
|
||||
results.append(r)
|
||||
}
|
||||
return Self.mergeChunkedResults(results)
|
||||
}
|
||||
|
||||
// MARK: - Chunked synthesis helpers
|
||||
|
||||
/// Single-call synthesis path: tokenize/normalize-aware text → fixture
|
||||
/// adapter → synthesizer. Shared between the fast (1-chunk) and chunked
|
||||
/// (N-chunk) paths in `synthesize(...)`.
|
||||
private func synthesizeChunk(
|
||||
text: String,
|
||||
promptAssets: CosyVoice3PromptAssets,
|
||||
options: CosyVoice3SynthesisOptions,
|
||||
frontend: CosyVoice3TextFrontend,
|
||||
synthesizer: CosyVoice3Synthesizer
|
||||
) async throws -> CosyVoice3SynthesisResult {
|
||||
let assembled = try frontend.assemble(
|
||||
promptText: promptAssets.promptText,
|
||||
ttsText: normalized,
|
||||
ttsText: text,
|
||||
promptSpeechIds: promptAssets.promptSpeechIds)
|
||||
|
||||
let lmInputEmbedsFlat = try Self.flattenLmEmbeds(
|
||||
@@ -246,6 +295,72 @@ public actor CosyVoice3TtsManager {
|
||||
return try await synthesizer.synthesize(fixture: fixture, options: parityOptions)
|
||||
}
|
||||
|
||||
/// Concatenate per-chunk results into a single `CosyVoice3SynthesisResult`.
|
||||
/// Audio is stitched with a short cosine cross-fade (`crossfadeMs`) at
|
||||
/// each boundary to mask DC/phase mismatch from independent synth calls.
|
||||
/// `finishedOnEos` is `true` only when every chunk ended naturally
|
||||
/// (so callers can still detect mid-segment truncation downstream).
|
||||
private static func mergeChunkedResults(
|
||||
_ results: [CosyVoice3SynthesisResult],
|
||||
crossfadeMs: Double = 8
|
||||
) -> CosyVoice3SynthesisResult {
|
||||
precondition(!results.isEmpty, "mergeChunkedResults requires ≥1 result")
|
||||
let sampleRate = results[0].sampleRate
|
||||
let samples = concatWithCrossfade(
|
||||
results.map { $0.samples },
|
||||
sampleRate: sampleRate,
|
||||
fadeMs: crossfadeMs)
|
||||
let totalGenerated = results.reduce(0) { $0 + $1.generatedTokenCount }
|
||||
var allDecoded: [Int32] = []
|
||||
allDecoded.reserveCapacity(totalGenerated)
|
||||
for r in results { allDecoded.append(contentsOf: r.decodedTokens) }
|
||||
let allEos = results.allSatisfy { $0.finishedOnEos }
|
||||
return CosyVoice3SynthesisResult(
|
||||
samples: samples,
|
||||
sampleRate: sampleRate,
|
||||
generatedTokenCount: totalGenerated,
|
||||
decodedTokens: allDecoded,
|
||||
finishedOnEos: allEos)
|
||||
}
|
||||
|
||||
/// Concatenate PCM chunks with a cosine cross-fade at each boundary.
|
||||
/// Fade window is the shorter of `fadeMs` and `min(prev.tail, next.head)
|
||||
/// / 2`, so very short chunks degrade gracefully (no overlap consuming
|
||||
/// the entire chunk).
|
||||
static func concatWithCrossfade(
|
||||
_ chunks: [[Float]],
|
||||
sampleRate: Int,
|
||||
fadeMs: Double
|
||||
) -> [Float] {
|
||||
guard !chunks.isEmpty else { return [] }
|
||||
let nominalFade = max(0, Int((Double(sampleRate) * fadeMs / 1000).rounded()))
|
||||
var out: [Float] = chunks[0]
|
||||
for i in 1..<chunks.count {
|
||||
let next = chunks[i]
|
||||
if nominalFade == 0 || out.isEmpty || next.isEmpty {
|
||||
out.append(contentsOf: next)
|
||||
continue
|
||||
}
|
||||
let fade = min(nominalFade, out.count / 2, next.count / 2)
|
||||
if fade <= 0 {
|
||||
out.append(contentsOf: next)
|
||||
continue
|
||||
}
|
||||
// Cosine equal-power crossfade: out tail fades down, next head
|
||||
// fades up; samples are summed in the overlap region. Length of
|
||||
// `out` after splice = old_len - fade + next.count.
|
||||
let outStart = out.count - fade
|
||||
for j in 0..<fade {
|
||||
let t = Float(j) / Float(fade)
|
||||
let down = 0.5 * (1 + cos(Float.pi * t)) // 1 → 0
|
||||
let up = 0.5 * (1 - cos(Float.pi * t)) // 0 → 1
|
||||
out[outStart + j] = out[outStart + j] * down + next[j] * up
|
||||
}
|
||||
out.append(contentsOf: next[fade..<next.count])
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
// MARK: - Helpers
|
||||
|
||||
/// Flatten `[1, tPre, 896]` MLMultiArray fp32 into `[tPre * 896]` Float,
|
||||
|
||||
@@ -0,0 +1,145 @@
|
||||
import Foundation
|
||||
|
||||
/// Splits long input text into segments that each fit within CosyVoice3's
|
||||
/// 250-token Flow input cap.
|
||||
///
|
||||
/// The Flow CFM model is exported with a fixed `[1, 250]` `token_total`
|
||||
/// shape (`CosyVoice3Constants.flowTotalTokens`). After the prompt's speech
|
||||
/// tokens consume `~85–95` slots (default voice), each `synthesize(...)`
|
||||
/// call has room for roughly `~155` new speech tokens of output (≈ 6.4 s of
|
||||
/// audio at the 40 ms/token rate `tokenMelRatio × hiftSamplesPerFrame /
|
||||
/// sampleRate = 2 × 480 / 24_000`). Long phrases truncate mid-utterance.
|
||||
///
|
||||
/// This chunker greedily packs input into segments under a target speech-
|
||||
/// token budget, splitting preferentially on hard sentence enders
|
||||
/// (`. ! ? 。 ! ? \n`) and falling back to soft clause separators
|
||||
/// (`, ; , ; 、 :`) when sentences exceed the budget. Synthesis is run
|
||||
/// per-chunk and audio is concatenated with a small cosine cross-fade at
|
||||
/// boundaries (handled by the caller, not here).
|
||||
///
|
||||
/// **Token-rate estimate** (calibrated against minimax-zh corpus runs):
|
||||
/// - CJK char ≈ 7.5 speech tokens (worst-case observed real rate;
|
||||
/// 5.5 was empirically too low and
|
||||
/// let ~16% of phrases hit cap)
|
||||
/// - ASCII char ≈ 1.5 speech tokens (BPE compresses; English is faster)
|
||||
/// - Other (Latin-1) ≈ 2.5 speech tokens (middle ground for accented Latin)
|
||||
///
|
||||
/// Default `maxSpeechTokens = 110` leaves a ~45-token safety margin under
|
||||
/// the typical room-for-new of ~155. The 30-token force-split overshoot
|
||||
/// can push a committed chunk to ~140 estimated, still comfortably under
|
||||
/// the cap once the conservative 5.5-tokens/CJK-char heuristic is
|
||||
/// reconciled with real generation rates. The synthesizer still emits
|
||||
/// its `LLM-Decode budget exhausted` warning if a chunk somehow exceeds
|
||||
/// the cap, so over-estimates are self-healing.
|
||||
public enum CosyVoice3TextChunker {
|
||||
|
||||
/// Sentence-ending punctuation. Always commit the current chunk after
|
||||
/// these, regardless of running token count.
|
||||
private static let hardEnders: Set<Character> = [
|
||||
"。", "!", "?", ".", "!", "?", "\n",
|
||||
]
|
||||
|
||||
/// Clause-internal punctuation. Commit only when the running token
|
||||
/// count is at or above the budget — soft splits should be preferred
|
||||
/// over force-splits but not preferred over hard enders.
|
||||
private static let softEnders: Set<Character> = [
|
||||
",", "、", ";", ":", ";", ",", " ",
|
||||
]
|
||||
|
||||
/// Default speech-token budget per chunk. Keeps a ~45-token margin
|
||||
/// under the typical room-for-new of ~155 (= `flowTotalTokens=250`
|
||||
/// minus a typical prompt of ~95 tokens). The 30-token force-split
|
||||
/// overshoot may push committed chunks to ~140 estimated, still under
|
||||
/// the structural cap.
|
||||
public static let defaultMaxSpeechTokens: Int = 110
|
||||
|
||||
/// Split `text` into chunks each estimated to produce ≤
|
||||
/// `maxSpeechTokens` LLM speech tokens. Returns `[text]` (single
|
||||
/// chunk) when the input already fits. Returns `[]` when `text` is
|
||||
/// empty or whitespace-only.
|
||||
public static func chunk(
|
||||
_ text: String,
|
||||
maxSpeechTokens: Int = defaultMaxSpeechTokens
|
||||
) -> [String] {
|
||||
let trimmed = text.trimmingCharacters(in: .whitespacesAndNewlines)
|
||||
guard !trimmed.isEmpty else { return [] }
|
||||
if estimateSpeechTokens(trimmed) <= maxSpeechTokens {
|
||||
return [trimmed]
|
||||
}
|
||||
|
||||
var chunks: [String] = []
|
||||
var current = ""
|
||||
for ch in trimmed {
|
||||
current.append(ch)
|
||||
let tokensSoFar = estimateSpeechTokens(current)
|
||||
|
||||
if hardEnders.contains(ch) {
|
||||
let pruned = current.trimmingCharacters(in: .whitespacesAndNewlines)
|
||||
if !pruned.isEmpty { chunks.append(pruned) }
|
||||
current = ""
|
||||
continue
|
||||
}
|
||||
if tokensSoFar >= maxSpeechTokens && softEnders.contains(ch) {
|
||||
let pruned = current.trimmingCharacters(in: .whitespacesAndNewlines)
|
||||
if !pruned.isEmpty { chunks.append(pruned) }
|
||||
current = ""
|
||||
continue
|
||||
}
|
||||
// Force-split if no punctuation has appeared within a 30-token
|
||||
// overshoot. Prefer the most recent whitespace; fall back to
|
||||
// hard-cut at the current position. Hard-cut on continuous CJK
|
||||
// (no whitespace) is rare in normalized input but can happen
|
||||
// when the normalizer collapses spaces.
|
||||
if tokensSoFar >= maxSpeechTokens + 30 {
|
||||
if let lastSpace = current.lastIndex(where: { $0 == " " }),
|
||||
lastSpace != current.startIndex
|
||||
{
|
||||
let head = String(current[..<lastSpace])
|
||||
.trimmingCharacters(in: .whitespacesAndNewlines)
|
||||
let tail = String(current[current.index(after: lastSpace)...])
|
||||
if !head.isEmpty { chunks.append(head) }
|
||||
current = tail
|
||||
} else {
|
||||
let pruned = current.trimmingCharacters(in: .whitespacesAndNewlines)
|
||||
if !pruned.isEmpty { chunks.append(pruned) }
|
||||
current = ""
|
||||
}
|
||||
}
|
||||
}
|
||||
let tail = current.trimmingCharacters(in: .whitespacesAndNewlines)
|
||||
if !tail.isEmpty { chunks.append(tail) }
|
||||
return chunks
|
||||
}
|
||||
|
||||
/// Rough estimate of how many SPEECH tokens the LLM-Decode AR loop
|
||||
/// will produce for `s`. Used by `chunk(...)` to size segments under
|
||||
/// the structural Flow cap.
|
||||
public static func estimateSpeechTokens(_ s: String) -> Int {
|
||||
var total = 0.0
|
||||
for scalar in s.unicodeScalars {
|
||||
if isCJK(scalar) {
|
||||
total += 7.5
|
||||
} else if scalar.isASCII {
|
||||
total += 1.5
|
||||
} else {
|
||||
total += 2.5
|
||||
}
|
||||
}
|
||||
return Int(total.rounded())
|
||||
}
|
||||
|
||||
private static func isCJK(_ scalar: Unicode.Scalar) -> Bool {
|
||||
let v = scalar.value
|
||||
// CJK Unified Ideographs (the bulk of zh/yue text)
|
||||
if (0x4E00...0x9FFF).contains(v) { return true }
|
||||
// CJK Unified Ideographs Extension A
|
||||
if (0x3400...0x4DBF).contains(v) { return true }
|
||||
// Hiragana
|
||||
if (0x3040...0x309F).contains(v) { return true }
|
||||
// Katakana
|
||||
if (0x30A0...0x30FF).contains(v) { return true }
|
||||
// Hangul Syllables
|
||||
if (0xAC00...0xD7AF).contains(v) { return true }
|
||||
return false
|
||||
}
|
||||
}
|
||||
+236
-135
@@ -7,11 +7,12 @@ import Foundation
|
||||
/// implemented as a method on this type, keeping the state (KV cache, running
|
||||
/// decoded list) local to a single synthesis call.
|
||||
///
|
||||
/// Decode uses CoreML `MLState` (macOS 15 / iOS 18): 48 per-layer buffers
|
||||
/// (`kv_k_0..kv_k_23`, `kv_v_0..kv_v_23`) replace the 18 MB kv_k / kv_v
|
||||
/// round-trip per step. Prefill remains non-stateful and its `kv_k` / `kv_v`
|
||||
/// outputs seed the decode state once after prefill.
|
||||
@available(macOS 15, iOS 18, *)
|
||||
/// Decode is **stateless** with an external KV cache. Prefill emits
|
||||
/// `kv_k` / `kv_v` of shape `[24, 1, 2, 768, 64]` fp32; decode accepts those
|
||||
/// same tensors as inputs and returns updated `kv_k_out` / `kv_v_out` at
|
||||
/// the same shape/dtype. We round-trip the cache once per step (≈18 MB
|
||||
/// total) and bind the previous step's outputs as the next step's inputs.
|
||||
/// No `MLState` dependency — runs on macOS 14 / iOS 17.
|
||||
public actor CosyVoice3Synthesizer {
|
||||
|
||||
private let logger = AppLogger(subsystem: "com.fluidaudio.tts", category: "CosyVoice3Synthesizer")
|
||||
@@ -19,6 +20,18 @@ public actor CosyVoice3Synthesizer {
|
||||
private let models: CosyVoice3Models
|
||||
private let embeddings: CosyVoice3SpeechEmbeddings
|
||||
|
||||
/// Set to `false` once `LLM-Decode-M768-fp16` rejects pre-allocated
|
||||
/// `outputBackings` (model exported without explicit MultiArray
|
||||
/// shape/dtype constraints on its `kv_k_out` / `kv_v_out` /
|
||||
/// `speech_logits` outputs). Latched off so we don't throw + catch on
|
||||
/// every one of ~163 AR decode steps per phrase. Same pattern as
|
||||
/// `MagpieKvCache.useOutputBackings`.
|
||||
private var useOutputBackings: Bool = true
|
||||
|
||||
/// One-shot flag for "fast path engaged" log message; only emitted on
|
||||
/// the first successful `outputBackings` prediction so we don't spam.
|
||||
private var loggedFastPath: Bool = false
|
||||
|
||||
public init(models: CosyVoice3Models, embeddings: CosyVoice3SpeechEmbeddings) {
|
||||
self.models = models
|
||||
self.embeddings = embeddings
|
||||
@@ -46,16 +59,52 @@ public actor CosyVoice3Synthesizer {
|
||||
sampler.seedTokens(fixture.decodedTokens)
|
||||
}
|
||||
|
||||
// 1) Prefill (non-stateful: returns kv_k / kv_v as outputs)
|
||||
// 1) Prefill (returns kv_k / kv_v as fp32 outputs)
|
||||
let tPrefill = Date()
|
||||
let (prefillLogits, initialKvK, initialKvV) = try await runPrefill(fixture: fixture)
|
||||
let prefillSec = Date().timeIntervalSince(tPrefill)
|
||||
|
||||
// Seed decode MLState from prefill kv_k / kv_v.
|
||||
let tSeed = Date()
|
||||
let state = models.decode.makeState()
|
||||
try seedDecodeState(state: state, kvK: initialKvK, kvV: initialKvV)
|
||||
let seedSec = Date().timeIntervalSince(tSeed)
|
||||
// External KV cache with **double-buffered outputBackings**: prefill's
|
||||
// `kv_k` / `kv_v` (shape `[24, 1, 2, 768, 64]` fp32, ~9 MB each) feed
|
||||
// the first decode step. Subsequent steps rotate between two
|
||||
// pre-allocated buffer pairs (A/B) bound as the model's
|
||||
// `kv_k_out` / `kv_v_out` outputs. Same pattern as
|
||||
// `MagpieKvCache.swapBackings()` — eliminates ~36 MB of host
|
||||
// alloc/dealloc per decode step (×163 steps ≈ 5.9 GB churn per
|
||||
// phrase). `speech_logits` is also pre-bound so we avoid a fresh
|
||||
// 27 KB allocation each step. CoreML rejects this when the model
|
||||
// was exported without explicit MultiArray shape/dtype constraints
|
||||
// on its outputs; in that case we latch `useOutputBackings = false`
|
||||
// and fall back to per-step allocation for the rest of the run.
|
||||
let kvShape: [NSNumber] = [
|
||||
NSNumber(value: CosyVoice3Constants.numLayers),
|
||||
1,
|
||||
NSNumber(value: CosyVoice3Constants.kvHeads),
|
||||
NSNumber(value: CosyVoice3Constants.kvMaxLength),
|
||||
NSNumber(value: CosyVoice3Constants.headDim),
|
||||
]
|
||||
let kvKBackA = try MLMultiArray(shape: kvShape, dataType: .float32)
|
||||
let kvVBackA = try MLMultiArray(shape: kvShape, dataType: .float32)
|
||||
let kvKBackB = try MLMultiArray(shape: kvShape, dataType: .float32)
|
||||
let kvVBackB = try MLMultiArray(shape: kvShape, dataType: .float32)
|
||||
let logitsBacking = try MLMultiArray(
|
||||
shape: [1, 1, NSNumber(value: CosyVoice3Constants.speechVocab)],
|
||||
dataType: .float32)
|
||||
|
||||
// Pointer-rotation triple. `frontKvK/V` are read by the next step;
|
||||
// `backKvK/V` receive the next step's writes; `spareKvK/V` are the
|
||||
// pre-allocated set ready to become `back` after rotation. Initial
|
||||
// `front` is the prefill output; we don't reuse those buffers as
|
||||
// `spare`/`back` — once decode step 1 finishes, `front` becomes A
|
||||
// (just-written), `back` becomes B (next write target), `spare`
|
||||
// becomes A's previous contents (which we drop, since prefill
|
||||
// output is single-use).
|
||||
var frontKvK: MLMultiArray = initialKvK
|
||||
var frontKvV: MLMultiArray = initialKvV
|
||||
var backKvK: MLMultiArray = kvKBackA
|
||||
var backKvV: MLMultiArray = kvVBackA
|
||||
var spareKvK: MLMultiArray = kvKBackB
|
||||
var spareKvV: MLMultiArray = kvVBackB
|
||||
|
||||
// Reusable per-step inputs for decode. `curLenArr` is mutated in place
|
||||
// each step; `inputsEmbedsArr` is overwritten by memcpy per step.
|
||||
@@ -64,6 +113,12 @@ public actor CosyVoice3Synthesizer {
|
||||
shape: [1, 1, NSNumber(value: CosyVoice3Constants.embedDim)],
|
||||
dataType: .float32)
|
||||
|
||||
// Logits scratch reused across all decode steps. The hot loop
|
||||
// memcpy's into this from `logitsBacking` (or strided-gathers from a
|
||||
// freshly-allocated array on the slow path).
|
||||
var logitsScratch = [Float](
|
||||
repeating: 0, count: CosyVoice3Constants.speechVocab)
|
||||
|
||||
// First token from prefill tail logits.
|
||||
var decoded: [Int32] = []
|
||||
let firstLogits = sliceLastStepLogits(
|
||||
@@ -81,31 +136,82 @@ public actor CosyVoice3Synthesizer {
|
||||
}
|
||||
decoded.append(topId)
|
||||
|
||||
// 2) Decode loop
|
||||
// 2) Decode loop (stateless, external cache, double-buffered backings)
|
||||
var curLen = fixture.tPre
|
||||
var decodeSteps = 0
|
||||
var hitEos = false
|
||||
let tDecode = Date()
|
||||
for step in 1..<maxNew {
|
||||
try embeddings.copyEmbedding(tokenId: topId, into: inputsEmbedsArr)
|
||||
curLenArr[0] = NSNumber(value: Int32(curLen))
|
||||
let logits = try runDecodeStateful(
|
||||
try runDecode(
|
||||
inputsEmbeds: inputsEmbedsArr,
|
||||
curLen: curLenArr,
|
||||
state: state)
|
||||
topId = sampler.sample(logits: logits, decodedSoFar: decoded)
|
||||
frontKvK: frontKvK,
|
||||
frontKvV: frontKvV,
|
||||
backKvK: backKvK,
|
||||
backKvV: backKvV,
|
||||
logitsBacking: logitsBacking,
|
||||
logits: &logitsScratch)
|
||||
topId = sampler.sample(logits: logitsScratch, decodedSoFar: decoded)
|
||||
curLen += 1
|
||||
decodeSteps += 1
|
||||
if CosyVoice3Constants.stopRange.contains(topId) {
|
||||
logger.info("EOS at step \(step) (token=\(topId))")
|
||||
hitEos = true
|
||||
break
|
||||
}
|
||||
decoded.append(topId)
|
||||
|
||||
// Rotate buffers: `back` (just-written) becomes new `front`;
|
||||
// `spare` becomes new `back`; old `front` becomes new `spare`
|
||||
// (will be overwritten next step). On step 1 the old `front` is
|
||||
// the prefill output — drops to `spare` and gets overwritten on
|
||||
// step 3, which is harmless (we never read it again).
|
||||
let prevFrontK = frontKvK
|
||||
let prevFrontV = frontKvV
|
||||
frontKvK = backKvK
|
||||
frontKvV = backKvV
|
||||
backKvK = spareKvK
|
||||
backKvV = spareKvV
|
||||
spareKvK = prevFrontK
|
||||
spareKvV = prevFrontV
|
||||
}
|
||||
let decodeSec = Date().timeIntervalSince(tDecode)
|
||||
guard !decoded.isEmpty else {
|
||||
throw CosyVoice3Error.predictionFailed("LLM produced no speech tokens")
|
||||
}
|
||||
|
||||
// Truncation signal: AR loop exhausted its decode budget without
|
||||
// observing an EOS token in `stopRange` (6_561…6_760). The 250-token
|
||||
// cap is structural — it's the fixed `[1, 250]` shape of the Flow
|
||||
// model's `token_total` input (`CosyVoice3Constants.flowTotalTokens`),
|
||||
// not a synthesizer-side soft limit. With ~40 ms of audio per token
|
||||
// (`tokenMelRatio=2 × hiftSamplesPerFrame=480 / sampleRate=24_000`),
|
||||
// a prompt taking ~`nPrompt` tokens leaves `(250 - nPrompt) × 0.04 s`
|
||||
// of generated audio — i.e. long phrases truncate mid-utterance.
|
||||
//
|
||||
// Surface this as a `.warning` so callers running long input get a
|
||||
// console signal instead of silent truncation. Lifting the cap
|
||||
// requires re-exporting Flow with a larger `token_total` shape; for
|
||||
// now, splitting input at clause boundaries (, / 。) is the
|
||||
// workaround.
|
||||
if !hitEos {
|
||||
let producedSec =
|
||||
Double(decoded.count)
|
||||
* Double(CosyVoice3Constants.tokenMelRatio)
|
||||
* Double(CosyVoice3Constants.hiftSamplesPerFrame)
|
||||
/ Double(CosyVoice3Constants.sampleRate)
|
||||
logger.warning(
|
||||
"LLM-Decode budget exhausted: \(decoded.count) generated tokens "
|
||||
+ "/ \(maxNew) cap (no EOS observed). "
|
||||
+ "Output truncated at ~"
|
||||
+ String(format: "%.1f", producedSec)
|
||||
+ "s of audio. The 250-token Flow input is a structural cap; "
|
||||
+ "split long phrases at clause boundaries (, 。) to work around."
|
||||
)
|
||||
}
|
||||
|
||||
// 3) Flow
|
||||
let nNew = decoded.count
|
||||
let tFlow = Date()
|
||||
@@ -133,14 +239,15 @@ public actor CosyVoice3Synthesizer {
|
||||
logger.info(
|
||||
String(
|
||||
format:
|
||||
"STAGES prefill=%.3fs seed=%.3fs decode=%.3fs(%d steps, %.2f tok/s) flow=%.3fs hift=%.3fs",
|
||||
prefillSec, seedSec, decodeSec, decodeSteps, decodeTps, flowSec, hiftSec))
|
||||
"STAGES prefill=%.3fs decode=%.3fs(%d steps, %.2f tok/s) flow=%.3fs hift=%.3fs",
|
||||
prefillSec, decodeSec, decodeSteps, decodeTps, flowSec, hiftSec))
|
||||
|
||||
return CosyVoice3SynthesisResult(
|
||||
samples: audio,
|
||||
sampleRate: CosyVoice3Constants.sampleRate,
|
||||
generatedTokenCount: nNew,
|
||||
decodedTokens: decoded)
|
||||
decodedTokens: decoded,
|
||||
finishedOnEos: hitEos)
|
||||
}
|
||||
|
||||
// MARK: - Stages
|
||||
@@ -193,140 +300,134 @@ public actor CosyVoice3Synthesizer {
|
||||
return (logits, kvK, kvV)
|
||||
}
|
||||
|
||||
/// Run one stateful decode step. `state` is mutated in place via the
|
||||
/// 48 per-layer `kv_k_i` / `kv_v_i` state buffers registered in the
|
||||
/// converted model.
|
||||
private func runDecodeStateful(
|
||||
/// Run one stateless decode step with an external KV cache.
|
||||
///
|
||||
/// Inputs match the converted CoreML graph signature:
|
||||
/// - `inputs_embeds: fp32 [1, 1, 896]`
|
||||
/// - `cur_len: int32 [1]`
|
||||
/// - `kv_k: fp32 [24, 1, 2, 768, 64]` (previous step's `kv_k_out`, or
|
||||
/// prefill's `kv_k` for the first decode step)
|
||||
/// - `kv_v: fp32 [24, 1, 2, 768, 64]`
|
||||
///
|
||||
/// Outputs (when `outputBackings` is accepted, written into the pre-
|
||||
/// allocated `backKvK` / `backKvV` / `logitsBacking` buffers in place):
|
||||
/// - `speech_logits: fp32 [1, 1, 6761]`
|
||||
/// - `kv_k_out: fp32 [24, 1, 2, 768, 64]`
|
||||
/// - `kv_v_out: fp32 [24, 1, 2, 768, 64]`
|
||||
///
|
||||
/// Falls back to per-step CoreML allocation + memcpy into the pre-
|
||||
/// allocated backings if the model rejects `outputBackings` (latches
|
||||
/// `useOutputBackings = false` so we don't retry on every step).
|
||||
private func runDecode(
|
||||
inputsEmbeds: MLMultiArray,
|
||||
curLen: MLMultiArray,
|
||||
state: MLState
|
||||
) throws -> [Float] {
|
||||
frontKvK: MLMultiArray,
|
||||
frontKvV: MLMultiArray,
|
||||
backKvK: MLMultiArray,
|
||||
backKvV: MLMultiArray,
|
||||
logitsBacking: MLMultiArray,
|
||||
logits: inout [Float]
|
||||
) throws {
|
||||
let features: [String: Any] = [
|
||||
"inputs_embeds": inputsEmbeds,
|
||||
"cur_len": curLen,
|
||||
"kv_k": frontKvK,
|
||||
"kv_v": frontKvV,
|
||||
]
|
||||
let provider = try MLDictionaryFeatureProvider(dictionary: features)
|
||||
let output = try models.decode.prediction(from: provider, using: state)
|
||||
|
||||
guard
|
||||
let logitsArr = output.featureValue(for: "speech_logits")?.multiArrayValue
|
||||
else {
|
||||
throw CosyVoice3Error.predictionFailed("decode: missing speech_logits")
|
||||
var fastPathSucceeded = false
|
||||
if useOutputBackings {
|
||||
let opts = MLPredictionOptions()
|
||||
opts.outputBackings = [
|
||||
"kv_k_out": backKvK,
|
||||
"kv_v_out": backKvV,
|
||||
"speech_logits": logitsBacking,
|
||||
]
|
||||
do {
|
||||
_ = try models.decode.prediction(from: provider, options: opts)
|
||||
Self.readLogits(from: logitsBacking, into: &logits)
|
||||
if !loggedFastPath {
|
||||
logger.info(
|
||||
"LLM-Decode outputBackings accepted; double-buffered "
|
||||
+ "AR loop active")
|
||||
loggedFastPath = true
|
||||
}
|
||||
fastPathSucceeded = true
|
||||
} catch {
|
||||
// CoreML refused our pre-allocated backings — typically
|
||||
// because `LLM-Decode-M768-fp16.mlpackage` was exported
|
||||
// without explicit MultiArray shape/dtype constraints on
|
||||
// its outputs. Latch the flag off so we don't throw + catch
|
||||
// on every one of ~163 steps for the rest of the corpus.
|
||||
// Warning level so it shows in release builds — this is a
|
||||
// perf regression worth surfacing to anyone running with a
|
||||
// re-exported model.
|
||||
useOutputBackings = false
|
||||
logger.warning(
|
||||
"LLM-Decode outputBackings rejected "
|
||||
+ "(\(error.localizedDescription)); switching to "
|
||||
+ "fresh-alloc fallback for the rest of the run")
|
||||
}
|
||||
}
|
||||
// logits shape = [1, 1, 6761] fp32; strides may be non-compact.
|
||||
|
||||
if !fastPathSucceeded {
|
||||
// Slow path: per-step CoreML allocation, then memcpy outputs
|
||||
// into the pre-allocated backings so the front/back rotation
|
||||
// protocol still works after this call.
|
||||
let output = try models.decode.prediction(from: provider)
|
||||
guard
|
||||
let logitsArr = output.featureValue(for: "speech_logits")?.multiArrayValue,
|
||||
let kvKOutArr = output.featureValue(for: "kv_k_out")?.multiArrayValue,
|
||||
let kvVOutArr = output.featureValue(for: "kv_v_out")?.multiArrayValue
|
||||
else {
|
||||
throw CosyVoice3Error.predictionFailed(
|
||||
"decode: missing speech_logits / kv_k_out / kv_v_out")
|
||||
}
|
||||
try Self.copyKvOutput(kvKOutArr, into: backKvK, name: "kv_k_out")
|
||||
try Self.copyKvOutput(kvVOutArr, into: backKvV, name: "kv_v_out")
|
||||
Self.readLogits(from: logitsArr, into: &logits)
|
||||
}
|
||||
}
|
||||
|
||||
/// Read a `[1, 1, 6761]` fp32 logits MLMultiArray into `dst`. Honors the
|
||||
/// last-dim stride (CoreML may emit non-compact strides on aligned
|
||||
/// allocations) — uses `memcpy` when stride==1, strided gather otherwise.
|
||||
private static func readLogits(from arr: MLMultiArray, into dst: inout [Float]) {
|
||||
let count = CosyVoice3Constants.speechVocab
|
||||
var logits = [Float](repeating: 0, count: count)
|
||||
let strides = logitsArr.strides.map { $0.intValue }
|
||||
let strides = arr.strides.map { $0.intValue }
|
||||
let vocabStride = strides.last ?? 1
|
||||
let base = logitsArr.dataPointer.bindMemory(to: Float.self, capacity: logitsArr.count)
|
||||
for i in 0..<count { logits[i] = base[i * vocabStride] }
|
||||
return logits
|
||||
let base = arr.dataPointer.bindMemory(to: Float.self, capacity: arr.count)
|
||||
if vocabStride == 1 {
|
||||
dst.withUnsafeMutableBytes { rawDst in
|
||||
guard let dstPtr = rawDst.baseAddress else { return }
|
||||
memcpy(dstPtr, base, count * MemoryLayout<Float>.size)
|
||||
}
|
||||
} else {
|
||||
for i in 0..<count { dst[i] = base[i * vocabStride] }
|
||||
}
|
||||
}
|
||||
|
||||
/// Seed the 48 decode state buffers (`kv_k_0..kv_k_23`, `kv_v_0..kv_v_23`)
|
||||
/// from prefill's `kv_k` / `kv_v` outputs.
|
||||
///
|
||||
/// Prefill logical shape per cache is `[L=24, 1, Hkv=2, M=768, D=64]`
|
||||
/// fp16; each per-layer state buffer is `[1, 2, 768, 64]` fp16. Copy
|
||||
/// layer-by-layer using stride-aware indexing (prefill strides may not
|
||||
/// be compact), letting CoreML's state writer convert to the underlying
|
||||
/// fp16 storage.
|
||||
private func seedDecodeState(
|
||||
state: MLState,
|
||||
kvK: MLMultiArray,
|
||||
kvV: MLMultiArray
|
||||
/// Copy a CoreML-allocated `kv_k_out` / `kv_v_out` MLMultiArray into our
|
||||
/// pre-allocated backing array. Used on the `outputBackings`-rejected
|
||||
/// fallback path so the front/back rotation protocol stays consistent.
|
||||
private static func copyKvOutput(
|
||||
_ src: MLMultiArray,
|
||||
into dst: MLMultiArray,
|
||||
name: String
|
||||
) throws {
|
||||
// Prefill declares fp32 KV outputs at its CoreML I/O boundary
|
||||
// (even though the weights / activations internally are fp16).
|
||||
// Decode state buffers are fp16. Convert per-element as we copy.
|
||||
guard kvK.dataType == .float32 && kvV.dataType == .float32 else {
|
||||
guard src.dataType == dst.dataType else {
|
||||
throw CosyVoice3Error.predictionFailed(
|
||||
"seedDecodeState: expected fp32 KV from prefill (kv_k=\(kvK.dataType.rawValue) kv_v=\(kvV.dataType.rawValue))"
|
||||
)
|
||||
"decode \(name): dtype mismatch \(src.dataType.rawValue) vs \(dst.dataType.rawValue)")
|
||||
}
|
||||
|
||||
let L = CosyVoice3Constants.numLayers
|
||||
let H = CosyVoice3Constants.kvHeads
|
||||
let M = CosyVoice3Constants.kvMaxLength
|
||||
let D = CosyVoice3Constants.headDim
|
||||
|
||||
// Prefill output strides for shape [L, 1, H, M, D].
|
||||
let kStrides = kvK.strides.map { $0.intValue }
|
||||
let vStrides = kvV.strides.map { $0.intValue }
|
||||
let kLayerStride = kStrides[0]
|
||||
let kHStride = kStrides[2]
|
||||
let kMStride = kStrides[3]
|
||||
let kDStride = kStrides[4]
|
||||
let vLayerStride = vStrides[0]
|
||||
let vHStride = vStrides[2]
|
||||
let vMStride = vStrides[3]
|
||||
let vDStride = vStrides[4]
|
||||
|
||||
let kSrcPtr = kvK.dataPointer.bindMemory(to: Float.self, capacity: kvK.count)
|
||||
let vSrcPtr = kvV.dataPointer.bindMemory(to: Float.self, capacity: kvV.count)
|
||||
|
||||
// Collect dtype-mismatch errors from inside the non-throwing closures.
|
||||
var stateDtypeError: String?
|
||||
|
||||
for i in 0..<L {
|
||||
state.withMultiArray(for: "kv_k_\(i)") { buf in
|
||||
guard buf.dataType == .float16 else {
|
||||
if stateDtypeError == nil {
|
||||
stateDtypeError = "kv_k_\(i) expected fp16 state, got \(buf.dataType.rawValue)"
|
||||
}
|
||||
return
|
||||
}
|
||||
let b = buf.strides.map { $0.intValue }
|
||||
let dPtr = buf.dataPointer.bindMemory(to: Float16.self, capacity: buf.count)
|
||||
Self.copyLayerF32ToF16(
|
||||
src: kSrcPtr, srcLayerBase: i * kLayerStride,
|
||||
srcHStride: kHStride, srcMStride: kMStride, srcDStride: kDStride,
|
||||
dst: dPtr,
|
||||
dstHStride: b[1], dstMStride: b[2], dstDStride: b[3],
|
||||
H: H, M: M, D: D)
|
||||
}
|
||||
state.withMultiArray(for: "kv_v_\(i)") { buf in
|
||||
guard buf.dataType == .float16 else {
|
||||
if stateDtypeError == nil {
|
||||
stateDtypeError = "kv_v_\(i) expected fp16 state, got \(buf.dataType.rawValue)"
|
||||
}
|
||||
return
|
||||
}
|
||||
let b = buf.strides.map { $0.intValue }
|
||||
let dPtr = buf.dataPointer.bindMemory(to: Float16.self, capacity: buf.count)
|
||||
Self.copyLayerF32ToF16(
|
||||
src: vSrcPtr, srcLayerBase: i * vLayerStride,
|
||||
srcHStride: vHStride, srcMStride: vMStride, srcDStride: vDStride,
|
||||
dst: dPtr,
|
||||
dstHStride: b[1], dstMStride: b[2], dstDStride: b[3],
|
||||
H: H, M: M, D: D)
|
||||
}
|
||||
}
|
||||
|
||||
if let msg = stateDtypeError {
|
||||
throw CosyVoice3Error.predictionFailed("seedDecodeState: \(msg)")
|
||||
}
|
||||
}
|
||||
|
||||
/// Copy one `[H, M, D]` KV slab from a fp32 prefill output into a fp16
|
||||
/// decode state buffer. Strides may be non-compact on either side.
|
||||
private static func copyLayerF32ToF16(
|
||||
src: UnsafeMutablePointer<Float>,
|
||||
srcLayerBase: Int,
|
||||
srcHStride: Int, srcMStride: Int, srcDStride: Int,
|
||||
dst: UnsafeMutablePointer<Float16>,
|
||||
dstHStride: Int, dstMStride: Int, dstDStride: Int,
|
||||
H: Int, M: Int, D: Int
|
||||
) {
|
||||
for h in 0..<H {
|
||||
for m in 0..<M {
|
||||
for d in 0..<D {
|
||||
let sOff = srcLayerBase + h * srcHStride + m * srcMStride + d * srcDStride
|
||||
let dOff = h * dstHStride + m * dstMStride + d * dstDStride
|
||||
dst[dOff] = Float16(src[sOff])
|
||||
}
|
||||
}
|
||||
guard src.count == dst.count else {
|
||||
throw CosyVoice3Error.predictionFailed(
|
||||
"decode \(name): count mismatch \(src.count) vs \(dst.count)")
|
||||
}
|
||||
// KV outputs are fp32. With contiguous strides (the default for
|
||||
// freshly-allocated CoreML outputs in this graph) memcpy is safe.
|
||||
let bytes = src.count * MemoryLayout<Float>.size
|
||||
memcpy(dst.dataPointer, src.dataPointer, bytes)
|
||||
}
|
||||
|
||||
private func runFlow(
|
||||
|
||||
@@ -10,6 +10,24 @@ public struct CosyVoice3SynthesisResult: Sendable {
|
||||
public let generatedTokenCount: Int
|
||||
/// Decoded speech token ids (useful for debugging + round-trip).
|
||||
public let decodedTokens: [Int32]
|
||||
/// `true` when the LLM-Decode AR loop ended on an EOS token in
|
||||
/// `CosyVoice3Constants.stopRange` (natural termination); `false` when
|
||||
/// the loop exhausted its decode budget (`flowTotalTokens - nPrompt`)
|
||||
/// without observing EOS — the audio is truncated mid-utterance.
|
||||
/// See the `.warning`-level log emitted from `CosyVoice3Synthesizer`
|
||||
/// when this is `false`.
|
||||
public let finishedOnEos: Bool
|
||||
|
||||
public init(
|
||||
samples: [Float], sampleRate: Int, generatedTokenCount: Int,
|
||||
decodedTokens: [Int32], finishedOnEos: Bool
|
||||
) {
|
||||
self.samples = samples
|
||||
self.sampleRate = sampleRate
|
||||
self.generatedTokenCount = generatedTokenCount
|
||||
self.decodedTokens = decodedTokens
|
||||
self.finishedOnEos = finishedOnEos
|
||||
}
|
||||
}
|
||||
|
||||
/// Options controlling a CosyVoice3 parity / synthesis call.
|
||||
@@ -42,9 +60,20 @@ public struct CosyVoice3SynthesisOptions: Sendable {
|
||||
public let maxNewTokens: Int?
|
||||
/// Sampler seed for the top-p/top-k + multinomial fallback path.
|
||||
public let seed: UInt64
|
||||
/// When `true`, skips `CosyVoice3TextChunker.chunk(...)` and runs a
|
||||
/// single synthesizer call regardless of input length. Useful for
|
||||
/// callers that pre-segment input themselves (e.g. UI-driven streaming
|
||||
/// per sentence). The structural 250-token Flow cap still applies and
|
||||
/// long inputs will truncate mid-utterance with a `.warning` log.
|
||||
public let disableAutoChunking: Bool
|
||||
|
||||
public init(maxNewTokens: Int? = nil, seed: UInt64 = 42) {
|
||||
public init(
|
||||
maxNewTokens: Int? = nil,
|
||||
seed: UInt64 = 42,
|
||||
disableAutoChunking: Bool = false
|
||||
) {
|
||||
self.maxNewTokens = maxNewTokens
|
||||
self.seed = seed
|
||||
self.disableAutoChunking = disableAutoChunking
|
||||
}
|
||||
}
|
||||
|
||||
@@ -42,6 +42,39 @@ public struct KokoroAneComputeUnits: Sendable, Equatable {
|
||||
prosody: .cpuAndGPU, noise: .cpuAndGPU, vocoder: .cpuAndGPU, tail: .cpuAndGPU
|
||||
)
|
||||
|
||||
/// Force every stage onto `.cpuAndNeuralEngine`. Stages that hit
|
||||
/// ANE-incompatible ops will fall back to CPU silently — included
|
||||
/// for the benchmark sweep (efficiency vs. latency comparison).
|
||||
public static let allAne = KokoroAneComputeUnits(
|
||||
albert: .cpuAndNeuralEngine, postAlbert: .cpuAndNeuralEngine,
|
||||
alignment: .cpuAndNeuralEngine, prosody: .cpuAndNeuralEngine,
|
||||
noise: .cpuAndNeuralEngine, vocoder: .cpuAndNeuralEngine,
|
||||
tail: .cpuAndNeuralEngine
|
||||
)
|
||||
|
||||
/// CPU-only (no ANE, no GPU). Slowest but most predictable; useful
|
||||
/// as a debugging / fallback baseline.
|
||||
public static let cpuOnly = KokoroAneComputeUnits(
|
||||
albert: .cpuOnly, postAlbert: .cpuOnly, alignment: .cpuOnly,
|
||||
prosody: .cpuOnly, noise: .cpuOnly, vocoder: .cpuOnly, tail: .cpuOnly
|
||||
)
|
||||
|
||||
/// Build a configuration from a generic preset (used by the
|
||||
/// `tts-benchmark` CLI so a single flag maps cleanly across
|
||||
/// backends).
|
||||
public init(preset: TtsComputeUnitPreset) {
|
||||
switch preset {
|
||||
case .default:
|
||||
self = .default
|
||||
case .allAne:
|
||||
self = .allAne
|
||||
case .cpuAndGpu:
|
||||
self = .cpuAndGpu
|
||||
case .cpuOnly:
|
||||
self = .cpuOnly
|
||||
}
|
||||
}
|
||||
|
||||
func units(for stage: KokoroAneStage) -> MLComputeUnits {
|
||||
switch stage {
|
||||
case .albert: return albert
|
||||
|
||||
@@ -75,6 +75,11 @@ public actor MagpieTtsManager {
|
||||
public func initialize() async throws {
|
||||
if synthesizer != nil { return }
|
||||
|
||||
logger.warning(
|
||||
"Magpie TTS is experimental / beta. Synthesis is below real-time "
|
||||
+ "(agg-RTFx ~0.41× on M2 for the MiniMax-English corpus) — "
|
||||
+ "see Documentation/TTS/Magpie.md.")
|
||||
|
||||
let store = MagpieModelStore(
|
||||
directory: directory,
|
||||
computeUnits: computeUnits,
|
||||
|
||||
@@ -54,6 +54,15 @@ public final class MagpieKvCache {
|
||||
public private(set) var cachesV: [MLMultiArray]
|
||||
public private(set) var positions: [MLMultiArray]
|
||||
|
||||
/// Set to `false` once `decoder_step.mlmodelc` rejects `outputBackings`
|
||||
/// (e.g. when the model was exported without explicit MultiArray
|
||||
/// shape/dtype constraints on its KV outputs). The rejection is a static
|
||||
/// property of the model, so once it fails we permanently skip the fast
|
||||
/// path and go straight to the fresh-alloc fallback to avoid throwing +
|
||||
/// catching an exception on every one of the ~500 AR decode steps per
|
||||
/// utterance.
|
||||
public var useOutputBackings: Bool = true
|
||||
|
||||
/// Back-buffer set for double-buffered AR loop. Used as `outputBackings` so
|
||||
/// CoreML writes new K/V/pos straight into our pre-allocated arrays instead
|
||||
/// of allocating ~18.9 MB of fresh fp16 buffers per step. After each
|
||||
|
||||
@@ -769,14 +769,73 @@ public actor MagpieSynthesizer {
|
||||
// step. The cache provides 24 K/V + 12 position back-buffers, the
|
||||
// synthesizer provides the 1 hidden buffer. After the call,
|
||||
// `swapBackings` promotes back→front for the next step's inputs.
|
||||
var backings: [String: Any] = [:]
|
||||
cache.addOutputBackings(to: &backings)
|
||||
backings[MagpieKvCache.decoderHiddenKey] = hiddenBacking
|
||||
let predOpts = MLPredictionOptions()
|
||||
predOpts.outputBackings = backings
|
||||
//
|
||||
// If a previous step already proved that this model was exported
|
||||
// without explicit MultiArray shape/dtype constraints on its KV
|
||||
// outputs, `cache.useOutputBackings` is `false` and we skip the
|
||||
// fast path entirely. This avoids the per-step throw/catch overhead
|
||||
// and debug-log spam across the entire AR loop (~500 iterations).
|
||||
var fastPathSucceeded = false
|
||||
if cache.useOutputBackings {
|
||||
var backings: [String: Any] = [:]
|
||||
cache.addOutputBackings(to: &backings)
|
||||
backings[MagpieKvCache.decoderHiddenKey] = hiddenBacking
|
||||
let predOpts = MLPredictionOptions()
|
||||
predOpts.outputBackings = backings
|
||||
|
||||
_ = try model.prediction(from: provider, options: predOpts)
|
||||
cache.swapBackings()
|
||||
do {
|
||||
_ = try model.prediction(from: provider, options: predOpts)
|
||||
cache.swapBackings()
|
||||
fastPathSucceeded = true
|
||||
} catch {
|
||||
// CoreML refused our pre-allocated outputBackings — typically
|
||||
// because `decoder_step.mlmodelc` was exported without
|
||||
// explicit MultiArray shape/dtype constraints on its KV
|
||||
// outputs, so the runtime can't validate the buffer layout
|
||||
// and bails with
|
||||
// "Output feature (null) doesn't support output backing
|
||||
// because it doesn't have a MultiArray constraints."
|
||||
// The rejection is a static property of the model, so latch
|
||||
// the cache flag off to skip the fast path on every
|
||||
// subsequent step (avoids ~500 throw/catch + log lines per
|
||||
// utterance).
|
||||
cache.useOutputBackings = false
|
||||
logger.debug(
|
||||
"decoder_step outputBackings rejected "
|
||||
+ "(\(error.localizedDescription)); switching to "
|
||||
+ "fresh-alloc fallback for the rest of the run")
|
||||
}
|
||||
}
|
||||
|
||||
if !fastPathSucceeded {
|
||||
// Slow path: re-run without `outputBackings`, route the
|
||||
// freshly-allocated K/V/pos through `MagpieKvCache.absorbOutputs`
|
||||
// (which replaces front pointers directly), and copy the hidden
|
||||
// state into `hiddenBacking` so the rest of this function works
|
||||
// unchanged. Costs ~18.9 MB of fresh fp16 allocation per step;
|
||||
// proper fix is to re-export `decoder_step.mlmodelc` with
|
||||
// shape/dtype constraints on `new_k_*`/`new_v_*`/`var_*`.
|
||||
let output = try model.prediction(from: provider)
|
||||
try cache.absorbOutputs(output)
|
||||
guard
|
||||
let hidden = output.featureValue(for: MagpieKvCache.decoderHiddenKey)?
|
||||
.multiArrayValue
|
||||
else {
|
||||
throw MagpieError.inferenceFailed(
|
||||
stage: "decoder_step",
|
||||
underlying:
|
||||
"missing hidden output key \(MagpieKvCache.decoderHiddenKey)")
|
||||
}
|
||||
guard hidden.dataType == .float16, hidden.count == hiddenBacking.count else {
|
||||
throw MagpieError.inferenceFailed(
|
||||
stage: "decoder_step",
|
||||
underlying:
|
||||
"decoder hidden mismatch (dtype=\(hidden.dataType.rawValue) "
|
||||
+ "count=\(hidden.count) expected=\(hiddenBacking.count))")
|
||||
}
|
||||
let bytes = hiddenBacking.count * MemoryLayout<UInt16>.size
|
||||
memcpy(hiddenBacking.dataPointer, hidden.dataPointer, bytes)
|
||||
}
|
||||
|
||||
// Hidden state lives in `hiddenBacking` after the call. Convert fp16
|
||||
// → fp32 via vImage into a fresh [Float] result buffer (the sampler
|
||||
|
||||
@@ -0,0 +1,72 @@
|
||||
@preconcurrency import CoreML
|
||||
import Foundation
|
||||
|
||||
/// Generic compute-unit preset shared across TTS backends.
|
||||
///
|
||||
/// Each backend keeps its own per-stage `<Backend>ComputeUnits` struct
|
||||
/// because stage names differ (Kokoro ANE has 7 stages, PocketTTS has 4
|
||||
/// CoreML models, StyleTTS2 has 4 models, etc.). This preset is the
|
||||
/// uniform knob the benchmarking harness flips so a single CLI flag
|
||||
/// (`--compute-units default|all-ane|cpu-and-gpu|cpu-only`) maps to a
|
||||
/// sensible per-stage assignment on every backend.
|
||||
///
|
||||
/// Backends opt in by adding `init(preset: TtsComputeUnitPreset)` to
|
||||
/// their compute-units struct (see `KokoroAneComputeUnits` for the
|
||||
/// reference implementation).
|
||||
public enum TtsComputeUnitPreset: String, Sendable, CaseIterable {
|
||||
|
||||
/// The backend's empirically-tuned default — typically a mix of
|
||||
/// ANE-friendly and CPU+GPU stages chosen by the conversion author.
|
||||
case `default`
|
||||
|
||||
/// Force every stage to `.cpuAndNeuralEngine`. Worst case for stages
|
||||
/// that fall back to CPU on ANE-incompatible ops, but the most
|
||||
/// energy-efficient when ops are ANE-clean.
|
||||
case allAne
|
||||
|
||||
/// Force every stage to `.cpuAndGPU`. Skips the ANE entirely;
|
||||
/// useful as a latency baseline when the ANE compile cache is cold
|
||||
/// (no `anecompilerservice` time on first call).
|
||||
case cpuAndGpu
|
||||
|
||||
/// Force every stage to `.cpuOnly`. Fallback / debugging baseline;
|
||||
/// every backend should at least run here, however slowly.
|
||||
case cpuOnly
|
||||
|
||||
/// Concrete `MLComputeUnits` for "force every stage to X" presets.
|
||||
/// Returns `nil` for `.default`, which means "let the backend keep
|
||||
/// its empirical mapping".
|
||||
public var uniformUnits: MLComputeUnits? {
|
||||
switch self {
|
||||
case .default: return nil
|
||||
case .allAne: return .cpuAndNeuralEngine
|
||||
case .cpuAndGpu: return .cpuAndGPU
|
||||
case .cpuOnly: return .cpuOnly
|
||||
}
|
||||
}
|
||||
|
||||
/// Parse the CLI flag value (`default`, `all-ane`, `cpu-and-gpu`,
|
||||
/// `cpu-only`). Returns `nil` for unrecognised values so callers
|
||||
/// can surface a usage error.
|
||||
public init?(cliValue: String) {
|
||||
switch cliValue.lowercased() {
|
||||
case "default": self = .default
|
||||
case "all-ane", "ane", "neural-engine": self = .allAne
|
||||
case "cpu-and-gpu", "cpuandgpu", "gpu": self = .cpuAndGpu
|
||||
case "cpu-only", "cpu", "cpuonly": self = .cpuOnly
|
||||
default: return nil
|
||||
}
|
||||
}
|
||||
|
||||
/// Canonical kebab-case form, matching the CLI flag values the
|
||||
/// `init?(cliValue:)` parser accepts. Use this for log lines and
|
||||
/// JSON reports so values round-trip back through the parser.
|
||||
public var cliValue: String {
|
||||
switch self {
|
||||
case .default: return "default"
|
||||
case .allAne: return "all-ane"
|
||||
case .cpuAndGpu: return "cpu-and-gpu"
|
||||
case .cpuOnly: return "cpu-only"
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -94,4 +94,25 @@ public struct StyleTTS2Vocab: Sendable {
|
||||
}
|
||||
return ids
|
||||
}
|
||||
|
||||
/// Diagnostic encode: same logic as `encode(_:)` but also returns a
|
||||
/// frequency map of every scalar that fell off the floor because no
|
||||
/// vocab entry exists for it. Used by the StyleTTS2 CLI's
|
||||
/// `--tokenize-only` mode to quantify the misaki ↔ espeak inventory
|
||||
/// gap without actually invoking the diffusion pipeline.
|
||||
public func encodeWithReport(
|
||||
_ phonemes: String
|
||||
) -> (ids: [Int32], dropped: [Unicode.Scalar: Int]) {
|
||||
var ids: [Int32] = []
|
||||
ids.reserveCapacity(phonemes.unicodeScalars.count)
|
||||
var dropped: [Unicode.Scalar: Int] = [:]
|
||||
for scalar in phonemes.unicodeScalars {
|
||||
if let id = map[Character(scalar)] {
|
||||
ids.append(id)
|
||||
} else {
|
||||
dropped[scalar, default: 0] += 1
|
||||
}
|
||||
}
|
||||
return (ids, dropped)
|
||||
}
|
||||
}
|
||||
|
||||
@@ -4,14 +4,30 @@ import Foundation
|
||||
///
|
||||
/// For English (`.americanEnglish`), uses the in-tree `G2PModel` (BART
|
||||
/// encoder-decoder, misaki-style IPA) and remaps the misaki conventions to
|
||||
/// the espeak-ng convention that StyleTTS2's LibriTTS checkpoint expects:
|
||||
/// the espeak-ng convention that StyleTTS2's LibriTTS checkpoint expects.
|
||||
///
|
||||
/// **Per-piece (single glyph) remap** — applied as misaki emits each piece:
|
||||
///
|
||||
/// misaki → espeak-ng
|
||||
/// A → eɪ I → aɪ O → oʊ W → aʊ Y → ɔɪ
|
||||
/// ᵊ → ə (tiny-schwa offglide; not in StyleTTS2's 178-vocab)
|
||||
///
|
||||
/// Other glyphs (`ʤ`, `ʧ`, `ˈ`, `ˌ`, `ð`, `θ`, `ɹ`, `ɾ`, etc.) are already in
|
||||
/// the 178-token espeak-ng vocabulary and pass through.
|
||||
/// **Post-pass (multi-glyph) remap** — applied to the assembled phoneme
|
||||
/// string after every word has been emitted. Both the ligature and the
|
||||
/// decomposed forms exist as distinct tokens in the 178-vocab, but the
|
||||
/// LibriTTS checkpoint was trained against espeak-ng output, so the model's
|
||||
/// embeddings for the misaki ligature glyphs (`ʧ`, `ʤ`) are essentially
|
||||
/// untrained noise. Same story for the schwa+r digraphs that espeak collapses
|
||||
/// into single rhotic vowels (`ɝ`, `ɚ`):
|
||||
///
|
||||
/// misaki → espeak-ng word example
|
||||
/// ʧ → tʃ choice → tʃˈɔɪs
|
||||
/// ʤ → dʒ jump → dʒˈʌmps
|
||||
/// ɜɹ → ɝ (U+025D) girl → ɡˈɝl
|
||||
/// əɹ → ɚ (U+025A) over → ˈoʊvɚ
|
||||
///
|
||||
/// Other glyphs (`ˈ`, `ˌ`, `ð`, `θ`, `ɹ`, `ɾ`, etc.) are already in the
|
||||
/// 178-token espeak-ng vocabulary and pass through unchanged.
|
||||
///
|
||||
/// Non-English languages fall back to `MultilingualG2PModel` (CharsiuG2P
|
||||
/// ByT5). Output quality there is unvalidated — the LibriTTS checkpoint is
|
||||
@@ -46,6 +62,30 @@ public enum StyleTTS2Phonemizer {
|
||||
"ᵊ": "ə",
|
||||
]
|
||||
|
||||
/// Post-pass multi-glyph remap applied to the assembled phoneme string
|
||||
/// after all word pieces have been concatenated. Decomposes misaki's
|
||||
/// affricate ligatures and collapses the schwa+r digraphs into the
|
||||
/// single rhotic vowels espeak-ng emits — see the type-level docs for
|
||||
/// rationale. Order matters only insofar as `əɹ` and `ɜɹ` must be
|
||||
/// applied before any rule that would consume the trailing `ɹ` (none
|
||||
/// exist today; left ordered for future-proofing).
|
||||
private static let misakiToEspeakPostPass: [(String, String)] = [
|
||||
("ʧ", "tʃ"),
|
||||
("ʤ", "dʒ"),
|
||||
("ɜɹ", "ɝ"),
|
||||
("əɹ", "ɚ"),
|
||||
]
|
||||
|
||||
/// Apply `misakiToEspeakPostPass` rules to a phoneme string in order.
|
||||
/// Exposed `internal` for unit tests.
|
||||
internal static func applyEspeakPostPass(_ s: String) -> String {
|
||||
var out = s
|
||||
for (from, to) in misakiToEspeakPostPass {
|
||||
out = out.replacingOccurrences(of: from, with: to)
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
/// Convert raw text to an IPA phoneme string for StyleTTS2.
|
||||
///
|
||||
/// - Parameters:
|
||||
@@ -87,6 +127,13 @@ public enum StyleTTS2Phonemizer {
|
||||
try await flushWord(&wordBuffer, language: language, into: &output)
|
||||
}
|
||||
|
||||
// Multi-glyph misaki → espeak normalization. Only meaningful for
|
||||
// English (the LibriTTS checkpoint is English-only); skipping for
|
||||
// other languages avoids touching CharsiuG2P output we don't have
|
||||
// a model contract for.
|
||||
if language == .americanEnglish {
|
||||
output = applyEspeakPostPass(output)
|
||||
}
|
||||
return output
|
||||
}
|
||||
|
||||
|
||||
@@ -239,6 +239,13 @@ public actor StyleTTS2Synthesizer {
|
||||
/// Slice an MLMultiArray of shape `(1, leading, trailing)` to the first
|
||||
/// `take` entries along either the leading or trailing axis. Returns a
|
||||
/// flat row-major `[Float]`.
|
||||
///
|
||||
/// Reads via `dataPointer` instead of `arr[idx].floatValue` and avoids
|
||||
/// `arr.strides` entirely — both trigger
|
||||
/// `E5RT: tensor_buffer has known strides while the model has
|
||||
/// FlexibleShapeInfo` on `text_predictor`'s flex-shape outputs. CoreML
|
||||
/// emits dense row-major buffers, so for shape `(1, leading, trailing)`
|
||||
/// the flat index is simply `r * trailing + c`.
|
||||
private func sliceFirstAxis2D(
|
||||
arr: MLMultiArray,
|
||||
leading: Int,
|
||||
@@ -246,29 +253,50 @@ public actor StyleTTS2Synthesizer {
|
||||
take: Int,
|
||||
sliceDim: SliceDim
|
||||
) -> [Float] {
|
||||
let strides = arr.strides.map { $0.intValue }
|
||||
let outCount: Int
|
||||
switch sliceDim {
|
||||
case .leading:
|
||||
// Result shape: (take, trailing).
|
||||
var out = [Float](repeating: 0, count: take * trailing)
|
||||
for r in 0..<take {
|
||||
for c in 0..<trailing {
|
||||
let idx = r * strides[1] + c * strides[2]
|
||||
out[r * trailing + c] = arr[idx].floatValue
|
||||
}
|
||||
}
|
||||
return out
|
||||
case .trailing:
|
||||
// Result shape: (leading, take).
|
||||
var out = [Float](repeating: 0, count: leading * take)
|
||||
for r in 0..<leading {
|
||||
for c in 0..<take {
|
||||
let idx = r * strides[1] + c * strides[2]
|
||||
out[r * take + c] = arr[idx].floatValue
|
||||
}
|
||||
}
|
||||
return out
|
||||
case .leading: outCount = take * trailing
|
||||
case .trailing: outCount = leading * take
|
||||
}
|
||||
var out = [Float](repeating: 0, count: outCount)
|
||||
|
||||
func fill(_ get: (Int) -> Float) {
|
||||
switch sliceDim {
|
||||
case .leading:
|
||||
// Result shape: (take, trailing).
|
||||
for r in 0..<take {
|
||||
for c in 0..<trailing {
|
||||
out[r * trailing + c] = get(r * trailing + c)
|
||||
}
|
||||
}
|
||||
case .trailing:
|
||||
// Result shape: (leading, take).
|
||||
for r in 0..<leading {
|
||||
for c in 0..<take {
|
||||
out[r * take + c] = get(r * trailing + c)
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
let count = arr.count
|
||||
switch arr.dataType {
|
||||
case .float32:
|
||||
let p = arr.dataPointer.bindMemory(to: Float.self, capacity: count)
|
||||
fill { p[$0] }
|
||||
case .float16:
|
||||
let p = arr.dataPointer.bindMemory(to: Float16.self, capacity: count)
|
||||
fill { Float(p[$0]) }
|
||||
case .double:
|
||||
let p = arr.dataPointer.bindMemory(to: Double.self, capacity: count)
|
||||
fill { Float(p[$0]) }
|
||||
default:
|
||||
// Fallback re-introduces the FlexibleShapeInfo trip wire, but
|
||||
// we don't expect text_predictor to emit anything other than
|
||||
// fp16/fp32.
|
||||
fill { arr[$0].floatValue }
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
// MARK: - Durations
|
||||
|
||||
@@ -15,10 +15,6 @@ import Foundation
|
||||
/// - cumsum-of-durations → one-hot → matmul hard-alignment,
|
||||
/// - bucket selection (round token length → text_predictor; round
|
||||
/// mel frames → decoder).
|
||||
///
|
||||
/// **Status:** scaffold only. Synthesis is not yet implemented; calls to
|
||||
/// `synthesize` throw `processingFailed`. The asset bring-up (download +
|
||||
/// model store) is wired up so dependent layers can land incrementally.
|
||||
public actor StyleTTS2Manager {
|
||||
|
||||
private let logger = AppLogger(category: "StyleTTS2Manager")
|
||||
@@ -47,6 +43,10 @@ public actor StyleTTS2Manager {
|
||||
public func initialize(
|
||||
progressHandler: DownloadUtils.ProgressHandler? = nil
|
||||
) async throws {
|
||||
logger.warning(
|
||||
"StyleTTS2 is experimental / beta. WER on long English phrases is "
|
||||
+ "elevated on the MiniMax corpus (~44% vs Kokoro 1.3%) — see "
|
||||
+ "Documentation/TTS/Benchmarks.md.")
|
||||
_ = try await modelStore.ensureAssetsAvailable(progressHandler: progressHandler)
|
||||
let config = try await modelStore.bundleConfig()
|
||||
try config.validate()
|
||||
@@ -111,6 +111,34 @@ public actor StyleTTS2Manager {
|
||||
return try await synthesizer.synthesize(ids: ids, voice: voice, options: options)
|
||||
}
|
||||
|
||||
/// Same as `synthesize` but returns raw fp32 PCM samples + sample rate.
|
||||
/// Used by callers (e.g. the tts-benchmark harness, ASR pairing) that
|
||||
/// don't want the WAV-encoding round trip.
|
||||
public func synthesizeSamples(
|
||||
text: String,
|
||||
voiceStyleURL: URL,
|
||||
language: MultilingualG2PLanguage = .americanEnglish,
|
||||
diffusionSteps: Int = StyleTTS2Constants.defaultDiffusionSteps,
|
||||
alpha: Float = 0.3,
|
||||
beta: Float = 0.7,
|
||||
randomSeed: UInt64? = nil
|
||||
) async throws -> (samples: [Float], sampleRate: Int) {
|
||||
guard isInitialized else {
|
||||
throw StyleTTS2Error.modelNotFound("StyleTTS2 model not initialized")
|
||||
}
|
||||
let voice = try StyleTTS2VoiceStyle.load(from: voiceStyleURL)
|
||||
let (_, ids) = try await tokenize(text: text, language: language)
|
||||
let options = StyleTTS2Synthesizer.Options(
|
||||
diffusionSteps: diffusionSteps,
|
||||
alpha: alpha,
|
||||
beta: beta,
|
||||
randomSeed: randomSeed
|
||||
)
|
||||
let samples = try await synthesizer.synthesizeSamples(
|
||||
ids: ids, voice: voice, options: options)
|
||||
return (samples, StyleTTS2Constants.audioSampleRate)
|
||||
}
|
||||
|
||||
/// Run the text frontend (preprocess → G2P → vocab encode) end-to-end.
|
||||
///
|
||||
/// Available before the diffusion synthesizer is wired so callers can
|
||||
@@ -138,6 +166,27 @@ public actor StyleTTS2Manager {
|
||||
return (phonemes, ids)
|
||||
}
|
||||
|
||||
/// Diagnostic tokenize: same as `tokenize(text:language:)` but also
|
||||
/// returns the per-scalar drop frequency from
|
||||
/// `StyleTTS2Vocab.encodeWithReport`. Used by the CLI to quantify
|
||||
/// how much of the misaki BART G2P output the espeak-ng-trained
|
||||
/// 178-token vocab can actually consume.
|
||||
public func tokenizeWithReport(
|
||||
text: String,
|
||||
language: MultilingualG2PLanguage = .americanEnglish
|
||||
) async throws -> (
|
||||
phonemes: String, ids: [Int32], dropped: [Unicode.Scalar: Int]
|
||||
) {
|
||||
guard isInitialized else {
|
||||
throw StyleTTS2Error.modelNotFound("StyleTTS2 model not initialized")
|
||||
}
|
||||
let phonemes = try await StyleTTS2Phonemizer.phonemize(
|
||||
text: text, language: language)
|
||||
let vocab = try await modelStore.vocabulary()
|
||||
let (ids, dropped) = vocab.encodeWithReport(phonemes)
|
||||
return (phonemes, ids, dropped)
|
||||
}
|
||||
|
||||
public func cleanup() {
|
||||
isInitialized = false
|
||||
}
|
||||
|
||||
@@ -13,7 +13,6 @@ import Foundation
|
||||
/// --output .../build/swift_e2e.wav \
|
||||
/// --seed 42
|
||||
/// ```
|
||||
@available(macOS 15, iOS 18, *)
|
||||
enum CosyVoice3ParityCLI {
|
||||
|
||||
private static let logger = AppLogger(category: "CosyVoice3ParityCLI")
|
||||
|
||||
@@ -19,7 +19,6 @@ import Foundation
|
||||
/// --output .../build/swift_cv3_text.wav \
|
||||
/// --seed 42
|
||||
/// ```
|
||||
@available(macOS 15, iOS 18, *)
|
||||
enum CosyVoice3TextCLI {
|
||||
|
||||
private static let logger = AppLogger(category: "CosyVoice3TextCLI")
|
||||
|
||||
@@ -0,0 +1,234 @@
|
||||
#if os(macOS)
|
||||
import FluidAudio
|
||||
import Foundation
|
||||
|
||||
/// Swift port of `Scripts/fetch_minimax_tts_corpus.py`.
|
||||
///
|
||||
/// Fetches the MiniMax Multilingual TTS Test Set per-language `.txt` files
|
||||
/// from HuggingFace and converts them to the FluidAudio TTS-benchmark
|
||||
/// corpus format (strip `<cloning_audio_filename>|` prefix, prepend a
|
||||
/// header documenting source + revision + license).
|
||||
///
|
||||
/// Reuses `DownloadUtils.fetchHuggingFaceFile` so we get the same auth
|
||||
/// (HF_TOKEN env), retry, and backoff treatment as every other HF asset
|
||||
/// pull in the project — no hardcoded URLs, no swift-transformers
|
||||
/// dependency added just for one corpus fetch.
|
||||
///
|
||||
/// Source dataset: https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set
|
||||
/// License: CC-BY-SA-4.0
|
||||
public enum MinimaxCorpusCommand {
|
||||
|
||||
private static let logger = AppLogger(category: "MinimaxCorpusCommand")
|
||||
|
||||
private static let repo = "MiniMaxAI/TTS-Multilingual-Test-Set"
|
||||
|
||||
/// Pin to the initial public commit so re-runs reproduce the vendored
|
||||
/// files. Matches `DEFAULT_REVISION` in the Python script.
|
||||
private static let defaultRevision = "cb416f0ac3658da0577e97873065e19fe6488917"
|
||||
|
||||
/// All 24 languages in the upstream `text/` directory. Keep in sync with
|
||||
/// `ALL_LANGUAGES` in `Scripts/fetch_minimax_tts_corpus.py`.
|
||||
private static let allLanguages: [String] = [
|
||||
"arabic", "cantonese", "chinese", "czech", "dutch", "english",
|
||||
"finnish", "french", "german", "greek", "hindi", "indonesian",
|
||||
"italian", "japanese", "korean", "polish", "portuguese", "romanian",
|
||||
"russian", "spanish", "thai", "turkish", "ukrainian", "vietnamese",
|
||||
]
|
||||
|
||||
public static func run(arguments: [String]) async {
|
||||
var languages = allLanguages
|
||||
var revision = defaultRevision
|
||||
var outDir: URL? = nil
|
||||
|
||||
var i = 0
|
||||
while i < arguments.count {
|
||||
let arg = arguments[i]
|
||||
switch arg {
|
||||
case "--languages", "-l":
|
||||
if i + 1 < arguments.count {
|
||||
languages = arguments[i + 1]
|
||||
.split(separator: ",")
|
||||
.map { $0.trimmingCharacters(in: .whitespaces) }
|
||||
.filter { !$0.isEmpty }
|
||||
i += 1
|
||||
}
|
||||
case "--revision":
|
||||
if i + 1 < arguments.count {
|
||||
revision = arguments[i + 1]
|
||||
i += 1
|
||||
}
|
||||
case "--out-dir":
|
||||
if i + 1 < arguments.count {
|
||||
outDir = URL(fileURLWithPath: arguments[i + 1])
|
||||
i += 1
|
||||
}
|
||||
case "help", "--help", "-h":
|
||||
printUsage()
|
||||
return
|
||||
default:
|
||||
logger.error("Unknown argument: \(arg)")
|
||||
printUsage()
|
||||
exit(1)
|
||||
}
|
||||
i += 1
|
||||
}
|
||||
|
||||
let unknown = Set(languages).subtracting(allLanguages).sorted()
|
||||
if !unknown.isEmpty {
|
||||
logger.error("Unknown language(s): \(unknown.joined(separator: ", "))")
|
||||
logger.error("Available: \(allLanguages.joined(separator: ", "))")
|
||||
exit(2)
|
||||
}
|
||||
|
||||
let resolvedOutDir = outDir ?? defaultOutDir()
|
||||
|
||||
do {
|
||||
try FileManager.default.createDirectory(
|
||||
at: resolvedOutDir, withIntermediateDirectories: true)
|
||||
} catch {
|
||||
logger.error("Failed to create output directory: \(error.localizedDescription)")
|
||||
exit(1)
|
||||
}
|
||||
|
||||
logger.info("Fetching MiniMax TTS Multilingual Test Set @ \(revision)")
|
||||
logger.info(" out_dir: \(resolvedOutDir.path)")
|
||||
logger.info(" langs: \(languages.count)")
|
||||
|
||||
var total = 0
|
||||
for lang in languages {
|
||||
guard let url = URL(string: hfURL(repo: repo, revision: revision, path: "text/\(lang).txt"))
|
||||
else {
|
||||
logger.error("[\(lang)] failed to construct URL")
|
||||
exit(1)
|
||||
}
|
||||
do {
|
||||
let data = try await DownloadUtils.fetchHuggingFaceFile(
|
||||
from: url, description: "minimax TTS corpus (\(lang))")
|
||||
guard let raw = String(data: data, encoding: .utf8) else {
|
||||
logger.error("[\(lang)] response was not valid UTF-8")
|
||||
exit(1)
|
||||
}
|
||||
let phrases = convert(raw: raw)
|
||||
let outPath = try writeCorpus(
|
||||
lang: lang, phrases: phrases, outDir: resolvedOutDir,
|
||||
revision: revision)
|
||||
let countStr = String(format: "%3d", phrases.count)
|
||||
let relPath = relativePath(outPath, from: repoRoot())
|
||||
logger.info(" [\(lang)] \(countStr) phrases -> \(relPath)")
|
||||
total += phrases.count
|
||||
} catch {
|
||||
logger.error("[\(lang)] FAILED: \(error.localizedDescription)")
|
||||
exit(1)
|
||||
}
|
||||
}
|
||||
|
||||
logger.info("OK — \(total) phrases across \(languages.count) language(s).")
|
||||
}
|
||||
|
||||
// MARK: - Helpers
|
||||
|
||||
private static func hfURL(repo: String, revision: String, path: String) -> String {
|
||||
"https://huggingface.co/datasets/\(repo)/resolve/\(revision)/\(path)"
|
||||
}
|
||||
|
||||
/// Strip `<filename>|` prefix and return the list of trimmed phrases.
|
||||
/// Mirrors `convert()` in the Python script.
|
||||
private static func convert(raw: String) -> [String] {
|
||||
var out: [String] = []
|
||||
for rawLine in raw.split(separator: "\n", omittingEmptySubsequences: false) {
|
||||
let line = rawLine.trimmingCharacters(in: .whitespacesAndNewlines)
|
||||
if line.isEmpty { continue }
|
||||
// Format: "<cloning_audio_filename>|<text>". Some lines may have
|
||||
// extra `|` inside the text — keep only the first split.
|
||||
let text: String
|
||||
if let sepIdx = line.firstIndex(of: "|") {
|
||||
text = String(line[line.index(after: sepIdx)...])
|
||||
.trimmingCharacters(in: .whitespacesAndNewlines)
|
||||
} else {
|
||||
text = line
|
||||
}
|
||||
if !text.isEmpty {
|
||||
out.append(text)
|
||||
}
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
private static func writeCorpus(
|
||||
lang: String,
|
||||
phrases: [String],
|
||||
outDir: URL,
|
||||
revision: String
|
||||
) throws -> URL {
|
||||
let outPath = outDir.appendingPathComponent("\(lang).txt")
|
||||
let header: [String] = [
|
||||
"# MiniMax Multilingual TTS Test Set — \(lang)",
|
||||
"# Source: https://huggingface.co/datasets/\(repo)",
|
||||
"# Revision: \(revision)",
|
||||
"# License: CC-BY-SA-4.0 (Creative Commons Attribution-ShareAlike 4.0)",
|
||||
"# Phrases: \(phrases.count)",
|
||||
"#",
|
||||
"# Cloning-audio filenames have been stripped — we only need the",
|
||||
"# text for the FluidAudio TTS benchmark harness. Voice selection",
|
||||
"# is per-backend (see Documentation/TTS/MinimaxCorpus.md).",
|
||||
"",
|
||||
]
|
||||
let body = (header + phrases).joined(separator: "\n") + "\n"
|
||||
try body.write(to: outPath, atomically: true, encoding: .utf8)
|
||||
return outPath
|
||||
}
|
||||
|
||||
/// `<repo>/Benchmarks/tts/corpus/minimax/`. Resolves relative to the
|
||||
/// current working directory (the standard place `swift run` is invoked
|
||||
/// from); falls back gracefully if the layout doesn't exist yet because
|
||||
/// we `createDirectory(withIntermediateDirectories: true)` before write.
|
||||
private static func defaultOutDir() -> URL {
|
||||
repoRoot()
|
||||
.appendingPathComponent("Benchmarks", isDirectory: true)
|
||||
.appendingPathComponent("tts", isDirectory: true)
|
||||
.appendingPathComponent("corpus", isDirectory: true)
|
||||
.appendingPathComponent("minimax", isDirectory: true)
|
||||
}
|
||||
|
||||
private static func repoRoot() -> URL {
|
||||
URL(fileURLWithPath: FileManager.default.currentDirectoryPath, isDirectory: true)
|
||||
}
|
||||
|
||||
private static func relativePath(_ url: URL, from base: URL) -> String {
|
||||
let path = url.standardizedFileURL.path
|
||||
let basePath = base.standardizedFileURL.path
|
||||
if path.hasPrefix(basePath + "/") {
|
||||
return String(path.dropFirst(basePath.count + 1))
|
||||
}
|
||||
return path
|
||||
}
|
||||
|
||||
private static func printUsage() {
|
||||
logger.info(
|
||||
"""
|
||||
Usage: fluidaudio minimax-corpus [options]
|
||||
|
||||
Fetches the MiniMax Multilingual TTS Test Set text files from
|
||||
HuggingFace and converts them to the FluidAudio TTS-benchmark
|
||||
corpus format. Outputs one file per language.
|
||||
|
||||
Options:
|
||||
--languages, -l <list> Comma-separated subset of languages
|
||||
(default: all 24).
|
||||
--revision <sha> HuggingFace dataset revision
|
||||
(default: \(defaultRevision)).
|
||||
--out-dir <path> Output directory
|
||||
(default: Benchmarks/tts/corpus/minimax).
|
||||
--help, -h Show this help.
|
||||
|
||||
Available languages:
|
||||
\(allLanguages.joined(separator: ", "))
|
||||
|
||||
Examples:
|
||||
fluidaudio minimax-corpus
|
||||
fluidaudio minimax-corpus --languages english,spanish,hindi
|
||||
fluidaudio minimax-corpus --revision <commit-sha>
|
||||
""")
|
||||
}
|
||||
}
|
||||
#endif
|
||||
@@ -23,6 +23,8 @@ public enum StyleTTS2Command {
|
||||
var alpha: Float = 0.3
|
||||
var beta: Float = 0.7
|
||||
var seed: UInt64?
|
||||
var tokenizeOnly = false
|
||||
var corpusPath: String?
|
||||
|
||||
var i = 0
|
||||
while i < arguments.count {
|
||||
@@ -74,6 +76,16 @@ public enum StyleTTS2Command {
|
||||
fputs("--seed requires an integer\n", stderr)
|
||||
exit(2)
|
||||
}
|
||||
case "--tokenize-only":
|
||||
tokenizeOnly = true
|
||||
i += 1
|
||||
case "--corpus":
|
||||
guard i + 1 < arguments.count else {
|
||||
fputs("--corpus requires a path\n", stderr)
|
||||
exit(2)
|
||||
}
|
||||
corpusPath = arguments[i + 1]
|
||||
i += 2
|
||||
case "--help", "-h":
|
||||
printUsage()
|
||||
return
|
||||
@@ -88,6 +100,11 @@ public enum StyleTTS2Command {
|
||||
}
|
||||
}
|
||||
|
||||
if tokenizeOnly {
|
||||
await runTokenizeOnly(text: text, corpusPath: corpusPath)
|
||||
return
|
||||
}
|
||||
|
||||
guard let text else {
|
||||
fputs("Missing required text argument\n", stderr)
|
||||
printUsage()
|
||||
@@ -136,6 +153,105 @@ public enum StyleTTS2Command {
|
||||
}
|
||||
}
|
||||
|
||||
/// `--tokenize-only`: phonemize + encode without invoking the diffusion
|
||||
/// pipeline. Reports phoneme string, token id sequence, and any scalars
|
||||
/// that the 178-token espeak-ng vocab silently dropped. With `--corpus`
|
||||
/// runs over every line of a phrase file and aggregates a histogram of
|
||||
/// dropped scalars for the whole corpus.
|
||||
private static func runTokenizeOnly(text: String?, corpusPath: String?) async {
|
||||
do {
|
||||
let manager = StyleTTS2Manager()
|
||||
try await manager.initialize { _ in }
|
||||
|
||||
var totalScalars = 0
|
||||
var totalIds = 0
|
||||
var totalDropped = 0
|
||||
var dropHist: [Unicode.Scalar: Int] = [:]
|
||||
var phraseCount = 0
|
||||
|
||||
func process(_ phrase: String) async throws {
|
||||
let (phonemes, ids, dropped) =
|
||||
try await manager.tokenizeWithReport(text: phrase)
|
||||
let scalars = phonemes.unicodeScalars.count
|
||||
totalScalars += scalars
|
||||
totalIds += ids.count
|
||||
let phraseDropCount = dropped.values.reduce(0, +)
|
||||
totalDropped += phraseDropCount
|
||||
for (k, v) in dropped { dropHist[k, default: 0] += v }
|
||||
phraseCount += 1
|
||||
|
||||
if corpusPath == nil {
|
||||
print("INPUT : \(phrase)")
|
||||
print("PHONEMES : \(phonemes)")
|
||||
print("TOKEN_IDS (\(ids.count)): \(ids)")
|
||||
let formatted =
|
||||
dropped
|
||||
.sorted { $0.value > $1.value }
|
||||
.map {
|
||||
"U+\(String($0.key.value, radix: 16, uppercase: true))"
|
||||
+ " '\($0.key)' ×\($0.value)"
|
||||
}
|
||||
.joined(separator: ", ")
|
||||
print(
|
||||
"DROPPED (\(phraseDropCount) of \(scalars) scalars):"
|
||||
+ " \(formatted)")
|
||||
}
|
||||
}
|
||||
|
||||
if let corpusPath {
|
||||
let url = expand(corpusPath)
|
||||
let raw = try String(contentsOf: url, encoding: .utf8)
|
||||
let phrases = raw.split(separator: "\n", omittingEmptySubsequences: true)
|
||||
.map { $0.trimmingCharacters(in: .whitespaces) }
|
||||
.filter { !$0.isEmpty && !$0.hasPrefix("#") }
|
||||
for (idx, phrase) in phrases.enumerated() {
|
||||
do {
|
||||
try await process(phrase)
|
||||
let dropPct =
|
||||
Double(totalDropped) / Double(max(totalScalars, 1)) * 100
|
||||
if (idx + 1) % 10 == 0 || idx + 1 == phrases.count {
|
||||
fputs(
|
||||
" [\(idx + 1)/\(phrases.count)] running drop rate "
|
||||
+ "\(String(format: "%.2f", dropPct))%\n",
|
||||
stderr)
|
||||
}
|
||||
} catch {
|
||||
fputs(" [\(idx + 1)] phrase failed: \(error)\n", stderr)
|
||||
}
|
||||
}
|
||||
} else if let text {
|
||||
try await process(text)
|
||||
} else {
|
||||
fputs("--tokenize-only requires either text or --corpus\n", stderr)
|
||||
exit(2)
|
||||
}
|
||||
|
||||
let dropPct = Double(totalDropped) / Double(max(totalScalars, 1)) * 100
|
||||
let kept = totalScalars - totalDropped
|
||||
print("")
|
||||
print("=== StyleTTS2 vocab coverage ===")
|
||||
print("phrases : \(phraseCount)")
|
||||
print("phoneme scalars total : \(totalScalars)")
|
||||
print("encoded token ids : \(totalIds) (== kept scalars: \(kept))")
|
||||
print(
|
||||
"dropped scalars : \(totalDropped) "
|
||||
+ "(\(String(format: "%.2f", dropPct))%)")
|
||||
print("distinct dropped chars : \(dropHist.count)")
|
||||
if !dropHist.isEmpty {
|
||||
print("")
|
||||
print("dropped histogram (most → least frequent):")
|
||||
for (scalar, count) in dropHist.sorted(by: { $0.value > $1.value }) {
|
||||
let hex = String(scalar.value, radix: 16, uppercase: true)
|
||||
print(
|
||||
" \(String(format: "%6d", count)) U+\(hex) '\(scalar)'")
|
||||
}
|
||||
}
|
||||
} catch {
|
||||
fputs("StyleTTS2 tokenize-only failed: \(error)\n", stderr)
|
||||
exit(1)
|
||||
}
|
||||
}
|
||||
|
||||
private static func expand(_ path: String) -> URL {
|
||||
let exp = (path as NSString).expandingTildeInPath
|
||||
if exp.hasPrefix("/") {
|
||||
@@ -152,12 +268,15 @@ public enum StyleTTS2Command {
|
||||
fluidaudio styletts2 "<text>" --voice <ref_s.bin> [options]
|
||||
|
||||
Options:
|
||||
--voice <path> Required. Path to precomputed ref_s.bin (256 fp32 LE).
|
||||
--voice <path> Required for synthesis. Path to precomputed ref_s.bin (256 fp32 LE).
|
||||
--output <path> Output WAV path (default: styletts2.wav).
|
||||
--steps <int> ADPM2 sampler steps (default: 5).
|
||||
--alpha <float> Acoustic style mix weight (default: 0.3).
|
||||
--beta <float> Prosody style mix weight (default: 0.7).
|
||||
--seed <uint> Deterministic noise seed (default: system RNG).
|
||||
--tokenize-only Run G2P + vocab encode only; report dropped scalars.
|
||||
No --voice needed. Use with text or --corpus.
|
||||
--corpus <path> Phrase-per-line corpus file (with --tokenize-only).
|
||||
|
||||
Example:
|
||||
fluidaudio styletts2 "Hello world" \\
|
||||
|
||||
@@ -414,22 +414,17 @@ public struct TTS {
|
||||
)
|
||||
return
|
||||
}
|
||||
if #available(macOS 15, iOS 18, *) {
|
||||
await CosyVoice3TextCLI.run(
|
||||
text: inputText,
|
||||
modelsDir: modelsDir,
|
||||
tokenizerDir: tokDir,
|
||||
embeddingsFile: embFile,
|
||||
specialTokensFile: specFile,
|
||||
promptAssetsPath: promptAssets,
|
||||
outputPath: output,
|
||||
seed: cv3Seed,
|
||||
maxNewTokens: cv3MaxNewTokens,
|
||||
cpuOnly: cv3CpuOnly)
|
||||
} else {
|
||||
logger.error(
|
||||
"CosyVoice3 requires macOS 15 / iOS 18 (uses CoreML MLState).")
|
||||
}
|
||||
await CosyVoice3TextCLI.run(
|
||||
text: inputText,
|
||||
modelsDir: modelsDir,
|
||||
tokenizerDir: tokDir,
|
||||
embeddingsFile: embFile,
|
||||
specialTokensFile: specFile,
|
||||
promptAssetsPath: promptAssets,
|
||||
outputPath: output,
|
||||
seed: cv3Seed,
|
||||
maxNewTokens: cv3MaxNewTokens,
|
||||
cpuOnly: cv3CpuOnly)
|
||||
return
|
||||
}
|
||||
|
||||
@@ -440,19 +435,14 @@ public struct TTS {
|
||||
)
|
||||
return
|
||||
}
|
||||
if #available(macOS 15, iOS 18, *) {
|
||||
await CosyVoice3ParityCLI.run(
|
||||
fixturePath: fixture,
|
||||
modelsDir: modelsDir,
|
||||
referencePath: cv3ReferencePath,
|
||||
outputPath: output,
|
||||
seed: cv3Seed,
|
||||
cpuOnly: cv3CpuOnly,
|
||||
replayTokens: cv3ReplayTokens)
|
||||
} else {
|
||||
logger.error(
|
||||
"CosyVoice3 requires macOS 15 / iOS 18 (uses CoreML MLState).")
|
||||
}
|
||||
await CosyVoice3ParityCLI.run(
|
||||
fixturePath: fixture,
|
||||
modelsDir: modelsDir,
|
||||
referencePath: cv3ReferencePath,
|
||||
outputPath: output,
|
||||
seed: cv3Seed,
|
||||
cpuOnly: cv3CpuOnly,
|
||||
replayTokens: cv3ReplayTokens)
|
||||
return
|
||||
}
|
||||
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -50,6 +50,10 @@ struct FluidAudioCLI {
|
||||
await MagpieCommand.run(arguments: Array(arguments.dropFirst(2)))
|
||||
case "tts-asr-verify":
|
||||
await TTSAsrVerifyCommand.run(arguments: Array(arguments.dropFirst(2)))
|
||||
case "tts-benchmark":
|
||||
await TtsBenchmarkCommand.run(arguments: Array(arguments.dropFirst(2)))
|
||||
case "minimax-corpus":
|
||||
await MinimaxCorpusCommand.run(arguments: Array(arguments.dropFirst(2)))
|
||||
case "diarization-benchmark":
|
||||
await StreamDiarizationBenchmark.run(arguments: Array(arguments.dropFirst(2)))
|
||||
case "process":
|
||||
@@ -116,6 +120,8 @@ struct FluidAudioCLI {
|
||||
tts Synthesize speech from text using Kokoro TTS
|
||||
magpie Magpie TTS Multilingual 357M (experimental, ~0.04 RTFx — slow, needs perf work)
|
||||
tts-asr-verify Batch TTS→ASR roundtrip WER verification
|
||||
tts-benchmark Quantitative TTS benchmark (latency, quality, compute-unit sweep)
|
||||
minimax-corpus Fetch MiniMax TTS Multilingual Test Set into Benchmarks/tts/corpus/minimax
|
||||
parakeet-eou Run Parakeet EOU Streaming ASR on a single file
|
||||
ctc-earnings-benchmark Run CTC keyword spotting benchmark on Earnings22
|
||||
sortformer Run Sortformer streaming diarization
|
||||
|
||||
@@ -0,0 +1,60 @@
|
||||
import XCTest
|
||||
|
||||
@testable import FluidAudio
|
||||
|
||||
/// Guard the stateful → stateless decode rename. The HF repo
|
||||
/// `FluidInference/CosyVoice3-0.5B-coreml` ships only `LLM-Decode-M768-fp16`
|
||||
/// (non-stateful, external KV cache); resurrecting `-stateful` here would
|
||||
/// re-break the download path and regress macOS 14 support.
|
||||
final class CosyVoice3ModelNameTests: XCTestCase {
|
||||
|
||||
// MARK: - ModelNames.CosyVoice3
|
||||
|
||||
func testLlmDecodeIsStatelessName() {
|
||||
XCTAssertEqual(ModelNames.CosyVoice3.llmDecode, "LLM-Decode-M768-fp16")
|
||||
XCTAssertFalse(
|
||||
ModelNames.CosyVoice3.llmDecode.contains("stateful"),
|
||||
"llmDecode must not reference the dropped stateful variant")
|
||||
}
|
||||
|
||||
func testLlmDecodeFileMatchesBaseName() {
|
||||
XCTAssertEqual(
|
||||
ModelNames.CosyVoice3.llmDecodeFile,
|
||||
"LLM-Decode-M768-fp16.mlmodelc")
|
||||
}
|
||||
|
||||
func testRequiredModelsContainsStatelessDecode() {
|
||||
XCTAssertTrue(
|
||||
ModelNames.CosyVoice3.requiredModels.contains("LLM-Decode-M768-fp16.mlmodelc"),
|
||||
"requiredModels must list the stateless decode bundle")
|
||||
XCTAssertFalse(
|
||||
ModelNames.CosyVoice3.requiredModels.contains(
|
||||
"LLM-Decode-M768-fp16-stateful.mlmodelc"),
|
||||
"requiredModels must not list the dropped stateful bundle")
|
||||
}
|
||||
|
||||
func testRequiredModelsHasFourEntries() {
|
||||
XCTAssertEqual(
|
||||
ModelNames.CosyVoice3.requiredModels.count, 4,
|
||||
"Pipeline ships exactly 4 CoreML bundles: prefill, decode, flow, hift")
|
||||
}
|
||||
|
||||
// MARK: - CosyVoice3Constants.Files
|
||||
|
||||
func testFilesLlmDecodeIsStatelessPackage() {
|
||||
XCTAssertEqual(
|
||||
CosyVoice3Constants.Files.llmDecode,
|
||||
"LLM-Decode-M768-fp16.mlpackage")
|
||||
XCTAssertFalse(
|
||||
CosyVoice3Constants.Files.llmDecode.contains("stateful"))
|
||||
}
|
||||
|
||||
func testFilesLlmDecodeSubdirIsRenamed() {
|
||||
XCTAssertEqual(
|
||||
CosyVoice3Constants.Files.llmDecodeSubdir,
|
||||
"llm-fp16-decode",
|
||||
"Local-build subdir must be the renamed stateless directory")
|
||||
XCTAssertFalse(
|
||||
CosyVoice3Constants.Files.llmDecodeSubdir.contains("stateful"))
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,184 @@
|
||||
import XCTest
|
||||
|
||||
@testable import FluidAudio
|
||||
|
||||
final class CosyVoice3TextChunkerTests: XCTestCase {
|
||||
|
||||
// MARK: - estimateSpeechTokens
|
||||
|
||||
func testEstimateSpeechTokensCJK() {
|
||||
// 4 CJK chars × 7.5 = 30 tokens
|
||||
XCTAssertEqual(CosyVoice3TextChunker.estimateSpeechTokens("你好世界"), 30)
|
||||
}
|
||||
|
||||
func testEstimateSpeechTokensASCII() {
|
||||
// 5 ASCII chars × 1.5 = 7.5 → rounds to 8
|
||||
XCTAssertEqual(CosyVoice3TextChunker.estimateSpeechTokens("hello"), 8)
|
||||
}
|
||||
|
||||
func testEstimateSpeechTokensEmpty() {
|
||||
XCTAssertEqual(CosyVoice3TextChunker.estimateSpeechTokens(""), 0)
|
||||
}
|
||||
|
||||
// MARK: - chunk: short input fast path
|
||||
|
||||
func testChunkEmptyReturnsEmpty() {
|
||||
XCTAssertEqual(CosyVoice3TextChunker.chunk(""), [])
|
||||
XCTAssertEqual(CosyVoice3TextChunker.chunk(" "), [])
|
||||
XCTAssertEqual(CosyVoice3TextChunker.chunk("\n\n"), [])
|
||||
}
|
||||
|
||||
func testChunkShortReturnsSingle() {
|
||||
// 5 chars (4 CJK + 「。」) ≈ 33 tokens, well under default 110
|
||||
XCTAssertEqual(
|
||||
CosyVoice3TextChunker.chunk("你好世界。"),
|
||||
["你好世界。"])
|
||||
}
|
||||
|
||||
func testChunkShortTrimsWhitespace() {
|
||||
XCTAssertEqual(
|
||||
CosyVoice3TextChunker.chunk(" hello world. "),
|
||||
["hello world."])
|
||||
}
|
||||
|
||||
// MARK: - chunk: hard sentence enders
|
||||
|
||||
func testChunkSplitsOnHardEnders() {
|
||||
// 25 CJK chars × 7.5 = 187.5 tokens > 110 default → must split
|
||||
let text = "今天天气很好。我们去公园散步。明天可能会下雨。下周打算去看电影。"
|
||||
let chunks = CosyVoice3TextChunker.chunk(text)
|
||||
XCTAssertGreaterThan(chunks.count, 1)
|
||||
// No chunk should exceed budget by more than the soft margin
|
||||
for chunk in chunks {
|
||||
let est = CosyVoice3TextChunker.estimateSpeechTokens(chunk)
|
||||
XCTAssertLessThanOrEqual(est, 110 + 30 + 8, "chunk over force-split margin: \(chunk)")
|
||||
}
|
||||
// Concatenating chunks back should reconstruct the input modulo
|
||||
// whitespace trimming.
|
||||
XCTAssertEqual(chunks.joined(), text)
|
||||
}
|
||||
|
||||
func testChunkSplitsOnEnglishSentenceEnders() {
|
||||
// Each sentence ≈ 25–30 tokens; with maxSpeechTokens=80 every
|
||||
// sentence fits individually so the chunker should commit on the
|
||||
// first hard ender it sees rather than packing greedily across
|
||||
// sentences and hitting force-split.
|
||||
let text = "Hello world. This is a test. Pack my box with five jugs. Quick brown fox jumps."
|
||||
let chunks = CosyVoice3TextChunker.chunk(text, maxSpeechTokens: 80)
|
||||
XCTAssertGreaterThan(chunks.count, 1)
|
||||
for chunk in chunks {
|
||||
XCTAssertTrue(
|
||||
chunk.hasSuffix(".") || chunk.hasSuffix("!") || chunk.hasSuffix("?"),
|
||||
"chunk does not end at hard boundary: \(chunk)")
|
||||
}
|
||||
}
|
||||
|
||||
// MARK: - chunk: soft enders fall-through
|
||||
|
||||
func testChunkFallsBackToSoftEnders() {
|
||||
// One huge sentence with commas, no periods. Should split on 「,」.
|
||||
let text = "一个非常非常长的句子,里面有很多分句,每个分句都不是很长,但是加在一起就会超过预算限制"
|
||||
let chunks = CosyVoice3TextChunker.chunk(text, maxSpeechTokens: 50)
|
||||
XCTAssertGreaterThan(chunks.count, 1)
|
||||
for chunk in chunks {
|
||||
let est = CosyVoice3TextChunker.estimateSpeechTokens(chunk)
|
||||
// Force-split allows one CJK char of overshoot past the +30 margin
|
||||
// because the budget check runs AFTER appending the current char.
|
||||
XCTAssertLessThanOrEqual(est, 50 + 30 + 8)
|
||||
}
|
||||
}
|
||||
|
||||
// MARK: - chunk: force-split fallback
|
||||
|
||||
func testChunkForceSplitsOnContinuousCJKWithoutPunctuation() {
|
||||
// 30 CJK chars, no punctuation: ≈ 225 tokens, must force-split
|
||||
// somewhere even without natural boundaries.
|
||||
let text = "今天天气很好我们去公园散步明天可能会下雨下周打算看电影然后回家"
|
||||
let chunks = CosyVoice3TextChunker.chunk(text, maxSpeechTokens: 50)
|
||||
XCTAssertGreaterThan(chunks.count, 1)
|
||||
for chunk in chunks {
|
||||
let est = CosyVoice3TextChunker.estimateSpeechTokens(chunk)
|
||||
// Force-split has a 30-token overshoot allowance + one CJK char (7.5)
|
||||
XCTAssertLessThanOrEqual(est, 50 + 30 + 8, "chunk overflow on force-split: \(chunk)")
|
||||
}
|
||||
// No content lost
|
||||
XCTAssertEqual(chunks.joined(), text)
|
||||
}
|
||||
|
||||
func testChunkForceSplitsOnEnglishSpacesWhenNoPunctuation() {
|
||||
// Long English with no terminal punctuation; should split on spaces
|
||||
// when the running estimate exceeds budget.
|
||||
let text = "the quick brown fox jumps over the lazy dog and then runs back home very fast"
|
||||
let chunks = CosyVoice3TextChunker.chunk(text, maxSpeechTokens: 20)
|
||||
XCTAssertGreaterThan(chunks.count, 1)
|
||||
for chunk in chunks {
|
||||
// No leading/trailing whitespace expected on returned chunks
|
||||
XCTAssertEqual(chunk, chunk.trimmingCharacters(in: .whitespaces))
|
||||
}
|
||||
}
|
||||
|
||||
// MARK: - concatWithCrossfade
|
||||
|
||||
func testConcatEmptyReturnsEmpty() {
|
||||
let out = CosyVoice3TtsManager.concatWithCrossfade(
|
||||
[], sampleRate: 24_000, fadeMs: 8)
|
||||
XCTAssertEqual(out, [])
|
||||
}
|
||||
|
||||
func testConcatSingleChunkPassthrough() {
|
||||
let chunk: [Float] = [0.1, 0.2, 0.3, 0.4]
|
||||
let out = CosyVoice3TtsManager.concatWithCrossfade(
|
||||
[chunk], sampleRate: 24_000, fadeMs: 8)
|
||||
XCTAssertEqual(out, chunk)
|
||||
}
|
||||
|
||||
func testConcatZeroFadeIsSimpleAppend() {
|
||||
let a: [Float] = [0.1, 0.2, 0.3]
|
||||
let b: [Float] = [0.4, 0.5, 0.6]
|
||||
let out = CosyVoice3TtsManager.concatWithCrossfade(
|
||||
[a, b], sampleRate: 24_000, fadeMs: 0)
|
||||
XCTAssertEqual(out, [0.1, 0.2, 0.3, 0.4, 0.5, 0.6])
|
||||
}
|
||||
|
||||
func testConcatCrossfadeShrinksGracefullyForShortChunks() {
|
||||
// 4-sample chunks; nominal fade at 24 kHz × 8 ms = 192 samples,
|
||||
// gets clamped to min(out.count/2, next.count/2) = 2.
|
||||
let a: [Float] = [1.0, 1.0, 1.0, 1.0]
|
||||
let b: [Float] = [0.0, 0.0, 0.0, 0.0]
|
||||
let out = CosyVoice3TtsManager.concatWithCrossfade(
|
||||
[a, b], sampleRate: 24_000, fadeMs: 8)
|
||||
// Output length: 4 (a) - 2 (fade) + 4 (b) = 6; first 2 of a remain
|
||||
// pristine, then a 2-sample crossfade region, then last 2 of b
|
||||
XCTAssertEqual(out.count, 6)
|
||||
XCTAssertEqual(out[0], 1.0)
|
||||
XCTAssertEqual(out[1], 1.0)
|
||||
// Crossfade region: a's 1.0 fades to 0; b's 0.0 fades from 0.
|
||||
// At j=0: down=1, up=0 → 1.0 * 1 + 0.0 * 0 = 1.0
|
||||
// At j=1: down=0.5, up=0.5 → 1.0*0.5 + 0.0*0.5 = 0.5
|
||||
XCTAssertEqual(out[2], 1.0, accuracy: 1e-5)
|
||||
XCTAssertEqual(out[3], 0.5, accuracy: 1e-5)
|
||||
XCTAssertEqual(out[4], 0.0, accuracy: 1e-5)
|
||||
XCTAssertEqual(out[5], 0.0, accuracy: 1e-5)
|
||||
}
|
||||
|
||||
func testConcatCrossfadePreservesPrefixAndSuffix() {
|
||||
// Long enough chunks for a full fade window
|
||||
let sampleRate = 24_000
|
||||
let fadeMs = 4.0 // 96 samples
|
||||
let a = [Float](repeating: 1.0, count: 480)
|
||||
let b = [Float](repeating: 0.0, count: 480)
|
||||
let out = CosyVoice3TtsManager.concatWithCrossfade(
|
||||
[a, b], sampleRate: sampleRate, fadeMs: fadeMs)
|
||||
let fade = Int((Double(sampleRate) * fadeMs / 1000).rounded())
|
||||
// Output length: a.count - fade + b.count
|
||||
XCTAssertEqual(out.count, a.count - fade + b.count)
|
||||
// Prefix of `a` (before crossfade region) untouched
|
||||
for j in 0..<(a.count - fade) {
|
||||
XCTAssertEqual(out[j], 1.0)
|
||||
}
|
||||
// Suffix of `b` (after crossfade region) untouched
|
||||
for j in (a.count..<out.count) {
|
||||
XCTAssertEqual(out[j - 0], 0.0)
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -53,4 +53,84 @@ final class MagpieKvCacheTests: XCTestCase {
|
||||
XCTAssertEqual(
|
||||
MagpieKvCache.positionOutputKeys.count, MagpieConstants.numDecoderLayers)
|
||||
}
|
||||
|
||||
/// Drives the slow-path fallback used by `MagpieSynthesizer.runDecoderStep`
|
||||
/// when CoreML rejects `outputBackings`. Builds a synthetic feature
|
||||
/// provider that mirrors the `decoder_step.mlmodelc` output schema, hands
|
||||
/// it to `absorbOutputs`, and verifies the cache front pointers + position
|
||||
/// were replaced (i.e. the fallback can take over without `swapBackings`).
|
||||
func testAbsorbOutputsReplacesFrontPointers() throws {
|
||||
let numLayers = 3
|
||||
let maxCacheLength = 16
|
||||
let numHeads = 2
|
||||
let headDim = 4
|
||||
let cache = try MagpieKvCache(
|
||||
numLayers: numLayers, maxCacheLength: maxCacheLength,
|
||||
numHeads: numHeads, headDim: headDim)
|
||||
|
||||
let preK = (0..<numLayers).map { ObjectIdentifier(cache.cachesK[$0]) }
|
||||
let preV = (0..<numLayers).map { ObjectIdentifier(cache.cachesV[$0]) }
|
||||
let prePos = (0..<numLayers).map { ObjectIdentifier(cache.positions[$0]) }
|
||||
|
||||
let cacheShape: [NSNumber] = [
|
||||
1,
|
||||
NSNumber(value: maxCacheLength),
|
||||
NSNumber(value: numHeads),
|
||||
NSNumber(value: headDim),
|
||||
]
|
||||
var features: [String: MLFeatureValue] = [:]
|
||||
for i in 0..<numLayers {
|
||||
let kArr = try MLMultiArray(shape: cacheShape, dataType: .float16)
|
||||
kArr.zeroFillFloat16()
|
||||
let vArr = try MLMultiArray(shape: cacheShape, dataType: .float16)
|
||||
vArr.zeroFillFloat16()
|
||||
let posArr = try MLMultiArray(shape: [1], dataType: .float16)
|
||||
posArr.zeroFillFloat16()
|
||||
posArr[0] = NSNumber(value: Float(i + 1))
|
||||
features[MagpieKvCache.cacheKOutputKeys[i]] = MLFeatureValue(multiArray: kArr)
|
||||
features[MagpieKvCache.cacheVOutputKeys[i]] = MLFeatureValue(multiArray: vArr)
|
||||
features[MagpieKvCache.positionOutputKeys[i]] = MLFeatureValue(multiArray: posArr)
|
||||
}
|
||||
let provider = try MLDictionaryFeatureProvider(dictionary: features)
|
||||
|
||||
try cache.absorbOutputs(provider)
|
||||
|
||||
for i in 0..<numLayers {
|
||||
XCTAssertNotEqual(
|
||||
ObjectIdentifier(cache.cachesK[i]), preK[i],
|
||||
"absorbOutputs must replace cachesK[\(i)] front pointer")
|
||||
XCTAssertNotEqual(
|
||||
ObjectIdentifier(cache.cachesV[i]), preV[i],
|
||||
"absorbOutputs must replace cachesV[\(i)] front pointer")
|
||||
XCTAssertNotEqual(
|
||||
ObjectIdentifier(cache.positions[i]), prePos[i],
|
||||
"absorbOutputs must replace positions[\(i)] front pointer")
|
||||
}
|
||||
// positions[0] = 1 → cache.position reads layer-0 scalar.
|
||||
XCTAssertEqual(cache.position, 1)
|
||||
}
|
||||
|
||||
func testAbsorbOutputsThrowsWhenCacheKOutputMissing() throws {
|
||||
let cache = try MagpieKvCache(
|
||||
numLayers: 2, maxCacheLength: 8, numHeads: 1, headDim: 2)
|
||||
|
||||
// Provide a feature provider with the wrong key for cache_k_0 so the
|
||||
// first lookup fails. This guards the error message users will see
|
||||
// when the fallback path is actually exercised.
|
||||
let bogus = try MLMultiArray(shape: [1, 8, 1, 2], dataType: .float16)
|
||||
bogus.zeroFillFloat16()
|
||||
let provider = try MLDictionaryFeatureProvider(dictionary: [
|
||||
"wrong_key": MLFeatureValue(multiArray: bogus)
|
||||
])
|
||||
|
||||
XCTAssertThrowsError(try cache.absorbOutputs(provider)) { error in
|
||||
guard case MagpieError.inferenceFailed(_, let underlying) = error else {
|
||||
XCTFail("expected MagpieError.inferenceFailed, got \(error)")
|
||||
return
|
||||
}
|
||||
XCTAssertTrue(
|
||||
underlying.contains("missing K cache output key"),
|
||||
"underlying should mention the missing K key, got: \(underlying)")
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@@ -0,0 +1,114 @@
|
||||
@preconcurrency import CoreML
|
||||
import XCTest
|
||||
|
||||
@testable import FluidAudio
|
||||
|
||||
final class TtsComputeUnitPresetTests: XCTestCase {
|
||||
|
||||
// MARK: - init?(cliValue:)
|
||||
|
||||
func testCliValueParsing_canonicalKebabCase() {
|
||||
XCTAssertEqual(TtsComputeUnitPreset(cliValue: "default"), .default)
|
||||
XCTAssertEqual(TtsComputeUnitPreset(cliValue: "all-ane"), .allAne)
|
||||
XCTAssertEqual(TtsComputeUnitPreset(cliValue: "cpu-and-gpu"), .cpuAndGpu)
|
||||
XCTAssertEqual(TtsComputeUnitPreset(cliValue: "cpu-only"), .cpuOnly)
|
||||
}
|
||||
|
||||
func testCliValueParsing_aliases() {
|
||||
XCTAssertEqual(TtsComputeUnitPreset(cliValue: "ane"), .allAne)
|
||||
XCTAssertEqual(TtsComputeUnitPreset(cliValue: "neural-engine"), .allAne)
|
||||
XCTAssertEqual(TtsComputeUnitPreset(cliValue: "cpuandgpu"), .cpuAndGpu)
|
||||
XCTAssertEqual(TtsComputeUnitPreset(cliValue: "gpu"), .cpuAndGpu)
|
||||
XCTAssertEqual(TtsComputeUnitPreset(cliValue: "cpu"), .cpuOnly)
|
||||
XCTAssertEqual(TtsComputeUnitPreset(cliValue: "cpuonly"), .cpuOnly)
|
||||
}
|
||||
|
||||
func testCliValueParsing_caseInsensitive() {
|
||||
XCTAssertEqual(TtsComputeUnitPreset(cliValue: "DEFAULT"), .default)
|
||||
XCTAssertEqual(TtsComputeUnitPreset(cliValue: "All-Ane"), .allAne)
|
||||
XCTAssertEqual(TtsComputeUnitPreset(cliValue: "CPU-AND-GPU"), .cpuAndGpu)
|
||||
}
|
||||
|
||||
func testCliValueParsing_unknownReturnsNil() {
|
||||
XCTAssertNil(TtsComputeUnitPreset(cliValue: ""))
|
||||
XCTAssertNil(TtsComputeUnitPreset(cliValue: "fastest"))
|
||||
XCTAssertNil(TtsComputeUnitPreset(cliValue: "all_ane")) // underscore rejected
|
||||
XCTAssertNil(TtsComputeUnitPreset(cliValue: "ane-only"))
|
||||
XCTAssertNil(TtsComputeUnitPreset(cliValue: "neuralengine"))
|
||||
}
|
||||
|
||||
// MARK: - cliValue (round-trip)
|
||||
|
||||
func testCliValueRoundTrip() {
|
||||
for preset in TtsComputeUnitPreset.allCases {
|
||||
let canonical = preset.cliValue
|
||||
XCTAssertEqual(
|
||||
TtsComputeUnitPreset(cliValue: canonical), preset,
|
||||
"cliValue '\(canonical)' must round-trip back to \(preset)")
|
||||
}
|
||||
}
|
||||
|
||||
func testCliValueIsKebabCase() {
|
||||
XCTAssertEqual(TtsComputeUnitPreset.default.cliValue, "default")
|
||||
XCTAssertEqual(TtsComputeUnitPreset.allAne.cliValue, "all-ane")
|
||||
XCTAssertEqual(TtsComputeUnitPreset.cpuAndGpu.cliValue, "cpu-and-gpu")
|
||||
XCTAssertEqual(TtsComputeUnitPreset.cpuOnly.cliValue, "cpu-only")
|
||||
}
|
||||
|
||||
// MARK: - uniformUnits
|
||||
|
||||
func testUniformUnits_defaultIsNil() {
|
||||
XCTAssertNil(TtsComputeUnitPreset.default.uniformUnits)
|
||||
}
|
||||
|
||||
func testUniformUnits_concretePresets() {
|
||||
XCTAssertEqual(TtsComputeUnitPreset.allAne.uniformUnits, .cpuAndNeuralEngine)
|
||||
XCTAssertEqual(TtsComputeUnitPreset.cpuAndGpu.uniformUnits, .cpuAndGPU)
|
||||
XCTAssertEqual(TtsComputeUnitPreset.cpuOnly.uniformUnits, .cpuOnly)
|
||||
}
|
||||
|
||||
// MARK: - KokoroAneComputeUnits(preset:)
|
||||
|
||||
func testKokoroAnePreset_defaultMatchesStaticDefault() {
|
||||
XCTAssertEqual(KokoroAneComputeUnits(preset: .default), .default)
|
||||
}
|
||||
|
||||
func testKokoroAnePreset_allAneMatchesStatic() {
|
||||
XCTAssertEqual(KokoroAneComputeUnits(preset: .allAne), .allAne)
|
||||
}
|
||||
|
||||
func testKokoroAnePreset_cpuAndGpuMatchesStatic() {
|
||||
XCTAssertEqual(KokoroAneComputeUnits(preset: .cpuAndGpu), .cpuAndGpu)
|
||||
}
|
||||
|
||||
func testKokoroAnePreset_cpuOnlyMatchesStatic() {
|
||||
XCTAssertEqual(KokoroAneComputeUnits(preset: .cpuOnly), .cpuOnly)
|
||||
}
|
||||
|
||||
func testKokoroAnePreset_allAneForcesEveryStageToANE() {
|
||||
let cu = KokoroAneComputeUnits(preset: .allAne)
|
||||
for stage in KokoroAneStage.allCases {
|
||||
XCTAssertEqual(
|
||||
cu.units(for: stage), .cpuAndNeuralEngine,
|
||||
"stage \(stage) should be .cpuAndNeuralEngine under .allAne")
|
||||
}
|
||||
}
|
||||
|
||||
func testKokoroAnePreset_cpuOnlyForcesEveryStageToCPU() {
|
||||
let cu = KokoroAneComputeUnits(preset: .cpuOnly)
|
||||
for stage in KokoroAneStage.allCases {
|
||||
XCTAssertEqual(
|
||||
cu.units(for: stage), .cpuOnly,
|
||||
"stage \(stage) should be .cpuOnly under .cpuOnly")
|
||||
}
|
||||
}
|
||||
|
||||
func testKokoroAnePreset_cpuAndGpuForcesEveryStageToCPUAndGPU() {
|
||||
let cu = KokoroAneComputeUnits(preset: .cpuAndGpu)
|
||||
for stage in KokoroAneStage.allCases {
|
||||
XCTAssertEqual(
|
||||
cu.units(for: stage), .cpuAndGPU,
|
||||
"stage \(stage) should be .cpuAndGPU under .cpuAndGpu")
|
||||
}
|
||||
}
|
||||
}
|
||||
Reference in New Issue
Block a user