feat(tts/benchmark): tts-benchmark CLI covering all TTS backends (#557)

## Summary

Adds `fluidaudio tts-benchmark`, a unified harness for measuring
**latency × efficiency × quality** across every shipping TTS backend in
FluidAudio, plus the model + runtime fixes needed to actually clear all
six backends end-to-end on the [MiniMax Multilingual TTS Test
Set](https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set).
Also tags Magpie / StyleTTS2 / CosyVoice3 as **beta** at the API + docs
level so users get a runtime warning on `initialize()` reflecting their
actual perf / quality posture.

### Backends — all green on M2 / macOS 26

| Backend | Corpus | Status | Audio out (min / p50 / max) | RTFx | WER |
Notes |
|---|---|---|---|---|---|---|
| Kokoro ANE | minimax-en (100/100) |  | 3.5 s / 8.0 s / 11.4 s | 5.19×
| 10.8% | one-shot @ 24 kHz, 7-graph pipeline; per-stage CU sweep |
| Kokoro | minimax-en (100/100) |  | 3.5 s / 6.8 s / 9.3 s | 2.02× |
1.3% | one-shot @ 24 kHz; multi-chunk w/ 8 ms crossfade; cleanest
English ASR roundtrip |
| PocketTTS | minimax-en (100/100) |  | 2.8 s / 6.3 s / 9.4 s | 0.61× |
1.4% | **streaming** @ 24 kHz, 80 ms frames; TTFT 1244 ms — RTFx looks
slow but is honest per-frame cost (see "RTFx caveat" below) |
| Magpie | minimax-en (100/100) | ⚠️ **BETA** | 4.7 s / 10.0 s / 20.6 s
| 0.64× | 5.6% | **streaming TTFT** @ 22.05 kHz: first chunk at **9.6 s
p50** vs full synth 15.1 s; split-K/V decoder + `outputBackings` fast
path; below real-time, runtime warning on init |
| StyleTTS2 | minimax-en (100/100) | ⚠️ **BETA** | 9.6 s / 22.6 s / 32.6
s | 2.72× | 44.0% | one-shot @ 24 kHz; flex-shape fix + misaki→espeak
post-pass remap (WER 58.1% → 44.0%); WER ~30× Kokoro's, runtime warning
on init |
| CosyVoice3 | minimax-zh (100/100) | ⚠️ **BETA** | 2.2 s / 6.5 s /
**16.0 s** | 0.357׆ | n/a‡ | post auto-chunker @ 24 kHz; long phrases
now split + crossfaded (8 ms cosine) — longest output 16.0 s (was capped
at ~6.5 s); HiFT `.cpuAndGPU` + LLM-Decode `outputBackings` (+33% RTFx);
**whisper-large-v3 CER 1.68% (macro) / 1.84% (micro)** across 100/100
phrases‡; RTFx < 1, runtime warning on init |
| CosyVoice3 | minimax-yue (100/100) | ⚠️ **BETA** | 3.3 s / 8.0 s /
**16.1 s** | 0.249× | n/a | post auto-chunker; **truncation 80/100 →
5/100 phrases** (`finished_on_eos=false` field), longest output 6.5 s →
16.1 s. TTFT-p50 climbs (24 s → 36 s) as the cost of multi-chunk synth |

⚠️ **BETA** = `${Backend}TtsManager.initialize()` emits a
`logger.warning` flagging the perf / quality posture; safe to ship in
non-latency-sensitive paths but read the per-backend doc first.

‡ CosyVoice3 zh WER stays `n/a` because `WERCalculator`
whitespace-tokenizes and Mandarin has no word boundaries (word-level WER
reads ~100% and is meaningless). CER is `whisper-large-v3` against the
rendered WAVs from the full 100-phrase `minimax-chinese` run via
`Scripts/whisper_zh_cer.py`. Cohere Transcribe q8 is also wired in this
PR via `--asr-backend cohere` (see [Cohere ASR backend in the
harness](#cohere-asr-backend-in-the-harness) below) and agrees with
whisper at the 3–5% CER range on a 10-phrase sub-sample, but hits a
`MILCompilerForANE` cache failure on this M2 host that drops it to RTFx
~0.13×, so whisper is the practical source-of-truth for the full
100-phrase run.

Full numbers (cold start, p50/p95 synth, peak RSS, WER/CER per category)
live in `Documentation/TTS/Benchmarks.md`. Corpus attribution +
reproduction notes live in `Documentation/TTS/MinimaxCorpus.md`.

### RTFx caveat — phrase length and streaming granularity both matter

Aggregate RTFx (audio_duration / wall_clock) is **only directly
comparable between backends when both produce similar phrase lengths and
yield audio at the same granularity**. Two things skew the headline
number on this corpus:

**1. Phrase-length spread.** StyleTTS2 emits ~22 s p50 of audio per
`minimax-english` phrase while Kokoro emits ~7 s — same input text, ~3×
more audio out. That's mostly long inter-word pauses + slow speaking
rate baked into the LibriTTS multi-speaker checkpoint, not a measurement
artifact. A 2.72× RTFx on 22 s audio = ~8 s wall — which matches the
TTFT p50 column. Kokoro's 2.02× on 7 s audio = ~3.5 s wall. Same-corpus
RTFx ratios alone hide this.

**2. Streaming granularity.** PocketTTS posts 0.61× agg-RTFx vs.
Kokoro's 2.02× but it's **not slower from a user perspective**:
PocketTTS yields its first 80 ms audio frame at TTFT **1244 ms**,
Kokoro's first frame at TTFT **3113 ms** (full one-shot chunk). The
0.61× is the per-frame cost averaged across the streaming run; what
users feel is TTFT.

| Backend | TTFT p50 | First yield | Implication |

|-------------|----------|------------------|--------------------------------------------|
| PocketTTS | 1244 ms | 80 ms frame | true streaming;
conversational-ready |
| Kokoro ANE | 1586 ms | full ~8 s chunk | ~1.6 s to any audio;
ANE-tuned |
| Kokoro | 3113 ms | full ~7 s chunk | clean quality, slower first-byte
|
| StyleTTS2 | 6671 ms | full ~22 s chunk | one-shot only; long phrase
output amortizes the wall |
| Magpie | **9580 ms** | first chunk @ 22.05 kHz | streaming via
`synthesizeStream`; TTFT-p50 9.6 s vs full synth 15.1 s — 36% earlier
playback start |
| CosyVoice3 | 14091 / 35681 ms (zh / yue) | full chunk @ 24 kHz |
one-shot per chunk; multi-chunk phrases pay TTFT for the first chunk
only |

For conversational use cases, **TTFT > RTFx**. PocketTTS (true
streaming), Magpie (streaming via `synthesizeStream`), and Kokoro ANE
(small one-shot chunks) are the three backends that meaningfully clear
the "user feels it's responsive" bar today.

### Beta callouts (StyleTTS2, Magpie, CosyVoice3)

Three of the six shipping backends post numbers that callers should
weigh against an explicit caveat:

- **StyleTTS2** — WER 44% on `minimax-english` is ~30× Kokoro's 1.3%.
The misaki→espeak post-pass remap closed half the gap; the remainder is
BART G2P misses + diffusion-sampler formant breaks on long phrases.
- **Magpie** — agg-RTFx 0.64× on M2 — below real-time but streaming via
`synthesizeStream` so TTFT (9.6 s p50) is significantly better than
full-synth wall (15.1 s p50). Long-tail phrases still pull p95 wall to
~30 s.
- **CosyVoice3** — agg-RTFx 0.357× on `minimax-chinese` (0.249× on the
longer-phrase `minimax-cantonese` after the auto-chunker). The 250-token
Flow input cap is now worked around at the call site by the auto-chunker
(long phrases split + crossfaded), dropping cantonese truncation from
80/100 → 5/100 and lifting longest output from 6.5 s → 16.1 s. The 5/100
residual is the long-tail token-rate worst case; the structural fix is
re-exporting Flow with a larger fixed input shape (tracked in
`mobius-cosyvoice3`). `CosyVoice3SynthesisResult.finishedOnEos: Bool` +
a `.warning`-level `LLM-Decode budget exhausted` log still surface any
truncation, and the harness writes `finished_on_eos` into each phrase in
the JSON report.

Each manager now logs a `.warning`-level beta notice on `initialize()`
(mirroring the existing CosyVoice3 pattern) so anyone wiring these into
a product gets a console signal, not a silent surprise. Docs
(`Documentation/TTS/Magpie.md`, `Documentation/TTS/Benchmarks.md`
StyleTTS2 footnote, existing `CosyVoice3.md` callout) carry the same
caveat at the top.

### Model + runtime fixes landed in this PR

#### CosyVoice3 stateless port (`71130c9fb`)
Switches LLM-Decode from the macOS 15+ stateful `MLState` path to the
non-stateful `LLM-Decode-M768-fp16` graph that's actually shipped on
HuggingFace. Drops ~95 LOC of state plumbing for ~30 LOC of plain
`MLDictionaryFeatureProvider` prediction with explicit kv carry-forward;
lowers the availability gate from macOS 15 / iOS 18 back to the package
baseline (macOS 14 / iOS 17). `CosyVoice3ModelNameTests` guard the
rename.

#### CosyVoice3 HiFT timeout fix (`267766b62`)
`minimax-chinese` runs were aborting mid-corpus with `E5RT: Submit Async
failed for [3:29] ... HiFT-T500-fp16_main__Op104_BnnsCpuInference has
timed out`. Root cause: HiFT was loaded with `.cpuAndNeuralEngine`,
which let the planner place most of the graph on ANE but kept at least
one op on the BNNS CPU async-dispatch path; long phrases tripped the
BNNS async watchdog. Fix pins HiFT to `.cpuAndGPU` regardless of
user-supplied compute-units, removing the BNNS path entirely. Verified
on 100/100 zh + 100/100 yue.

#### CosyVoice3 LLM-Decode `outputBackings` double-buffer (`248c638c6`)
The autoregressive decode loop runs ~163 steps per phrase to fill the
250-token cap. Each step takes the previous step's KV cache as `kv_k` /
`kv_v` (fp32 `[24, 1, 2, 768, 64]` = 9 MB each) and produces fresh
`kv_k_out` / `kv_v_out` plus logits — i.e. ~36 MB of host-side
`MLMultiArray` allocation **per step**. Fix pre-allocates 4 KV
back-buffers + a logits backing, rotates front/back/spare across steps
via `MLPredictionOptions.outputBackings`, and falls back to fresh-alloc
on first rejection (one-shot `logger.warning`). Mirrors the Magpie
pattern. Result on full `minimax-chinese`: agg-RTFx **0.269 → 0.357
(+33%)**, TTFT-p50 14091 ms → 9619 ms (-31%), peak RSS 3302 MB → 2470
MB.

#### CosyVoice3 auto-chunker (`f80e0b92e` + `fd22624b5` + `f60cccd0d`)
The 250-token Flow input cap means a single synth pass produces at most
~6.5 s of audio regardless of input length. Re-exporting Flow with a
larger fixed input shape is gated on upstream conversion work, so this
PR works around it at the call site: long inputs are split at
sentence/clause boundaries by `CosyVoice3TextChunker`, synthesized
independently, and merged with an 8 ms equal-power cosine crossfade.

**Splitter policy**: hard enders (`. ! ? 。 ! ? \n`) commit always; soft
enders (`, 、 ; : ; ,` + ASCII space) commit only at-or-past budget;
force-split at +30 token overshoot if no natural boundary exists.
`defaultMaxSpeechTokens` = 110 (leaves margin under the 250-token cap
minus a typical 60–90-token speech-prompt context). Token-rate heuristic
is calibrated against minimax-zh + minimax-yue runs:

| Char class | Tokens / char | Rationale |

|------------|---------------|--------------------------------------------------------------|
| CJK | 7.5 | worst-case observed in real generation; varies 5.5–9 per
char |
| ASCII | 1.5 | matches BPE rate on English text |
| Other | 2.5 | conservative for accented Latin / non-CJK Unicode |

**Validation** on full `minimax-cantonese` (100 phrases, M2):

| Metric | Pre-chunker | Post-chunker | Δ |

|-------------------------------------------|-------------|--------------|------------|
| `finished_on_eos=false` (truncated) | 80 / 100 | **5 / 100** | −94% |
| Longest audio output | 6.5 s | **16.1 s** | +148% |
| agg-RTFx | 0.245× | 0.249× | +1.6% |
| TTFT p50 | 23.9 s | 35.7 s | +49% |

The TTFT regression is the cost of running multiple synth passes per
long phrase — splitting unblocks long-form output at the price of
wall-clock latency. The 5/100 residual truncation is the long-tail
token-rate worst case (some chars hit ~9 tokens/char); raising the
per-CJK heuristic further would over-fragment short phrases. Cleaner fix
is the Flow re-export.

16-test suite covers tokenization estimates, hard/soft/force-split
policy, and the crossfade arithmetic. Lives in
`Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Preprocess/CosyVoice3TextChunker.swift`
+ `CosyVoice3TtsManager.concatWithCrossfade`.

#### Magpie streaming TTFT wire-up (`ace0bf485`)
`TtsBenchmarkCommand.swift` now drives Magpie through
`MagpieTtsManager.synthesizeStream`, recording `ttft_ms` at first
`MagpieAudioChunk` emit instead of conflating it with full-synth wall
time. Result on full `minimax-english` (100 phrases, M2): TTFT-p50 **9.6
s** vs full synth-p50 **15.1 s** — agents start playback ~36% earlier
than waiting for full synth. agg-RTFx 0.41× → 0.64× (warm-cache re-run
benefit; fundamentals unchanged).

#### StyleTTS2 `FlexibleShapeInfo` fix (`c24900731` + `8f9e42fd9`)
`text_predictor.mlmodelc` aborted on long MiniMax phrases with `E5RT:
tensor_buffer has known strides while the model has FlexibleShapeInfo`.
The CoreML runtime rejects two access patterns on outputs from a
flex-shape model: `arr.strides` reads, and `arr[idx].floatValue` element
subscripts — and the original `sliceFirstAxis2D` helper used both. Fix
rewrites it to read via `arr.dataPointer.bindMemory(...)` (handling
`.float32`, `.float16`, `.double`) and computes the flat index from the
known `(1, leading, trailing)` row-major layout. Verified on full
100/100 minimax-en with a `ref_s.bin` dumped from the upstream LibriTTS
demo voice.

#### StyleTTS2 misaki → espeak post-pass remap (`ded0b9467`)
After `sliceFirstAxis2D` unblocked the full corpus, StyleTTS2 still
landed at **WER 0.581 / CER 0.476** — an order of magnitude worse than
Kokoro (0.013). Instrumented the encoder via a new `--tokenize-only
--corpus` mode and disproved the silent-vocab-drop hypothesis: only
**0.09% of scalars** dropped on the full 100-phrase corpus (11 ASCII
hyphens / 12247 scalars).

Real root cause: G2P convention mismatch. Both Kokoro and StyleTTS2
share the in-tree misaki BART G2P (`G2PModel`), but the StyleTTS2
LibriTTS checkpoint was trained by yl4579 on **espeak-ng-phonemized**
LibriTTS — predating misaki by years. The 178-vocab accepts both forms
(e.g. both `ʧ` U+02A7 and `tʃ` decomposed encode), but acoustic
embeddings for the misaki ligature glyphs are essentially untrained
noise.

Side-by-side comparison against locally-installed `espeak-ng -v en-us
--ipa -q` flagged four systematic divergences:

| misaki | espeak-ng | example                  |
|--------|-----------|--------------------------|
| `ʧ`    | `tʃ`      | choice → `tʃˈɔɪs`        |
| `ʤ`    | `dʒ`      | jump   → `dʒˈʌmps`       |
| `ɜɹ`   | `ɝ`       | girl   → `ɡˈɝl`          |
| `əɹ`   | `ɚ`       | over   → `ˈoʊvɚ`         |

Fix: a 4-rule post-pass remap in `StyleTTS2Phonemizer.phonemize`, gated
on `.americanEnglish` and applied to the assembled phoneme string after
every word has been emitted by the BART G2P. Lives alongside the
existing per-piece misaki diphthong remap. Result on the same 100-phrase
MiniMax-English run with the same `libritts_696` voice and same Parakeet
TDT roundtrip:

| Metric          | Pre   | Post  | Δ      |
|-----------------|-------|-------|--------|
| Macro WER       | 0.581 | 0.440 | −24.2% |
| Macro CER       | 0.476 | 0.241 | −49.5% |
| TTFT p50 (ms)   | 8937  | 6671  | −25.4% |
| Agg RTFx        | 2.36× | 2.72× | +15.3% |
| Peak RSS (MB)   | 1428  | 963   | −32.6% |

Phrase 1 (`"…simple choice. Get busy living…"`) went from `simple voice.
Busy dying.` (0.40 WER) to a perfect roundtrip. Remaining errors cluster
on word-level G2P misses from the BART itself (`practical →
practicckles`, `separation → expiration`) and diffusion-sampler formant
breaks; closing the rest of the gap to Kokoro likely needs richer espeak
coverage or libespeak-ng vendor — tracked separately.

#### Beta callouts on StyleTTS2 + Magpie managers (`25e2b492a`)
`StyleTTS2Manager.initialize` and `MagpieTtsManager.initialize` now emit
`logger.warning` beta notices mirroring the existing
`CosyVoice3TtsManager.initialize` pattern. Backends docs (`Magpie.md`
Status section, `Benchmarks.md` StyleTTS2 footnote) gain matching `⚠️
Beta / experimental` callouts so the perf / quality posture is visible
at every entry point — runtime, manager docstring, doc top, PR body.

#### Magpie `outputBackings` rejection fallback (`72dae8400` +
`9767e1ef9`)
The shipped `decoder_step.mlmodelc` reaches the user before the rebuild
lands, so CoreML can reject our `outputBackings` dictionary on a
name-mismatch. Latched fallback path falls back to a fresh-alloc decode
so the model still runs; first rejection latches the flag for the rest
of the run.

### Cohere ASR backend in the harness (`8e741e659`)

Lets non-English TTS runs (CosyVoice3, Magpie zh, etc.) score WER / CER
through the harness against [Cohere
Transcribe](Sources/FluidAudio/ASR/Cohere/) instead of being forced into
`--skip-asr`. Four new flags on `tts-benchmark`:

- `--asr-backend parakeet|cohere|none` — selects the ASR roundtrip
engine. Default is `parakeet` for English-only runs and skipped for
CosyVoice3.
- `--cohere-model-dir <path>` — path to a directory containing
`cohere_encoder.mlmodelc`, `cohere_decoder_cache_external_v2.mlmodelc`,
and `vocab.json`.
- `--asr-language <code>` — overrides the inferred language code (covers
all 14 Cohere languages: en, fr, de, es, it, pt, nl, pl, el, ar, ja, zh,
ko, vi).
- `--cohere-compute-units all|cpu-and-gpu|cpu-only|all-ane` — pins
`MLComputeUnits` at `CoherePipeline.loadModels` time. Use `cpu-and-gpu`
when the q8 encoder fails ANE compilation (`MILCompilerForANE error:
failed to compile ANE model using ANEF`) to skip the multi-minute
fallback compile on the first call. The harness logs a WER caveat for
zh/ja runs flagging that whitespace-tokenized WER is meaningless and the
CER column is the real signal.

Example end-to-end:
```bash
fluidaudio tts-benchmark \
    --backend cosyvoice3 \
    --corpus minimax-chinese \
    --asr-backend cohere \
    --cohere-model-dir /path/to/cohere/q8 \
    --asr-language zh \
    --output-json benchmark_results/cv3-zh-cohere.json \
    --audio-dir benchmark_results/cv3-zh-cohere/audio
```

On this M2 host the q8 encoder hits a CoreML ANE-cache failure
(`MILCompilerForANE error: ANECCompile() FAILED`) and CoreML silently
falls back to CPU+GPU, dropping Cohere from its documented RTFx ~2× (per
`Documentation/ASR/Cohere.md`) to RTFx ~0.13× — correctness is
unaffected (same graph, same output), only latency. The full 100-phrase
CosyVoice3 zh CER number reported above (1.68% macro / 1.84% micro) was
therefore produced via `whisper-large-v3` (Python CPU FP32,
`Scripts/whisper_zh_cer.py`) rather than by running Cohere over all 100
phrases. A 10-phrase Cohere sub-sample agrees with whisper at the 3–5%
CER range.

### Corpus migration (`4cc7d3111`) + on-demand fetch CLI (`8022e8384`)

Replaces the original `prose-en` / `numbers-en` / `names-en` /
`prose-zh` shipped with the first cut of this PR with the [MiniMax
Multilingual TTS Test
Set](https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set)
(CC-BY-SA-4.0; 100 phrases × 25 languages). Same public corpus used by
[MiniMax-Speech](https://arxiv.org/abs/2505.07916), seed-tts-eval, and
Gradium — numbers in this PR are paper-comparable.

The 24 per-language `.txt` files used to be vendored in
`Benchmarks/tts/corpus/minimax/`. **Removed in this PR** in favor of an
on-demand `fluidaudio minimax-corpus` CLI subcommand that fetches them
from the upstream HF dataset at the pinned revision and writes them to
the same path. Reuses `DownloadUtils.fetchHuggingFaceFile` for HF auth
(HF_TOKEN env) + retry/backoff — no `swift-transformers` dep added, no
hardcoded asset URLs. The `.txt` files now live in `.gitignore` since
they're CC-BY-SA-4.0 derivative content; only
`Documentation/TTS/MinimaxCorpus.md` (attribution + revision pin + WER
caveats — moved from `Benchmarks/tts/corpus/minimax/README.md` in
`ac21d60bf`) and the CLI subcommand are tracked. Replaces the prior
`python Scripts/fetch_minimax_tts_corpus.py` (also deleted). Per-backend
language scope:

| Backend | Languages benchmarked |
|---|---|
| Kokoro / Kokoro ANE | en (af_heart) |
| PocketTTS | en + de + it + pt + es + fr |
| Magpie | en + es + de + fr + it + vi + zh + hi |
| StyleTTS2 | en (LibriTTS multi-spk) |
| CosyVoice3 | zh + yue |

### PocketTTS streaming TTFT (`c26f1e163`)
PocketTTS now drives the harness through its `synthesizeStreaming` API
so TTFT measures time-to-first-80ms-frame instead of full one-shot
synth. TTFT 1244 ms vs. full synth 8757 ms — a 7× streaming advantage
that one-shot benchmarking previously hid.

### Reference voice dumper helper (mobius-styletts2)
`mobius-styletts2/scripts/06_dump_ref_s.py` (added in the sibling repo)
wraps `style_encoder` + `predictor_encoder` from `99_parity_check.py` to
dump a 256-fp32 LE `ref_s.bin` that `StyleTTS2Manager.synthesize`
consumes via `--voice`. Required because the shipped CoreML bundle
doesn't include those upstream-only PyTorch encoders.

## Test plan

- [x] `swift build -c release` clean
- [x] `swift format lint` clean for new files
- [x] `fluidaudio tts-benchmark --help` lists all 6 backends
- [x] `fluidaudio minimax-corpus --languages english --out-dir /tmp/x`
produces byte-identical output to the deleted Python script
- [x] Kokoro / Kokoro ANE / PocketTTS / Magpie — full 100/100 minimax-en
- [x] StyleTTS2 — full 100/100 minimax-en (verified after
`sliceFirstAxis2D` fix + post-pass remap)
- [x] CosyVoice3 — full 100/100 minimax-zh + 100/100 minimax-yue
(verified after HiFT + LLM-Decode `outputBackings` fixes)
- [x] `CosyVoice3ModelNameTests` + `TtsComputeUnitPresetTests` green
- [x] No `@unchecked Sendable`; per-backend error enums use `Error,
LocalizedError`
- [x] StyleTTS2 + Magpie + CosyVoice3 emit beta `logger.warning` on
`initialize()`
- [x] Corpus README moved to `Documentation/TTS/MinimaxCorpus.md`;
cross-refs in `Benchmarks.md`, `MinimaxCorpusCommand.swift`,
`TtsBenchmarkCommand.swift` updated
- [x] CosyVoice3 6.5 s output cap investigated — confirmed structural
(250-token Flow input shape, 40 ms / token); surfaced via
`finishedOnEos` + warning log + JSON `finished_on_eos` field. See
[Decode budget
cap](Documentation/TTS/Benchmarks.md#cosyvoice3-decode-budget-cap)
- [x] **CosyVoice3 auto-chunker** lands in this PR as a call-site
workaround. Validated on full minimax-cantonese: truncation **80/100 →
5/100**, longest output **6.5 s → 16.1 s**, agg-RTFx 0.245× → 0.249×.
16-test suite (`CosyVoice3TextChunkerTests`) green. See [CosyVoice3
auto-chunker](Documentation/TTS/Benchmarks.md#cosyvoice3-auto-chunker)
- [x] **Magpie streaming TTFT** wired through `synthesizeStream` in
`TtsBenchmarkCommand.swift`. Validated on full minimax-english: TTFT-p50
**9.6 s** (first chunk) vs full-synth-p50 **15.1 s** — 36% earlier
playback start. agg-RTFx 0.41× → 0.64× (warm-cache re-run)
- [x] **Cohere ASR harness wiring** (`--asr-backend cohere` +
`--cohere-model-dir` + `--asr-language` + `--cohere-compute-units`).
Smoke-tested on a 10-phrase `minimax-chinese` sub-sample (Cohere q8
macro CER 4.88%, hit `MILCompilerForANE` fallback, RTFx ~0.13× on this
M2 host). Whisper-large-v3 cross-check on the same WAVs: macro CER 3.04%
— both backends agree
- [x] **CosyVoice3 zh CER on full corpus** measured via
`whisper-large-v3` (Python CPU FP32, `Scripts/whisper_zh_cer.py`) over
all 100 minimax-chinese WAVs: macro CER **1.68%**, micro CER **1.84%**.
Recorded in `Documentation/TTS/Benchmarks.md` (CosyVoice3 row + footnote
‡)
This commit is contained in:
Alex
2026-05-01 09:09:42 -04:00
committed by GitHub
parent b5d8017d1f
commit 7603ac6733
33 changed files with 3672 additions and 254 deletions
+5
View File
@@ -11,6 +11,7 @@ xcuserdata/
*.hmap
*.txt
!Benchmarks/**/*.txt
## App packaging
*.ipa
@@ -104,6 +105,10 @@ Resources/
scripts/
!Scripts/parakeet_subset_benchmark.sh
!Scripts/diarizer_subset_benchmark.sh
# MiniMax TTS corpus is CC-BY-SA-4.0 derivative content fetched on demand
# via `fluidaudio minimax-corpus`; only the README is checked in.
Benchmarks/tts/corpus/minimax/*.txt
Documentation/parakeet-tdt/
docs/parakeet-tdt/
+343
View File
@@ -0,0 +1,343 @@
# TTS Benchmarks
> **Setup:** MacBook Air M2 (2022), 16 GB, macOS 26, on AC.
> **Corpus:** [MiniMax Multilingual TTS Test Set][minimax] (100
> phrases / language, CC-BY-SA-4.0) — the same public corpus used
> by [MiniMax-Speech][mms], seed-tts-eval, and Gradium, so numbers
> here are directly paper-comparable.
> **Status:** Kokoro, Kokoro ANE, PocketTTS, Magpie, StyleTTS2 all
> complete the English run; CosyVoice3 completes the full Mandarin
> run.
>
> [minimax]: https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set
> [mms]: https://arxiv.org/abs/2505.07916
## Why not just RTFx?
RTFx (audio_seconds / synth_seconds) is a useful single number for batch
synthesis, but for conversational use it hides the things users actually
feel:
1. **Cold start** — first model load + ANE compile after install or
reboot. On Apple Silicon the system's `anecompilerservice` can take
tens of seconds on first invocation; subsequent loads finish in ~1 s.
2. **TTFT (time-to-first-audio)** — for streaming agents the question
is "how long until the user hears *something*", not "how long until
the whole utterance is rendered". For one-shot backends in this
slice `ttft_ms == synth_ms`. **PocketTTS** and **Magpie** are
wired through their respective streaming APIs (`synthesizeStreaming`
/ `synthesizeStream`), so their `ttft_ms` is honest first-frame
latency.
3. **Per-stage compute units** — Kokoro ANE / Magpie are pipelines of
67 graphs. Sometimes ANE is *slower per call* but more efficient.
The "right" compute-unit choice differs per stage.
4. **Memory footprint** — drives whether a backend is mobile-viable.
5. **Quality** — RTFx alone tells you nothing about whether the model
pronounced "Reykjavík" or "$1,234.56" correctly. We measure WER +
CER via Parakeet roundtrip on a fixed English corpus; non-English
backends run with `--skip-asr` for now.
## Methodology
### Corpus
All shipped corpora come from the **MiniMax Multilingual TTS Test
Set** (`MiniMaxAI/TTS-Multilingual-Test-Set` on Hugging Face,
CC-BY-SA-4.0). The fetched files land under
`Benchmarks/tts/corpus/minimax/<lang>.txt` (24 languages × 100 phrases
= 2400 phrases) and are gitignored — populate them on demand with
`swift run fluidaudio minimax-corpus`. Attribution, revision pin,
and WER caveats live in [`MinimaxCorpus.md`](MinimaxCorpus.md).
Reference each language as `--corpus minimax-<lang>`:
| Backend | Default corpus | Other supported MiniMax languages |
|-------------|--------------------|------------------------------------------------|
| Kokoro / Kokoro ANE | `minimax-english` | `english` only (`af_heart` voice) |
| PocketTTS | `minimax-english` | `english`, `german`, `italian`, `portuguese`, `spanish`, `french` |
| StyleTTS2 | `minimax-english` | `english` only (LibriTTS multi-speaker) |
| Magpie | `minimax-english` | `english`, `spanish`, `german`, `french`, `italian`, `vietnamese`, `chinese`, `hindi` |
| CosyVoice3 | `minimax-chinese` | `chinese`, `cantonese` |
Lines beginning with `#` are comments. Custom corpora can still be
passed with `--corpus-path <file.txt>`.
### Metrics
Per phrase:
- `ttft_ms` — time-to-first-audio. For one-shot backends this equals
`synth_ms`. **PocketTTS** is benchmarked through
`synthesizeStreaming`, so its `ttft_ms` is the timestamp of the first
80 ms audio frame (1920 samples @ 24 kHz). **Magpie** is benchmarked
through `synthesizeStream`, so its `ttft_ms` is the first
`MagpieAudioChunk` emit time (typically ~9.6 s on M2 vs ~15 s for
full synth).
- `synth_ms` — total synth wall time.
- `audio_ms` — generated audio duration.
- `rtfx``audio_ms / synth_ms`.
- `wer`, `cer` — via Parakeet ASR roundtrip on the rendered WAV.
- `stage_ms` — per-stage breakdown (backend-specific keys; populated
for Kokoro ANE + Magpie; empty for Kokoro / PocketTTS /
StyleTTS2 / CosyVoice3).
- Backend-specific extras: `encoder_tokens`, `acoustic_frames`,
`chunk_count`, `frame_count`, `code_count`, `finished_on_eos`,
`generated_token_count`, etc.
Aggregates:
- `cold_start_s``manager.initialize()` wall time. CosyVoice3 also
includes voice-asset load.
- `first_synth_ms` — first synth call after init (still cold-ish).
- `ttft_ms_p50` / `ttft_ms_p95`.
- `warm_synth_ms_p50` / `warm_synth_ms_p95`.
- `agg_rtfx``Σ audio_ms / Σ synth_ms` across the corpus.
- `peak_rss_mb` — process-wide peak resident set, via
`task_vm_info_data_t.resident_size_peak`.
- Per-category macro WER / CER.
### Reproducibility
```bash
# From the package root.
swift run fluidaudio tts-benchmark \
--backend kokoro-ane \
--corpus minimax-english \
--voice af_heart \
--compute-units default \
--output-json bench.json \
--audio-dir bench-wavs/
```
The harness writes a JSON report to `--output-json` and (optionally)
keeps WAVs under `--audio-dir`. Pass `--skip-asr` to drop the ASR
roundtrip. The default ASR backend is `parakeet` for English-only
runs and is skipped for CosyVoice3; pass `--asr-backend cohere
--cohere-model-dir <dir>` to score Mandarin (or any of the 14
Cohere languages) against [Cohere Transcribe](../../Sources/FluidAudio/ASR/Cohere/).
## Results
### Per-backend top-line
Reference machine: **MacBook Air, Apple M2 (2022), 8-core CPU /
8-core GPU / 16-core Neural Engine, 16 GB unified memory, macOS 26**
(`Mac14,2`, on AC). All English runs use `--compute-units default`,
voice = backend default
(`af_heart` for Kokoro, `alba` for PocketTTS, `John` for Magpie),
corpus = `minimax-english` (100 phrases), Parakeet TDT roundtrip for
WER / CER.
| Backend | License | Languages | Footprint | Cold start | TTFT p50 / p95\* | Synth p50 / p95 | Agg RTFx | Peak RSS | WER | CER | Notes |
|-------------|-------------|------------------------|-----------|------------|---------------------|---------------------|----------|----------|---------|---------|-------|
| Kokoro ANE | Apache-2.0 | en (af_heart only) | ~330 MB | 37.9 s | 1586 / 2515 ms | 1586 / 2515 ms | 5.19× | 738 MB | 0.108 | 0.040 | one-shot; per-stage CU sweep, 7-graph pipeline |
| Kokoro | Apache-2.0 | en (af_heart only) | ~330 MB | 92.2 s | 3113 / 4696 ms | 3113 / 4696 ms | 2.02× | 736 MB | 0.013 | 0.005 | one-shot; cleanest English ASR roundtrip |
| PocketTTS | research | en + de + it + pt + es + fr (6L / 24L) | ~140 / ~520 MB | 6.0 s | **1244 / 4749 ms** | 8757 / 19174 ms | 0.61× | 1503 MB | 0.014 | 0.006 | **streaming**; TTFT is first 80 ms audio frame |
| StyleTTS2 | MIT | en (LibriTTS multi-spk) | ~280 MB | 955 s§ | 6671 / 15990 ms§ | 6671 / 15990 ms§ | 2.72×§ | 963 MB§ | 0.440§ | 0.241§ | full 100/100 `minimax-english` via [misaki→espeak post-pass remap](#styletts2-misaki--espeak-post-pass-remap); ref_s = LibriTTS `696_92939_000016_000006.wav` (StyleTTS2 demo voice) |
| Magpie | research | en/es/de/fr/it/vi/zh/hi | ~1.3 GB | 38.5 s∥ | **9580 / 23796 ms**∥ | 15080 / 29895 ms∥ | 0.64×∥ | 762 MB∥ | 0.056 | 0.033 | **streaming TTFT**: first audio chunk at 9.6 s p50 on M2 (full synth 15.1 s); split-K/V decoder; outputBackings fast path with latched fallback |
| CosyVoice3 | Apache-2.0 | zh (mandarin) | ~1.5 GB | 29.2 s† | 14091 / 23679 ms† | 14091 / 23679 ms† | 0.357׆ | 3302 MB† | n/a‡ | 0.017‡ | beta; full `minimax-chinese` (100/100 phrases) for latency / RSS and whisper-large-v3 CER‡; cantonese supported via [auto-chunker](#cosyvoice3-auto-chunker) but not benchmarked (no yue ASR) |
\* TTFT for **PocketTTS / Magpie** is first-frame emit through the
streaming API; the others are one-shot, so `ttft_ms == synth_ms`.
† CosyVoice3 chinese: 100/100, 0 errors, ASR skipped. Cold-start
dropped from 302.7 s to 29.2 s on the warm re-run.
‡ CosyVoice3 CER measured on the **full 100-phrase**
`minimax-chinese` corpus via `whisper-large-v3` (Python CPU FP32,
[`Scripts/whisper_zh_cer.py`](../../Scripts/whisper_zh_cer.py)) on
the WAVs rendered by `tts-benchmark --backend cosyvoice3 --corpus
minimax-chinese --skip-asr --audio-dir <dir>`: **macro CER 1.68%
(0.0168)**, **micro CER 1.84% (0.0184)** across 100 phrases.
Whisper is the source of truth here because Cohere Transcribe q8
hit a `MILCompilerForANE` cache failure on this M2 host and ran on
the CPU+GPU fallback path at RTFx ~0.13× (would have taken multiple
hours for the full 100-phrase set vs. ~70 min for whisper). WER is
omitted because Mandarin has no word boundaries and `WERCalculator`
splits on whitespace, so word-level WER reads near 100% and is
meaningless.
∥ Magpie: streamed via `synthesizeStream`. TTFT (9.6 s p50) is
first-chunk emit; synth (15.1 s p50) is full-utterance wall time —
the 5.5 s gap is the streaming win.
§ StyleTTS2 (**beta** — `StyleTTS2Manager.initialize` emits a
runtime warning): warm-cache run; first cold compile of the
bucketed text_predictor / diffusion_step / decoder graphs is
multi-second. ref_s dumped via
[`06_dump_ref_s.py`](https://github.com/voicelink-ai/mobius-styletts2/blob/main/models/tts/styletts2/scripts/06_dump_ref_s.py).
Read WER **relatively** per the
[WER caveat](#about-the-wer--cer-numbers); StyleTTS2's own demo
notebook reports artifacts on long sentences at default
`alpha/beta/diffusion_steps`.
### Kokoro ANE — per-stage breakdown (default preset, MiniMax-English)
Means across 100 `minimax-english` phrases on M2. Stages map to the
7-CoreML-graph split documented in [KokoroAne.md](KokoroAne.md). Vocoder
+ noise together account for ~92% of synth time, which is the natural
target for any further per-stage compute-unit re-tuning. The MiniMax
mean is meaningfully higher than the prior Harvard-sentences run
because phrases 81100 are paragraph-length news / story sentences.
| Stage | Mean ms | % of total |
|---------------|---------|------------|
| `albert` | 28.2 | 2.0% |
| `post_albert` | 12.1 | 0.9% |
| `alignment` | 1.8 | 0.1% |
| `prosody` | 49.2 | 3.5% |
| `noise` | 242.6 | 17.4% |
| `vocoder` | 1039.8 | 74.4% |
| `tail` | 24.6 | 1.8% |
| **total** | 1398.4 | 100% |
### Magpie — per-stage breakdown (default preset, MiniMax-English)
Means across 100 `minimax-english` phrases on M2 (`John` voice, en,
default compute units), captured during the original one-shot
profiling run. `ar_loop` is the umbrella for the per-step
`decoder_step` + `sampler` (so it is not added on top in the total).
`nanocodec` runs concurrently with the AR loop in chunked-streaming
mode, which is why the per-stage means do not sum to total warm-synth
mean. The AR loop dominates the wall clock, and its cost grows
super-linearly with phrase length — long news / story phrases drive
the long-tail p95.
| Stage | Mean ms |
|--------------------|---------|
| `text_encoder` | 91 |
| `prefill` | 281 |
| `ar_loop` | 17946 |
| └── `decoder_step` | 14840 |
| └── `sampler` | 3081 |
| `nanocodec` | 17948 |
### About the WER / CER numbers
The MiniMax corpus mixes short conversational phrases, medium news
headlines, and long narrative paragraphs. WER on the long tail is
sensitive to the ASR + text-normalizer stack (e.g. `"3,5%"`
`"three point five percent"` vs. `"three and a half percent"`); per
the [upstream community
discussion](https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set/discussions/10),
absolute WER is best read **relatively** (backend A vs. backend B on
the same corpus + same ASR + same normalizer) rather than against
raw paper numbers.
## StyleTTS2 misaki → espeak post-pass remap
StyleTTS2's LibriTTS checkpoint was trained on **espeak-ng-phonemized**
text, but the in-tree BART G2P (shared with Kokoro) emits **misaki**
output. The 178-token vocab accepts both forms, but the acoustic
embeddings for the misaki ligature glyphs are essentially untrained
noise — every training utterance saw the espeak form.
Four systematic divergences vs. `espeak-ng -v en-us --ipa -q`:
| misaki | espeak-ng | example |
|--------|-----------|--------------------------|
| `ʧ` | `tʃ` | choice → `tʃˈɔɪs` |
| `ʤ` | `dʒ` | jump → `dʒˈʌmps` |
| `ɜɹ` | `ɝ` | girl → `ɡˈɝl` |
| `əɹ` | `ɚ` | over → `ˈoʊvɚ` |
Fix: 4-rule post-pass remap in `StyleTTS2Phonemizer.phonemize`, gated
on `.americanEnglish`. Result on `minimax-english`: WER 0.581 →
0.440, CER 0.476 → 0.241, agg-RTFx 2.36× → 2.72× (warm-cache
re-run, so latency / RSS deltas are noise — WER / CER are the real
signal). WER is still 30× worse than Kokoro; remaining errors cluster
on word-level BART mispronunciations and long-tail diffusion artifacts.
Further gains likely need a richer remap layer or swapping BART for
libespeak-ng directly.
## CosyVoice3 Decode budget cap
CosyVoice3's Flow CFM was exported with a fixed input shape of
`[1, 250]` speech tokens (`flowTotalTokens` in
`CosyVoice3Constants.swift:45`). The LLM-Decode AR loop is allowed to
emit up to `flowTotalTokens N_prompt` tokens before being cut off
(typically ~163 generated tokens after the speech-prompt portion).
At `tokenMelRatio=2 × hiftSamplesPerFrame=480 / sampleRate=24000`
that's **40 ms of audio per generated token**, so the loop produces
**at most ~6.5 s of speech per phrase**, regardless of how long the
input text is.
When the AR loop exits because it ran out of budget (i.e. no EOS
token in `stopRange = 6_561…6_760`) instead of natural termination,
`CosyVoice3Synthesizer` now:
1. Logs a `.warning` (one-shot per phrase) naming the
`decoded.count / maxNew` budget and the produced audio duration.
2. Sets `CosyVoice3SynthesisResult.finishedOnEos = false`, which the
benchmark harness surfaces as the `finished_on_eos` field on each
phrase in the JSON report.
Footprint on the cantonese corpus (`minimax-cantonese`,
100 phrases) **without the chunker**: 80 / 100 phrases would hit the
cap, all producing exactly 163 generated tokens / ~6.5 s of audio.
The mandarin corpus sees a much lower truncation rate because
MiniMax-zh phrases are shorter on average.
The structural fix — re-exporting the Flow CFM from
[`mobius-cosyvoice3`](https://github.com/voicelink-ai/mobius-cosyvoice3)
with a larger fixed input shape (e.g. `[1, 500]`) — is upstream
work; bumping the constant in Swift alone would make the Flow
input/output shapes mismatch at predict time. The shipped workaround
is the call-site [auto-chunker](#cosyvoice3-auto-chunker), which
drops cantonese truncation from 80/100 → 5/100 by splitting long
inputs at clause boundaries and crossfading the results.
Surfaced in
`CosyVoice3Synthesizer.synthesize`
(`Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Synthesize/CosyVoice3Synthesizer.swift`)
and
`CosyVoice3SynthesisResult.finishedOnEos`
(`Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Synthesize/CosyVoice3Types.swift`).
## CosyVoice3 auto-chunker
Re-exporting Flow CFM with a larger fixed input shape is gated on
upstream conversion work. Until that lands, `CosyVoice3TtsManager`
splits long inputs at the call site, synthesizes each chunk
independently, and merges with an 8 ms equal-power cosine crossfade.
**Splitter policy** (`CosyVoice3TextChunker`):
- **Hard enders** commit always: `.`, `!`, `?`, `。`, ``, ``,
`\n`.
- **Soft enders** commit only when the running estimate is at or past
the budget: ``, `、`, ``, ``, `;`, `,`, ASCII space.
- **Force-split** at `budget + 30` tokens of overshoot if no natural
boundary appeared (rare; mostly continuous CJK with no
punctuation).
**Token-rate estimate** (calibrated against minimax-zh + minimax-yue
runs):
| Char class | Tokens / char | Rationale |
|------------|---------------|--------------------------------------------------------------|
| CJK | 7.5 | worst-case observed in real generation; varies 5.59 per char |
| ASCII | 1.5 | matches BPE rate on English text |
| Other | 2.5 | conservative for accented Latin / non-CJK Unicode |
`defaultMaxSpeechTokens` is **110**, leaving margin under the
250-token Flow cap minus typical 6090 token speech-prompt context.
**Concatenation**: 8 ms equal-power cosine crossfade at 24 kHz
between adjacent chunks; single-chunk path short-circuits to plain
copy.
**Validation** (full `minimax-cantonese`, 100 phrases, M2):
| Metric | Pre-chunker | Post-chunker | Δ |
|-------------------------------------------|-------------|--------------|------------|
| `finished_on_eos=false` (truncated) | 80 / 100 | **5 / 100** | 94% |
| Longest audio output | 6.5 s | **16.1 s** | +148% |
| agg-RTFx | 0.245× | 0.249× | +1.6% |
| TTFT p50 | 23.9 s | 35.7 s | +49% |
| TTFT p95 | 41.2 s | 60.5 s | +47% |
| Peak RSS | 2016 MB | 3264 MB | +62% |
The 5/100 residual is the long-tail token-rate worst case (some
Cantonese characters generate >9 speech tokens); raising the
per-CJK heuristic further would over-fragment short phrases.
Cleaner fix is the upstream Flow re-export.
+98 -15
View File
@@ -3,16 +3,19 @@
Mandarin zero-shot voice cloning via Qwen2 LM + CFM Flow + HiFT vocoder,
running on CoreML.
> ⚠️ **Beta / experimental.** End-to-end synthesis is currently slow on
> Apple Silicon — RTFx < 1.0 typical, several seconds of latency for
> short Mandarin utterances. The slowdown is partly the Flow CFM stage
> (fp32, CPU-or-GPU only because fp16 + ANE produces NaNs through the
> fused `layer_norm` — CoreMLTools limitation, tracked upstream) and
> partly HiFT sinegen / windowing ops that fall back to CPU. May be a
> model issue, may be recoverable through better conversion. Treat
> performance numbers as preliminary; the Swift API, model layout, and
> prompt-asset format may change in subsequent releases without
> deprecation aliases.
> ⚠️ **Beta / experimental.** End-to-end synthesis is below real-time
> on Apple Silicon — agg-RTFx **0.357×** and p50 TTFT **~9.6 s** on
> the full `minimax-chinese` 100-phrase corpus (M2, default compute
> units), after the
> [HiFT timeout fix](Benchmarks.md#cosyvoice3-hift-timeout-fix) and
> [LLM-Decode `outputBackings` double-buffer](Benchmarks.md#cosyvoice3-llm-decode-outputbackings-fix).
> The slowdown is partly the Flow CFM stage (fp32, CPU-or-GPU only
> because fp16 + ANE produces NaNs through the fused `layer_norm` —
> CoreMLTools limitation, tracked upstream) and partly HiFT sinegen
> / windowing ops that fall back to CPU. May be a model issue, may
> be recoverable through better conversion. Treat performance numbers
> as preliminary; the Swift API, model layout, and prompt-asset format
> may change in subsequent releases without deprecation aliases.
## Files
@@ -105,8 +108,9 @@ let result = try await manager.synthesize(
| Field | Default | Notes |
|---|---|---|
| `maxNewTokens` | `nil` (cap = 1024) | Hard ceiling on speech-token count |
| `maxNewTokens` | `nil` (= `flowTotalTokens N_prompt`) | Soft ceiling on the LLM-Decode AR loop. The hard ceiling is the structural 250-token cap below — `maxNewTokens` only lets you generate fewer than that. |
| `seed` | 42 | Drives the RAS sampler RNG; reproducible runs |
| `disableAutoChunking` | `false` | When `true`, bypasses `CosyVoice3TextChunker` and runs a single synthesizer call regardless of input length. Use when you've pre-segmented input upstream (UI streaming, paragraph-at-a-time playback, etc.). The structural 250-token cap then applies and long inputs truncate mid-utterance. |
`CosyVoice3SynthesisResult`:
@@ -116,13 +120,92 @@ let result = try await manager.synthesize(
| `sampleRate` | `Int` | always 24000 |
| `generatedTokenCount` | `Int` | tokens before EOS |
| `decodedTokens` | `[Int32]` | full speech token sequence (debug) |
| `finishedOnEos` | `Bool` | `true` = AR loop exited on an EOS token (natural termination); `false` = budget exhausted, audio truncated mid-utterance. See "Decode budget cap" below. |
### Decode budget cap + auto-chunking
The Flow CFM model is exported with a fixed-shape `token_total` input of
`[1, 250]` (`CosyVoice3Constants.flowTotalTokens = 250`). Each LLM-Decode
token corresponds to **40 ms of audio** (`tokenMelRatio = 2 × hiftSamplesPerFrame = 480 / sampleRate = 24 000`),
so the *generated* portion of a single synthesizer call is bounded by
`(250 N_prompt) × 40 ms`. With a typical prompt of ~8595 tokens,
this leaves ~6.46.6 s of generated audio per call — long Mandarin
phrases would truncate mid-utterance if synthesized in one shot.
**`CosyVoice3TtsManager.synthesize(...)` auto-chunks long input** to
sidestep this. Pipeline:
1. Run the existing Chinese normalizer (or skip it, per `prenormalized`).
2. `CosyVoice3TextChunker.chunk(normalized)` greedily splits on hard
sentence enders (`. ! ? 。 `) and falls back to soft clause
separators (`, ; `) when sentences exceed the budget. The
default budget is `defaultMaxSpeechTokens = 110` speech tokens
(`~45-token margin under the typical 155 room-for-new`; the 30-token
force-split overshoot may push committed chunks to ~140 estimated).
3. If the chunker returns one segment, take the fast path — single
synthesizer call, no concat overhead.
4. Otherwise loop, calling the synthesizer once per chunk, then merge
results: PCM concatenated with an 8 ms cosine cross-fade at each
boundary (masks DC/phase mismatch from independent synth calls);
`generatedTokenCount`/`decodedTokens` summed/concatenated;
`finishedOnEos` = AND across all chunks.
Tunables: `CosyVoice3TextChunker.defaultMaxSpeechTokens` (110) is the
default budget; pass `disableAutoChunking: true` in
`CosyVoice3SynthesisOptions` to bypass the chunker entirely and run a
single call (useful for UI-driven sentence-at-a-time streaming where
the caller already controls segmentation).
Token-rate estimate inside the chunker (calibrated against minimax-zh
corpus runs — initial 5.5 figure was too optimistic and let ~16% of
phrases hit the cap; 7.5 covers the worst-case observed real rate):
| Class | Tokens/char | Rationale |
|---|---|---|
| CJK | 7.5 | worst-case observed in real generation; varies 5.59 per char |
| ASCII | 1.5 | BPE compresses; English speaks faster than Mandarin per char |
| Other (Latin-1, etc.) | 2.5 | middle ground |
Caveats:
- **Prosody discontinuity at boundaries.** Each chunk re-establishes the
pitch contour from the prompt, so concatenated audio has audible breaks
at chunk seams. The 8 ms cross-fade hides clicks/DC offsets but cannot
reconstruct cross-sentence prosody.
- **Per-chunk prefill cost.** Each segment pays the prefill cost
separately, so total wall-clock for an N-chunk synth is roughly
`N × prefill + Σ decode_per_chunk`. Single-chunk inputs are unaffected.
- **Estimate slack.** The token-per-char heuristic is rough; if a chunk
somehow exceeds the model's structural budget at runtime, the
synthesizer still emits the `LLM-Decode budget exhausted` warning and
returns `finishedOnEos: false` for that chunk.
Behavior of the underlying synthesizer when its budget is hit (still
applies for `disableAutoChunking: true` or for one-shot mode):
- **AR loop exhausts `maxNew` without observing an EOS** in
`CosyVoice3Constants.stopRange` (`6_561…6_760`).
- `CosyVoice3Synthesizer` emits a `.warning`-level log:
`"LLM-Decode budget exhausted: <N> generated tokens / <maxNew> cap (no EOS observed). Output truncated at ~<S>s of audio."`.
- `result.finishedOnEos` is `false` so callers can detect it
programmatically (the `tts-benchmark` harness surfaces this as a
per-phrase `finished_on_eos` field in the JSON report).
Lifting the cap structurally (no auto-chunk, no prosody seams) requires
re-exporting Flow with a larger `token_total` shape (e.g. `[1, 500]` for
~16 s) — handled upstream in the `mobius-cosyvoice3` conversion pipeline;
not changeable from the Swift host.
## Key State
### KV cache (`kv_cache[24, 1, 2, 768, 64]` fp16)
- 24 transformer layers × `[K,V]` × heads × dim, packed into one `MLState`-style
`MLMultiArray` that the prefill produces and the decode loop both reads
and overwrites in-place.
### KV cache (`kv_k` / `kv_v` each `[24, 1, 2, 768, 64]` fp32)
- 24 transformer layers × `[K,V]` × heads × dim, split across two
`MLMultiArray` outputs (`kv_k`, `kv_v`) that prefill produces and the
decode loop carries forward across steps via
`MLPredictionOptions.outputBackings` double-buffering.
- No `MLState` dependency — runs on the package baseline (macOS 14 / iOS 17).
- ~9 MB per array; pre-allocated front/back/spare buffers rotated each
step (see [LLM-Decode `outputBackings` fix](Benchmarks.md#cosyvoice3-llm-decode-outputbackings-fix)).
- Reset per `synthesize()` call.
### Prompt assets (`CosyVoice3PromptAssets`)
+5 -1
View File
@@ -148,7 +148,11 @@ timing (5 s of audio, M1):
| Vocoder | ~120 ms |
| Tail | ~50 ms |
Vocoder dominates. Total ≈ 300 ms for 5 s audio (~16× RTFx).
Vocoder dominates. Total ≈ 300 ms for 5 s audio (~16× RTFx). For
full-corpus numbers (warm-synth p50 / p95, peak RSS, WER) on the
MiniMax-English 100-phrase suite — including the longer paragraph
phrases that pull the per-corpus aggregate down to ~5.2× — see
[Benchmarks.md](Benchmarks.md).
## Source
+20 -10
View File
@@ -5,16 +5,26 @@ Lives under `Sources/FluidAudio/TTS/Magpie/`.
## Status
Functional but **quite slow — needs significant perf work, not for real-time
or latency-sensitive use.** First synth on a fresh process is dominated by
CoreML model load + first-call ANE compile (~30 s); warm synths run at
~96 s wall for an 8-word English sentence on M-series, i.e. RTFx ≈ **0.04**
(~25× slower than realtime). Whether the throughput ceiling is a model
characteristic, a CoreML conversion limitation, or both is still being
investigated and is expected to improve in subsequent iterations. For
real-time use prefer Kokoro (~20× RTFx) or PocketTTS (~1.52× RTFx);
Magpie's value prop is multilingual coverage and the 5 built-in speaker
contexts, not throughput.
> ⚠️ **Beta / experimental.** Below real-time on Apple Silicon
> (agg-RTFx ~0.41× on M2). Not for latency-sensitive use; prefer
> Kokoro / Kokoro ANE or PocketTTS for real-time. Initializing
> `MagpieTtsManager` logs a runtime beta warning at `.warning` level.
Functional but **below real-time — not for latency-sensitive use.**
On the full `minimax-english` 100-phrase corpus (M2, default compute
units), Magpie posts agg-RTFx **0.41×** with p50 warm synth ~19.8 s
and p95 ~57.5 s — most of the long tail comes from paragraph-length
news / story phrases (max 107 s on a single 18 s utterance). Cold
start ~19 s on warm ANE caches, dominated by first-call decoder_step
compile. The AR loop (`decoder_step` + sampler) dominates wall clock
and grows super-linearly with phrase length; the
[`outputBackings` fast path](Benchmarks.md#magpie-outputbackings-fast-path)
already eliminated the per-step KV reallocation cost. Further gains
likely need an MLX-backed LocalTransformer or a smaller-K/V variant.
For real-time use prefer Kokoro / Kokoro ANE (25× RTFx) or PocketTTS
(streaming, TTFT ~1.2 s); Magpie's value prop is multilingual coverage
(en/es/de/fr/it/vi/zh/hi) and 5 built-in speaker contexts, not
throughput.
Audio quality is perceptually clean across all 5 speakers and ASR-clean on
4/5; speaker 0 has a single trailing-word artifact ("…and") attributable
+89
View File
@@ -0,0 +1,89 @@
# MiniMax Multilingual TTS Test Set
The FluidAudio `tts-benchmark` corpus is sourced on demand from the
[MiniMaxAI/TTS-Multilingual-Test-Set](https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set)
Hugging Face dataset and converted to the harness format (one phrase
per non-empty, non-`#` line). The fetched `.txt` files land under
`Benchmarks/tts/corpus/minimax/<lang>.txt`; they are gitignored — only
this document is checked in.
| Field | Value |
|----------|-------|
| Source | https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set |
| Revision | `cb416f0ac3658da0577e97873065e19fe6488917` (initial public release) |
| License | [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/) |
| Citation | MiniMax-Speech tech report — [arXiv 2505.07916](https://arxiv.org/pdf/2505.07916) |
| Languages | 24 (arabic, cantonese, chinese, czech, dutch, english, finnish, french, german, greek, hindi, indonesian, italian, japanese, korean, polish, portuguese, romanian, russian, spanish, thai, turkish, ukrainian, vietnamese) |
| Phrases | 100 per language (2400 total) |
The fetched text files are derivative works of the upstream dataset
and remain under **CC-BY-SA-4.0**. The rest of the FluidAudio
repository is licensed separately (see top-level `LICENSE`); only the
contents of `Benchmarks/tts/corpus/minimax/` are share-alike-bound to
CC-BY-SA-4.0.
## Why this corpus?
MiniMax positions this as *"a public benchmark used in a number of
recent TTS papers, which makes our numbers directly comparable to
existing work"* (Gradium, MiniMax-Speech, seed-tts-eval, etc.).
FluidAudio's `tts-benchmark` ships exclusively against this corpus
so the resulting RTFx / WER numbers land on the same axis as
published TTS work.
## Format conversion
Upstream lines have a `<cloning_audio_filename>|<text>` pipe-delimited
shape because the dataset also ships per-speaker reference audio for
zero-shot voice cloning. The FluidAudio harness only needs the text —
voice selection is a per-backend concern (Kokoro / PocketTTS / Magpie /
StyleTTS2 each have their own voice plumbing). The leading
`<filename>|` is stripped at fetch time; if you need the cloning audio
later, fetch it from the upstream HF repo's `audio/` directory.
## Fetching
The `fluidaudio minimax-corpus` CLI subcommand pins the upstream
revision to the value above so re-runs are deterministic. From the
package root:
```bash
# All 24 languages
swift run fluidaudio minimax-corpus
# Subset
swift run fluidaudio minimax-corpus --languages english,spanish,hindi
# Refresh against a newer release
swift run fluidaudio minimax-corpus --revision <commit-sha>
```
Output lands in `Benchmarks/tts/corpus/minimax/<lang>.txt` (relative
to the package root) by default; override with `--out-dir <path>`.
Auth-gated revisions are honored via the standard `HF_TOKEN` /
`HUGGING_FACE_HUB_TOKEN` env vars (same as every other HF asset pull
in the project). Run `fluidaudio minimax-corpus --help` for the full
flag list.
Per-backend ↔ language coverage and `tts-benchmark --corpus minimax-<lang>`
usage live in [`Benchmarks.md`](Benchmarks.md#corpus).
## WER caveats
Per the [open community discussion on the upstream
dataset](https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set/discussions/10),
WER on this corpus is sensitive to the ASR + text-normalization stack:
- Whisper-v3 (and similarly Parakeet) often need text normalization on
the reference (`"32"``"thirty two"`) before comparing against the
hypothesis to get a clean WER.
- For non-Latin-script languages (Hindi, Japanese, Cantonese, etc.) the
ASR may emit transliterated forms that don't match the reference
script, inflating WER even when the synthesis is intelligible.
- For non-word-segmented languages (Chinese, Japanese, Thai), CER is
the more meaningful metric — `tts-benchmark` already reports both.
This means **MiniMax WER is best read relatively (FluidAudio backend
A vs. backend B on the same corpus + same ASR), not absolutely**, and
side-by-side comparison with published numbers requires matching the
upstream ASR + normalizer choice.
+1 -1
View File
@@ -716,7 +716,7 @@ public enum ModelNames {
/// expected local directory layout is encoded in `CosyVoice3Constants.Files`.
public enum CosyVoice3 {
public static let llmPrefill = "LLM-Prefill-T256-M768-fp16"
public static let llmDecode = "LLM-Decode-M768-fp16-stateful"
public static let llmDecode = "LLM-Decode-M768-fp16"
public static let flow = "Flow-N250-fp16"
public static let hift = "HiFT-T500-fp16"
public static let speechEmbeddings = "speech_embedding-fp16.safetensors"
@@ -28,11 +28,12 @@ public actor CosyVoice3ModelStore {
/// - Parameters:
/// - directory: Base build directory that contains
/// `llm-fp16/`, `llm-fp16-stateful/`, `flow-fp16-n250/`,
/// `llm-fp16/`, `llm-fp16-decode/`, `flow-fp16-n250/`,
/// `hift-fp16-t500/`, `embeddings/`.
/// - computeUnits: Defaults to `.cpuAndNeuralEngine`. Applied to
/// LLM-Prefill + HiFT models only. LLM-Decode (stateful) and Flow
/// both force `.cpuAndGPU` regardless (see `loadIfNeeded()`).
/// LLM-Prefill only. LLM-Decode (stateless external cache),
/// Flow, and HiFT all pin `.cpuAndGPU` regardless (see
/// `loadIfNeeded()`).
public init(directory: URL, computeUnits: MLComputeUnits = .cpuAndNeuralEngine) {
self.directory = directory
self.computeUnits = computeUnits
@@ -67,10 +68,10 @@ public actor CosyVoice3ModelStore {
let prefill = try await compileAndLoad(prefillURL, configuration: config)
logger.info("Loaded \(CosyVoice3Constants.Files.llmPrefill)")
// Stateful decode MUST run on `.cpuAndGPU`:
// - ANE refuses to compile the stateful graph (same failure mode
// as Flow: `MILCompilerForANE ANECCompile() FAILED`), so
// `.cpuAndNE` / `.all` deadlock load
// Stateless decode MUST run on `.cpuAndGPU`:
// - ANE refuses to compile the rotary + sliced SDPA decode graph
// (same failure mode as Flow: `MILCompilerForANE ANECCompile()
// FAILED`), so `.cpuAndNE` / `.all` deadlock load
// - CPU-only works but is ~2× slower than the GPU path
// Ignore the user-supplied `computeUnits` for decode.
let decodeConfig = MLModelConfiguration()
@@ -98,7 +99,25 @@ public actor CosyVoice3ModelStore {
let flow = try await compileAndLoad(flowURL, configuration: flowConfig)
logger.info("Loaded \(CosyVoice3Constants.Files.flow)")
let hift = try await compileAndLoad(hiftURL, configuration: config)
// HiFT runs on `.cpuAndGPU` (fp16). With `.cpuAndNeuralEngine`
// CoreML's planner placed most of HiFT on ANE but kept at least
// one op (`HiFT-T500-fp16_main__Op104`) on the BNNS CPU path,
// which trips a hard async-dispatch watchdog mid-corpus on
// long phrases:
//
// E5RT: Submit Async failed for [3:29]: Async task:
// HiFT-T500-fp16_main__Op104_BnnsCpuInference has timed out.
// @ CancelTimedOutAsyncTask_block_invoke
//
// Pinning HiFT to `.cpuAndGPU` removes the ANE+BNNS mixed-compute
// pathology (the same family of issue that already forced Flow
// and Decode off ANE above). The model is fixed-shape
// [1, 80, 500] so GPU placement is predictable. Trade-off: a
// small per-call latency increase vs. ANE acceptable, since
// the prior ANE config didn't actually complete the corpus.
let hiftConfig = MLModelConfiguration()
hiftConfig.computeUnits = .cpuAndGPU
let hift = try await compileAndLoad(hiftURL, configuration: hiftConfig)
logger.info("Loaded \(CosyVoice3Constants.Files.hift)")
loadedModels = CosyVoice3Models(prefill: prefill, decode: decode, flow: flow, hift: hift)
@@ -4,7 +4,7 @@ import Foundation
///
/// Shipping config (frozen):
/// - LLM-Prefill-T256-M768-fp16 (cpuAndNeuralEngine)
/// - LLM-Decode-M768-fp16-stateful (cpuAndGPU see note)
/// - LLM-Decode-M768-fp16 (cpuAndGPU see note)
/// - Flow-N250-fp16 (cpuAndGPU an ANE-port
/// BC1S rewrite was attempted and reverted: the converted graph ran
/// ~3× faster but numerically broken (mel dynamic range collapsed
@@ -15,14 +15,22 @@ import Foundation
/// `input_embed.conv_pos_embed` (`Conv1d(1024,1024,k=31)+Mish`)
/// that three rewrite attempts couldn't move ANEF rejects the
/// conv footprint regardless of group count.)
/// - HiFT-T500-fp16 (cpuAndNeuralEngine)
/// - HiFT-T500-fp16 (cpuAndGPU pinned off
/// ANE because the `.cpuAndNeuralEngine` planner left at least one
/// op on the BNNS CPU path, which tripped a hard async-dispatch
/// watchdog mid-corpus on long phrases:
/// `E5RT: Submit Async failed ... HiFT-T500-fp16_main__Op104_BnnsCpuInference
/// has timed out`. GPU placement is deterministic and avoids the
/// ANE+BNNS mixed-compute pathology.)
///
/// The stateful decode model uses per-layer `MLState` buffers for the
/// KV cache (48 tensors, `[1, 2, 768, 64]` fp16 each) instead of
/// round-tripping 18 MB of kv_k / kv_v MLMultiArrays every step. ANE
/// refuses to compile the stateful graph (`MILCompilerForANE
/// ANECCompile() FAILED`); decode therefore runs on `.cpuAndGPU`.
/// Requires macOS 15 / iOS 18.
/// Decode runs **stateless** with an external KV cache: prefill emits
/// `kv_k` / `kv_v` of shape `[24, 1, 2, 768, 64]` fp32, and decode
/// accepts the same tensors as inputs and returns `kv_k_out` / `kv_v_out`
/// at the same shape/dtype. The cache is round-tripped once per step
/// (18 MB total). ANE still rejects this graph (`MILCompilerForANE
/// ANECCompile() FAILED` on the rotary + sliced SDPA), so decode is
/// pinned to `.cpuAndGPU`. The library floor is macOS 14 / iOS 17 no
/// MLState dependency.
public enum CosyVoice3Constants {
// MARK: - LLM shapes
@@ -66,8 +74,8 @@ public enum CosyVoice3Constants {
public enum Files {
public static let llmPrefill = "LLM-Prefill-T256-M768-fp16.mlpackage"
public static let llmPrefillSubdir = "llm-fp16"
public static let llmDecode = "LLM-Decode-M768-fp16-stateful.mlpackage"
public static let llmDecodeSubdir = "llm-fp16-stateful"
public static let llmDecode = "LLM-Decode-M768-fp16.mlpackage"
public static let llmDecodeSubdir = "llm-fp16-decode"
public static let flow = "Flow-N250-fp16.mlpackage"
public static let flowSubdir = "flow-fp16-n250"
public static let hift = "HiFT-T500-fp16.mlpackage"
@@ -38,11 +38,9 @@ import Foundation
/// the 281 runtime-added special tokens (CosyVoice3Tokenizer). Same format
/// that `tokenizer_fixture.json` dumps under its `special_tokens` key.
///
/// > Note: Gated to macOS 15 / iOS 18 because the underlying
/// > `CosyVoice3Synthesizer` uses CoreML `MLState` for the decode KV cache.
/// > Other FluidAudio modules (ASR, Diarization, VAD, Kokoro, PocketTTS)
/// > remain available on macOS 14 / iOS 17.
@available(macOS 15, iOS 18, *)
/// > Available on the same floor as the rest of FluidAudio (macOS 14 /
/// > iOS 17). Decode runs stateless with an external KV cache rather than
/// > `MLState`, so no extra OS gate is required.
public actor CosyVoice3TtsManager {
private let logger = AppLogger(subsystem: "com.fluidaudio.tts", category: "CosyVoice3TtsManager")
@@ -216,9 +214,60 @@ public actor CosyVoice3TtsManager {
normalized = CosyVoice3ChineseNormalizer.normalize(text)
}
// Auto-chunk long input under the structural 250-token Flow cap.
// The chunker greedily splits on hard sentence enders + soft clause
// separators when the running speech-token estimate exceeds budget;
// short inputs return a single chunk and take the fast path. Caller
// can opt out via `options.disableAutoChunking` for pre-segmented
// input (e.g. UI-driven streaming).
let chunks: [String]
if options.disableAutoChunking {
chunks = [normalized]
} else {
let split = CosyVoice3TextChunker.chunk(normalized)
chunks = split.isEmpty ? [normalized] : split
}
if chunks.count == 1 {
return try await synthesizeChunk(
text: chunks[0], promptAssets: promptAssets,
options: options, frontend: frontend, synthesizer: synthesizer)
}
logger.info(
"Auto-chunking long input into \(chunks.count) segments to fit "
+ "the 250-token Flow cap (estimated speech tokens: "
+ "\(CosyVoice3TextChunker.estimateSpeechTokens(normalized))).")
var results: [CosyVoice3SynthesisResult] = []
results.reserveCapacity(chunks.count)
for (i, chunk) in chunks.enumerated() {
logger.info(
" chunk \(i + 1)/\(chunks.count): "
+ "\(chunk.count) chars, ~"
+ "\(CosyVoice3TextChunker.estimateSpeechTokens(chunk)) speech tokens")
let r = try await synthesizeChunk(
text: chunk, promptAssets: promptAssets,
options: options, frontend: frontend, synthesizer: synthesizer)
results.append(r)
}
return Self.mergeChunkedResults(results)
}
// MARK: - Chunked synthesis helpers
/// Single-call synthesis path: tokenize/normalize-aware text fixture
/// adapter synthesizer. Shared between the fast (1-chunk) and chunked
/// (N-chunk) paths in `synthesize(...)`.
private func synthesizeChunk(
text: String,
promptAssets: CosyVoice3PromptAssets,
options: CosyVoice3SynthesisOptions,
frontend: CosyVoice3TextFrontend,
synthesizer: CosyVoice3Synthesizer
) async throws -> CosyVoice3SynthesisResult {
let assembled = try frontend.assemble(
promptText: promptAssets.promptText,
ttsText: normalized,
ttsText: text,
promptSpeechIds: promptAssets.promptSpeechIds)
let lmInputEmbedsFlat = try Self.flattenLmEmbeds(
@@ -246,6 +295,72 @@ public actor CosyVoice3TtsManager {
return try await synthesizer.synthesize(fixture: fixture, options: parityOptions)
}
/// Concatenate per-chunk results into a single `CosyVoice3SynthesisResult`.
/// Audio is stitched with a short cosine cross-fade (`crossfadeMs`) at
/// each boundary to mask DC/phase mismatch from independent synth calls.
/// `finishedOnEos` is `true` only when every chunk ended naturally
/// (so callers can still detect mid-segment truncation downstream).
private static func mergeChunkedResults(
_ results: [CosyVoice3SynthesisResult],
crossfadeMs: Double = 8
) -> CosyVoice3SynthesisResult {
precondition(!results.isEmpty, "mergeChunkedResults requires ≥1 result")
let sampleRate = results[0].sampleRate
let samples = concatWithCrossfade(
results.map { $0.samples },
sampleRate: sampleRate,
fadeMs: crossfadeMs)
let totalGenerated = results.reduce(0) { $0 + $1.generatedTokenCount }
var allDecoded: [Int32] = []
allDecoded.reserveCapacity(totalGenerated)
for r in results { allDecoded.append(contentsOf: r.decodedTokens) }
let allEos = results.allSatisfy { $0.finishedOnEos }
return CosyVoice3SynthesisResult(
samples: samples,
sampleRate: sampleRate,
generatedTokenCount: totalGenerated,
decodedTokens: allDecoded,
finishedOnEos: allEos)
}
/// Concatenate PCM chunks with a cosine cross-fade at each boundary.
/// Fade window is the shorter of `fadeMs` and `min(prev.tail, next.head)
/// / 2`, so very short chunks degrade gracefully (no overlap consuming
/// the entire chunk).
static func concatWithCrossfade(
_ chunks: [[Float]],
sampleRate: Int,
fadeMs: Double
) -> [Float] {
guard !chunks.isEmpty else { return [] }
let nominalFade = max(0, Int((Double(sampleRate) * fadeMs / 1000).rounded()))
var out: [Float] = chunks[0]
for i in 1..<chunks.count {
let next = chunks[i]
if nominalFade == 0 || out.isEmpty || next.isEmpty {
out.append(contentsOf: next)
continue
}
let fade = min(nominalFade, out.count / 2, next.count / 2)
if fade <= 0 {
out.append(contentsOf: next)
continue
}
// Cosine equal-power crossfade: out tail fades down, next head
// fades up; samples are summed in the overlap region. Length of
// `out` after splice = old_len - fade + next.count.
let outStart = out.count - fade
for j in 0..<fade {
let t = Float(j) / Float(fade)
let down = 0.5 * (1 + cos(Float.pi * t)) // 1 0
let up = 0.5 * (1 - cos(Float.pi * t)) // 0 1
out[outStart + j] = out[outStart + j] * down + next[j] * up
}
out.append(contentsOf: next[fade..<next.count])
}
return out
}
// MARK: - Helpers
/// Flatten `[1, tPre, 896]` MLMultiArray fp32 into `[tPre * 896]` Float,
@@ -0,0 +1,145 @@
import Foundation
/// Splits long input text into segments that each fit within CosyVoice3's
/// 250-token Flow input cap.
///
/// The Flow CFM model is exported with a fixed `[1, 250]` `token_total`
/// shape (`CosyVoice3Constants.flowTotalTokens`). After the prompt's speech
/// tokens consume `~8595` slots (default voice), each `synthesize(...)`
/// call has room for roughly `~155` new speech tokens of output ( 6.4 s of
/// audio at the 40 ms/token rate `tokenMelRatio × hiftSamplesPerFrame /
/// sampleRate = 2 × 480 / 24_000`). Long phrases truncate mid-utterance.
///
/// This chunker greedily packs input into segments under a target speech-
/// token budget, splitting preferentially on hard sentence enders
/// (`. ! ? \n`) and falling back to soft clause separators
/// (`, ; `) when sentences exceed the budget. Synthesis is run
/// per-chunk and audio is concatenated with a small cosine cross-fade at
/// boundaries (handled by the caller, not here).
///
/// **Token-rate estimate** (calibrated against minimax-zh corpus runs):
/// - CJK char 7.5 speech tokens (worst-case observed real rate;
/// 5.5 was empirically too low and
/// let ~16% of phrases hit cap)
/// - ASCII char 1.5 speech tokens (BPE compresses; English is faster)
/// - Other (Latin-1) 2.5 speech tokens (middle ground for accented Latin)
///
/// Default `maxSpeechTokens = 110` leaves a ~45-token safety margin under
/// the typical room-for-new of ~155. The 30-token force-split overshoot
/// can push a committed chunk to ~140 estimated, still comfortably under
/// the cap once the conservative 5.5-tokens/CJK-char heuristic is
/// reconciled with real generation rates. The synthesizer still emits
/// its `LLM-Decode budget exhausted` warning if a chunk somehow exceeds
/// the cap, so over-estimates are self-healing.
public enum CosyVoice3TextChunker {
/// Sentence-ending punctuation. Always commit the current chunk after
/// these, regardless of running token count.
private static let hardEnders: Set<Character> = [
"", "", "", ".", "!", "?", "\n",
]
/// Clause-internal punctuation. Commit only when the running token
/// count is at or above the budget soft splits should be preferred
/// over force-splits but not preferred over hard enders.
private static let softEnders: Set<Character> = [
"", "", "", "", ";", ",", " ",
]
/// Default speech-token budget per chunk. Keeps a ~45-token margin
/// under the typical room-for-new of ~155 (= `flowTotalTokens=250`
/// minus a typical prompt of ~95 tokens). The 30-token force-split
/// overshoot may push committed chunks to ~140 estimated, still under
/// the structural cap.
public static let defaultMaxSpeechTokens: Int = 110
/// Split `text` into chunks each estimated to produce
/// `maxSpeechTokens` LLM speech tokens. Returns `[text]` (single
/// chunk) when the input already fits. Returns `[]` when `text` is
/// empty or whitespace-only.
public static func chunk(
_ text: String,
maxSpeechTokens: Int = defaultMaxSpeechTokens
) -> [String] {
let trimmed = text.trimmingCharacters(in: .whitespacesAndNewlines)
guard !trimmed.isEmpty else { return [] }
if estimateSpeechTokens(trimmed) <= maxSpeechTokens {
return [trimmed]
}
var chunks: [String] = []
var current = ""
for ch in trimmed {
current.append(ch)
let tokensSoFar = estimateSpeechTokens(current)
if hardEnders.contains(ch) {
let pruned = current.trimmingCharacters(in: .whitespacesAndNewlines)
if !pruned.isEmpty { chunks.append(pruned) }
current = ""
continue
}
if tokensSoFar >= maxSpeechTokens && softEnders.contains(ch) {
let pruned = current.trimmingCharacters(in: .whitespacesAndNewlines)
if !pruned.isEmpty { chunks.append(pruned) }
current = ""
continue
}
// Force-split if no punctuation has appeared within a 30-token
// overshoot. Prefer the most recent whitespace; fall back to
// hard-cut at the current position. Hard-cut on continuous CJK
// (no whitespace) is rare in normalized input but can happen
// when the normalizer collapses spaces.
if tokensSoFar >= maxSpeechTokens + 30 {
if let lastSpace = current.lastIndex(where: { $0 == " " }),
lastSpace != current.startIndex
{
let head = String(current[..<lastSpace])
.trimmingCharacters(in: .whitespacesAndNewlines)
let tail = String(current[current.index(after: lastSpace)...])
if !head.isEmpty { chunks.append(head) }
current = tail
} else {
let pruned = current.trimmingCharacters(in: .whitespacesAndNewlines)
if !pruned.isEmpty { chunks.append(pruned) }
current = ""
}
}
}
let tail = current.trimmingCharacters(in: .whitespacesAndNewlines)
if !tail.isEmpty { chunks.append(tail) }
return chunks
}
/// Rough estimate of how many SPEECH tokens the LLM-Decode AR loop
/// will produce for `s`. Used by `chunk(...)` to size segments under
/// the structural Flow cap.
public static func estimateSpeechTokens(_ s: String) -> Int {
var total = 0.0
for scalar in s.unicodeScalars {
if isCJK(scalar) {
total += 7.5
} else if scalar.isASCII {
total += 1.5
} else {
total += 2.5
}
}
return Int(total.rounded())
}
private static func isCJK(_ scalar: Unicode.Scalar) -> Bool {
let v = scalar.value
// CJK Unified Ideographs (the bulk of zh/yue text)
if (0x4E00...0x9FFF).contains(v) { return true }
// CJK Unified Ideographs Extension A
if (0x3400...0x4DBF).contains(v) { return true }
// Hiragana
if (0x3040...0x309F).contains(v) { return true }
// Katakana
if (0x30A0...0x30FF).contains(v) { return true }
// Hangul Syllables
if (0xAC00...0xD7AF).contains(v) { return true }
return false
}
}
@@ -7,11 +7,12 @@ import Foundation
/// implemented as a method on this type, keeping the state (KV cache, running
/// decoded list) local to a single synthesis call.
///
/// Decode uses CoreML `MLState` (macOS 15 / iOS 18): 48 per-layer buffers
/// (`kv_k_0..kv_k_23`, `kv_v_0..kv_v_23`) replace the 18 MB kv_k / kv_v
/// round-trip per step. Prefill remains non-stateful and its `kv_k` / `kv_v`
/// outputs seed the decode state once after prefill.
@available(macOS 15, iOS 18, *)
/// Decode is **stateless** with an external KV cache. Prefill emits
/// `kv_k` / `kv_v` of shape `[24, 1, 2, 768, 64]` fp32; decode accepts those
/// same tensors as inputs and returns updated `kv_k_out` / `kv_v_out` at
/// the same shape/dtype. We round-trip the cache once per step (18 MB
/// total) and bind the previous step's outputs as the next step's inputs.
/// No `MLState` dependency runs on macOS 14 / iOS 17.
public actor CosyVoice3Synthesizer {
private let logger = AppLogger(subsystem: "com.fluidaudio.tts", category: "CosyVoice3Synthesizer")
@@ -19,6 +20,18 @@ public actor CosyVoice3Synthesizer {
private let models: CosyVoice3Models
private let embeddings: CosyVoice3SpeechEmbeddings
/// Set to `false` once `LLM-Decode-M768-fp16` rejects pre-allocated
/// `outputBackings` (model exported without explicit MultiArray
/// shape/dtype constraints on its `kv_k_out` / `kv_v_out` /
/// `speech_logits` outputs). Latched off so we don't throw + catch on
/// every one of ~163 AR decode steps per phrase. Same pattern as
/// `MagpieKvCache.useOutputBackings`.
private var useOutputBackings: Bool = true
/// One-shot flag for "fast path engaged" log message; only emitted on
/// the first successful `outputBackings` prediction so we don't spam.
private var loggedFastPath: Bool = false
public init(models: CosyVoice3Models, embeddings: CosyVoice3SpeechEmbeddings) {
self.models = models
self.embeddings = embeddings
@@ -46,16 +59,52 @@ public actor CosyVoice3Synthesizer {
sampler.seedTokens(fixture.decodedTokens)
}
// 1) Prefill (non-stateful: returns kv_k / kv_v as outputs)
// 1) Prefill (returns kv_k / kv_v as fp32 outputs)
let tPrefill = Date()
let (prefillLogits, initialKvK, initialKvV) = try await runPrefill(fixture: fixture)
let prefillSec = Date().timeIntervalSince(tPrefill)
// Seed decode MLState from prefill kv_k / kv_v.
let tSeed = Date()
let state = models.decode.makeState()
try seedDecodeState(state: state, kvK: initialKvK, kvV: initialKvV)
let seedSec = Date().timeIntervalSince(tSeed)
// External KV cache with **double-buffered outputBackings**: prefill's
// `kv_k` / `kv_v` (shape `[24, 1, 2, 768, 64]` fp32, ~9 MB each) feed
// the first decode step. Subsequent steps rotate between two
// pre-allocated buffer pairs (A/B) bound as the model's
// `kv_k_out` / `kv_v_out` outputs. Same pattern as
// `MagpieKvCache.swapBackings()` eliminates ~36 MB of host
// alloc/dealloc per decode step (×163 steps 5.9 GB churn per
// phrase). `speech_logits` is also pre-bound so we avoid a fresh
// 27 KB allocation each step. CoreML rejects this when the model
// was exported without explicit MultiArray shape/dtype constraints
// on its outputs; in that case we latch `useOutputBackings = false`
// and fall back to per-step allocation for the rest of the run.
let kvShape: [NSNumber] = [
NSNumber(value: CosyVoice3Constants.numLayers),
1,
NSNumber(value: CosyVoice3Constants.kvHeads),
NSNumber(value: CosyVoice3Constants.kvMaxLength),
NSNumber(value: CosyVoice3Constants.headDim),
]
let kvKBackA = try MLMultiArray(shape: kvShape, dataType: .float32)
let kvVBackA = try MLMultiArray(shape: kvShape, dataType: .float32)
let kvKBackB = try MLMultiArray(shape: kvShape, dataType: .float32)
let kvVBackB = try MLMultiArray(shape: kvShape, dataType: .float32)
let logitsBacking = try MLMultiArray(
shape: [1, 1, NSNumber(value: CosyVoice3Constants.speechVocab)],
dataType: .float32)
// Pointer-rotation triple. `frontKvK/V` are read by the next step;
// `backKvK/V` receive the next step's writes; `spareKvK/V` are the
// pre-allocated set ready to become `back` after rotation. Initial
// `front` is the prefill output; we don't reuse those buffers as
// `spare`/`back` once decode step 1 finishes, `front` becomes A
// (just-written), `back` becomes B (next write target), `spare`
// becomes A's previous contents (which we drop, since prefill
// output is single-use).
var frontKvK: MLMultiArray = initialKvK
var frontKvV: MLMultiArray = initialKvV
var backKvK: MLMultiArray = kvKBackA
var backKvV: MLMultiArray = kvVBackA
var spareKvK: MLMultiArray = kvKBackB
var spareKvV: MLMultiArray = kvVBackB
// Reusable per-step inputs for decode. `curLenArr` is mutated in place
// each step; `inputsEmbedsArr` is overwritten by memcpy per step.
@@ -64,6 +113,12 @@ public actor CosyVoice3Synthesizer {
shape: [1, 1, NSNumber(value: CosyVoice3Constants.embedDim)],
dataType: .float32)
// Logits scratch reused across all decode steps. The hot loop
// memcpy's into this from `logitsBacking` (or strided-gathers from a
// freshly-allocated array on the slow path).
var logitsScratch = [Float](
repeating: 0, count: CosyVoice3Constants.speechVocab)
// First token from prefill tail logits.
var decoded: [Int32] = []
let firstLogits = sliceLastStepLogits(
@@ -81,31 +136,82 @@ public actor CosyVoice3Synthesizer {
}
decoded.append(topId)
// 2) Decode loop
// 2) Decode loop (stateless, external cache, double-buffered backings)
var curLen = fixture.tPre
var decodeSteps = 0
var hitEos = false
let tDecode = Date()
for step in 1..<maxNew {
try embeddings.copyEmbedding(tokenId: topId, into: inputsEmbedsArr)
curLenArr[0] = NSNumber(value: Int32(curLen))
let logits = try runDecodeStateful(
try runDecode(
inputsEmbeds: inputsEmbedsArr,
curLen: curLenArr,
state: state)
topId = sampler.sample(logits: logits, decodedSoFar: decoded)
frontKvK: frontKvK,
frontKvV: frontKvV,
backKvK: backKvK,
backKvV: backKvV,
logitsBacking: logitsBacking,
logits: &logitsScratch)
topId = sampler.sample(logits: logitsScratch, decodedSoFar: decoded)
curLen += 1
decodeSteps += 1
if CosyVoice3Constants.stopRange.contains(topId) {
logger.info("EOS at step \(step) (token=\(topId))")
hitEos = true
break
}
decoded.append(topId)
// Rotate buffers: `back` (just-written) becomes new `front`;
// `spare` becomes new `back`; old `front` becomes new `spare`
// (will be overwritten next step). On step 1 the old `front` is
// the prefill output drops to `spare` and gets overwritten on
// step 3, which is harmless (we never read it again).
let prevFrontK = frontKvK
let prevFrontV = frontKvV
frontKvK = backKvK
frontKvV = backKvV
backKvK = spareKvK
backKvV = spareKvV
spareKvK = prevFrontK
spareKvV = prevFrontV
}
let decodeSec = Date().timeIntervalSince(tDecode)
guard !decoded.isEmpty else {
throw CosyVoice3Error.predictionFailed("LLM produced no speech tokens")
}
// Truncation signal: AR loop exhausted its decode budget without
// observing an EOS token in `stopRange` (6_5616_760). The 250-token
// cap is structural it's the fixed `[1, 250]` shape of the Flow
// model's `token_total` input (`CosyVoice3Constants.flowTotalTokens`),
// not a synthesizer-side soft limit. With ~40 ms of audio per token
// (`tokenMelRatio=2 × hiftSamplesPerFrame=480 / sampleRate=24_000`),
// a prompt taking ~`nPrompt` tokens leaves `(250 - nPrompt) × 0.04 s`
// of generated audio i.e. long phrases truncate mid-utterance.
//
// Surface this as a `.warning` so callers running long input get a
// console signal instead of silent truncation. Lifting the cap
// requires re-exporting Flow with a larger `token_total` shape; for
// now, splitting input at clause boundaries ( / ) is the
// workaround.
if !hitEos {
let producedSec =
Double(decoded.count)
* Double(CosyVoice3Constants.tokenMelRatio)
* Double(CosyVoice3Constants.hiftSamplesPerFrame)
/ Double(CosyVoice3Constants.sampleRate)
logger.warning(
"LLM-Decode budget exhausted: \(decoded.count) generated tokens "
+ "/ \(maxNew) cap (no EOS observed). "
+ "Output truncated at ~"
+ String(format: "%.1f", producedSec)
+ "s of audio. The 250-token Flow input is a structural cap; "
+ "split long phrases at clause boundaries ( 。) to work around."
)
}
// 3) Flow
let nNew = decoded.count
let tFlow = Date()
@@ -133,14 +239,15 @@ public actor CosyVoice3Synthesizer {
logger.info(
String(
format:
"STAGES prefill=%.3fs seed=%.3fs decode=%.3fs(%d steps, %.2f tok/s) flow=%.3fs hift=%.3fs",
prefillSec, seedSec, decodeSec, decodeSteps, decodeTps, flowSec, hiftSec))
"STAGES prefill=%.3fs decode=%.3fs(%d steps, %.2f tok/s) flow=%.3fs hift=%.3fs",
prefillSec, decodeSec, decodeSteps, decodeTps, flowSec, hiftSec))
return CosyVoice3SynthesisResult(
samples: audio,
sampleRate: CosyVoice3Constants.sampleRate,
generatedTokenCount: nNew,
decodedTokens: decoded)
decodedTokens: decoded,
finishedOnEos: hitEos)
}
// MARK: - Stages
@@ -193,140 +300,134 @@ public actor CosyVoice3Synthesizer {
return (logits, kvK, kvV)
}
/// Run one stateful decode step. `state` is mutated in place via the
/// 48 per-layer `kv_k_i` / `kv_v_i` state buffers registered in the
/// converted model.
private func runDecodeStateful(
/// Run one stateless decode step with an external KV cache.
///
/// Inputs match the converted CoreML graph signature:
/// - `inputs_embeds: fp32 [1, 1, 896]`
/// - `cur_len: int32 [1]`
/// - `kv_k: fp32 [24, 1, 2, 768, 64]` (previous step's `kv_k_out`, or
/// prefill's `kv_k` for the first decode step)
/// - `kv_v: fp32 [24, 1, 2, 768, 64]`
///
/// Outputs (when `outputBackings` is accepted, written into the pre-
/// allocated `backKvK` / `backKvV` / `logitsBacking` buffers in place):
/// - `speech_logits: fp32 [1, 1, 6761]`
/// - `kv_k_out: fp32 [24, 1, 2, 768, 64]`
/// - `kv_v_out: fp32 [24, 1, 2, 768, 64]`
///
/// Falls back to per-step CoreML allocation + memcpy into the pre-
/// allocated backings if the model rejects `outputBackings` (latches
/// `useOutputBackings = false` so we don't retry on every step).
private func runDecode(
inputsEmbeds: MLMultiArray,
curLen: MLMultiArray,
state: MLState
) throws -> [Float] {
frontKvK: MLMultiArray,
frontKvV: MLMultiArray,
backKvK: MLMultiArray,
backKvV: MLMultiArray,
logitsBacking: MLMultiArray,
logits: inout [Float]
) throws {
let features: [String: Any] = [
"inputs_embeds": inputsEmbeds,
"cur_len": curLen,
"kv_k": frontKvK,
"kv_v": frontKvV,
]
let provider = try MLDictionaryFeatureProvider(dictionary: features)
let output = try models.decode.prediction(from: provider, using: state)
guard
let logitsArr = output.featureValue(for: "speech_logits")?.multiArrayValue
else {
throw CosyVoice3Error.predictionFailed("decode: missing speech_logits")
var fastPathSucceeded = false
if useOutputBackings {
let opts = MLPredictionOptions()
opts.outputBackings = [
"kv_k_out": backKvK,
"kv_v_out": backKvV,
"speech_logits": logitsBacking,
]
do {
_ = try models.decode.prediction(from: provider, options: opts)
Self.readLogits(from: logitsBacking, into: &logits)
if !loggedFastPath {
logger.info(
"LLM-Decode outputBackings accepted; double-buffered "
+ "AR loop active")
loggedFastPath = true
}
fastPathSucceeded = true
} catch {
// CoreML refused our pre-allocated backings typically
// because `LLM-Decode-M768-fp16.mlpackage` was exported
// without explicit MultiArray shape/dtype constraints on
// its outputs. Latch the flag off so we don't throw + catch
// on every one of ~163 steps for the rest of the corpus.
// Warning level so it shows in release builds this is a
// perf regression worth surfacing to anyone running with a
// re-exported model.
useOutputBackings = false
logger.warning(
"LLM-Decode outputBackings rejected "
+ "(\(error.localizedDescription)); switching to "
+ "fresh-alloc fallback for the rest of the run")
}
}
// logits shape = [1, 1, 6761] fp32; strides may be non-compact.
if !fastPathSucceeded {
// Slow path: per-step CoreML allocation, then memcpy outputs
// into the pre-allocated backings so the front/back rotation
// protocol still works after this call.
let output = try models.decode.prediction(from: provider)
guard
let logitsArr = output.featureValue(for: "speech_logits")?.multiArrayValue,
let kvKOutArr = output.featureValue(for: "kv_k_out")?.multiArrayValue,
let kvVOutArr = output.featureValue(for: "kv_v_out")?.multiArrayValue
else {
throw CosyVoice3Error.predictionFailed(
"decode: missing speech_logits / kv_k_out / kv_v_out")
}
try Self.copyKvOutput(kvKOutArr, into: backKvK, name: "kv_k_out")
try Self.copyKvOutput(kvVOutArr, into: backKvV, name: "kv_v_out")
Self.readLogits(from: logitsArr, into: &logits)
}
}
/// Read a `[1, 1, 6761]` fp32 logits MLMultiArray into `dst`. Honors the
/// last-dim stride (CoreML may emit non-compact strides on aligned
/// allocations) uses `memcpy` when stride==1, strided gather otherwise.
private static func readLogits(from arr: MLMultiArray, into dst: inout [Float]) {
let count = CosyVoice3Constants.speechVocab
var logits = [Float](repeating: 0, count: count)
let strides = logitsArr.strides.map { $0.intValue }
let strides = arr.strides.map { $0.intValue }
let vocabStride = strides.last ?? 1
let base = logitsArr.dataPointer.bindMemory(to: Float.self, capacity: logitsArr.count)
for i in 0..<count { logits[i] = base[i * vocabStride] }
return logits
let base = arr.dataPointer.bindMemory(to: Float.self, capacity: arr.count)
if vocabStride == 1 {
dst.withUnsafeMutableBytes { rawDst in
guard let dstPtr = rawDst.baseAddress else { return }
memcpy(dstPtr, base, count * MemoryLayout<Float>.size)
}
} else {
for i in 0..<count { dst[i] = base[i * vocabStride] }
}
}
/// Seed the 48 decode state buffers (`kv_k_0..kv_k_23`, `kv_v_0..kv_v_23`)
/// from prefill's `kv_k` / `kv_v` outputs.
///
/// Prefill logical shape per cache is `[L=24, 1, Hkv=2, M=768, D=64]`
/// fp16; each per-layer state buffer is `[1, 2, 768, 64]` fp16. Copy
/// layer-by-layer using stride-aware indexing (prefill strides may not
/// be compact), letting CoreML's state writer convert to the underlying
/// fp16 storage.
private func seedDecodeState(
state: MLState,
kvK: MLMultiArray,
kvV: MLMultiArray
/// Copy a CoreML-allocated `kv_k_out` / `kv_v_out` MLMultiArray into our
/// pre-allocated backing array. Used on the `outputBackings`-rejected
/// fallback path so the front/back rotation protocol stays consistent.
private static func copyKvOutput(
_ src: MLMultiArray,
into dst: MLMultiArray,
name: String
) throws {
// Prefill declares fp32 KV outputs at its CoreML I/O boundary
// (even though the weights / activations internally are fp16).
// Decode state buffers are fp16. Convert per-element as we copy.
guard kvK.dataType == .float32 && kvV.dataType == .float32 else {
guard src.dataType == dst.dataType else {
throw CosyVoice3Error.predictionFailed(
"seedDecodeState: expected fp32 KV from prefill (kv_k=\(kvK.dataType.rawValue) kv_v=\(kvV.dataType.rawValue))"
)
"decode \(name): dtype mismatch \(src.dataType.rawValue) vs \(dst.dataType.rawValue)")
}
let L = CosyVoice3Constants.numLayers
let H = CosyVoice3Constants.kvHeads
let M = CosyVoice3Constants.kvMaxLength
let D = CosyVoice3Constants.headDim
// Prefill output strides for shape [L, 1, H, M, D].
let kStrides = kvK.strides.map { $0.intValue }
let vStrides = kvV.strides.map { $0.intValue }
let kLayerStride = kStrides[0]
let kHStride = kStrides[2]
let kMStride = kStrides[3]
let kDStride = kStrides[4]
let vLayerStride = vStrides[0]
let vHStride = vStrides[2]
let vMStride = vStrides[3]
let vDStride = vStrides[4]
let kSrcPtr = kvK.dataPointer.bindMemory(to: Float.self, capacity: kvK.count)
let vSrcPtr = kvV.dataPointer.bindMemory(to: Float.self, capacity: kvV.count)
// Collect dtype-mismatch errors from inside the non-throwing closures.
var stateDtypeError: String?
for i in 0..<L {
state.withMultiArray(for: "kv_k_\(i)") { buf in
guard buf.dataType == .float16 else {
if stateDtypeError == nil {
stateDtypeError = "kv_k_\(i) expected fp16 state, got \(buf.dataType.rawValue)"
}
return
}
let b = buf.strides.map { $0.intValue }
let dPtr = buf.dataPointer.bindMemory(to: Float16.self, capacity: buf.count)
Self.copyLayerF32ToF16(
src: kSrcPtr, srcLayerBase: i * kLayerStride,
srcHStride: kHStride, srcMStride: kMStride, srcDStride: kDStride,
dst: dPtr,
dstHStride: b[1], dstMStride: b[2], dstDStride: b[3],
H: H, M: M, D: D)
}
state.withMultiArray(for: "kv_v_\(i)") { buf in
guard buf.dataType == .float16 else {
if stateDtypeError == nil {
stateDtypeError = "kv_v_\(i) expected fp16 state, got \(buf.dataType.rawValue)"
}
return
}
let b = buf.strides.map { $0.intValue }
let dPtr = buf.dataPointer.bindMemory(to: Float16.self, capacity: buf.count)
Self.copyLayerF32ToF16(
src: vSrcPtr, srcLayerBase: i * vLayerStride,
srcHStride: vHStride, srcMStride: vMStride, srcDStride: vDStride,
dst: dPtr,
dstHStride: b[1], dstMStride: b[2], dstDStride: b[3],
H: H, M: M, D: D)
}
}
if let msg = stateDtypeError {
throw CosyVoice3Error.predictionFailed("seedDecodeState: \(msg)")
}
}
/// Copy one `[H, M, D]` KV slab from a fp32 prefill output into a fp16
/// decode state buffer. Strides may be non-compact on either side.
private static func copyLayerF32ToF16(
src: UnsafeMutablePointer<Float>,
srcLayerBase: Int,
srcHStride: Int, srcMStride: Int, srcDStride: Int,
dst: UnsafeMutablePointer<Float16>,
dstHStride: Int, dstMStride: Int, dstDStride: Int,
H: Int, M: Int, D: Int
) {
for h in 0..<H {
for m in 0..<M {
for d in 0..<D {
let sOff = srcLayerBase + h * srcHStride + m * srcMStride + d * srcDStride
let dOff = h * dstHStride + m * dstMStride + d * dstDStride
dst[dOff] = Float16(src[sOff])
}
}
guard src.count == dst.count else {
throw CosyVoice3Error.predictionFailed(
"decode \(name): count mismatch \(src.count) vs \(dst.count)")
}
// KV outputs are fp32. With contiguous strides (the default for
// freshly-allocated CoreML outputs in this graph) memcpy is safe.
let bytes = src.count * MemoryLayout<Float>.size
memcpy(dst.dataPointer, src.dataPointer, bytes)
}
private func runFlow(
@@ -10,6 +10,24 @@ public struct CosyVoice3SynthesisResult: Sendable {
public let generatedTokenCount: Int
/// Decoded speech token ids (useful for debugging + round-trip).
public let decodedTokens: [Int32]
/// `true` when the LLM-Decode AR loop ended on an EOS token in
/// `CosyVoice3Constants.stopRange` (natural termination); `false` when
/// the loop exhausted its decode budget (`flowTotalTokens - nPrompt`)
/// without observing EOS the audio is truncated mid-utterance.
/// See the `.warning`-level log emitted from `CosyVoice3Synthesizer`
/// when this is `false`.
public let finishedOnEos: Bool
public init(
samples: [Float], sampleRate: Int, generatedTokenCount: Int,
decodedTokens: [Int32], finishedOnEos: Bool
) {
self.samples = samples
self.sampleRate = sampleRate
self.generatedTokenCount = generatedTokenCount
self.decodedTokens = decodedTokens
self.finishedOnEos = finishedOnEos
}
}
/// Options controlling a CosyVoice3 parity / synthesis call.
@@ -42,9 +60,20 @@ public struct CosyVoice3SynthesisOptions: Sendable {
public let maxNewTokens: Int?
/// Sampler seed for the top-p/top-k + multinomial fallback path.
public let seed: UInt64
/// When `true`, skips `CosyVoice3TextChunker.chunk(...)` and runs a
/// single synthesizer call regardless of input length. Useful for
/// callers that pre-segment input themselves (e.g. UI-driven streaming
/// per sentence). The structural 250-token Flow cap still applies and
/// long inputs will truncate mid-utterance with a `.warning` log.
public let disableAutoChunking: Bool
public init(maxNewTokens: Int? = nil, seed: UInt64 = 42) {
public init(
maxNewTokens: Int? = nil,
seed: UInt64 = 42,
disableAutoChunking: Bool = false
) {
self.maxNewTokens = maxNewTokens
self.seed = seed
self.disableAutoChunking = disableAutoChunking
}
}
@@ -42,6 +42,39 @@ public struct KokoroAneComputeUnits: Sendable, Equatable {
prosody: .cpuAndGPU, noise: .cpuAndGPU, vocoder: .cpuAndGPU, tail: .cpuAndGPU
)
/// Force every stage onto `.cpuAndNeuralEngine`. Stages that hit
/// ANE-incompatible ops will fall back to CPU silently included
/// for the benchmark sweep (efficiency vs. latency comparison).
public static let allAne = KokoroAneComputeUnits(
albert: .cpuAndNeuralEngine, postAlbert: .cpuAndNeuralEngine,
alignment: .cpuAndNeuralEngine, prosody: .cpuAndNeuralEngine,
noise: .cpuAndNeuralEngine, vocoder: .cpuAndNeuralEngine,
tail: .cpuAndNeuralEngine
)
/// CPU-only (no ANE, no GPU). Slowest but most predictable; useful
/// as a debugging / fallback baseline.
public static let cpuOnly = KokoroAneComputeUnits(
albert: .cpuOnly, postAlbert: .cpuOnly, alignment: .cpuOnly,
prosody: .cpuOnly, noise: .cpuOnly, vocoder: .cpuOnly, tail: .cpuOnly
)
/// Build a configuration from a generic preset (used by the
/// `tts-benchmark` CLI so a single flag maps cleanly across
/// backends).
public init(preset: TtsComputeUnitPreset) {
switch preset {
case .default:
self = .default
case .allAne:
self = .allAne
case .cpuAndGpu:
self = .cpuAndGpu
case .cpuOnly:
self = .cpuOnly
}
}
func units(for stage: KokoroAneStage) -> MLComputeUnits {
switch stage {
case .albert: return albert
@@ -75,6 +75,11 @@ public actor MagpieTtsManager {
public func initialize() async throws {
if synthesizer != nil { return }
logger.warning(
"Magpie TTS is experimental / beta. Synthesis is below real-time "
+ "(agg-RTFx ~0.41× on M2 for the MiniMax-English corpus) — "
+ "see Documentation/TTS/Magpie.md.")
let store = MagpieModelStore(
directory: directory,
computeUnits: computeUnits,
@@ -54,6 +54,15 @@ public final class MagpieKvCache {
public private(set) var cachesV: [MLMultiArray]
public private(set) var positions: [MLMultiArray]
/// Set to `false` once `decoder_step.mlmodelc` rejects `outputBackings`
/// (e.g. when the model was exported without explicit MultiArray
/// shape/dtype constraints on its KV outputs). The rejection is a static
/// property of the model, so once it fails we permanently skip the fast
/// path and go straight to the fresh-alloc fallback to avoid throwing +
/// catching an exception on every one of the ~500 AR decode steps per
/// utterance.
public var useOutputBackings: Bool = true
/// Back-buffer set for double-buffered AR loop. Used as `outputBackings` so
/// CoreML writes new K/V/pos straight into our pre-allocated arrays instead
/// of allocating ~18.9 MB of fresh fp16 buffers per step. After each
@@ -769,14 +769,73 @@ public actor MagpieSynthesizer {
// step. The cache provides 24 K/V + 12 position back-buffers, the
// synthesizer provides the 1 hidden buffer. After the call,
// `swapBackings` promotes backfront for the next step's inputs.
var backings: [String: Any] = [:]
cache.addOutputBackings(to: &backings)
backings[MagpieKvCache.decoderHiddenKey] = hiddenBacking
let predOpts = MLPredictionOptions()
predOpts.outputBackings = backings
//
// If a previous step already proved that this model was exported
// without explicit MultiArray shape/dtype constraints on its KV
// outputs, `cache.useOutputBackings` is `false` and we skip the
// fast path entirely. This avoids the per-step throw/catch overhead
// and debug-log spam across the entire AR loop (~500 iterations).
var fastPathSucceeded = false
if cache.useOutputBackings {
var backings: [String: Any] = [:]
cache.addOutputBackings(to: &backings)
backings[MagpieKvCache.decoderHiddenKey] = hiddenBacking
let predOpts = MLPredictionOptions()
predOpts.outputBackings = backings
_ = try model.prediction(from: provider, options: predOpts)
cache.swapBackings()
do {
_ = try model.prediction(from: provider, options: predOpts)
cache.swapBackings()
fastPathSucceeded = true
} catch {
// CoreML refused our pre-allocated outputBackings typically
// because `decoder_step.mlmodelc` was exported without
// explicit MultiArray shape/dtype constraints on its KV
// outputs, so the runtime can't validate the buffer layout
// and bails with
// "Output feature (null) doesn't support output backing
// because it doesn't have a MultiArray constraints."
// The rejection is a static property of the model, so latch
// the cache flag off to skip the fast path on every
// subsequent step (avoids ~500 throw/catch + log lines per
// utterance).
cache.useOutputBackings = false
logger.debug(
"decoder_step outputBackings rejected "
+ "(\(error.localizedDescription)); switching to "
+ "fresh-alloc fallback for the rest of the run")
}
}
if !fastPathSucceeded {
// Slow path: re-run without `outputBackings`, route the
// freshly-allocated K/V/pos through `MagpieKvCache.absorbOutputs`
// (which replaces front pointers directly), and copy the hidden
// state into `hiddenBacking` so the rest of this function works
// unchanged. Costs ~18.9 MB of fresh fp16 allocation per step;
// proper fix is to re-export `decoder_step.mlmodelc` with
// shape/dtype constraints on `new_k_*`/`new_v_*`/`var_*`.
let output = try model.prediction(from: provider)
try cache.absorbOutputs(output)
guard
let hidden = output.featureValue(for: MagpieKvCache.decoderHiddenKey)?
.multiArrayValue
else {
throw MagpieError.inferenceFailed(
stage: "decoder_step",
underlying:
"missing hidden output key \(MagpieKvCache.decoderHiddenKey)")
}
guard hidden.dataType == .float16, hidden.count == hiddenBacking.count else {
throw MagpieError.inferenceFailed(
stage: "decoder_step",
underlying:
"decoder hidden mismatch (dtype=\(hidden.dataType.rawValue) "
+ "count=\(hidden.count) expected=\(hiddenBacking.count))")
}
let bytes = hiddenBacking.count * MemoryLayout<UInt16>.size
memcpy(hiddenBacking.dataPointer, hidden.dataPointer, bytes)
}
// Hidden state lives in `hiddenBacking` after the call. Convert fp16
// fp32 via vImage into a fresh [Float] result buffer (the sampler
@@ -0,0 +1,72 @@
@preconcurrency import CoreML
import Foundation
/// Generic compute-unit preset shared across TTS backends.
///
/// Each backend keeps its own per-stage `<Backend>ComputeUnits` struct
/// because stage names differ (Kokoro ANE has 7 stages, PocketTTS has 4
/// CoreML models, StyleTTS2 has 4 models, etc.). This preset is the
/// uniform knob the benchmarking harness flips so a single CLI flag
/// (`--compute-units default|all-ane|cpu-and-gpu|cpu-only`) maps to a
/// sensible per-stage assignment on every backend.
///
/// Backends opt in by adding `init(preset: TtsComputeUnitPreset)` to
/// their compute-units struct (see `KokoroAneComputeUnits` for the
/// reference implementation).
public enum TtsComputeUnitPreset: String, Sendable, CaseIterable {
/// The backend's empirically-tuned default typically a mix of
/// ANE-friendly and CPU+GPU stages chosen by the conversion author.
case `default`
/// Force every stage to `.cpuAndNeuralEngine`. Worst case for stages
/// that fall back to CPU on ANE-incompatible ops, but the most
/// energy-efficient when ops are ANE-clean.
case allAne
/// Force every stage to `.cpuAndGPU`. Skips the ANE entirely;
/// useful as a latency baseline when the ANE compile cache is cold
/// (no `anecompilerservice` time on first call).
case cpuAndGpu
/// Force every stage to `.cpuOnly`. Fallback / debugging baseline;
/// every backend should at least run here, however slowly.
case cpuOnly
/// Concrete `MLComputeUnits` for "force every stage to X" presets.
/// Returns `nil` for `.default`, which means "let the backend keep
/// its empirical mapping".
public var uniformUnits: MLComputeUnits? {
switch self {
case .default: return nil
case .allAne: return .cpuAndNeuralEngine
case .cpuAndGpu: return .cpuAndGPU
case .cpuOnly: return .cpuOnly
}
}
/// Parse the CLI flag value (`default`, `all-ane`, `cpu-and-gpu`,
/// `cpu-only`). Returns `nil` for unrecognised values so callers
/// can surface a usage error.
public init?(cliValue: String) {
switch cliValue.lowercased() {
case "default": self = .default
case "all-ane", "ane", "neural-engine": self = .allAne
case "cpu-and-gpu", "cpuandgpu", "gpu": self = .cpuAndGpu
case "cpu-only", "cpu", "cpuonly": self = .cpuOnly
default: return nil
}
}
/// Canonical kebab-case form, matching the CLI flag values the
/// `init?(cliValue:)` parser accepts. Use this for log lines and
/// JSON reports so values round-trip back through the parser.
public var cliValue: String {
switch self {
case .default: return "default"
case .allAne: return "all-ane"
case .cpuAndGpu: return "cpu-and-gpu"
case .cpuOnly: return "cpu-only"
}
}
}
@@ -94,4 +94,25 @@ public struct StyleTTS2Vocab: Sendable {
}
return ids
}
/// Diagnostic encode: same logic as `encode(_:)` but also returns a
/// frequency map of every scalar that fell off the floor because no
/// vocab entry exists for it. Used by the StyleTTS2 CLI's
/// `--tokenize-only` mode to quantify the misaki espeak inventory
/// gap without actually invoking the diffusion pipeline.
public func encodeWithReport(
_ phonemes: String
) -> (ids: [Int32], dropped: [Unicode.Scalar: Int]) {
var ids: [Int32] = []
ids.reserveCapacity(phonemes.unicodeScalars.count)
var dropped: [Unicode.Scalar: Int] = [:]
for scalar in phonemes.unicodeScalars {
if let id = map[Character(scalar)] {
ids.append(id)
} else {
dropped[scalar, default: 0] += 1
}
}
return (ids, dropped)
}
}
@@ -4,14 +4,30 @@ import Foundation
///
/// For English (`.americanEnglish`), uses the in-tree `G2PModel` (BART
/// encoder-decoder, misaki-style IPA) and remaps the misaki conventions to
/// the espeak-ng convention that StyleTTS2's LibriTTS checkpoint expects:
/// the espeak-ng convention that StyleTTS2's LibriTTS checkpoint expects.
///
/// **Per-piece (single glyph) remap** applied as misaki emits each piece:
///
/// misaki espeak-ng
/// A eɪ I aɪ O oʊ W aʊ Y ɔɪ
/// ə (tiny-schwa offglide; not in StyleTTS2's 178-vocab)
///
/// Other glyphs (`ʤ`, `ʧ`, `ˈ`, `ˌ`, `ð`, `θ`, `ɹ`, `ɾ`, etc.) are already in
/// the 178-token espeak-ng vocabulary and pass through.
/// **Post-pass (multi-glyph) remap** applied to the assembled phoneme
/// string after every word has been emitted. Both the ligature and the
/// decomposed forms exist as distinct tokens in the 178-vocab, but the
/// LibriTTS checkpoint was trained against espeak-ng output, so the model's
/// embeddings for the misaki ligature glyphs (`ʧ`, `ʤ`) are essentially
/// untrained noise. Same story for the schwa+r digraphs that espeak collapses
/// into single rhotic vowels (`ɝ`, `ɚ`):
///
/// misaki espeak-ng word example
/// ʧ tʃ choice tʃˈɔɪs
/// ʤ dʒ jump dʒˈʌmps
/// ɜɹ ɝ (U+025D) girl ɡˈɝl
/// əɹ ɚ (U+025A) over ˈoʊvɚ
///
/// Other glyphs (`ˈ`, `ˌ`, `ð`, `θ`, `ɹ`, `ɾ`, etc.) are already in the
/// 178-token espeak-ng vocabulary and pass through unchanged.
///
/// Non-English languages fall back to `MultilingualG2PModel` (CharsiuG2P
/// ByT5). Output quality there is unvalidated the LibriTTS checkpoint is
@@ -46,6 +62,30 @@ public enum StyleTTS2Phonemizer {
"": "ə",
]
/// Post-pass multi-glyph remap applied to the assembled phoneme string
/// after all word pieces have been concatenated. Decomposes misaki's
/// affricate ligatures and collapses the schwa+r digraphs into the
/// single rhotic vowels espeak-ng emits see the type-level docs for
/// rationale. Order matters only insofar as `əɹ` and `ɜɹ` must be
/// applied before any rule that would consume the trailing `ɹ` (none
/// exist today; left ordered for future-proofing).
private static let misakiToEspeakPostPass: [(String, String)] = [
("ʧ", ""),
("ʤ", ""),
("ɜɹ", "ɝ"),
("əɹ", "ɚ"),
]
/// Apply `misakiToEspeakPostPass` rules to a phoneme string in order.
/// Exposed `internal` for unit tests.
internal static func applyEspeakPostPass(_ s: String) -> String {
var out = s
for (from, to) in misakiToEspeakPostPass {
out = out.replacingOccurrences(of: from, with: to)
}
return out
}
/// Convert raw text to an IPA phoneme string for StyleTTS2.
///
/// - Parameters:
@@ -87,6 +127,13 @@ public enum StyleTTS2Phonemizer {
try await flushWord(&wordBuffer, language: language, into: &output)
}
// Multi-glyph misaki espeak normalization. Only meaningful for
// English (the LibriTTS checkpoint is English-only); skipping for
// other languages avoids touching CharsiuG2P output we don't have
// a model contract for.
if language == .americanEnglish {
output = applyEspeakPostPass(output)
}
return output
}
@@ -239,6 +239,13 @@ public actor StyleTTS2Synthesizer {
/// Slice an MLMultiArray of shape `(1, leading, trailing)` to the first
/// `take` entries along either the leading or trailing axis. Returns a
/// flat row-major `[Float]`.
///
/// Reads via `dataPointer` instead of `arr[idx].floatValue` and avoids
/// `arr.strides` entirely both trigger
/// `E5RT: tensor_buffer has known strides while the model has
/// FlexibleShapeInfo` on `text_predictor`'s flex-shape outputs. CoreML
/// emits dense row-major buffers, so for shape `(1, leading, trailing)`
/// the flat index is simply `r * trailing + c`.
private func sliceFirstAxis2D(
arr: MLMultiArray,
leading: Int,
@@ -246,29 +253,50 @@ public actor StyleTTS2Synthesizer {
take: Int,
sliceDim: SliceDim
) -> [Float] {
let strides = arr.strides.map { $0.intValue }
let outCount: Int
switch sliceDim {
case .leading:
// Result shape: (take, trailing).
var out = [Float](repeating: 0, count: take * trailing)
for r in 0..<take {
for c in 0..<trailing {
let idx = r * strides[1] + c * strides[2]
out[r * trailing + c] = arr[idx].floatValue
}
}
return out
case .trailing:
// Result shape: (leading, take).
var out = [Float](repeating: 0, count: leading * take)
for r in 0..<leading {
for c in 0..<take {
let idx = r * strides[1] + c * strides[2]
out[r * take + c] = arr[idx].floatValue
}
}
return out
case .leading: outCount = take * trailing
case .trailing: outCount = leading * take
}
var out = [Float](repeating: 0, count: outCount)
func fill(_ get: (Int) -> Float) {
switch sliceDim {
case .leading:
// Result shape: (take, trailing).
for r in 0..<take {
for c in 0..<trailing {
out[r * trailing + c] = get(r * trailing + c)
}
}
case .trailing:
// Result shape: (leading, take).
for r in 0..<leading {
for c in 0..<take {
out[r * take + c] = get(r * trailing + c)
}
}
}
}
let count = arr.count
switch arr.dataType {
case .float32:
let p = arr.dataPointer.bindMemory(to: Float.self, capacity: count)
fill { p[$0] }
case .float16:
let p = arr.dataPointer.bindMemory(to: Float16.self, capacity: count)
fill { Float(p[$0]) }
case .double:
let p = arr.dataPointer.bindMemory(to: Double.self, capacity: count)
fill { Float(p[$0]) }
default:
// Fallback re-introduces the FlexibleShapeInfo trip wire, but
// we don't expect text_predictor to emit anything other than
// fp16/fp32.
fill { arr[$0].floatValue }
}
return out
}
// MARK: - Durations
@@ -15,10 +15,6 @@ import Foundation
/// - cumsum-of-durations one-hot matmul hard-alignment,
/// - bucket selection (round token length text_predictor; round
/// mel frames decoder).
///
/// **Status:** scaffold only. Synthesis is not yet implemented; calls to
/// `synthesize` throw `processingFailed`. The asset bring-up (download +
/// model store) is wired up so dependent layers can land incrementally.
public actor StyleTTS2Manager {
private let logger = AppLogger(category: "StyleTTS2Manager")
@@ -47,6 +43,10 @@ public actor StyleTTS2Manager {
public func initialize(
progressHandler: DownloadUtils.ProgressHandler? = nil
) async throws {
logger.warning(
"StyleTTS2 is experimental / beta. WER on long English phrases is "
+ "elevated on the MiniMax corpus (~44% vs Kokoro 1.3%) — see "
+ "Documentation/TTS/Benchmarks.md.")
_ = try await modelStore.ensureAssetsAvailable(progressHandler: progressHandler)
let config = try await modelStore.bundleConfig()
try config.validate()
@@ -111,6 +111,34 @@ public actor StyleTTS2Manager {
return try await synthesizer.synthesize(ids: ids, voice: voice, options: options)
}
/// Same as `synthesize` but returns raw fp32 PCM samples + sample rate.
/// Used by callers (e.g. the tts-benchmark harness, ASR pairing) that
/// don't want the WAV-encoding round trip.
public func synthesizeSamples(
text: String,
voiceStyleURL: URL,
language: MultilingualG2PLanguage = .americanEnglish,
diffusionSteps: Int = StyleTTS2Constants.defaultDiffusionSteps,
alpha: Float = 0.3,
beta: Float = 0.7,
randomSeed: UInt64? = nil
) async throws -> (samples: [Float], sampleRate: Int) {
guard isInitialized else {
throw StyleTTS2Error.modelNotFound("StyleTTS2 model not initialized")
}
let voice = try StyleTTS2VoiceStyle.load(from: voiceStyleURL)
let (_, ids) = try await tokenize(text: text, language: language)
let options = StyleTTS2Synthesizer.Options(
diffusionSteps: diffusionSteps,
alpha: alpha,
beta: beta,
randomSeed: randomSeed
)
let samples = try await synthesizer.synthesizeSamples(
ids: ids, voice: voice, options: options)
return (samples, StyleTTS2Constants.audioSampleRate)
}
/// Run the text frontend (preprocess G2P vocab encode) end-to-end.
///
/// Available before the diffusion synthesizer is wired so callers can
@@ -138,6 +166,27 @@ public actor StyleTTS2Manager {
return (phonemes, ids)
}
/// Diagnostic tokenize: same as `tokenize(text:language:)` but also
/// returns the per-scalar drop frequency from
/// `StyleTTS2Vocab.encodeWithReport`. Used by the CLI to quantify
/// how much of the misaki BART G2P output the espeak-ng-trained
/// 178-token vocab can actually consume.
public func tokenizeWithReport(
text: String,
language: MultilingualG2PLanguage = .americanEnglish
) async throws -> (
phonemes: String, ids: [Int32], dropped: [Unicode.Scalar: Int]
) {
guard isInitialized else {
throw StyleTTS2Error.modelNotFound("StyleTTS2 model not initialized")
}
let phonemes = try await StyleTTS2Phonemizer.phonemize(
text: text, language: language)
let vocab = try await modelStore.vocabulary()
let (ids, dropped) = vocab.encodeWithReport(phonemes)
return (phonemes, ids, dropped)
}
public func cleanup() {
isInitialized = false
}
@@ -13,7 +13,6 @@ import Foundation
/// --output .../build/swift_e2e.wav \
/// --seed 42
/// ```
@available(macOS 15, iOS 18, *)
enum CosyVoice3ParityCLI {
private static let logger = AppLogger(category: "CosyVoice3ParityCLI")
@@ -19,7 +19,6 @@ import Foundation
/// --output .../build/swift_cv3_text.wav \
/// --seed 42
/// ```
@available(macOS 15, iOS 18, *)
enum CosyVoice3TextCLI {
private static let logger = AppLogger(category: "CosyVoice3TextCLI")
@@ -0,0 +1,234 @@
#if os(macOS)
import FluidAudio
import Foundation
/// Swift port of `Scripts/fetch_minimax_tts_corpus.py`.
///
/// Fetches the MiniMax Multilingual TTS Test Set per-language `.txt` files
/// from HuggingFace and converts them to the FluidAudio TTS-benchmark
/// corpus format (strip `<cloning_audio_filename>|` prefix, prepend a
/// header documenting source + revision + license).
///
/// Reuses `DownloadUtils.fetchHuggingFaceFile` so we get the same auth
/// (HF_TOKEN env), retry, and backoff treatment as every other HF asset
/// pull in the project no hardcoded URLs, no swift-transformers
/// dependency added just for one corpus fetch.
///
/// Source dataset: https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set
/// License: CC-BY-SA-4.0
public enum MinimaxCorpusCommand {
private static let logger = AppLogger(category: "MinimaxCorpusCommand")
private static let repo = "MiniMaxAI/TTS-Multilingual-Test-Set"
/// Pin to the initial public commit so re-runs reproduce the vendored
/// files. Matches `DEFAULT_REVISION` in the Python script.
private static let defaultRevision = "cb416f0ac3658da0577e97873065e19fe6488917"
/// All 24 languages in the upstream `text/` directory. Keep in sync with
/// `ALL_LANGUAGES` in `Scripts/fetch_minimax_tts_corpus.py`.
private static let allLanguages: [String] = [
"arabic", "cantonese", "chinese", "czech", "dutch", "english",
"finnish", "french", "german", "greek", "hindi", "indonesian",
"italian", "japanese", "korean", "polish", "portuguese", "romanian",
"russian", "spanish", "thai", "turkish", "ukrainian", "vietnamese",
]
public static func run(arguments: [String]) async {
var languages = allLanguages
var revision = defaultRevision
var outDir: URL? = nil
var i = 0
while i < arguments.count {
let arg = arguments[i]
switch arg {
case "--languages", "-l":
if i + 1 < arguments.count {
languages = arguments[i + 1]
.split(separator: ",")
.map { $0.trimmingCharacters(in: .whitespaces) }
.filter { !$0.isEmpty }
i += 1
}
case "--revision":
if i + 1 < arguments.count {
revision = arguments[i + 1]
i += 1
}
case "--out-dir":
if i + 1 < arguments.count {
outDir = URL(fileURLWithPath: arguments[i + 1])
i += 1
}
case "help", "--help", "-h":
printUsage()
return
default:
logger.error("Unknown argument: \(arg)")
printUsage()
exit(1)
}
i += 1
}
let unknown = Set(languages).subtracting(allLanguages).sorted()
if !unknown.isEmpty {
logger.error("Unknown language(s): \(unknown.joined(separator: ", "))")
logger.error("Available: \(allLanguages.joined(separator: ", "))")
exit(2)
}
let resolvedOutDir = outDir ?? defaultOutDir()
do {
try FileManager.default.createDirectory(
at: resolvedOutDir, withIntermediateDirectories: true)
} catch {
logger.error("Failed to create output directory: \(error.localizedDescription)")
exit(1)
}
logger.info("Fetching MiniMax TTS Multilingual Test Set @ \(revision)")
logger.info(" out_dir: \(resolvedOutDir.path)")
logger.info(" langs: \(languages.count)")
var total = 0
for lang in languages {
guard let url = URL(string: hfURL(repo: repo, revision: revision, path: "text/\(lang).txt"))
else {
logger.error("[\(lang)] failed to construct URL")
exit(1)
}
do {
let data = try await DownloadUtils.fetchHuggingFaceFile(
from: url, description: "minimax TTS corpus (\(lang))")
guard let raw = String(data: data, encoding: .utf8) else {
logger.error("[\(lang)] response was not valid UTF-8")
exit(1)
}
let phrases = convert(raw: raw)
let outPath = try writeCorpus(
lang: lang, phrases: phrases, outDir: resolvedOutDir,
revision: revision)
let countStr = String(format: "%3d", phrases.count)
let relPath = relativePath(outPath, from: repoRoot())
logger.info(" [\(lang)] \(countStr) phrases -> \(relPath)")
total += phrases.count
} catch {
logger.error("[\(lang)] FAILED: \(error.localizedDescription)")
exit(1)
}
}
logger.info("OK — \(total) phrases across \(languages.count) language(s).")
}
// MARK: - Helpers
private static func hfURL(repo: String, revision: String, path: String) -> String {
"https://huggingface.co/datasets/\(repo)/resolve/\(revision)/\(path)"
}
/// Strip `<filename>|` prefix and return the list of trimmed phrases.
/// Mirrors `convert()` in the Python script.
private static func convert(raw: String) -> [String] {
var out: [String] = []
for rawLine in raw.split(separator: "\n", omittingEmptySubsequences: false) {
let line = rawLine.trimmingCharacters(in: .whitespacesAndNewlines)
if line.isEmpty { continue }
// Format: "<cloning_audio_filename>|<text>". Some lines may have
// extra `|` inside the text keep only the first split.
let text: String
if let sepIdx = line.firstIndex(of: "|") {
text = String(line[line.index(after: sepIdx)...])
.trimmingCharacters(in: .whitespacesAndNewlines)
} else {
text = line
}
if !text.isEmpty {
out.append(text)
}
}
return out
}
private static func writeCorpus(
lang: String,
phrases: [String],
outDir: URL,
revision: String
) throws -> URL {
let outPath = outDir.appendingPathComponent("\(lang).txt")
let header: [String] = [
"# MiniMax Multilingual TTS Test Set — \(lang)",
"# Source: https://huggingface.co/datasets/\(repo)",
"# Revision: \(revision)",
"# License: CC-BY-SA-4.0 (Creative Commons Attribution-ShareAlike 4.0)",
"# Phrases: \(phrases.count)",
"#",
"# Cloning-audio filenames have been stripped — we only need the",
"# text for the FluidAudio TTS benchmark harness. Voice selection",
"# is per-backend (see Documentation/TTS/MinimaxCorpus.md).",
"",
]
let body = (header + phrases).joined(separator: "\n") + "\n"
try body.write(to: outPath, atomically: true, encoding: .utf8)
return outPath
}
/// `<repo>/Benchmarks/tts/corpus/minimax/`. Resolves relative to the
/// current working directory (the standard place `swift run` is invoked
/// from); falls back gracefully if the layout doesn't exist yet because
/// we `createDirectory(withIntermediateDirectories: true)` before write.
private static func defaultOutDir() -> URL {
repoRoot()
.appendingPathComponent("Benchmarks", isDirectory: true)
.appendingPathComponent("tts", isDirectory: true)
.appendingPathComponent("corpus", isDirectory: true)
.appendingPathComponent("minimax", isDirectory: true)
}
private static func repoRoot() -> URL {
URL(fileURLWithPath: FileManager.default.currentDirectoryPath, isDirectory: true)
}
private static func relativePath(_ url: URL, from base: URL) -> String {
let path = url.standardizedFileURL.path
let basePath = base.standardizedFileURL.path
if path.hasPrefix(basePath + "/") {
return String(path.dropFirst(basePath.count + 1))
}
return path
}
private static func printUsage() {
logger.info(
"""
Usage: fluidaudio minimax-corpus [options]
Fetches the MiniMax Multilingual TTS Test Set text files from
HuggingFace and converts them to the FluidAudio TTS-benchmark
corpus format. Outputs one file per language.
Options:
--languages, -l <list> Comma-separated subset of languages
(default: all 24).
--revision <sha> HuggingFace dataset revision
(default: \(defaultRevision)).
--out-dir <path> Output directory
(default: Benchmarks/tts/corpus/minimax).
--help, -h Show this help.
Available languages:
\(allLanguages.joined(separator: ", "))
Examples:
fluidaudio minimax-corpus
fluidaudio minimax-corpus --languages english,spanish,hindi
fluidaudio minimax-corpus --revision <commit-sha>
""")
}
}
#endif
@@ -23,6 +23,8 @@ public enum StyleTTS2Command {
var alpha: Float = 0.3
var beta: Float = 0.7
var seed: UInt64?
var tokenizeOnly = false
var corpusPath: String?
var i = 0
while i < arguments.count {
@@ -74,6 +76,16 @@ public enum StyleTTS2Command {
fputs("--seed requires an integer\n", stderr)
exit(2)
}
case "--tokenize-only":
tokenizeOnly = true
i += 1
case "--corpus":
guard i + 1 < arguments.count else {
fputs("--corpus requires a path\n", stderr)
exit(2)
}
corpusPath = arguments[i + 1]
i += 2
case "--help", "-h":
printUsage()
return
@@ -88,6 +100,11 @@ public enum StyleTTS2Command {
}
}
if tokenizeOnly {
await runTokenizeOnly(text: text, corpusPath: corpusPath)
return
}
guard let text else {
fputs("Missing required text argument\n", stderr)
printUsage()
@@ -136,6 +153,105 @@ public enum StyleTTS2Command {
}
}
/// `--tokenize-only`: phonemize + encode without invoking the diffusion
/// pipeline. Reports phoneme string, token id sequence, and any scalars
/// that the 178-token espeak-ng vocab silently dropped. With `--corpus`
/// runs over every line of a phrase file and aggregates a histogram of
/// dropped scalars for the whole corpus.
private static func runTokenizeOnly(text: String?, corpusPath: String?) async {
do {
let manager = StyleTTS2Manager()
try await manager.initialize { _ in }
var totalScalars = 0
var totalIds = 0
var totalDropped = 0
var dropHist: [Unicode.Scalar: Int] = [:]
var phraseCount = 0
func process(_ phrase: String) async throws {
let (phonemes, ids, dropped) =
try await manager.tokenizeWithReport(text: phrase)
let scalars = phonemes.unicodeScalars.count
totalScalars += scalars
totalIds += ids.count
let phraseDropCount = dropped.values.reduce(0, +)
totalDropped += phraseDropCount
for (k, v) in dropped { dropHist[k, default: 0] += v }
phraseCount += 1
if corpusPath == nil {
print("INPUT : \(phrase)")
print("PHONEMES : \(phonemes)")
print("TOKEN_IDS (\(ids.count)): \(ids)")
let formatted =
dropped
.sorted { $0.value > $1.value }
.map {
"U+\(String($0.key.value, radix: 16, uppercase: true))"
+ " '\($0.key)' ×\($0.value)"
}
.joined(separator: ", ")
print(
"DROPPED (\(phraseDropCount) of \(scalars) scalars):"
+ " \(formatted)")
}
}
if let corpusPath {
let url = expand(corpusPath)
let raw = try String(contentsOf: url, encoding: .utf8)
let phrases = raw.split(separator: "\n", omittingEmptySubsequences: true)
.map { $0.trimmingCharacters(in: .whitespaces) }
.filter { !$0.isEmpty && !$0.hasPrefix("#") }
for (idx, phrase) in phrases.enumerated() {
do {
try await process(phrase)
let dropPct =
Double(totalDropped) / Double(max(totalScalars, 1)) * 100
if (idx + 1) % 10 == 0 || idx + 1 == phrases.count {
fputs(
" [\(idx + 1)/\(phrases.count)] running drop rate "
+ "\(String(format: "%.2f", dropPct))%\n",
stderr)
}
} catch {
fputs(" [\(idx + 1)] phrase failed: \(error)\n", stderr)
}
}
} else if let text {
try await process(text)
} else {
fputs("--tokenize-only requires either text or --corpus\n", stderr)
exit(2)
}
let dropPct = Double(totalDropped) / Double(max(totalScalars, 1)) * 100
let kept = totalScalars - totalDropped
print("")
print("=== StyleTTS2 vocab coverage ===")
print("phrases : \(phraseCount)")
print("phoneme scalars total : \(totalScalars)")
print("encoded token ids : \(totalIds) (== kept scalars: \(kept))")
print(
"dropped scalars : \(totalDropped) "
+ "(\(String(format: "%.2f", dropPct))%)")
print("distinct dropped chars : \(dropHist.count)")
if !dropHist.isEmpty {
print("")
print("dropped histogram (most → least frequent):")
for (scalar, count) in dropHist.sorted(by: { $0.value > $1.value }) {
let hex = String(scalar.value, radix: 16, uppercase: true)
print(
" \(String(format: "%6d", count)) U+\(hex) '\(scalar)'")
}
}
} catch {
fputs("StyleTTS2 tokenize-only failed: \(error)\n", stderr)
exit(1)
}
}
private static func expand(_ path: String) -> URL {
let exp = (path as NSString).expandingTildeInPath
if exp.hasPrefix("/") {
@@ -152,12 +268,15 @@ public enum StyleTTS2Command {
fluidaudio styletts2 "<text>" --voice <ref_s.bin> [options]
Options:
--voice <path> Required. Path to precomputed ref_s.bin (256 fp32 LE).
--voice <path> Required for synthesis. Path to precomputed ref_s.bin (256 fp32 LE).
--output <path> Output WAV path (default: styletts2.wav).
--steps <int> ADPM2 sampler steps (default: 5).
--alpha <float> Acoustic style mix weight (default: 0.3).
--beta <float> Prosody style mix weight (default: 0.7).
--seed <uint> Deterministic noise seed (default: system RNG).
--tokenize-only Run G2P + vocab encode only; report dropped scalars.
No --voice needed. Use with text or --corpus.
--corpus <path> Phrase-per-line corpus file (with --tokenize-only).
Example:
fluidaudio styletts2 "Hello world" \\
+19 -29
View File
@@ -414,22 +414,17 @@ public struct TTS {
)
return
}
if #available(macOS 15, iOS 18, *) {
await CosyVoice3TextCLI.run(
text: inputText,
modelsDir: modelsDir,
tokenizerDir: tokDir,
embeddingsFile: embFile,
specialTokensFile: specFile,
promptAssetsPath: promptAssets,
outputPath: output,
seed: cv3Seed,
maxNewTokens: cv3MaxNewTokens,
cpuOnly: cv3CpuOnly)
} else {
logger.error(
"CosyVoice3 requires macOS 15 / iOS 18 (uses CoreML MLState).")
}
await CosyVoice3TextCLI.run(
text: inputText,
modelsDir: modelsDir,
tokenizerDir: tokDir,
embeddingsFile: embFile,
specialTokensFile: specFile,
promptAssetsPath: promptAssets,
outputPath: output,
seed: cv3Seed,
maxNewTokens: cv3MaxNewTokens,
cpuOnly: cv3CpuOnly)
return
}
@@ -440,19 +435,14 @@ public struct TTS {
)
return
}
if #available(macOS 15, iOS 18, *) {
await CosyVoice3ParityCLI.run(
fixturePath: fixture,
modelsDir: modelsDir,
referencePath: cv3ReferencePath,
outputPath: output,
seed: cv3Seed,
cpuOnly: cv3CpuOnly,
replayTokens: cv3ReplayTokens)
} else {
logger.error(
"CosyVoice3 requires macOS 15 / iOS 18 (uses CoreML MLState).")
}
await CosyVoice3ParityCLI.run(
fixturePath: fixture,
modelsDir: modelsDir,
referencePath: cv3ReferencePath,
outputPath: output,
seed: cv3Seed,
cpuOnly: cv3CpuOnly,
replayTokens: cv3ReplayTokens)
return
}
File diff suppressed because it is too large Load Diff
@@ -50,6 +50,10 @@ struct FluidAudioCLI {
await MagpieCommand.run(arguments: Array(arguments.dropFirst(2)))
case "tts-asr-verify":
await TTSAsrVerifyCommand.run(arguments: Array(arguments.dropFirst(2)))
case "tts-benchmark":
await TtsBenchmarkCommand.run(arguments: Array(arguments.dropFirst(2)))
case "minimax-corpus":
await MinimaxCorpusCommand.run(arguments: Array(arguments.dropFirst(2)))
case "diarization-benchmark":
await StreamDiarizationBenchmark.run(arguments: Array(arguments.dropFirst(2)))
case "process":
@@ -116,6 +120,8 @@ struct FluidAudioCLI {
tts Synthesize speech from text using Kokoro TTS
magpie Magpie TTS Multilingual 357M (experimental, ~0.04 RTFx slow, needs perf work)
tts-asr-verify Batch TTSASR roundtrip WER verification
tts-benchmark Quantitative TTS benchmark (latency, quality, compute-unit sweep)
minimax-corpus Fetch MiniMax TTS Multilingual Test Set into Benchmarks/tts/corpus/minimax
parakeet-eou Run Parakeet EOU Streaming ASR on a single file
ctc-earnings-benchmark Run CTC keyword spotting benchmark on Earnings22
sortformer Run Sortformer streaming diarization
@@ -0,0 +1,60 @@
import XCTest
@testable import FluidAudio
/// Guard the stateful stateless decode rename. The HF repo
/// `FluidInference/CosyVoice3-0.5B-coreml` ships only `LLM-Decode-M768-fp16`
/// (non-stateful, external KV cache); resurrecting `-stateful` here would
/// re-break the download path and regress macOS 14 support.
final class CosyVoice3ModelNameTests: XCTestCase {
// MARK: - ModelNames.CosyVoice3
func testLlmDecodeIsStatelessName() {
XCTAssertEqual(ModelNames.CosyVoice3.llmDecode, "LLM-Decode-M768-fp16")
XCTAssertFalse(
ModelNames.CosyVoice3.llmDecode.contains("stateful"),
"llmDecode must not reference the dropped stateful variant")
}
func testLlmDecodeFileMatchesBaseName() {
XCTAssertEqual(
ModelNames.CosyVoice3.llmDecodeFile,
"LLM-Decode-M768-fp16.mlmodelc")
}
func testRequiredModelsContainsStatelessDecode() {
XCTAssertTrue(
ModelNames.CosyVoice3.requiredModels.contains("LLM-Decode-M768-fp16.mlmodelc"),
"requiredModels must list the stateless decode bundle")
XCTAssertFalse(
ModelNames.CosyVoice3.requiredModels.contains(
"LLM-Decode-M768-fp16-stateful.mlmodelc"),
"requiredModels must not list the dropped stateful bundle")
}
func testRequiredModelsHasFourEntries() {
XCTAssertEqual(
ModelNames.CosyVoice3.requiredModels.count, 4,
"Pipeline ships exactly 4 CoreML bundles: prefill, decode, flow, hift")
}
// MARK: - CosyVoice3Constants.Files
func testFilesLlmDecodeIsStatelessPackage() {
XCTAssertEqual(
CosyVoice3Constants.Files.llmDecode,
"LLM-Decode-M768-fp16.mlpackage")
XCTAssertFalse(
CosyVoice3Constants.Files.llmDecode.contains("stateful"))
}
func testFilesLlmDecodeSubdirIsRenamed() {
XCTAssertEqual(
CosyVoice3Constants.Files.llmDecodeSubdir,
"llm-fp16-decode",
"Local-build subdir must be the renamed stateless directory")
XCTAssertFalse(
CosyVoice3Constants.Files.llmDecodeSubdir.contains("stateful"))
}
}
@@ -0,0 +1,184 @@
import XCTest
@testable import FluidAudio
final class CosyVoice3TextChunkerTests: XCTestCase {
// MARK: - estimateSpeechTokens
func testEstimateSpeechTokensCJK() {
// 4 CJK chars × 7.5 = 30 tokens
XCTAssertEqual(CosyVoice3TextChunker.estimateSpeechTokens("你好世界"), 30)
}
func testEstimateSpeechTokensASCII() {
// 5 ASCII chars × 1.5 = 7.5 rounds to 8
XCTAssertEqual(CosyVoice3TextChunker.estimateSpeechTokens("hello"), 8)
}
func testEstimateSpeechTokensEmpty() {
XCTAssertEqual(CosyVoice3TextChunker.estimateSpeechTokens(""), 0)
}
// MARK: - chunk: short input fast path
func testChunkEmptyReturnsEmpty() {
XCTAssertEqual(CosyVoice3TextChunker.chunk(""), [])
XCTAssertEqual(CosyVoice3TextChunker.chunk(" "), [])
XCTAssertEqual(CosyVoice3TextChunker.chunk("\n\n"), [])
}
func testChunkShortReturnsSingle() {
// 5 chars (4 CJK + ) 33 tokens, well under default 110
XCTAssertEqual(
CosyVoice3TextChunker.chunk("你好世界。"),
["你好世界。"])
}
func testChunkShortTrimsWhitespace() {
XCTAssertEqual(
CosyVoice3TextChunker.chunk(" hello world. "),
["hello world."])
}
// MARK: - chunk: hard sentence enders
func testChunkSplitsOnHardEnders() {
// 25 CJK chars × 7.5 = 187.5 tokens > 110 default must split
let text = "今天天气很好。我们去公园散步。明天可能会下雨。下周打算去看电影。"
let chunks = CosyVoice3TextChunker.chunk(text)
XCTAssertGreaterThan(chunks.count, 1)
// No chunk should exceed budget by more than the soft margin
for chunk in chunks {
let est = CosyVoice3TextChunker.estimateSpeechTokens(chunk)
XCTAssertLessThanOrEqual(est, 110 + 30 + 8, "chunk over force-split margin: \(chunk)")
}
// Concatenating chunks back should reconstruct the input modulo
// whitespace trimming.
XCTAssertEqual(chunks.joined(), text)
}
func testChunkSplitsOnEnglishSentenceEnders() {
// Each sentence 2530 tokens; with maxSpeechTokens=80 every
// sentence fits individually so the chunker should commit on the
// first hard ender it sees rather than packing greedily across
// sentences and hitting force-split.
let text = "Hello world. This is a test. Pack my box with five jugs. Quick brown fox jumps."
let chunks = CosyVoice3TextChunker.chunk(text, maxSpeechTokens: 80)
XCTAssertGreaterThan(chunks.count, 1)
for chunk in chunks {
XCTAssertTrue(
chunk.hasSuffix(".") || chunk.hasSuffix("!") || chunk.hasSuffix("?"),
"chunk does not end at hard boundary: \(chunk)")
}
}
// MARK: - chunk: soft enders fall-through
func testChunkFallsBackToSoftEnders() {
// One huge sentence with commas, no periods. Should split on .
let text = "一个非常非常长的句子,里面有很多分句,每个分句都不是很长,但是加在一起就会超过预算限制"
let chunks = CosyVoice3TextChunker.chunk(text, maxSpeechTokens: 50)
XCTAssertGreaterThan(chunks.count, 1)
for chunk in chunks {
let est = CosyVoice3TextChunker.estimateSpeechTokens(chunk)
// Force-split allows one CJK char of overshoot past the +30 margin
// because the budget check runs AFTER appending the current char.
XCTAssertLessThanOrEqual(est, 50 + 30 + 8)
}
}
// MARK: - chunk: force-split fallback
func testChunkForceSplitsOnContinuousCJKWithoutPunctuation() {
// 30 CJK chars, no punctuation: 225 tokens, must force-split
// somewhere even without natural boundaries.
let text = "今天天气很好我们去公园散步明天可能会下雨下周打算看电影然后回家"
let chunks = CosyVoice3TextChunker.chunk(text, maxSpeechTokens: 50)
XCTAssertGreaterThan(chunks.count, 1)
for chunk in chunks {
let est = CosyVoice3TextChunker.estimateSpeechTokens(chunk)
// Force-split has a 30-token overshoot allowance + one CJK char (7.5)
XCTAssertLessThanOrEqual(est, 50 + 30 + 8, "chunk overflow on force-split: \(chunk)")
}
// No content lost
XCTAssertEqual(chunks.joined(), text)
}
func testChunkForceSplitsOnEnglishSpacesWhenNoPunctuation() {
// Long English with no terminal punctuation; should split on spaces
// when the running estimate exceeds budget.
let text = "the quick brown fox jumps over the lazy dog and then runs back home very fast"
let chunks = CosyVoice3TextChunker.chunk(text, maxSpeechTokens: 20)
XCTAssertGreaterThan(chunks.count, 1)
for chunk in chunks {
// No leading/trailing whitespace expected on returned chunks
XCTAssertEqual(chunk, chunk.trimmingCharacters(in: .whitespaces))
}
}
// MARK: - concatWithCrossfade
func testConcatEmptyReturnsEmpty() {
let out = CosyVoice3TtsManager.concatWithCrossfade(
[], sampleRate: 24_000, fadeMs: 8)
XCTAssertEqual(out, [])
}
func testConcatSingleChunkPassthrough() {
let chunk: [Float] = [0.1, 0.2, 0.3, 0.4]
let out = CosyVoice3TtsManager.concatWithCrossfade(
[chunk], sampleRate: 24_000, fadeMs: 8)
XCTAssertEqual(out, chunk)
}
func testConcatZeroFadeIsSimpleAppend() {
let a: [Float] = [0.1, 0.2, 0.3]
let b: [Float] = [0.4, 0.5, 0.6]
let out = CosyVoice3TtsManager.concatWithCrossfade(
[a, b], sampleRate: 24_000, fadeMs: 0)
XCTAssertEqual(out, [0.1, 0.2, 0.3, 0.4, 0.5, 0.6])
}
func testConcatCrossfadeShrinksGracefullyForShortChunks() {
// 4-sample chunks; nominal fade at 24 kHz × 8 ms = 192 samples,
// gets clamped to min(out.count/2, next.count/2) = 2.
let a: [Float] = [1.0, 1.0, 1.0, 1.0]
let b: [Float] = [0.0, 0.0, 0.0, 0.0]
let out = CosyVoice3TtsManager.concatWithCrossfade(
[a, b], sampleRate: 24_000, fadeMs: 8)
// Output length: 4 (a) - 2 (fade) + 4 (b) = 6; first 2 of a remain
// pristine, then a 2-sample crossfade region, then last 2 of b
XCTAssertEqual(out.count, 6)
XCTAssertEqual(out[0], 1.0)
XCTAssertEqual(out[1], 1.0)
// Crossfade region: a's 1.0 fades to 0; b's 0.0 fades from 0.
// At j=0: down=1, up=0 1.0 * 1 + 0.0 * 0 = 1.0
// At j=1: down=0.5, up=0.5 1.0*0.5 + 0.0*0.5 = 0.5
XCTAssertEqual(out[2], 1.0, accuracy: 1e-5)
XCTAssertEqual(out[3], 0.5, accuracy: 1e-5)
XCTAssertEqual(out[4], 0.0, accuracy: 1e-5)
XCTAssertEqual(out[5], 0.0, accuracy: 1e-5)
}
func testConcatCrossfadePreservesPrefixAndSuffix() {
// Long enough chunks for a full fade window
let sampleRate = 24_000
let fadeMs = 4.0 // 96 samples
let a = [Float](repeating: 1.0, count: 480)
let b = [Float](repeating: 0.0, count: 480)
let out = CosyVoice3TtsManager.concatWithCrossfade(
[a, b], sampleRate: sampleRate, fadeMs: fadeMs)
let fade = Int((Double(sampleRate) * fadeMs / 1000).rounded())
// Output length: a.count - fade + b.count
XCTAssertEqual(out.count, a.count - fade + b.count)
// Prefix of `a` (before crossfade region) untouched
for j in 0..<(a.count - fade) {
XCTAssertEqual(out[j], 1.0)
}
// Suffix of `b` (after crossfade region) untouched
for j in (a.count..<out.count) {
XCTAssertEqual(out[j - 0], 0.0)
}
}
}
@@ -53,4 +53,84 @@ final class MagpieKvCacheTests: XCTestCase {
XCTAssertEqual(
MagpieKvCache.positionOutputKeys.count, MagpieConstants.numDecoderLayers)
}
/// Drives the slow-path fallback used by `MagpieSynthesizer.runDecoderStep`
/// when CoreML rejects `outputBackings`. Builds a synthetic feature
/// provider that mirrors the `decoder_step.mlmodelc` output schema, hands
/// it to `absorbOutputs`, and verifies the cache front pointers + position
/// were replaced (i.e. the fallback can take over without `swapBackings`).
func testAbsorbOutputsReplacesFrontPointers() throws {
let numLayers = 3
let maxCacheLength = 16
let numHeads = 2
let headDim = 4
let cache = try MagpieKvCache(
numLayers: numLayers, maxCacheLength: maxCacheLength,
numHeads: numHeads, headDim: headDim)
let preK = (0..<numLayers).map { ObjectIdentifier(cache.cachesK[$0]) }
let preV = (0..<numLayers).map { ObjectIdentifier(cache.cachesV[$0]) }
let prePos = (0..<numLayers).map { ObjectIdentifier(cache.positions[$0]) }
let cacheShape: [NSNumber] = [
1,
NSNumber(value: maxCacheLength),
NSNumber(value: numHeads),
NSNumber(value: headDim),
]
var features: [String: MLFeatureValue] = [:]
for i in 0..<numLayers {
let kArr = try MLMultiArray(shape: cacheShape, dataType: .float16)
kArr.zeroFillFloat16()
let vArr = try MLMultiArray(shape: cacheShape, dataType: .float16)
vArr.zeroFillFloat16()
let posArr = try MLMultiArray(shape: [1], dataType: .float16)
posArr.zeroFillFloat16()
posArr[0] = NSNumber(value: Float(i + 1))
features[MagpieKvCache.cacheKOutputKeys[i]] = MLFeatureValue(multiArray: kArr)
features[MagpieKvCache.cacheVOutputKeys[i]] = MLFeatureValue(multiArray: vArr)
features[MagpieKvCache.positionOutputKeys[i]] = MLFeatureValue(multiArray: posArr)
}
let provider = try MLDictionaryFeatureProvider(dictionary: features)
try cache.absorbOutputs(provider)
for i in 0..<numLayers {
XCTAssertNotEqual(
ObjectIdentifier(cache.cachesK[i]), preK[i],
"absorbOutputs must replace cachesK[\(i)] front pointer")
XCTAssertNotEqual(
ObjectIdentifier(cache.cachesV[i]), preV[i],
"absorbOutputs must replace cachesV[\(i)] front pointer")
XCTAssertNotEqual(
ObjectIdentifier(cache.positions[i]), prePos[i],
"absorbOutputs must replace positions[\(i)] front pointer")
}
// positions[0] = 1 cache.position reads layer-0 scalar.
XCTAssertEqual(cache.position, 1)
}
func testAbsorbOutputsThrowsWhenCacheKOutputMissing() throws {
let cache = try MagpieKvCache(
numLayers: 2, maxCacheLength: 8, numHeads: 1, headDim: 2)
// Provide a feature provider with the wrong key for cache_k_0 so the
// first lookup fails. This guards the error message users will see
// when the fallback path is actually exercised.
let bogus = try MLMultiArray(shape: [1, 8, 1, 2], dataType: .float16)
bogus.zeroFillFloat16()
let provider = try MLDictionaryFeatureProvider(dictionary: [
"wrong_key": MLFeatureValue(multiArray: bogus)
])
XCTAssertThrowsError(try cache.absorbOutputs(provider)) { error in
guard case MagpieError.inferenceFailed(_, let underlying) = error else {
XCTFail("expected MagpieError.inferenceFailed, got \(error)")
return
}
XCTAssertTrue(
underlying.contains("missing K cache output key"),
"underlying should mention the missing K key, got: \(underlying)")
}
}
}
@@ -0,0 +1,114 @@
@preconcurrency import CoreML
import XCTest
@testable import FluidAudio
final class TtsComputeUnitPresetTests: XCTestCase {
// MARK: - init?(cliValue:)
func testCliValueParsing_canonicalKebabCase() {
XCTAssertEqual(TtsComputeUnitPreset(cliValue: "default"), .default)
XCTAssertEqual(TtsComputeUnitPreset(cliValue: "all-ane"), .allAne)
XCTAssertEqual(TtsComputeUnitPreset(cliValue: "cpu-and-gpu"), .cpuAndGpu)
XCTAssertEqual(TtsComputeUnitPreset(cliValue: "cpu-only"), .cpuOnly)
}
func testCliValueParsing_aliases() {
XCTAssertEqual(TtsComputeUnitPreset(cliValue: "ane"), .allAne)
XCTAssertEqual(TtsComputeUnitPreset(cliValue: "neural-engine"), .allAne)
XCTAssertEqual(TtsComputeUnitPreset(cliValue: "cpuandgpu"), .cpuAndGpu)
XCTAssertEqual(TtsComputeUnitPreset(cliValue: "gpu"), .cpuAndGpu)
XCTAssertEqual(TtsComputeUnitPreset(cliValue: "cpu"), .cpuOnly)
XCTAssertEqual(TtsComputeUnitPreset(cliValue: "cpuonly"), .cpuOnly)
}
func testCliValueParsing_caseInsensitive() {
XCTAssertEqual(TtsComputeUnitPreset(cliValue: "DEFAULT"), .default)
XCTAssertEqual(TtsComputeUnitPreset(cliValue: "All-Ane"), .allAne)
XCTAssertEqual(TtsComputeUnitPreset(cliValue: "CPU-AND-GPU"), .cpuAndGpu)
}
func testCliValueParsing_unknownReturnsNil() {
XCTAssertNil(TtsComputeUnitPreset(cliValue: ""))
XCTAssertNil(TtsComputeUnitPreset(cliValue: "fastest"))
XCTAssertNil(TtsComputeUnitPreset(cliValue: "all_ane")) // underscore rejected
XCTAssertNil(TtsComputeUnitPreset(cliValue: "ane-only"))
XCTAssertNil(TtsComputeUnitPreset(cliValue: "neuralengine"))
}
// MARK: - cliValue (round-trip)
func testCliValueRoundTrip() {
for preset in TtsComputeUnitPreset.allCases {
let canonical = preset.cliValue
XCTAssertEqual(
TtsComputeUnitPreset(cliValue: canonical), preset,
"cliValue '\(canonical)' must round-trip back to \(preset)")
}
}
func testCliValueIsKebabCase() {
XCTAssertEqual(TtsComputeUnitPreset.default.cliValue, "default")
XCTAssertEqual(TtsComputeUnitPreset.allAne.cliValue, "all-ane")
XCTAssertEqual(TtsComputeUnitPreset.cpuAndGpu.cliValue, "cpu-and-gpu")
XCTAssertEqual(TtsComputeUnitPreset.cpuOnly.cliValue, "cpu-only")
}
// MARK: - uniformUnits
func testUniformUnits_defaultIsNil() {
XCTAssertNil(TtsComputeUnitPreset.default.uniformUnits)
}
func testUniformUnits_concretePresets() {
XCTAssertEqual(TtsComputeUnitPreset.allAne.uniformUnits, .cpuAndNeuralEngine)
XCTAssertEqual(TtsComputeUnitPreset.cpuAndGpu.uniformUnits, .cpuAndGPU)
XCTAssertEqual(TtsComputeUnitPreset.cpuOnly.uniformUnits, .cpuOnly)
}
// MARK: - KokoroAneComputeUnits(preset:)
func testKokoroAnePreset_defaultMatchesStaticDefault() {
XCTAssertEqual(KokoroAneComputeUnits(preset: .default), .default)
}
func testKokoroAnePreset_allAneMatchesStatic() {
XCTAssertEqual(KokoroAneComputeUnits(preset: .allAne), .allAne)
}
func testKokoroAnePreset_cpuAndGpuMatchesStatic() {
XCTAssertEqual(KokoroAneComputeUnits(preset: .cpuAndGpu), .cpuAndGpu)
}
func testKokoroAnePreset_cpuOnlyMatchesStatic() {
XCTAssertEqual(KokoroAneComputeUnits(preset: .cpuOnly), .cpuOnly)
}
func testKokoroAnePreset_allAneForcesEveryStageToANE() {
let cu = KokoroAneComputeUnits(preset: .allAne)
for stage in KokoroAneStage.allCases {
XCTAssertEqual(
cu.units(for: stage), .cpuAndNeuralEngine,
"stage \(stage) should be .cpuAndNeuralEngine under .allAne")
}
}
func testKokoroAnePreset_cpuOnlyForcesEveryStageToCPU() {
let cu = KokoroAneComputeUnits(preset: .cpuOnly)
for stage in KokoroAneStage.allCases {
XCTAssertEqual(
cu.units(for: stage), .cpuOnly,
"stage \(stage) should be .cpuOnly under .cpuOnly")
}
}
func testKokoroAnePreset_cpuAndGpuForcesEveryStageToCPUAndGPU() {
let cu = KokoroAneComputeUnits(preset: .cpuAndGpu)
for stage in KokoroAneStage.allCases {
XCTAssertEqual(
cu.units(for: stage), .cpuAndGPU,
"stage \(stage) should be .cpuAndGPU under .cpuAndGpu")
}
}
}