Files
FluidAudio/Tests/FluidAudioTests/TTS/Magpie/MagpieKvCacheTests.swift
T
Alex 7603ac6733 feat(tts/benchmark): tts-benchmark CLI covering all TTS backends (#557)
## Summary

Adds `fluidaudio tts-benchmark`, a unified harness for measuring
**latency × efficiency × quality** across every shipping TTS backend in
FluidAudio, plus the model + runtime fixes needed to actually clear all
six backends end-to-end on the [MiniMax Multilingual TTS Test
Set](https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set).
Also tags Magpie / StyleTTS2 / CosyVoice3 as **beta** at the API + docs
level so users get a runtime warning on `initialize()` reflecting their
actual perf / quality posture.

### Backends — all green on M2 / macOS 26

| Backend | Corpus | Status | Audio out (min / p50 / max) | RTFx | WER |
Notes |
|---|---|---|---|---|---|---|
| Kokoro ANE | minimax-en (100/100) |  | 3.5 s / 8.0 s / 11.4 s | 5.19×
| 10.8% | one-shot @ 24 kHz, 7-graph pipeline; per-stage CU sweep |
| Kokoro | minimax-en (100/100) |  | 3.5 s / 6.8 s / 9.3 s | 2.02× |
1.3% | one-shot @ 24 kHz; multi-chunk w/ 8 ms crossfade; cleanest
English ASR roundtrip |
| PocketTTS | minimax-en (100/100) |  | 2.8 s / 6.3 s / 9.4 s | 0.61× |
1.4% | **streaming** @ 24 kHz, 80 ms frames; TTFT 1244 ms — RTFx looks
slow but is honest per-frame cost (see "RTFx caveat" below) |
| Magpie | minimax-en (100/100) | ⚠️ **BETA** | 4.7 s / 10.0 s / 20.6 s
| 0.64× | 5.6% | **streaming TTFT** @ 22.05 kHz: first chunk at **9.6 s
p50** vs full synth 15.1 s; split-K/V decoder + `outputBackings` fast
path; below real-time, runtime warning on init |
| StyleTTS2 | minimax-en (100/100) | ⚠️ **BETA** | 9.6 s / 22.6 s / 32.6
s | 2.72× | 44.0% | one-shot @ 24 kHz; flex-shape fix + misaki→espeak
post-pass remap (WER 58.1% → 44.0%); WER ~30× Kokoro's, runtime warning
on init |
| CosyVoice3 | minimax-zh (100/100) | ⚠️ **BETA** | 2.2 s / 6.5 s /
**16.0 s** | 0.357׆ | n/a‡ | post auto-chunker @ 24 kHz; long phrases
now split + crossfaded (8 ms cosine) — longest output 16.0 s (was capped
at ~6.5 s); HiFT `.cpuAndGPU` + LLM-Decode `outputBackings` (+33% RTFx);
**whisper-large-v3 CER 1.68% (macro) / 1.84% (micro)** across 100/100
phrases‡; RTFx < 1, runtime warning on init |
| CosyVoice3 | minimax-yue (100/100) | ⚠️ **BETA** | 3.3 s / 8.0 s /
**16.1 s** | 0.249× | n/a | post auto-chunker; **truncation 80/100 →
5/100 phrases** (`finished_on_eos=false` field), longest output 6.5 s →
16.1 s. TTFT-p50 climbs (24 s → 36 s) as the cost of multi-chunk synth |

⚠️ **BETA** = `${Backend}TtsManager.initialize()` emits a
`logger.warning` flagging the perf / quality posture; safe to ship in
non-latency-sensitive paths but read the per-backend doc first.

‡ CosyVoice3 zh WER stays `n/a` because `WERCalculator`
whitespace-tokenizes and Mandarin has no word boundaries (word-level WER
reads ~100% and is meaningless). CER is `whisper-large-v3` against the
rendered WAVs from the full 100-phrase `minimax-chinese` run via
`Scripts/whisper_zh_cer.py`. Cohere Transcribe q8 is also wired in this
PR via `--asr-backend cohere` (see [Cohere ASR backend in the
harness](#cohere-asr-backend-in-the-harness) below) and agrees with
whisper at the 3–5% CER range on a 10-phrase sub-sample, but hits a
`MILCompilerForANE` cache failure on this M2 host that drops it to RTFx
~0.13×, so whisper is the practical source-of-truth for the full
100-phrase run.

Full numbers (cold start, p50/p95 synth, peak RSS, WER/CER per category)
live in `Documentation/TTS/Benchmarks.md`. Corpus attribution +
reproduction notes live in `Documentation/TTS/MinimaxCorpus.md`.

### RTFx caveat — phrase length and streaming granularity both matter

Aggregate RTFx (audio_duration / wall_clock) is **only directly
comparable between backends when both produce similar phrase lengths and
yield audio at the same granularity**. Two things skew the headline
number on this corpus:

**1. Phrase-length spread.** StyleTTS2 emits ~22 s p50 of audio per
`minimax-english` phrase while Kokoro emits ~7 s — same input text, ~3×
more audio out. That's mostly long inter-word pauses + slow speaking
rate baked into the LibriTTS multi-speaker checkpoint, not a measurement
artifact. A 2.72× RTFx on 22 s audio = ~8 s wall — which matches the
TTFT p50 column. Kokoro's 2.02× on 7 s audio = ~3.5 s wall. Same-corpus
RTFx ratios alone hide this.

**2. Streaming granularity.** PocketTTS posts 0.61× agg-RTFx vs.
Kokoro's 2.02× but it's **not slower from a user perspective**:
PocketTTS yields its first 80 ms audio frame at TTFT **1244 ms**,
Kokoro's first frame at TTFT **3113 ms** (full one-shot chunk). The
0.61× is the per-frame cost averaged across the streaming run; what
users feel is TTFT.

| Backend | TTFT p50 | First yield | Implication |

|-------------|----------|------------------|--------------------------------------------|
| PocketTTS | 1244 ms | 80 ms frame | true streaming;
conversational-ready |
| Kokoro ANE | 1586 ms | full ~8 s chunk | ~1.6 s to any audio;
ANE-tuned |
| Kokoro | 3113 ms | full ~7 s chunk | clean quality, slower first-byte
|
| StyleTTS2 | 6671 ms | full ~22 s chunk | one-shot only; long phrase
output amortizes the wall |
| Magpie | **9580 ms** | first chunk @ 22.05 kHz | streaming via
`synthesizeStream`; TTFT-p50 9.6 s vs full synth 15.1 s — 36% earlier
playback start |
| CosyVoice3 | 14091 / 35681 ms (zh / yue) | full chunk @ 24 kHz |
one-shot per chunk; multi-chunk phrases pay TTFT for the first chunk
only |

For conversational use cases, **TTFT > RTFx**. PocketTTS (true
streaming), Magpie (streaming via `synthesizeStream`), and Kokoro ANE
(small one-shot chunks) are the three backends that meaningfully clear
the "user feels it's responsive" bar today.

### Beta callouts (StyleTTS2, Magpie, CosyVoice3)

Three of the six shipping backends post numbers that callers should
weigh against an explicit caveat:

- **StyleTTS2** — WER 44% on `minimax-english` is ~30× Kokoro's 1.3%.
The misaki→espeak post-pass remap closed half the gap; the remainder is
BART G2P misses + diffusion-sampler formant breaks on long phrases.
- **Magpie** — agg-RTFx 0.64× on M2 — below real-time but streaming via
`synthesizeStream` so TTFT (9.6 s p50) is significantly better than
full-synth wall (15.1 s p50). Long-tail phrases still pull p95 wall to
~30 s.
- **CosyVoice3** — agg-RTFx 0.357× on `minimax-chinese` (0.249× on the
longer-phrase `minimax-cantonese` after the auto-chunker). The 250-token
Flow input cap is now worked around at the call site by the auto-chunker
(long phrases split + crossfaded), dropping cantonese truncation from
80/100 → 5/100 and lifting longest output from 6.5 s → 16.1 s. The 5/100
residual is the long-tail token-rate worst case; the structural fix is
re-exporting Flow with a larger fixed input shape (tracked in
`mobius-cosyvoice3`). `CosyVoice3SynthesisResult.finishedOnEos: Bool` +
a `.warning`-level `LLM-Decode budget exhausted` log still surface any
truncation, and the harness writes `finished_on_eos` into each phrase in
the JSON report.

Each manager now logs a `.warning`-level beta notice on `initialize()`
(mirroring the existing CosyVoice3 pattern) so anyone wiring these into
a product gets a console signal, not a silent surprise. Docs
(`Documentation/TTS/Magpie.md`, `Documentation/TTS/Benchmarks.md`
StyleTTS2 footnote, existing `CosyVoice3.md` callout) carry the same
caveat at the top.

### Model + runtime fixes landed in this PR

#### CosyVoice3 stateless port (`71130c9fb`)
Switches LLM-Decode from the macOS 15+ stateful `MLState` path to the
non-stateful `LLM-Decode-M768-fp16` graph that's actually shipped on
HuggingFace. Drops ~95 LOC of state plumbing for ~30 LOC of plain
`MLDictionaryFeatureProvider` prediction with explicit kv carry-forward;
lowers the availability gate from macOS 15 / iOS 18 back to the package
baseline (macOS 14 / iOS 17). `CosyVoice3ModelNameTests` guard the
rename.

#### CosyVoice3 HiFT timeout fix (`267766b62`)
`minimax-chinese` runs were aborting mid-corpus with `E5RT: Submit Async
failed for [3:29] ... HiFT-T500-fp16_main__Op104_BnnsCpuInference has
timed out`. Root cause: HiFT was loaded with `.cpuAndNeuralEngine`,
which let the planner place most of the graph on ANE but kept at least
one op on the BNNS CPU async-dispatch path; long phrases tripped the
BNNS async watchdog. Fix pins HiFT to `.cpuAndGPU` regardless of
user-supplied compute-units, removing the BNNS path entirely. Verified
on 100/100 zh + 100/100 yue.

#### CosyVoice3 LLM-Decode `outputBackings` double-buffer (`248c638c6`)
The autoregressive decode loop runs ~163 steps per phrase to fill the
250-token cap. Each step takes the previous step's KV cache as `kv_k` /
`kv_v` (fp32 `[24, 1, 2, 768, 64]` = 9 MB each) and produces fresh
`kv_k_out` / `kv_v_out` plus logits — i.e. ~36 MB of host-side
`MLMultiArray` allocation **per step**. Fix pre-allocates 4 KV
back-buffers + a logits backing, rotates front/back/spare across steps
via `MLPredictionOptions.outputBackings`, and falls back to fresh-alloc
on first rejection (one-shot `logger.warning`). Mirrors the Magpie
pattern. Result on full `minimax-chinese`: agg-RTFx **0.269 → 0.357
(+33%)**, TTFT-p50 14091 ms → 9619 ms (-31%), peak RSS 3302 MB → 2470
MB.

#### CosyVoice3 auto-chunker (`f80e0b92e` + `fd22624b5` + `f60cccd0d`)
The 250-token Flow input cap means a single synth pass produces at most
~6.5 s of audio regardless of input length. Re-exporting Flow with a
larger fixed input shape is gated on upstream conversion work, so this
PR works around it at the call site: long inputs are split at
sentence/clause boundaries by `CosyVoice3TextChunker`, synthesized
independently, and merged with an 8 ms equal-power cosine crossfade.

**Splitter policy**: hard enders (`. ! ? 。 ! ? \n`) commit always; soft
enders (`, 、 ; : ; ,` + ASCII space) commit only at-or-past budget;
force-split at +30 token overshoot if no natural boundary exists.
`defaultMaxSpeechTokens` = 110 (leaves margin under the 250-token cap
minus a typical 60–90-token speech-prompt context). Token-rate heuristic
is calibrated against minimax-zh + minimax-yue runs:

| Char class | Tokens / char | Rationale |

|------------|---------------|--------------------------------------------------------------|
| CJK | 7.5 | worst-case observed in real generation; varies 5.5–9 per
char |
| ASCII | 1.5 | matches BPE rate on English text |
| Other | 2.5 | conservative for accented Latin / non-CJK Unicode |

**Validation** on full `minimax-cantonese` (100 phrases, M2):

| Metric | Pre-chunker | Post-chunker | Δ |

|-------------------------------------------|-------------|--------------|------------|
| `finished_on_eos=false` (truncated) | 80 / 100 | **5 / 100** | −94% |
| Longest audio output | 6.5 s | **16.1 s** | +148% |
| agg-RTFx | 0.245× | 0.249× | +1.6% |
| TTFT p50 | 23.9 s | 35.7 s | +49% |

The TTFT regression is the cost of running multiple synth passes per
long phrase — splitting unblocks long-form output at the price of
wall-clock latency. The 5/100 residual truncation is the long-tail
token-rate worst case (some chars hit ~9 tokens/char); raising the
per-CJK heuristic further would over-fragment short phrases. Cleaner fix
is the Flow re-export.

16-test suite covers tokenization estimates, hard/soft/force-split
policy, and the crossfade arithmetic. Lives in
`Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Preprocess/CosyVoice3TextChunker.swift`
+ `CosyVoice3TtsManager.concatWithCrossfade`.

#### Magpie streaming TTFT wire-up (`ace0bf485`)
`TtsBenchmarkCommand.swift` now drives Magpie through
`MagpieTtsManager.synthesizeStream`, recording `ttft_ms` at first
`MagpieAudioChunk` emit instead of conflating it with full-synth wall
time. Result on full `minimax-english` (100 phrases, M2): TTFT-p50 **9.6
s** vs full synth-p50 **15.1 s** — agents start playback ~36% earlier
than waiting for full synth. agg-RTFx 0.41× → 0.64× (warm-cache re-run
benefit; fundamentals unchanged).

#### StyleTTS2 `FlexibleShapeInfo` fix (`c24900731` + `8f9e42fd9`)
`text_predictor.mlmodelc` aborted on long MiniMax phrases with `E5RT:
tensor_buffer has known strides while the model has FlexibleShapeInfo`.
The CoreML runtime rejects two access patterns on outputs from a
flex-shape model: `arr.strides` reads, and `arr[idx].floatValue` element
subscripts — and the original `sliceFirstAxis2D` helper used both. Fix
rewrites it to read via `arr.dataPointer.bindMemory(...)` (handling
`.float32`, `.float16`, `.double`) and computes the flat index from the
known `(1, leading, trailing)` row-major layout. Verified on full
100/100 minimax-en with a `ref_s.bin` dumped from the upstream LibriTTS
demo voice.

#### StyleTTS2 misaki → espeak post-pass remap (`ded0b9467`)
After `sliceFirstAxis2D` unblocked the full corpus, StyleTTS2 still
landed at **WER 0.581 / CER 0.476** — an order of magnitude worse than
Kokoro (0.013). Instrumented the encoder via a new `--tokenize-only
--corpus` mode and disproved the silent-vocab-drop hypothesis: only
**0.09% of scalars** dropped on the full 100-phrase corpus (11 ASCII
hyphens / 12247 scalars).

Real root cause: G2P convention mismatch. Both Kokoro and StyleTTS2
share the in-tree misaki BART G2P (`G2PModel`), but the StyleTTS2
LibriTTS checkpoint was trained by yl4579 on **espeak-ng-phonemized**
LibriTTS — predating misaki by years. The 178-vocab accepts both forms
(e.g. both `ʧ` U+02A7 and `tʃ` decomposed encode), but acoustic
embeddings for the misaki ligature glyphs are essentially untrained
noise.

Side-by-side comparison against locally-installed `espeak-ng -v en-us
--ipa -q` flagged four systematic divergences:

| misaki | espeak-ng | example                  |
|--------|-----------|--------------------------|
| `ʧ`    | `tʃ`      | choice → `tʃˈɔɪs`        |
| `ʤ`    | `dʒ`      | jump   → `dʒˈʌmps`       |
| `ɜɹ`   | `ɝ`       | girl   → `ɡˈɝl`          |
| `əɹ`   | `ɚ`       | over   → `ˈoʊvɚ`         |

Fix: a 4-rule post-pass remap in `StyleTTS2Phonemizer.phonemize`, gated
on `.americanEnglish` and applied to the assembled phoneme string after
every word has been emitted by the BART G2P. Lives alongside the
existing per-piece misaki diphthong remap. Result on the same 100-phrase
MiniMax-English run with the same `libritts_696` voice and same Parakeet
TDT roundtrip:

| Metric          | Pre   | Post  | Δ      |
|-----------------|-------|-------|--------|
| Macro WER       | 0.581 | 0.440 | −24.2% |
| Macro CER       | 0.476 | 0.241 | −49.5% |
| TTFT p50 (ms)   | 8937  | 6671  | −25.4% |
| Agg RTFx        | 2.36× | 2.72× | +15.3% |
| Peak RSS (MB)   | 1428  | 963   | −32.6% |

Phrase 1 (`"…simple choice. Get busy living…"`) went from `simple voice.
Busy dying.` (0.40 WER) to a perfect roundtrip. Remaining errors cluster
on word-level G2P misses from the BART itself (`practical →
practicckles`, `separation → expiration`) and diffusion-sampler formant
breaks; closing the rest of the gap to Kokoro likely needs richer espeak
coverage or libespeak-ng vendor — tracked separately.

#### Beta callouts on StyleTTS2 + Magpie managers (`25e2b492a`)
`StyleTTS2Manager.initialize` and `MagpieTtsManager.initialize` now emit
`logger.warning` beta notices mirroring the existing
`CosyVoice3TtsManager.initialize` pattern. Backends docs (`Magpie.md`
Status section, `Benchmarks.md` StyleTTS2 footnote) gain matching `⚠️
Beta / experimental` callouts so the perf / quality posture is visible
at every entry point — runtime, manager docstring, doc top, PR body.

#### Magpie `outputBackings` rejection fallback (`72dae8400` +
`9767e1ef9`)
The shipped `decoder_step.mlmodelc` reaches the user before the rebuild
lands, so CoreML can reject our `outputBackings` dictionary on a
name-mismatch. Latched fallback path falls back to a fresh-alloc decode
so the model still runs; first rejection latches the flag for the rest
of the run.

### Cohere ASR backend in the harness (`8e741e659`)

Lets non-English TTS runs (CosyVoice3, Magpie zh, etc.) score WER / CER
through the harness against [Cohere
Transcribe](Sources/FluidAudio/ASR/Cohere/) instead of being forced into
`--skip-asr`. Four new flags on `tts-benchmark`:

- `--asr-backend parakeet|cohere|none` — selects the ASR roundtrip
engine. Default is `parakeet` for English-only runs and skipped for
CosyVoice3.
- `--cohere-model-dir <path>` — path to a directory containing
`cohere_encoder.mlmodelc`, `cohere_decoder_cache_external_v2.mlmodelc`,
and `vocab.json`.
- `--asr-language <code>` — overrides the inferred language code (covers
all 14 Cohere languages: en, fr, de, es, it, pt, nl, pl, el, ar, ja, zh,
ko, vi).
- `--cohere-compute-units all|cpu-and-gpu|cpu-only|all-ane` — pins
`MLComputeUnits` at `CoherePipeline.loadModels` time. Use `cpu-and-gpu`
when the q8 encoder fails ANE compilation (`MILCompilerForANE error:
failed to compile ANE model using ANEF`) to skip the multi-minute
fallback compile on the first call. The harness logs a WER caveat for
zh/ja runs flagging that whitespace-tokenized WER is meaningless and the
CER column is the real signal.

Example end-to-end:
```bash
fluidaudio tts-benchmark \
    --backend cosyvoice3 \
    --corpus minimax-chinese \
    --asr-backend cohere \
    --cohere-model-dir /path/to/cohere/q8 \
    --asr-language zh \
    --output-json benchmark_results/cv3-zh-cohere.json \
    --audio-dir benchmark_results/cv3-zh-cohere/audio
```

On this M2 host the q8 encoder hits a CoreML ANE-cache failure
(`MILCompilerForANE error: ANECCompile() FAILED`) and CoreML silently
falls back to CPU+GPU, dropping Cohere from its documented RTFx ~2× (per
`Documentation/ASR/Cohere.md`) to RTFx ~0.13× — correctness is
unaffected (same graph, same output), only latency. The full 100-phrase
CosyVoice3 zh CER number reported above (1.68% macro / 1.84% micro) was
therefore produced via `whisper-large-v3` (Python CPU FP32,
`Scripts/whisper_zh_cer.py`) rather than by running Cohere over all 100
phrases. A 10-phrase Cohere sub-sample agrees with whisper at the 3–5%
CER range.

### Corpus migration (`4cc7d3111`) + on-demand fetch CLI (`8022e8384`)

Replaces the original `prose-en` / `numbers-en` / `names-en` /
`prose-zh` shipped with the first cut of this PR with the [MiniMax
Multilingual TTS Test
Set](https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set)
(CC-BY-SA-4.0; 100 phrases × 25 languages). Same public corpus used by
[MiniMax-Speech](https://arxiv.org/abs/2505.07916), seed-tts-eval, and
Gradium — numbers in this PR are paper-comparable.

The 24 per-language `.txt` files used to be vendored in
`Benchmarks/tts/corpus/minimax/`. **Removed in this PR** in favor of an
on-demand `fluidaudio minimax-corpus` CLI subcommand that fetches them
from the upstream HF dataset at the pinned revision and writes them to
the same path. Reuses `DownloadUtils.fetchHuggingFaceFile` for HF auth
(HF_TOKEN env) + retry/backoff — no `swift-transformers` dep added, no
hardcoded asset URLs. The `.txt` files now live in `.gitignore` since
they're CC-BY-SA-4.0 derivative content; only
`Documentation/TTS/MinimaxCorpus.md` (attribution + revision pin + WER
caveats — moved from `Benchmarks/tts/corpus/minimax/README.md` in
`ac21d60bf`) and the CLI subcommand are tracked. Replaces the prior
`python Scripts/fetch_minimax_tts_corpus.py` (also deleted). Per-backend
language scope:

| Backend | Languages benchmarked |
|---|---|
| Kokoro / Kokoro ANE | en (af_heart) |
| PocketTTS | en + de + it + pt + es + fr |
| Magpie | en + es + de + fr + it + vi + zh + hi |
| StyleTTS2 | en (LibriTTS multi-spk) |
| CosyVoice3 | zh + yue |

### PocketTTS streaming TTFT (`c26f1e163`)
PocketTTS now drives the harness through its `synthesizeStreaming` API
so TTFT measures time-to-first-80ms-frame instead of full one-shot
synth. TTFT 1244 ms vs. full synth 8757 ms — a 7× streaming advantage
that one-shot benchmarking previously hid.

### Reference voice dumper helper (mobius-styletts2)
`mobius-styletts2/scripts/06_dump_ref_s.py` (added in the sibling repo)
wraps `style_encoder` + `predictor_encoder` from `99_parity_check.py` to
dump a 256-fp32 LE `ref_s.bin` that `StyleTTS2Manager.synthesize`
consumes via `--voice`. Required because the shipped CoreML bundle
doesn't include those upstream-only PyTorch encoders.

## Test plan

- [x] `swift build -c release` clean
- [x] `swift format lint` clean for new files
- [x] `fluidaudio tts-benchmark --help` lists all 6 backends
- [x] `fluidaudio minimax-corpus --languages english --out-dir /tmp/x`
produces byte-identical output to the deleted Python script
- [x] Kokoro / Kokoro ANE / PocketTTS / Magpie — full 100/100 minimax-en
- [x] StyleTTS2 — full 100/100 minimax-en (verified after
`sliceFirstAxis2D` fix + post-pass remap)
- [x] CosyVoice3 — full 100/100 minimax-zh + 100/100 minimax-yue
(verified after HiFT + LLM-Decode `outputBackings` fixes)
- [x] `CosyVoice3ModelNameTests` + `TtsComputeUnitPresetTests` green
- [x] No `@unchecked Sendable`; per-backend error enums use `Error,
LocalizedError`
- [x] StyleTTS2 + Magpie + CosyVoice3 emit beta `logger.warning` on
`initialize()`
- [x] Corpus README moved to `Documentation/TTS/MinimaxCorpus.md`;
cross-refs in `Benchmarks.md`, `MinimaxCorpusCommand.swift`,
`TtsBenchmarkCommand.swift` updated
- [x] CosyVoice3 6.5 s output cap investigated — confirmed structural
(250-token Flow input shape, 40 ms / token); surfaced via
`finishedOnEos` + warning log + JSON `finished_on_eos` field. See
[Decode budget
cap](Documentation/TTS/Benchmarks.md#cosyvoice3-decode-budget-cap)
- [x] **CosyVoice3 auto-chunker** lands in this PR as a call-site
workaround. Validated on full minimax-cantonese: truncation **80/100 →
5/100**, longest output **6.5 s → 16.1 s**, agg-RTFx 0.245× → 0.249×.
16-test suite (`CosyVoice3TextChunkerTests`) green. See [CosyVoice3
auto-chunker](Documentation/TTS/Benchmarks.md#cosyvoice3-auto-chunker)
- [x] **Magpie streaming TTFT** wired through `synthesizeStream` in
`TtsBenchmarkCommand.swift`. Validated on full minimax-english: TTFT-p50
**9.6 s** (first chunk) vs full-synth-p50 **15.1 s** — 36% earlier
playback start. agg-RTFx 0.41× → 0.64× (warm-cache re-run)
- [x] **Cohere ASR harness wiring** (`--asr-backend cohere` +
`--cohere-model-dir` + `--asr-language` + `--cohere-compute-units`).
Smoke-tested on a 10-phrase `minimax-chinese` sub-sample (Cohere q8
macro CER 4.88%, hit `MILCompilerForANE` fallback, RTFx ~0.13× on this
M2 host). Whisper-large-v3 cross-check on the same WAVs: macro CER 3.04%
— both backends agree
- [x] **CosyVoice3 zh CER on full corpus** measured via
`whisper-large-v3` (Python CPU FP32, `Scripts/whisper_zh_cer.py`) over
all 100 minimax-chinese WAVs: macro CER **1.68%**, micro CER **1.84%**.
Recorded in `Documentation/TTS/Benchmarks.md` (CosyVoice3 row + footnote
‡)
2026-05-01 09:09:42 -04:00

137 lines
6.0 KiB
Swift
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
import CoreML
import XCTest
@testable import FluidAudio
final class MagpieKvCacheTests: XCTestCase {
func testInitialShapeAndZeroPosition() throws {
let cache = try MagpieKvCache(
numLayers: MagpieConstants.numDecoderLayers,
maxCacheLength: MagpieConstants.maxCacheLength,
numHeads: MagpieConstants.numHeads,
headDim: MagpieConstants.headDim)
XCTAssertEqual(cache.cachesK.count, MagpieConstants.numDecoderLayers)
XCTAssertEqual(cache.cachesV.count, MagpieConstants.numDecoderLayers)
XCTAssertEqual(cache.positions.count, MagpieConstants.numDecoderLayers)
XCTAssertEqual(cache.position, 0)
// Rank-4 split-K/V layout: [1, T, H, D] per cache tensor.
let expectedShape: [NSNumber] = [
1,
NSNumber(value: MagpieConstants.maxCacheLength),
NSNumber(value: MagpieConstants.numHeads),
NSNumber(value: MagpieConstants.headDim),
]
XCTAssertEqual(cache.cachesK[0].shape, expectedShape)
XCTAssertEqual(cache.cachesV[0].shape, expectedShape)
XCTAssertEqual(cache.positions[0].shape, [1])
}
func testAddInputsProvidesAllLayerKeys() throws {
let cache = try MagpieKvCache(
numLayers: 3, maxCacheLength: 32, numHeads: 4, headDim: 8)
var inputs: [String: MLMultiArray] = [:]
cache.addInputs(to: &inputs)
// 3 layers × (cache_k, cache_v, position) = 9 entries.
XCTAssertEqual(inputs.count, 9)
for i in 0..<3 {
XCTAssertNotNil(inputs["cache_k\(i)"])
XCTAssertNotNil(inputs["cache_v\(i)"])
XCTAssertNotNil(inputs["position\(i)"])
}
}
func testStaticOutputKeyCountMatchesLayers() {
XCTAssertEqual(
MagpieKvCache.cacheKOutputKeys.count, MagpieConstants.numDecoderLayers,
"cacheKOutputKeys must match numDecoderLayers — regenerate list if the exporter changes.")
XCTAssertEqual(
MagpieKvCache.cacheVOutputKeys.count, MagpieConstants.numDecoderLayers,
"cacheVOutputKeys must match numDecoderLayers — regenerate list if the exporter changes.")
XCTAssertEqual(
MagpieKvCache.positionOutputKeys.count, MagpieConstants.numDecoderLayers)
}
/// Drives the slow-path fallback used by `MagpieSynthesizer.runDecoderStep`
/// when CoreML rejects `outputBackings`. Builds a synthetic feature
/// provider that mirrors the `decoder_step.mlmodelc` output schema, hands
/// it to `absorbOutputs`, and verifies the cache front pointers + position
/// were replaced (i.e. the fallback can take over without `swapBackings`).
func testAbsorbOutputsReplacesFrontPointers() throws {
let numLayers = 3
let maxCacheLength = 16
let numHeads = 2
let headDim = 4
let cache = try MagpieKvCache(
numLayers: numLayers, maxCacheLength: maxCacheLength,
numHeads: numHeads, headDim: headDim)
let preK = (0..<numLayers).map { ObjectIdentifier(cache.cachesK[$0]) }
let preV = (0..<numLayers).map { ObjectIdentifier(cache.cachesV[$0]) }
let prePos = (0..<numLayers).map { ObjectIdentifier(cache.positions[$0]) }
let cacheShape: [NSNumber] = [
1,
NSNumber(value: maxCacheLength),
NSNumber(value: numHeads),
NSNumber(value: headDim),
]
var features: [String: MLFeatureValue] = [:]
for i in 0..<numLayers {
let kArr = try MLMultiArray(shape: cacheShape, dataType: .float16)
kArr.zeroFillFloat16()
let vArr = try MLMultiArray(shape: cacheShape, dataType: .float16)
vArr.zeroFillFloat16()
let posArr = try MLMultiArray(shape: [1], dataType: .float16)
posArr.zeroFillFloat16()
posArr[0] = NSNumber(value: Float(i + 1))
features[MagpieKvCache.cacheKOutputKeys[i]] = MLFeatureValue(multiArray: kArr)
features[MagpieKvCache.cacheVOutputKeys[i]] = MLFeatureValue(multiArray: vArr)
features[MagpieKvCache.positionOutputKeys[i]] = MLFeatureValue(multiArray: posArr)
}
let provider = try MLDictionaryFeatureProvider(dictionary: features)
try cache.absorbOutputs(provider)
for i in 0..<numLayers {
XCTAssertNotEqual(
ObjectIdentifier(cache.cachesK[i]), preK[i],
"absorbOutputs must replace cachesK[\(i)] front pointer")
XCTAssertNotEqual(
ObjectIdentifier(cache.cachesV[i]), preV[i],
"absorbOutputs must replace cachesV[\(i)] front pointer")
XCTAssertNotEqual(
ObjectIdentifier(cache.positions[i]), prePos[i],
"absorbOutputs must replace positions[\(i)] front pointer")
}
// positions[0] = 1 cache.position reads layer-0 scalar.
XCTAssertEqual(cache.position, 1)
}
func testAbsorbOutputsThrowsWhenCacheKOutputMissing() throws {
let cache = try MagpieKvCache(
numLayers: 2, maxCacheLength: 8, numHeads: 1, headDim: 2)
// Provide a feature provider with the wrong key for cache_k_0 so the
// first lookup fails. This guards the error message users will see
// when the fallback path is actually exercised.
let bogus = try MLMultiArray(shape: [1, 8, 1, 2], dataType: .float16)
bogus.zeroFillFloat16()
let provider = try MLDictionaryFeatureProvider(dictionary: [
"wrong_key": MLFeatureValue(multiArray: bogus)
])
XCTAssertThrowsError(try cache.absorbOutputs(provider)) { error in
guard case MagpieError.inferenceFailed(_, let underlying) = error else {
XCTFail("expected MagpieError.inferenceFailed, got \(error)")
return
}
XCTAssertTrue(
underlying.contains("missing K cache output key"),
"underlying should mention the missing K key, got: \(underlying)")
}
}
}