diff --git a/.gitignore b/.gitignore
index 08103c74..e94d441a 100644
--- a/.gitignore
+++ b/.gitignore
@@ -11,6 +11,7 @@ xcuserdata/
 *.hmap
 
 *.txt
+!Benchmarks/**/*.txt
 
 ## App packaging
 *.ipa
@@ -104,6 +105,10 @@ Resources/
 scripts/
 !Scripts/parakeet_subset_benchmark.sh
 !Scripts/diarizer_subset_benchmark.sh
+
+# MiniMax TTS corpus is CC-BY-SA-4.0 derivative content fetched on demand
+# via `fluidaudio minimax-corpus`; only the README is checked in.
+Benchmarks/tts/corpus/minimax/*.txt
 Documentation/parakeet-tdt/
 docs/parakeet-tdt/
 
diff --git a/Documentation/TTS/Benchmarks.md b/Documentation/TTS/Benchmarks.md
new file mode 100644
index 00000000..1d05f950
--- /dev/null
+++ b/Documentation/TTS/Benchmarks.md
@@ -0,0 +1,343 @@
+# TTS Benchmarks
+
+> **Setup:** MacBook Air M2 (2022), 16 GB, macOS 26, on AC.
+> **Corpus:** [MiniMax Multilingual TTS Test Set][minimax] (100
+> phrases / language, CC-BY-SA-4.0) — the same public corpus used
+> by [MiniMax-Speech][mms], seed-tts-eval, and Gradium, so numbers
+> here are directly paper-comparable.
+> **Status:** Kokoro, Kokoro ANE, PocketTTS, Magpie, StyleTTS2 all
+> complete the English run; CosyVoice3 completes the full Mandarin
+> run.
+>
+> [minimax]: https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set
+> [mms]: https://arxiv.org/abs/2505.07916
+
+## Why not just RTFx?
+
+RTFx (audio_seconds / synth_seconds) is a useful single number for batch
+synthesis, but for conversational use it hides the things users actually
+feel:
+
+1. **Cold start** — first model load + ANE compile after install or
+   reboot. On Apple Silicon the system's `anecompilerservice` can take
+   tens of seconds on first invocation; subsequent loads finish in ~1 s.
+2. **TTFT (time-to-first-audio)** — for streaming agents the question
+   is "how long until the user hears *something*", not "how long until
+   the whole utterance is rendered". For one-shot backends in this
+   slice `ttft_ms == synth_ms`. **PocketTTS** and **Magpie** are
+   wired through their respective streaming APIs (`synthesizeStreaming`
+   / `synthesizeStream`), so their `ttft_ms` is honest first-frame
+   latency.
+3. **Per-stage compute units** — Kokoro ANE / Magpie are pipelines of
+   6–7 graphs. Sometimes ANE is *slower per call* but more efficient.
+   The "right" compute-unit choice differs per stage.
+4. **Memory footprint** — drives whether a backend is mobile-viable.
+5. **Quality** — RTFx alone tells you nothing about whether the model
+   pronounced "Reykjavík" or "$1,234.56" correctly. We measure WER +
+   CER via Parakeet roundtrip on a fixed English corpus; non-English
+   backends run with `--skip-asr` for now.
+
+## Methodology
+
+### Corpus
+
+All shipped corpora come from the **MiniMax Multilingual TTS Test
+Set** (`MiniMaxAI/TTS-Multilingual-Test-Set` on Hugging Face,
+CC-BY-SA-4.0). The fetched files land under
+`Benchmarks/tts/corpus/minimax/<lang>.txt` (24 languages × 100 phrases
+= 2400 phrases) and are gitignored — populate them on demand with
+`swift run fluidaudio minimax-corpus`. Attribution, revision pin,
+and WER caveats live in [`MinimaxCorpus.md`](MinimaxCorpus.md).
+
+Reference each language as `--corpus minimax-<lang>`:
+
+| Backend     | Default corpus     | Other supported MiniMax languages              |
+|-------------|--------------------|------------------------------------------------|
+| Kokoro / Kokoro ANE | `minimax-english` | `english` only (`af_heart` voice) |
+| PocketTTS   | `minimax-english`  | `english`, `german`, `italian`, `portuguese`, `spanish`, `french` |
+| StyleTTS2   | `minimax-english`  | `english` only (LibriTTS multi-speaker)        |
+| Magpie      | `minimax-english`  | `english`, `spanish`, `german`, `french`, `italian`, `vietnamese`, `chinese`, `hindi` |
+| CosyVoice3  | `minimax-chinese`  | `chinese`, `cantonese`                         |
+
+Lines beginning with `#` are comments. Custom corpora can still be
+passed with `--corpus-path <file.txt>`.
+
+### Metrics
+
+Per phrase:
+- `ttft_ms` — time-to-first-audio. For one-shot backends this equals
+  `synth_ms`. **PocketTTS** is benchmarked through
+  `synthesizeStreaming`, so its `ttft_ms` is the timestamp of the first
+  80 ms audio frame (1920 samples @ 24 kHz). **Magpie** is benchmarked
+  through `synthesizeStream`, so its `ttft_ms` is the first
+  `MagpieAudioChunk` emit time (typically ~9.6 s on M2 vs ~15 s for
+  full synth).
+- `synth_ms` — total synth wall time.
+- `audio_ms` — generated audio duration.
+- `rtfx` — `audio_ms / synth_ms`.
+- `wer`, `cer` — via Parakeet ASR roundtrip on the rendered WAV.
+- `stage_ms` — per-stage breakdown (backend-specific keys; populated
+  for Kokoro ANE + Magpie; empty for Kokoro / PocketTTS /
+  StyleTTS2 / CosyVoice3).
+- Backend-specific extras: `encoder_tokens`, `acoustic_frames`,
+  `chunk_count`, `frame_count`, `code_count`, `finished_on_eos`,
+  `generated_token_count`, etc.
+
+Aggregates:
+- `cold_start_s` — `manager.initialize()` wall time. CosyVoice3 also
+  includes voice-asset load.
+- `first_synth_ms` — first synth call after init (still cold-ish).
+- `ttft_ms_p50` / `ttft_ms_p95`.
+- `warm_synth_ms_p50` / `warm_synth_ms_p95`.
+- `agg_rtfx` — `Σ audio_ms / Σ synth_ms` across the corpus.
+- `peak_rss_mb` — process-wide peak resident set, via
+  `task_vm_info_data_t.resident_size_peak`.
+- Per-category macro WER / CER.
+
+### Reproducibility
+
+```bash
+# From the package root.
+swift run fluidaudio tts-benchmark \
+  --backend kokoro-ane \
+  --corpus minimax-english \
+  --voice af_heart \
+  --compute-units default \
+  --output-json bench.json \
+  --audio-dir bench-wavs/
+```
+
+The harness writes a JSON report to `--output-json` and (optionally)
+keeps WAVs under `--audio-dir`. Pass `--skip-asr` to drop the ASR
+roundtrip. The default ASR backend is `parakeet` for English-only
+runs and is skipped for CosyVoice3; pass `--asr-backend cohere
+--cohere-model-dir <dir>` to score Mandarin (or any of the 14
+Cohere languages) against [Cohere Transcribe](../../Sources/FluidAudio/ASR/Cohere/).
+
+## Results
+
+### Per-backend top-line
+
+Reference machine: **MacBook Air, Apple M2 (2022), 8-core CPU /
+8-core GPU / 16-core Neural Engine, 16 GB unified memory, macOS 26**
+(`Mac14,2`, on AC). All English runs use `--compute-units default`,
+voice = backend default
+(`af_heart` for Kokoro, `alba` for PocketTTS, `John` for Magpie),
+corpus = `minimax-english` (100 phrases), Parakeet TDT roundtrip for
+WER / CER.
+
+| Backend     | License     | Languages              | Footprint | Cold start | TTFT p50 / p95\*   | Synth p50 / p95     | Agg RTFx | Peak RSS | WER     | CER     | Notes |
+|-------------|-------------|------------------------|-----------|------------|---------------------|---------------------|----------|----------|---------|---------|-------|
+| Kokoro ANE  | Apache-2.0  | en (af_heart only)     | ~330 MB   | 37.9 s     | 1586 / 2515 ms      | 1586 / 2515 ms      | 5.19×    | 738 MB   | 0.108   | 0.040   | one-shot; per-stage CU sweep, 7-graph pipeline |
+| Kokoro      | Apache-2.0  | en (af_heart only)     | ~330 MB   | 92.2 s     | 3113 / 4696 ms      | 3113 / 4696 ms      | 2.02×    | 736 MB   | 0.013   | 0.005   | one-shot; cleanest English ASR roundtrip |
+| PocketTTS   | research    | en + de + it + pt + es + fr (6L / 24L) | ~140 / ~520 MB | 6.0 s | **1244 / 4749 ms**  | 8757 / 19174 ms     | 0.61×    | 1503 MB  | 0.014   | 0.006   | **streaming**; TTFT is first 80 ms audio frame |
+| StyleTTS2   | MIT         | en (LibriTTS multi-spk) | ~280 MB  | 955 s§     | 6671 / 15990 ms§    | 6671 / 15990 ms§    | 2.72×§   | 963 MB§  | 0.440§  | 0.241§  | full 100/100 `minimax-english` via [misaki→espeak post-pass remap](#styletts2-misaki--espeak-post-pass-remap); ref_s = LibriTTS `696_92939_000016_000006.wav` (StyleTTS2 demo voice) |
+| Magpie      | research    | en/es/de/fr/it/vi/zh/hi | ~1.3 GB   | 38.5 s∥    | **9580 / 23796 ms**∥ | 15080 / 29895 ms∥   | 0.64×∥   | 762 MB∥  | 0.056   | 0.033   | **streaming TTFT**: first audio chunk at 9.6 s p50 on M2 (full synth 15.1 s); split-K/V decoder; outputBackings fast path with latched fallback |
+| CosyVoice3  | Apache-2.0  | zh (mandarin)          | ~1.5 GB   | 29.2 s†    | 14091 / 23679 ms†   | 14091 / 23679 ms†   | 0.357×†  | 3302 MB† | n/a‡    | 0.017‡  | beta; full `minimax-chinese` (100/100 phrases) for latency / RSS and whisper-large-v3 CER‡; cantonese supported via [auto-chunker](#cosyvoice3-auto-chunker) but not benchmarked (no yue ASR) |
+
+\* TTFT for **PocketTTS / Magpie** is first-frame emit through the
+streaming API; the others are one-shot, so `ttft_ms == synth_ms`.
+
+† CosyVoice3 chinese: 100/100, 0 errors, ASR skipped. Cold-start
+dropped from 302.7 s to 29.2 s on the warm re-run.
+
+‡ CosyVoice3 CER measured on the **full 100-phrase**
+`minimax-chinese` corpus via `whisper-large-v3` (Python CPU FP32,
+[`Scripts/whisper_zh_cer.py`](../../Scripts/whisper_zh_cer.py)) on
+the WAVs rendered by `tts-benchmark --backend cosyvoice3 --corpus
+minimax-chinese --skip-asr --audio-dir <dir>`: **macro CER 1.68%
+(0.0168)**, **micro CER 1.84% (0.0184)** across 100 phrases.
+Whisper is the source of truth here because Cohere Transcribe q8
+hit a `MILCompilerForANE` cache failure on this M2 host and ran on
+the CPU+GPU fallback path at RTFx ~0.13× (would have taken multiple
+hours for the full 100-phrase set vs. ~70 min for whisper). WER is
+omitted because Mandarin has no word boundaries and `WERCalculator`
+splits on whitespace, so word-level WER reads near 100% and is
+meaningless.
+
+∥ Magpie: streamed via `synthesizeStream`. TTFT (9.6 s p50) is
+first-chunk emit; synth (15.1 s p50) is full-utterance wall time —
+the 5.5 s gap is the streaming win.
+
+§ StyleTTS2 (**beta** — `StyleTTS2Manager.initialize` emits a
+runtime warning): warm-cache run; first cold compile of the
+bucketed text_predictor / diffusion_step / decoder graphs is
+multi-second. ref_s dumped via
+[`06_dump_ref_s.py`](https://github.com/voicelink-ai/mobius-styletts2/blob/main/models/tts/styletts2/scripts/06_dump_ref_s.py).
+Read WER **relatively** per the
+[WER caveat](#about-the-wer--cer-numbers); StyleTTS2's own demo
+notebook reports artifacts on long sentences at default
+`alpha/beta/diffusion_steps`.
+
+### Kokoro ANE — per-stage breakdown (default preset, MiniMax-English)
+
+Means across 100 `minimax-english` phrases on M2. Stages map to the
+7-CoreML-graph split documented in [KokoroAne.md](KokoroAne.md). Vocoder
++ noise together account for ~92% of synth time, which is the natural
+target for any further per-stage compute-unit re-tuning. The MiniMax
+mean is meaningfully higher than the prior Harvard-sentences run
+because phrases 81–100 are paragraph-length news / story sentences.
+
+| Stage         | Mean ms | % of total |
+|---------------|---------|------------|
+| `albert`      | 28.2    | 2.0%       |
+| `post_albert` | 12.1    | 0.9%       |
+| `alignment`   | 1.8     | 0.1%       |
+| `prosody`     | 49.2    | 3.5%       |
+| `noise`       | 242.6   | 17.4%      |
+| `vocoder`     | 1039.8  | 74.4%      |
+| `tail`        | 24.6    | 1.8%       |
+| **total**     | 1398.4  | 100%       |
+
+### Magpie — per-stage breakdown (default preset, MiniMax-English)
+
+Means across 100 `minimax-english` phrases on M2 (`John` voice, en,
+default compute units), captured during the original one-shot
+profiling run. `ar_loop` is the umbrella for the per-step
+`decoder_step` + `sampler` (so it is not added on top in the total).
+`nanocodec` runs concurrently with the AR loop in chunked-streaming
+mode, which is why the per-stage means do not sum to total warm-synth
+mean. The AR loop dominates the wall clock, and its cost grows
+super-linearly with phrase length — long news / story phrases drive
+the long-tail p95.
+
+| Stage              | Mean ms |
+|--------------------|---------|
+| `text_encoder`     | 91      |
+| `prefill`          | 281     |
+| `ar_loop`          | 17946   |
+| └── `decoder_step` | 14840   |
+| └── `sampler`      | 3081    |
+| `nanocodec`        | 17948   |
+
+### About the WER / CER numbers
+
+The MiniMax corpus mixes short conversational phrases, medium news
+headlines, and long narrative paragraphs. WER on the long tail is
+sensitive to the ASR + text-normalizer stack (e.g. `"3,5%"` →
+`"three point five percent"` vs. `"three and a half percent"`); per
+the [upstream community
+discussion](https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set/discussions/10),
+absolute WER is best read **relatively** (backend A vs. backend B on
+the same corpus + same ASR + same normalizer) rather than against
+raw paper numbers.
+
+## StyleTTS2 misaki → espeak post-pass remap
+
+StyleTTS2's LibriTTS checkpoint was trained on **espeak-ng-phonemized**
+text, but the in-tree BART G2P (shared with Kokoro) emits **misaki**
+output. The 178-token vocab accepts both forms, but the acoustic
+embeddings for the misaki ligature glyphs are essentially untrained
+noise — every training utterance saw the espeak form.
+
+Four systematic divergences vs. `espeak-ng -v en-us --ipa -q`:
+
+| misaki | espeak-ng | example                  |
+|--------|-----------|--------------------------|
+| `ʧ`    | `tʃ`      | choice → `tʃˈɔɪs`        |
+| `ʤ`    | `dʒ`      | jump   → `dʒˈʌmps`       |
+| `ɜɹ`   | `ɝ`       | girl   → `ɡˈɝl`          |
+| `əɹ`   | `ɚ`       | over   → `ˈoʊvɚ`         |
+
+Fix: 4-rule post-pass remap in `StyleTTS2Phonemizer.phonemize`, gated
+on `.americanEnglish`. Result on `minimax-english`: WER 0.581 →
+0.440, CER 0.476 → 0.241, agg-RTFx 2.36× → 2.72× (warm-cache
+re-run, so latency / RSS deltas are noise — WER / CER are the real
+signal). WER is still 30× worse than Kokoro; remaining errors cluster
+on word-level BART mispronunciations and long-tail diffusion artifacts.
+Further gains likely need a richer remap layer or swapping BART for
+libespeak-ng directly.
+
+## CosyVoice3 Decode budget cap
+
+CosyVoice3's Flow CFM was exported with a fixed input shape of
+`[1, 250]` speech tokens (`flowTotalTokens` in
+`CosyVoice3Constants.swift:45`). The LLM-Decode AR loop is allowed to
+emit up to `flowTotalTokens − N_prompt` tokens before being cut off
+(typically ~163 generated tokens after the speech-prompt portion).
+At `tokenMelRatio=2 × hiftSamplesPerFrame=480 / sampleRate=24000`
+that's **40 ms of audio per generated token**, so the loop produces
+**at most ~6.5 s of speech per phrase**, regardless of how long the
+input text is.
+
+When the AR loop exits because it ran out of budget (i.e. no EOS
+token in `stopRange = 6_561…6_760`) instead of natural termination,
+`CosyVoice3Synthesizer` now:
+
+1. Logs a `.warning` (one-shot per phrase) naming the
+   `decoded.count / maxNew` budget and the produced audio duration.
+2. Sets `CosyVoice3SynthesisResult.finishedOnEos = false`, which the
+   benchmark harness surfaces as the `finished_on_eos` field on each
+   phrase in the JSON report.
+
+Footprint on the cantonese corpus (`minimax-cantonese`,
+100 phrases) **without the chunker**: 80 / 100 phrases would hit the
+cap, all producing exactly 163 generated tokens / ~6.5 s of audio.
+The mandarin corpus sees a much lower truncation rate because
+MiniMax-zh phrases are shorter on average.
+
+The structural fix — re-exporting the Flow CFM from
+[`mobius-cosyvoice3`](https://github.com/voicelink-ai/mobius-cosyvoice3)
+with a larger fixed input shape (e.g. `[1, 500]`) — is upstream
+work; bumping the constant in Swift alone would make the Flow
+input/output shapes mismatch at predict time. The shipped workaround
+is the call-site [auto-chunker](#cosyvoice3-auto-chunker), which
+drops cantonese truncation from 80/100 → 5/100 by splitting long
+inputs at clause boundaries and crossfading the results.
+
+Surfaced in
+`CosyVoice3Synthesizer.synthesize`
+(`Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Synthesize/CosyVoice3Synthesizer.swift`)
+and
+`CosyVoice3SynthesisResult.finishedOnEos`
+(`Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Synthesize/CosyVoice3Types.swift`).
+
+## CosyVoice3 auto-chunker
+
+Re-exporting Flow CFM with a larger fixed input shape is gated on
+upstream conversion work. Until that lands, `CosyVoice3TtsManager`
+splits long inputs at the call site, synthesizes each chunk
+independently, and merges with an 8 ms equal-power cosine crossfade.
+
+**Splitter policy** (`CosyVoice3TextChunker`):
+
+- **Hard enders** commit always: `.`, `!`, `?`, `。`, `！`, `？`,
+  `\n`.
+- **Soft enders** commit only when the running estimate is at or past
+  the budget: `，`, `、`, `；`, `：`, `;`, `,`, ASCII space.
+- **Force-split** at `budget + 30` tokens of overshoot if no natural
+  boundary appeared (rare; mostly continuous CJK with no
+  punctuation).
+
+**Token-rate estimate** (calibrated against minimax-zh + minimax-yue
+runs):
+
+| Char class | Tokens / char | Rationale                                                    |
+|------------|---------------|--------------------------------------------------------------|
+| CJK        | 7.5           | worst-case observed in real generation; varies 5.5–9 per char |
+| ASCII      | 1.5           | matches BPE rate on English text                              |
+| Other      | 2.5           | conservative for accented Latin / non-CJK Unicode             |
+
+`defaultMaxSpeechTokens` is **110**, leaving margin under the
+250-token Flow cap minus typical 60–90 token speech-prompt context.
+
+**Concatenation**: 8 ms equal-power cosine crossfade at 24 kHz
+between adjacent chunks; single-chunk path short-circuits to plain
+copy.
+
+**Validation** (full `minimax-cantonese`, 100 phrases, M2):
+
+| Metric                                    | Pre-chunker | Post-chunker | Δ          |
+|-------------------------------------------|-------------|--------------|------------|
+| `finished_on_eos=false` (truncated)       | 80 / 100    | **5 / 100**  | −94%       |
+| Longest audio output                      | 6.5 s       | **16.1 s**   | +148%      |
+| agg-RTFx                                  | 0.245×      | 0.249×       | +1.6%      |
+| TTFT p50                                  | 23.9 s      | 35.7 s       | +49%       |
+| TTFT p95                                  | 41.2 s      | 60.5 s       | +47%       |
+| Peak RSS                                  | 2016 MB     | 3264 MB      | +62%       |
+
+The 5/100 residual is the long-tail token-rate worst case (some
+Cantonese characters generate >9 speech tokens); raising the
+per-CJK heuristic further would over-fragment short phrases.
+Cleaner fix is the upstream Flow re-export.
+
diff --git a/Documentation/TTS/CosyVoice3.md b/Documentation/TTS/CosyVoice3.md
index 7308e208..2bca7061 100644
--- a/Documentation/TTS/CosyVoice3.md
+++ b/Documentation/TTS/CosyVoice3.md
@@ -3,16 +3,19 @@
 Mandarin zero-shot voice cloning via Qwen2 LM + CFM Flow + HiFT vocoder,
 running on CoreML.
 
-> ⚠️ **Beta / experimental.** End-to-end synthesis is currently slow on
-> Apple Silicon — RTFx < 1.0 typical, several seconds of latency for
-> short Mandarin utterances. The slowdown is partly the Flow CFM stage
-> (fp32, CPU-or-GPU only because fp16 + ANE produces NaNs through the
-> fused `layer_norm` — CoreMLTools limitation, tracked upstream) and
-> partly HiFT sinegen / windowing ops that fall back to CPU. May be a
-> model issue, may be recoverable through better conversion. Treat
-> performance numbers as preliminary; the Swift API, model layout, and
-> prompt-asset format may change in subsequent releases without
-> deprecation aliases.
+> ⚠️ **Beta / experimental.** End-to-end synthesis is below real-time
+> on Apple Silicon — agg-RTFx **0.357×** and p50 TTFT **~9.6 s** on
+> the full `minimax-chinese` 100-phrase corpus (M2, default compute
+> units), after the
+> [HiFT timeout fix](Benchmarks.md#cosyvoice3-hift-timeout-fix) and
+> [LLM-Decode `outputBackings` double-buffer](Benchmarks.md#cosyvoice3-llm-decode-outputbackings-fix).
+> The slowdown is partly the Flow CFM stage (fp32, CPU-or-GPU only
+> because fp16 + ANE produces NaNs through the fused `layer_norm` —
+> CoreMLTools limitation, tracked upstream) and partly HiFT sinegen
+> / windowing ops that fall back to CPU. May be a model issue, may
+> be recoverable through better conversion. Treat performance numbers
+> as preliminary; the Swift API, model layout, and prompt-asset format
+> may change in subsequent releases without deprecation aliases.
 
 ## Files
 
@@ -105,8 +108,9 @@ let result = try await manager.synthesize(
 
 | Field | Default | Notes |
 |---|---|---|
-| `maxNewTokens` | `nil` (cap = 1024) | Hard ceiling on speech-token count |
+| `maxNewTokens` | `nil` (= `flowTotalTokens − N_prompt`) | Soft ceiling on the LLM-Decode AR loop. The hard ceiling is the structural 250-token cap below — `maxNewTokens` only lets you generate fewer than that. |
 | `seed` | 42 | Drives the RAS sampler RNG; reproducible runs |
+| `disableAutoChunking` | `false` | When `true`, bypasses `CosyVoice3TextChunker` and runs a single synthesizer call regardless of input length. Use when you've pre-segmented input upstream (UI streaming, paragraph-at-a-time playback, etc.). The structural 250-token cap then applies and long inputs truncate mid-utterance. |
 
 `CosyVoice3SynthesisResult`:
 
@@ -116,13 +120,92 @@ let result = try await manager.synthesize(
 | `sampleRate` | `Int` | always 24000 |
 | `generatedTokenCount` | `Int` | tokens before EOS |
 | `decodedTokens` | `[Int32]` | full speech token sequence (debug) |
+| `finishedOnEos` | `Bool` | `true` = AR loop exited on an EOS token (natural termination); `false` = budget exhausted, audio truncated mid-utterance. See "Decode budget cap" below. |
+
+### Decode budget cap + auto-chunking
+
+The Flow CFM model is exported with a fixed-shape `token_total` input of
+`[1, 250]` (`CosyVoice3Constants.flowTotalTokens = 250`). Each LLM-Decode
+token corresponds to **40 ms of audio** (`tokenMelRatio = 2 × hiftSamplesPerFrame = 480 / sampleRate = 24 000`),
+so the *generated* portion of a single synthesizer call is bounded by
+`(250 − N_prompt) × 40 ms`. With a typical prompt of ~85–95 tokens,
+this leaves ~6.4–6.6 s of generated audio per call — long Mandarin
+phrases would truncate mid-utterance if synthesized in one shot.
+
+**`CosyVoice3TtsManager.synthesize(...)` auto-chunks long input** to
+sidestep this. Pipeline:
+
+1. Run the existing Chinese normalizer (or skip it, per `prenormalized`).
+2. `CosyVoice3TextChunker.chunk(normalized)` greedily splits on hard
+   sentence enders (`. ! ? 。 ！ ？`) and falls back to soft clause
+   separators (`, ; ， ； 、 ：`) when sentences exceed the budget. The
+   default budget is `defaultMaxSpeechTokens = 110` speech tokens
+   (`~45-token margin under the typical 155 room-for-new`; the 30-token
+   force-split overshoot may push committed chunks to ~140 estimated).
+3. If the chunker returns one segment, take the fast path — single
+   synthesizer call, no concat overhead.
+4. Otherwise loop, calling the synthesizer once per chunk, then merge
+   results: PCM concatenated with an 8 ms cosine cross-fade at each
+   boundary (masks DC/phase mismatch from independent synth calls);
+   `generatedTokenCount`/`decodedTokens` summed/concatenated;
+   `finishedOnEos` = AND across all chunks.
+
+Tunables: `CosyVoice3TextChunker.defaultMaxSpeechTokens` (110) is the
+default budget; pass `disableAutoChunking: true` in
+`CosyVoice3SynthesisOptions` to bypass the chunker entirely and run a
+single call (useful for UI-driven sentence-at-a-time streaming where
+the caller already controls segmentation).
+
+Token-rate estimate inside the chunker (calibrated against minimax-zh
+corpus runs — initial 5.5 figure was too optimistic and let ~16% of
+phrases hit the cap; 7.5 covers the worst-case observed real rate):
+
+| Class | Tokens/char | Rationale |
+|---|---|---|
+| CJK | 7.5 | worst-case observed in real generation; varies 5.5–9 per char |
+| ASCII | 1.5 | BPE compresses; English speaks faster than Mandarin per char |
+| Other (Latin-1, etc.) | 2.5 | middle ground |
+
+Caveats:
+
+- **Prosody discontinuity at boundaries.** Each chunk re-establishes the
+  pitch contour from the prompt, so concatenated audio has audible breaks
+  at chunk seams. The 8 ms cross-fade hides clicks/DC offsets but cannot
+  reconstruct cross-sentence prosody.
+- **Per-chunk prefill cost.** Each segment pays the prefill cost
+  separately, so total wall-clock for an N-chunk synth is roughly
+  `N × prefill + Σ decode_per_chunk`. Single-chunk inputs are unaffected.
+- **Estimate slack.** The token-per-char heuristic is rough; if a chunk
+  somehow exceeds the model's structural budget at runtime, the
+  synthesizer still emits the `LLM-Decode budget exhausted` warning and
+  returns `finishedOnEos: false` for that chunk.
+
+Behavior of the underlying synthesizer when its budget is hit (still
+applies for `disableAutoChunking: true` or for one-shot mode):
+
+- **AR loop exhausts `maxNew` without observing an EOS** in
+  `CosyVoice3Constants.stopRange` (`6_561…6_760`).
+- `CosyVoice3Synthesizer` emits a `.warning`-level log:
+  `"LLM-Decode budget exhausted: <N> generated tokens / <maxNew> cap (no EOS observed). Output truncated at ~<S>s of audio."`.
+- `result.finishedOnEos` is `false` so callers can detect it
+  programmatically (the `tts-benchmark` harness surfaces this as a
+  per-phrase `finished_on_eos` field in the JSON report).
+
+Lifting the cap structurally (no auto-chunk, no prosody seams) requires
+re-exporting Flow with a larger `token_total` shape (e.g. `[1, 500]` for
+~16 s) — handled upstream in the `mobius-cosyvoice3` conversion pipeline;
+not changeable from the Swift host.
 
 ## Key State
 
-### KV cache (`kv_cache[24, 1, 2, 768, 64]` fp16)
-- 24 transformer layers × `[K,V]` × heads × dim, packed into one `MLState`-style
-  `MLMultiArray` that the prefill produces and the decode loop both reads
-  and overwrites in-place.
+### KV cache (`kv_k` / `kv_v` each `[24, 1, 2, 768, 64]` fp32)
+- 24 transformer layers × `[K,V]` × heads × dim, split across two
+  `MLMultiArray` outputs (`kv_k`, `kv_v`) that prefill produces and the
+  decode loop carries forward across steps via
+  `MLPredictionOptions.outputBackings` double-buffering.
+- No `MLState` dependency — runs on the package baseline (macOS 14 / iOS 17).
+- ~9 MB per array; pre-allocated front/back/spare buffers rotated each
+  step (see [LLM-Decode `outputBackings` fix](Benchmarks.md#cosyvoice3-llm-decode-outputbackings-fix)).
 - Reset per `synthesize()` call.
 
 ### Prompt assets (`CosyVoice3PromptAssets`)
diff --git a/Documentation/TTS/KokoroAne.md b/Documentation/TTS/KokoroAne.md
index 88afa80a..b941f9e6 100644
--- a/Documentation/TTS/KokoroAne.md
+++ b/Documentation/TTS/KokoroAne.md
@@ -148,7 +148,11 @@ timing (5 s of audio, M1):
 | Vocoder    | ~120 ms  |
 | Tail       | ~50 ms   |
 
-Vocoder dominates. Total ≈ 300 ms for 5 s audio (~16× RTFx).
+Vocoder dominates. Total ≈ 300 ms for 5 s audio (~16× RTFx). For
+full-corpus numbers (warm-synth p50 / p95, peak RSS, WER) on the
+MiniMax-English 100-phrase suite — including the longer paragraph
+phrases that pull the per-corpus aggregate down to ~5.2× — see
+[Benchmarks.md](Benchmarks.md).
 
 ## Source
 
diff --git a/Documentation/TTS/Magpie.md b/Documentation/TTS/Magpie.md
index 9124b59e..74a93fc8 100644
--- a/Documentation/TTS/Magpie.md
+++ b/Documentation/TTS/Magpie.md
@@ -5,16 +5,26 @@ Lives under `Sources/FluidAudio/TTS/Magpie/`.
 
 ## Status
 
-Functional but **quite slow — needs significant perf work, not for real-time
-or latency-sensitive use.** First synth on a fresh process is dominated by
-CoreML model load + first-call ANE compile (~30 s); warm synths run at
-~96 s wall for an 8-word English sentence on M-series, i.e. RTFx ≈ **0.04**
-(~25× slower than realtime). Whether the throughput ceiling is a model
-characteristic, a CoreML conversion limitation, or both is still being
-investigated and is expected to improve in subsequent iterations. For
-real-time use prefer Kokoro (~20× RTFx) or PocketTTS (~1.5–2× RTFx);
-Magpie's value prop is multilingual coverage and the 5 built-in speaker
-contexts, not throughput.
+> ⚠️ **Beta / experimental.** Below real-time on Apple Silicon
+> (agg-RTFx ~0.41× on M2). Not for latency-sensitive use; prefer
+> Kokoro / Kokoro ANE or PocketTTS for real-time. Initializing
+> `MagpieTtsManager` logs a runtime beta warning at `.warning` level.
+
+Functional but **below real-time — not for latency-sensitive use.**
+On the full `minimax-english` 100-phrase corpus (M2, default compute
+units), Magpie posts agg-RTFx **0.41×** with p50 warm synth ~19.8 s
+and p95 ~57.5 s — most of the long tail comes from paragraph-length
+news / story phrases (max 107 s on a single 18 s utterance). Cold
+start ~19 s on warm ANE caches, dominated by first-call decoder_step
+compile. The AR loop (`decoder_step` + sampler) dominates wall clock
+and grows super-linearly with phrase length; the
+[`outputBackings` fast path](Benchmarks.md#magpie-outputbackings-fast-path)
+already eliminated the per-step KV reallocation cost. Further gains
+likely need an MLX-backed LocalTransformer or a smaller-K/V variant.
+For real-time use prefer Kokoro / Kokoro ANE (2–5× RTFx) or PocketTTS
+(streaming, TTFT ~1.2 s); Magpie's value prop is multilingual coverage
+(en/es/de/fr/it/vi/zh/hi) and 5 built-in speaker contexts, not
+throughput.
 
 Audio quality is perceptually clean across all 5 speakers and ASR-clean on
 4/5; speaker 0 has a single trailing-word artifact ("…and") attributable
diff --git a/Documentation/TTS/MinimaxCorpus.md b/Documentation/TTS/MinimaxCorpus.md
new file mode 100644
index 00000000..66dd679d
--- /dev/null
+++ b/Documentation/TTS/MinimaxCorpus.md
@@ -0,0 +1,89 @@
+# MiniMax Multilingual TTS Test Set
+
+The FluidAudio `tts-benchmark` corpus is sourced on demand from the
+[MiniMaxAI/TTS-Multilingual-Test-Set](https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set)
+Hugging Face dataset and converted to the harness format (one phrase
+per non-empty, non-`#` line). The fetched `.txt` files land under
+`Benchmarks/tts/corpus/minimax/<lang>.txt`; they are gitignored — only
+this document is checked in.
+
+| Field    | Value |
+|----------|-------|
+| Source   | https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set |
+| Revision | `cb416f0ac3658da0577e97873065e19fe6488917` (initial public release) |
+| License  | [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/) |
+| Citation | MiniMax-Speech tech report — [arXiv 2505.07916](https://arxiv.org/pdf/2505.07916) |
+| Languages | 24 (arabic, cantonese, chinese, czech, dutch, english, finnish, french, german, greek, hindi, indonesian, italian, japanese, korean, polish, portuguese, romanian, russian, spanish, thai, turkish, ukrainian, vietnamese) |
+| Phrases   | 100 per language (2400 total) |
+
+The fetched text files are derivative works of the upstream dataset
+and remain under **CC-BY-SA-4.0**. The rest of the FluidAudio
+repository is licensed separately (see top-level `LICENSE`); only the
+contents of `Benchmarks/tts/corpus/minimax/` are share-alike-bound to
+CC-BY-SA-4.0.
+
+## Why this corpus?
+
+MiniMax positions this as *"a public benchmark used in a number of
+recent TTS papers, which makes our numbers directly comparable to
+existing work"* (Gradium, MiniMax-Speech, seed-tts-eval, etc.).
+FluidAudio's `tts-benchmark` ships exclusively against this corpus
+so the resulting RTFx / WER numbers land on the same axis as
+published TTS work.
+
+## Format conversion
+
+Upstream lines have a `<cloning_audio_filename>|<text>` pipe-delimited
+shape because the dataset also ships per-speaker reference audio for
+zero-shot voice cloning. The FluidAudio harness only needs the text —
+voice selection is a per-backend concern (Kokoro / PocketTTS / Magpie /
+StyleTTS2 each have their own voice plumbing). The leading
+`<filename>|` is stripped at fetch time; if you need the cloning audio
+later, fetch it from the upstream HF repo's `audio/` directory.
+
+## Fetching
+
+The `fluidaudio minimax-corpus` CLI subcommand pins the upstream
+revision to the value above so re-runs are deterministic. From the
+package root:
+
+```bash
+# All 24 languages
+swift run fluidaudio minimax-corpus
+
+# Subset
+swift run fluidaudio minimax-corpus --languages english,spanish,hindi
+
+# Refresh against a newer release
+swift run fluidaudio minimax-corpus --revision <commit-sha>
+```
+
+Output lands in `Benchmarks/tts/corpus/minimax/<lang>.txt` (relative
+to the package root) by default; override with `--out-dir <path>`.
+Auth-gated revisions are honored via the standard `HF_TOKEN` /
+`HUGGING_FACE_HUB_TOKEN` env vars (same as every other HF asset pull
+in the project). Run `fluidaudio minimax-corpus --help` for the full
+flag list.
+
+Per-backend ↔ language coverage and `tts-benchmark --corpus minimax-<lang>`
+usage live in [`Benchmarks.md`](Benchmarks.md#corpus).
+
+## WER caveats
+
+Per the [open community discussion on the upstream
+dataset](https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set/discussions/10),
+WER on this corpus is sensitive to the ASR + text-normalization stack:
+
+- Whisper-v3 (and similarly Parakeet) often need text normalization on
+  the reference (`"32"` → `"thirty two"`) before comparing against the
+  hypothesis to get a clean WER.
+- For non-Latin-script languages (Hindi, Japanese, Cantonese, etc.) the
+  ASR may emit transliterated forms that don't match the reference
+  script, inflating WER even when the synthesis is intelligible.
+- For non-word-segmented languages (Chinese, Japanese, Thai), CER is
+  the more meaningful metric — `tts-benchmark` already reports both.
+
+This means **MiniMax WER is best read relatively (FluidAudio backend
+A vs. backend B on the same corpus + same ASR), not absolutely**, and
+side-by-side comparison with published numbers requires matching the
+upstream ASR + normalizer choice.
diff --git a/Sources/FluidAudio/ModelNames.swift b/Sources/FluidAudio/ModelNames.swift
index b49066e1..fbb7d281 100644
--- a/Sources/FluidAudio/ModelNames.swift
+++ b/Sources/FluidAudio/ModelNames.swift
@@ -716,7 +716,7 @@ public enum ModelNames {
     /// expected local directory layout is encoded in `CosyVoice3Constants.Files`.
     public enum CosyVoice3 {
         public static let llmPrefill = "LLM-Prefill-T256-M768-fp16"
-        public static let llmDecode = "LLM-Decode-M768-fp16-stateful"
+        public static let llmDecode = "LLM-Decode-M768-fp16"
         public static let flow = "Flow-N250-fp16"
         public static let hift = "HiFT-T500-fp16"
         public static let speechEmbeddings = "speech_embedding-fp16.safetensors"
diff --git a/Sources/FluidAudio/TTS/CosyVoice3/Assets/CosyVoice3ModelStore.swift b/Sources/FluidAudio/TTS/CosyVoice3/Assets/CosyVoice3ModelStore.swift
index 7051143f..ae616a42 100644
--- a/Sources/FluidAudio/TTS/CosyVoice3/Assets/CosyVoice3ModelStore.swift
+++ b/Sources/FluidAudio/TTS/CosyVoice3/Assets/CosyVoice3ModelStore.swift
@@ -28,11 +28,12 @@ public actor CosyVoice3ModelStore {
 
     /// - Parameters:
     ///   - directory: Base build directory that contains
-    ///     `llm-fp16/`, `llm-fp16-stateful/`, `flow-fp16-n250/`,
+    ///     `llm-fp16/`, `llm-fp16-decode/`, `flow-fp16-n250/`,
     ///     `hift-fp16-t500/`, `embeddings/`.
     ///   - computeUnits: Defaults to `.cpuAndNeuralEngine`. Applied to
-    ///     LLM-Prefill + HiFT models only. LLM-Decode (stateful) and Flow
-    ///     both force `.cpuAndGPU` regardless (see `loadIfNeeded()`).
+    ///     LLM-Prefill only. LLM-Decode (stateless external cache),
+    ///     Flow, and HiFT all pin `.cpuAndGPU` regardless (see
+    ///     `loadIfNeeded()`).
     public init(directory: URL, computeUnits: MLComputeUnits = .cpuAndNeuralEngine) {
         self.directory = directory
         self.computeUnits = computeUnits
@@ -67,10 +68,10 @@ public actor CosyVoice3ModelStore {
         let prefill = try await compileAndLoad(prefillURL, configuration: config)
         logger.info("Loaded \(CosyVoice3Constants.Files.llmPrefill)")
 
-        // Stateful decode MUST run on `.cpuAndGPU`:
-        //   - ANE refuses to compile the stateful graph (same failure mode
-        //     as Flow: `MILCompilerForANE ANECCompile() FAILED`), so
-        //     `.cpuAndNE` / `.all` deadlock load
+        // Stateless decode MUST run on `.cpuAndGPU`:
+        //   - ANE refuses to compile the rotary + sliced SDPA decode graph
+        //     (same failure mode as Flow: `MILCompilerForANE ANECCompile()
+        //     FAILED`), so `.cpuAndNE` / `.all` deadlock load
         //   - CPU-only works but is ~2× slower than the GPU path
         // Ignore the user-supplied `computeUnits` for decode.
         let decodeConfig = MLModelConfiguration()
@@ -98,7 +99,25 @@ public actor CosyVoice3ModelStore {
         let flow = try await compileAndLoad(flowURL, configuration: flowConfig)
         logger.info("Loaded \(CosyVoice3Constants.Files.flow)")
 
-        let hift = try await compileAndLoad(hiftURL, configuration: config)
+        // HiFT runs on `.cpuAndGPU` (fp16). With `.cpuAndNeuralEngine`
+        // CoreML's planner placed most of HiFT on ANE but kept at least
+        // one op (`HiFT-T500-fp16_main__Op104`) on the BNNS CPU path,
+        // which trips a hard async-dispatch watchdog mid-corpus on
+        // long phrases:
+        //
+        //   E5RT: Submit Async failed for [3:29]: Async task:
+        //   HiFT-T500-fp16_main__Op104_BnnsCpuInference has timed out.
+        //   @ CancelTimedOutAsyncTask_block_invoke
+        //
+        // Pinning HiFT to `.cpuAndGPU` removes the ANE+BNNS mixed-compute
+        // pathology (the same family of issue that already forced Flow
+        // and Decode off ANE above). The model is fixed-shape
+        // [1, 80, 500] so GPU placement is predictable. Trade-off: a
+        // small per-call latency increase vs. ANE — acceptable, since
+        // the prior ANE config didn't actually complete the corpus.
+        let hiftConfig = MLModelConfiguration()
+        hiftConfig.computeUnits = .cpuAndGPU
+        let hift = try await compileAndLoad(hiftURL, configuration: hiftConfig)
         logger.info("Loaded \(CosyVoice3Constants.Files.hift)")
 
         loadedModels = CosyVoice3Models(prefill: prefill, decode: decode, flow: flow, hift: hift)
diff --git a/Sources/FluidAudio/TTS/CosyVoice3/CosyVoice3Constants.swift b/Sources/FluidAudio/TTS/CosyVoice3/CosyVoice3Constants.swift
index b0a46f93..094879e0 100644
--- a/Sources/FluidAudio/TTS/CosyVoice3/CosyVoice3Constants.swift
+++ b/Sources/FluidAudio/TTS/CosyVoice3/CosyVoice3Constants.swift
@@ -4,7 +4,7 @@ import Foundation
 ///
 /// Shipping config (frozen):
 /// - LLM-Prefill-T256-M768-fp16           (cpuAndNeuralEngine)
-/// - LLM-Decode-M768-fp16-stateful        (cpuAndGPU — see note)
+/// - LLM-Decode-M768-fp16                 (cpuAndGPU — see note)
 /// - Flow-N250-fp16                       (cpuAndGPU — an ANE-port
 ///   BC1S rewrite was attempted and reverted: the converted graph ran
 ///   ~3× faster but numerically broken (mel dynamic range collapsed
@@ -15,14 +15,22 @@ import Foundation
 ///   `input_embed.conv_pos_embed` (`Conv1d(1024,1024,k=31)+Mish`)
 ///   that three rewrite attempts couldn't move — ANEF rejects the
 ///   conv footprint regardless of group count.)
-/// - HiFT-T500-fp16                       (cpuAndNeuralEngine)
+/// - HiFT-T500-fp16                       (cpuAndGPU — pinned off
+///   ANE because the `.cpuAndNeuralEngine` planner left at least one
+///   op on the BNNS CPU path, which tripped a hard async-dispatch
+///   watchdog mid-corpus on long phrases:
+///   `E5RT: Submit Async failed ... HiFT-T500-fp16_main__Op104_BnnsCpuInference
+///   has timed out`. GPU placement is deterministic and avoids the
+///   ANE+BNNS mixed-compute pathology.)
 ///
-/// The stateful decode model uses per-layer `MLState` buffers for the
-/// KV cache (48 tensors, `[1, 2, 768, 64]` fp16 each) instead of
-/// round-tripping 18 MB of kv_k / kv_v MLMultiArrays every step. ANE
-/// refuses to compile the stateful graph (`MILCompilerForANE
-/// ANECCompile() FAILED`); decode therefore runs on `.cpuAndGPU`.
-/// Requires macOS 15 / iOS 18.
+/// Decode runs **stateless** with an external KV cache: prefill emits
+/// `kv_k` / `kv_v` of shape `[24, 1, 2, 768, 64]` fp32, and decode
+/// accepts the same tensors as inputs and returns `kv_k_out` / `kv_v_out`
+/// at the same shape/dtype. The cache is round-tripped once per step
+/// (≈18 MB total). ANE still rejects this graph (`MILCompilerForANE
+/// ANECCompile() FAILED` on the rotary + sliced SDPA), so decode is
+/// pinned to `.cpuAndGPU`. The library floor is macOS 14 / iOS 17 — no
+/// MLState dependency.
 public enum CosyVoice3Constants {
 
     // MARK: - LLM shapes
@@ -66,8 +74,8 @@ public enum CosyVoice3Constants {
     public enum Files {
         public static let llmPrefill = "LLM-Prefill-T256-M768-fp16.mlpackage"
         public static let llmPrefillSubdir = "llm-fp16"
-        public static let llmDecode = "LLM-Decode-M768-fp16-stateful.mlpackage"
-        public static let llmDecodeSubdir = "llm-fp16-stateful"
+        public static let llmDecode = "LLM-Decode-M768-fp16.mlpackage"
+        public static let llmDecodeSubdir = "llm-fp16-decode"
         public static let flow = "Flow-N250-fp16.mlpackage"
         public static let flowSubdir = "flow-fp16-n250"
         public static let hift = "HiFT-T500-fp16.mlpackage"
diff --git a/Sources/FluidAudio/TTS/CosyVoice3/CosyVoice3TtsManager.swift b/Sources/FluidAudio/TTS/CosyVoice3/CosyVoice3TtsManager.swift
index d71f3ea6..7d76a45b 100644
--- a/Sources/FluidAudio/TTS/CosyVoice3/CosyVoice3TtsManager.swift
+++ b/Sources/FluidAudio/TTS/CosyVoice3/CosyVoice3TtsManager.swift
@@ -38,11 +38,9 @@ import Foundation
 ///   the 281 runtime-added special tokens (CosyVoice3Tokenizer). Same format
 ///   that `tokenizer_fixture.json` dumps under its `special_tokens` key.
 ///
-/// > Note: Gated to macOS 15 / iOS 18 because the underlying
-/// > `CosyVoice3Synthesizer` uses CoreML `MLState` for the decode KV cache.
-/// > Other FluidAudio modules (ASR, Diarization, VAD, Kokoro, PocketTTS)
-/// > remain available on macOS 14 / iOS 17.
-@available(macOS 15, iOS 18, *)
+/// > Available on the same floor as the rest of FluidAudio (macOS 14 /
+/// > iOS 17). Decode runs stateless with an external KV cache rather than
+/// > `MLState`, so no extra OS gate is required.
 public actor CosyVoice3TtsManager {
 
     private let logger = AppLogger(subsystem: "com.fluidaudio.tts", category: "CosyVoice3TtsManager")
@@ -216,9 +214,60 @@ public actor CosyVoice3TtsManager {
             normalized = CosyVoice3ChineseNormalizer.normalize(text)
         }
 
+        // Auto-chunk long input under the structural 250-token Flow cap.
+        // The chunker greedily splits on hard sentence enders + soft clause
+        // separators when the running speech-token estimate exceeds budget;
+        // short inputs return a single chunk and take the fast path. Caller
+        // can opt out via `options.disableAutoChunking` for pre-segmented
+        // input (e.g. UI-driven streaming).
+        let chunks: [String]
+        if options.disableAutoChunking {
+            chunks = [normalized]
+        } else {
+            let split = CosyVoice3TextChunker.chunk(normalized)
+            chunks = split.isEmpty ? [normalized] : split
+        }
+
+        if chunks.count == 1 {
+            return try await synthesizeChunk(
+                text: chunks[0], promptAssets: promptAssets,
+                options: options, frontend: frontend, synthesizer: synthesizer)
+        }
+
+        logger.info(
+            "Auto-chunking long input into \(chunks.count) segments to fit "
+                + "the 250-token Flow cap (estimated speech tokens: "
+                + "\(CosyVoice3TextChunker.estimateSpeechTokens(normalized))).")
+        var results: [CosyVoice3SynthesisResult] = []
+        results.reserveCapacity(chunks.count)
+        for (i, chunk) in chunks.enumerated() {
+            logger.info(
+                "  chunk \(i + 1)/\(chunks.count): "
+                    + "\(chunk.count) chars, ~"
+                    + "\(CosyVoice3TextChunker.estimateSpeechTokens(chunk)) speech tokens")
+            let r = try await synthesizeChunk(
+                text: chunk, promptAssets: promptAssets,
+                options: options, frontend: frontend, synthesizer: synthesizer)
+            results.append(r)
+        }
+        return Self.mergeChunkedResults(results)
+    }
+
+    // MARK: - Chunked synthesis helpers
+
+    /// Single-call synthesis path: tokenize/normalize-aware text → fixture
+    /// adapter → synthesizer. Shared between the fast (1-chunk) and chunked
+    /// (N-chunk) paths in `synthesize(...)`.
+    private func synthesizeChunk(
+        text: String,
+        promptAssets: CosyVoice3PromptAssets,
+        options: CosyVoice3SynthesisOptions,
+        frontend: CosyVoice3TextFrontend,
+        synthesizer: CosyVoice3Synthesizer
+    ) async throws -> CosyVoice3SynthesisResult {
         let assembled = try frontend.assemble(
             promptText: promptAssets.promptText,
-            ttsText: normalized,
+            ttsText: text,
             promptSpeechIds: promptAssets.promptSpeechIds)
 
         let lmInputEmbedsFlat = try Self.flattenLmEmbeds(
@@ -246,6 +295,72 @@ public actor CosyVoice3TtsManager {
         return try await synthesizer.synthesize(fixture: fixture, options: parityOptions)
     }
 
+    /// Concatenate per-chunk results into a single `CosyVoice3SynthesisResult`.
+    /// Audio is stitched with a short cosine cross-fade (`crossfadeMs`) at
+    /// each boundary to mask DC/phase mismatch from independent synth calls.
+    /// `finishedOnEos` is `true` only when every chunk ended naturally
+    /// (so callers can still detect mid-segment truncation downstream).
+    private static func mergeChunkedResults(
+        _ results: [CosyVoice3SynthesisResult],
+        crossfadeMs: Double = 8
+    ) -> CosyVoice3SynthesisResult {
+        precondition(!results.isEmpty, "mergeChunkedResults requires ≥1 result")
+        let sampleRate = results[0].sampleRate
+        let samples = concatWithCrossfade(
+            results.map { $0.samples },
+            sampleRate: sampleRate,
+            fadeMs: crossfadeMs)
+        let totalGenerated = results.reduce(0) { $0 + $1.generatedTokenCount }
+        var allDecoded: [Int32] = []
+        allDecoded.reserveCapacity(totalGenerated)
+        for r in results { allDecoded.append(contentsOf: r.decodedTokens) }
+        let allEos = results.allSatisfy { $0.finishedOnEos }
+        return CosyVoice3SynthesisResult(
+            samples: samples,
+            sampleRate: sampleRate,
+            generatedTokenCount: totalGenerated,
+            decodedTokens: allDecoded,
+            finishedOnEos: allEos)
+    }
+
+    /// Concatenate PCM chunks with a cosine cross-fade at each boundary.
+    /// Fade window is the shorter of `fadeMs` and `min(prev.tail, next.head)
+    /// / 2`, so very short chunks degrade gracefully (no overlap consuming
+    /// the entire chunk).
+    static func concatWithCrossfade(
+        _ chunks: [[Float]],
+        sampleRate: Int,
+        fadeMs: Double
+    ) -> [Float] {
+        guard !chunks.isEmpty else { return [] }
+        let nominalFade = max(0, Int((Double(sampleRate) * fadeMs / 1000).rounded()))
+        var out: [Float] = chunks[0]
+        for i in 1..<chunks.count {
+            let next = chunks[i]
+            if nominalFade == 0 || out.isEmpty || next.isEmpty {
+                out.append(contentsOf: next)
+                continue
+            }
+            let fade = min(nominalFade, out.count / 2, next.count / 2)
+            if fade <= 0 {
+                out.append(contentsOf: next)
+                continue
+            }
+            // Cosine equal-power crossfade: out tail fades down, next head
+            // fades up; samples are summed in the overlap region. Length of
+            // `out` after splice = old_len - fade + next.count.
+            let outStart = out.count - fade
+            for j in 0..<fade {
+                let t = Float(j) / Float(fade)
+                let down = 0.5 * (1 + cos(Float.pi * t))  // 1 → 0
+                let up = 0.5 * (1 - cos(Float.pi * t))  // 0 → 1
+                out[outStart + j] = out[outStart + j] * down + next[j] * up
+            }
+            out.append(contentsOf: next[fade..<next.count])
+        }
+        return out
+    }
+
     // MARK: - Helpers
 
     /// Flatten `[1, tPre, 896]` MLMultiArray fp32 into `[tPre * 896]` Float,
diff --git a/Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Preprocess/CosyVoice3TextChunker.swift b/Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Preprocess/CosyVoice3TextChunker.swift
new file mode 100644
index 00000000..7d08650d
--- /dev/null
+++ b/Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Preprocess/CosyVoice3TextChunker.swift
@@ -0,0 +1,145 @@
+import Foundation
+
+/// Splits long input text into segments that each fit within CosyVoice3's
+/// 250-token Flow input cap.
+///
+/// The Flow CFM model is exported with a fixed `[1, 250]` `token_total`
+/// shape (`CosyVoice3Constants.flowTotalTokens`). After the prompt's speech
+/// tokens consume `~85–95` slots (default voice), each `synthesize(...)`
+/// call has room for roughly `~155` new speech tokens of output (≈ 6.4 s of
+/// audio at the 40 ms/token rate `tokenMelRatio × hiftSamplesPerFrame /
+/// sampleRate = 2 × 480 / 24_000`). Long phrases truncate mid-utterance.
+///
+/// This chunker greedily packs input into segments under a target speech-
+/// token budget, splitting preferentially on hard sentence enders
+/// (`. ! ? 。 ！ ？ \n`) and falling back to soft clause separators
+/// (`, ; ， ； 、 ：`) when sentences exceed the budget. Synthesis is run
+/// per-chunk and audio is concatenated with a small cosine cross-fade at
+/// boundaries (handled by the caller, not here).
+///
+/// **Token-rate estimate** (calibrated against minimax-zh corpus runs):
+/// - CJK char        ≈ 7.5 speech tokens (worst-case observed real rate;
+///                                          5.5 was empirically too low and
+///                                          let ~16% of phrases hit cap)
+/// - ASCII char      ≈ 1.5 speech tokens (BPE compresses; English is faster)
+/// - Other (Latin-1) ≈ 2.5 speech tokens (middle ground for accented Latin)
+///
+/// Default `maxSpeechTokens = 110` leaves a ~45-token safety margin under
+/// the typical room-for-new of ~155. The 30-token force-split overshoot
+/// can push a committed chunk to ~140 estimated, still comfortably under
+/// the cap once the conservative 5.5-tokens/CJK-char heuristic is
+/// reconciled with real generation rates. The synthesizer still emits
+/// its `LLM-Decode budget exhausted` warning if a chunk somehow exceeds
+/// the cap, so over-estimates are self-healing.
+public enum CosyVoice3TextChunker {
+
+    /// Sentence-ending punctuation. Always commit the current chunk after
+    /// these, regardless of running token count.
+    private static let hardEnders: Set<Character> = [
+        "。", "！", "？", ".", "!", "?", "\n",
+    ]
+
+    /// Clause-internal punctuation. Commit only when the running token
+    /// count is at or above the budget — soft splits should be preferred
+    /// over force-splits but not preferred over hard enders.
+    private static let softEnders: Set<Character> = [
+        "，", "、", "；", "：", ";", ",", " ",
+    ]
+
+    /// Default speech-token budget per chunk. Keeps a ~45-token margin
+    /// under the typical room-for-new of ~155 (= `flowTotalTokens=250`
+    /// minus a typical prompt of ~95 tokens). The 30-token force-split
+    /// overshoot may push committed chunks to ~140 estimated, still under
+    /// the structural cap.
+    public static let defaultMaxSpeechTokens: Int = 110
+
+    /// Split `text` into chunks each estimated to produce ≤
+    /// `maxSpeechTokens` LLM speech tokens. Returns `[text]` (single
+    /// chunk) when the input already fits. Returns `[]` when `text` is
+    /// empty or whitespace-only.
+    public static func chunk(
+        _ text: String,
+        maxSpeechTokens: Int = defaultMaxSpeechTokens
+    ) -> [String] {
+        let trimmed = text.trimmingCharacters(in: .whitespacesAndNewlines)
+        guard !trimmed.isEmpty else { return [] }
+        if estimateSpeechTokens(trimmed) <= maxSpeechTokens {
+            return [trimmed]
+        }
+
+        var chunks: [String] = []
+        var current = ""
+        for ch in trimmed {
+            current.append(ch)
+            let tokensSoFar = estimateSpeechTokens(current)
+
+            if hardEnders.contains(ch) {
+                let pruned = current.trimmingCharacters(in: .whitespacesAndNewlines)
+                if !pruned.isEmpty { chunks.append(pruned) }
+                current = ""
+                continue
+            }
+            if tokensSoFar >= maxSpeechTokens && softEnders.contains(ch) {
+                let pruned = current.trimmingCharacters(in: .whitespacesAndNewlines)
+                if !pruned.isEmpty { chunks.append(pruned) }
+                current = ""
+                continue
+            }
+            // Force-split if no punctuation has appeared within a 30-token
+            // overshoot. Prefer the most recent whitespace; fall back to
+            // hard-cut at the current position. Hard-cut on continuous CJK
+            // (no whitespace) is rare in normalized input but can happen
+            // when the normalizer collapses spaces.
+            if tokensSoFar >= maxSpeechTokens + 30 {
+                if let lastSpace = current.lastIndex(where: { $0 == " " }),
+                    lastSpace != current.startIndex
+                {
+                    let head = String(current[..<lastSpace])
+                        .trimmingCharacters(in: .whitespacesAndNewlines)
+                    let tail = String(current[current.index(after: lastSpace)...])
+                    if !head.isEmpty { chunks.append(head) }
+                    current = tail
+                } else {
+                    let pruned = current.trimmingCharacters(in: .whitespacesAndNewlines)
+                    if !pruned.isEmpty { chunks.append(pruned) }
+                    current = ""
+                }
+            }
+        }
+        let tail = current.trimmingCharacters(in: .whitespacesAndNewlines)
+        if !tail.isEmpty { chunks.append(tail) }
+        return chunks
+    }
+
+    /// Rough estimate of how many SPEECH tokens the LLM-Decode AR loop
+    /// will produce for `s`. Used by `chunk(...)` to size segments under
+    /// the structural Flow cap.
+    public static func estimateSpeechTokens(_ s: String) -> Int {
+        var total = 0.0
+        for scalar in s.unicodeScalars {
+            if isCJK(scalar) {
+                total += 7.5
+            } else if scalar.isASCII {
+                total += 1.5
+            } else {
+                total += 2.5
+            }
+        }
+        return Int(total.rounded())
+    }
+
+    private static func isCJK(_ scalar: Unicode.Scalar) -> Bool {
+        let v = scalar.value
+        // CJK Unified Ideographs (the bulk of zh/yue text)
+        if (0x4E00...0x9FFF).contains(v) { return true }
+        // CJK Unified Ideographs Extension A
+        if (0x3400...0x4DBF).contains(v) { return true }
+        // Hiragana
+        if (0x3040...0x309F).contains(v) { return true }
+        // Katakana
+        if (0x30A0...0x30FF).contains(v) { return true }
+        // Hangul Syllables
+        if (0xAC00...0xD7AF).contains(v) { return true }
+        return false
+    }
+}
diff --git a/Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Synthesize/CosyVoice3Synthesizer.swift b/Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Synthesize/CosyVoice3Synthesizer.swift
index 83a00b37..2402d0cf 100644
--- a/Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Synthesize/CosyVoice3Synthesizer.swift
+++ b/Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Synthesize/CosyVoice3Synthesizer.swift
@@ -7,11 +7,12 @@ import Foundation
 /// implemented as a method on this type, keeping the state (KV cache, running
 /// decoded list) local to a single synthesis call.
 ///
-/// Decode uses CoreML `MLState` (macOS 15 / iOS 18): 48 per-layer buffers
-/// (`kv_k_0..kv_k_23`, `kv_v_0..kv_v_23`) replace the 18 MB kv_k / kv_v
-/// round-trip per step. Prefill remains non-stateful and its `kv_k` / `kv_v`
-/// outputs seed the decode state once after prefill.
-@available(macOS 15, iOS 18, *)
+/// Decode is **stateless** with an external KV cache. Prefill emits
+/// `kv_k` / `kv_v` of shape `[24, 1, 2, 768, 64]` fp32; decode accepts those
+/// same tensors as inputs and returns updated `kv_k_out` / `kv_v_out` at
+/// the same shape/dtype. We round-trip the cache once per step (≈18 MB
+/// total) and bind the previous step's outputs as the next step's inputs.
+/// No `MLState` dependency — runs on macOS 14 / iOS 17.
 public actor CosyVoice3Synthesizer {
 
     private let logger = AppLogger(subsystem: "com.fluidaudio.tts", category: "CosyVoice3Synthesizer")
@@ -19,6 +20,18 @@ public actor CosyVoice3Synthesizer {
     private let models: CosyVoice3Models
     private let embeddings: CosyVoice3SpeechEmbeddings
 
+    /// Set to `false` once `LLM-Decode-M768-fp16` rejects pre-allocated
+    /// `outputBackings` (model exported without explicit MultiArray
+    /// shape/dtype constraints on its `kv_k_out` / `kv_v_out` /
+    /// `speech_logits` outputs). Latched off so we don't throw + catch on
+    /// every one of ~163 AR decode steps per phrase. Same pattern as
+    /// `MagpieKvCache.useOutputBackings`.
+    private var useOutputBackings: Bool = true
+
+    /// One-shot flag for "fast path engaged" log message; only emitted on
+    /// the first successful `outputBackings` prediction so we don't spam.
+    private var loggedFastPath: Bool = false
+
     public init(models: CosyVoice3Models, embeddings: CosyVoice3SpeechEmbeddings) {
         self.models = models
         self.embeddings = embeddings
@@ -46,16 +59,52 @@ public actor CosyVoice3Synthesizer {
             sampler.seedTokens(fixture.decodedTokens)
         }
 
-        // 1) Prefill (non-stateful: returns kv_k / kv_v as outputs)
+        // 1) Prefill (returns kv_k / kv_v as fp32 outputs)
         let tPrefill = Date()
         let (prefillLogits, initialKvK, initialKvV) = try await runPrefill(fixture: fixture)
         let prefillSec = Date().timeIntervalSince(tPrefill)
 
-        // Seed decode MLState from prefill kv_k / kv_v.
-        let tSeed = Date()
-        let state = models.decode.makeState()
-        try seedDecodeState(state: state, kvK: initialKvK, kvV: initialKvV)
-        let seedSec = Date().timeIntervalSince(tSeed)
+        // External KV cache with **double-buffered outputBackings**: prefill's
+        // `kv_k` / `kv_v` (shape `[24, 1, 2, 768, 64]` fp32, ~9 MB each) feed
+        // the first decode step. Subsequent steps rotate between two
+        // pre-allocated buffer pairs (A/B) bound as the model's
+        // `kv_k_out` / `kv_v_out` outputs. Same pattern as
+        // `MagpieKvCache.swapBackings()` — eliminates ~36 MB of host
+        // alloc/dealloc per decode step (×163 steps ≈ 5.9 GB churn per
+        // phrase). `speech_logits` is also pre-bound so we avoid a fresh
+        // 27 KB allocation each step. CoreML rejects this when the model
+        // was exported without explicit MultiArray shape/dtype constraints
+        // on its outputs; in that case we latch `useOutputBackings = false`
+        // and fall back to per-step allocation for the rest of the run.
+        let kvShape: [NSNumber] = [
+            NSNumber(value: CosyVoice3Constants.numLayers),
+            1,
+            NSNumber(value: CosyVoice3Constants.kvHeads),
+            NSNumber(value: CosyVoice3Constants.kvMaxLength),
+            NSNumber(value: CosyVoice3Constants.headDim),
+        ]
+        let kvKBackA = try MLMultiArray(shape: kvShape, dataType: .float32)
+        let kvVBackA = try MLMultiArray(shape: kvShape, dataType: .float32)
+        let kvKBackB = try MLMultiArray(shape: kvShape, dataType: .float32)
+        let kvVBackB = try MLMultiArray(shape: kvShape, dataType: .float32)
+        let logitsBacking = try MLMultiArray(
+            shape: [1, 1, NSNumber(value: CosyVoice3Constants.speechVocab)],
+            dataType: .float32)
+
+        // Pointer-rotation triple. `frontKvK/V` are read by the next step;
+        // `backKvK/V` receive the next step's writes; `spareKvK/V` are the
+        // pre-allocated set ready to become `back` after rotation. Initial
+        // `front` is the prefill output; we don't reuse those buffers as
+        // `spare`/`back` — once decode step 1 finishes, `front` becomes A
+        // (just-written), `back` becomes B (next write target), `spare`
+        // becomes A's previous contents (which we drop, since prefill
+        // output is single-use).
+        var frontKvK: MLMultiArray = initialKvK
+        var frontKvV: MLMultiArray = initialKvV
+        var backKvK: MLMultiArray = kvKBackA
+        var backKvV: MLMultiArray = kvVBackA
+        var spareKvK: MLMultiArray = kvKBackB
+        var spareKvV: MLMultiArray = kvVBackB
 
         // Reusable per-step inputs for decode. `curLenArr` is mutated in place
         // each step; `inputsEmbedsArr` is overwritten by memcpy per step.
@@ -64,6 +113,12 @@ public actor CosyVoice3Synthesizer {
             shape: [1, 1, NSNumber(value: CosyVoice3Constants.embedDim)],
             dataType: .float32)
 
+        // Logits scratch reused across all decode steps. The hot loop
+        // memcpy's into this from `logitsBacking` (or strided-gathers from a
+        // freshly-allocated array on the slow path).
+        var logitsScratch = [Float](
+            repeating: 0, count: CosyVoice3Constants.speechVocab)
+
         // First token from prefill tail logits.
         var decoded: [Int32] = []
         let firstLogits = sliceLastStepLogits(
@@ -81,31 +136,82 @@ public actor CosyVoice3Synthesizer {
         }
         decoded.append(topId)
 
-        // 2) Decode loop
+        // 2) Decode loop (stateless, external cache, double-buffered backings)
         var curLen = fixture.tPre
         var decodeSteps = 0
+        var hitEos = false
         let tDecode = Date()
         for step in 1..<maxNew {
             try embeddings.copyEmbedding(tokenId: topId, into: inputsEmbedsArr)
             curLenArr[0] = NSNumber(value: Int32(curLen))
-            let logits = try runDecodeStateful(
+            try runDecode(
                 inputsEmbeds: inputsEmbedsArr,
                 curLen: curLenArr,
-                state: state)
-            topId = sampler.sample(logits: logits, decodedSoFar: decoded)
+                frontKvK: frontKvK,
+                frontKvV: frontKvV,
+                backKvK: backKvK,
+                backKvV: backKvV,
+                logitsBacking: logitsBacking,
+                logits: &logitsScratch)
+            topId = sampler.sample(logits: logitsScratch, decodedSoFar: decoded)
             curLen += 1
             decodeSteps += 1
             if CosyVoice3Constants.stopRange.contains(topId) {
                 logger.info("EOS at step \(step) (token=\(topId))")
+                hitEos = true
                 break
             }
             decoded.append(topId)
+
+            // Rotate buffers: `back` (just-written) becomes new `front`;
+            // `spare` becomes new `back`; old `front` becomes new `spare`
+            // (will be overwritten next step). On step 1 the old `front` is
+            // the prefill output — drops to `spare` and gets overwritten on
+            // step 3, which is harmless (we never read it again).
+            let prevFrontK = frontKvK
+            let prevFrontV = frontKvV
+            frontKvK = backKvK
+            frontKvV = backKvV
+            backKvK = spareKvK
+            backKvV = spareKvV
+            spareKvK = prevFrontK
+            spareKvV = prevFrontV
         }
         let decodeSec = Date().timeIntervalSince(tDecode)
         guard !decoded.isEmpty else {
             throw CosyVoice3Error.predictionFailed("LLM produced no speech tokens")
         }
 
+        // Truncation signal: AR loop exhausted its decode budget without
+        // observing an EOS token in `stopRange` (6_561…6_760). The 250-token
+        // cap is structural — it's the fixed `[1, 250]` shape of the Flow
+        // model's `token_total` input (`CosyVoice3Constants.flowTotalTokens`),
+        // not a synthesizer-side soft limit. With ~40 ms of audio per token
+        // (`tokenMelRatio=2 × hiftSamplesPerFrame=480 / sampleRate=24_000`),
+        // a prompt taking ~`nPrompt` tokens leaves `(250 - nPrompt) × 0.04 s`
+        // of generated audio — i.e. long phrases truncate mid-utterance.
+        //
+        // Surface this as a `.warning` so callers running long input get a
+        // console signal instead of silent truncation. Lifting the cap
+        // requires re-exporting Flow with a larger `token_total` shape; for
+        // now, splitting input at clause boundaries (， / 。) is the
+        // workaround.
+        if !hitEos {
+            let producedSec =
+                Double(decoded.count)
+                * Double(CosyVoice3Constants.tokenMelRatio)
+                * Double(CosyVoice3Constants.hiftSamplesPerFrame)
+                / Double(CosyVoice3Constants.sampleRate)
+            logger.warning(
+                "LLM-Decode budget exhausted: \(decoded.count) generated tokens "
+                    + "/ \(maxNew) cap (no EOS observed). "
+                    + "Output truncated at ~"
+                    + String(format: "%.1f", producedSec)
+                    + "s of audio. The 250-token Flow input is a structural cap; "
+                    + "split long phrases at clause boundaries (， 。) to work around."
+            )
+        }
+
         // 3) Flow
         let nNew = decoded.count
         let tFlow = Date()
@@ -133,14 +239,15 @@ public actor CosyVoice3Synthesizer {
         logger.info(
             String(
                 format:
-                    "STAGES prefill=%.3fs seed=%.3fs decode=%.3fs(%d steps, %.2f tok/s) flow=%.3fs hift=%.3fs",
-                prefillSec, seedSec, decodeSec, decodeSteps, decodeTps, flowSec, hiftSec))
+                    "STAGES prefill=%.3fs decode=%.3fs(%d steps, %.2f tok/s) flow=%.3fs hift=%.3fs",
+                prefillSec, decodeSec, decodeSteps, decodeTps, flowSec, hiftSec))
 
         return CosyVoice3SynthesisResult(
             samples: audio,
             sampleRate: CosyVoice3Constants.sampleRate,
             generatedTokenCount: nNew,
-            decodedTokens: decoded)
+            decodedTokens: decoded,
+            finishedOnEos: hitEos)
     }
 
     // MARK: - Stages
@@ -193,140 +300,134 @@ public actor CosyVoice3Synthesizer {
         return (logits, kvK, kvV)
     }
 
-    /// Run one stateful decode step. `state` is mutated in place via the
-    /// 48 per-layer `kv_k_i` / `kv_v_i` state buffers registered in the
-    /// converted model.
-    private func runDecodeStateful(
+    /// Run one stateless decode step with an external KV cache.
+    ///
+    /// Inputs match the converted CoreML graph signature:
+    /// - `inputs_embeds: fp32 [1, 1, 896]`
+    /// - `cur_len: int32 [1]`
+    /// - `kv_k: fp32 [24, 1, 2, 768, 64]` (previous step's `kv_k_out`, or
+    ///   prefill's `kv_k` for the first decode step)
+    /// - `kv_v: fp32 [24, 1, 2, 768, 64]`
+    ///
+    /// Outputs (when `outputBackings` is accepted, written into the pre-
+    /// allocated `backKvK` / `backKvV` / `logitsBacking` buffers in place):
+    /// - `speech_logits: fp32 [1, 1, 6761]`
+    /// - `kv_k_out: fp32 [24, 1, 2, 768, 64]`
+    /// - `kv_v_out: fp32 [24, 1, 2, 768, 64]`
+    ///
+    /// Falls back to per-step CoreML allocation + memcpy into the pre-
+    /// allocated backings if the model rejects `outputBackings` (latches
+    /// `useOutputBackings = false` so we don't retry on every step).
+    private func runDecode(
         inputsEmbeds: MLMultiArray,
         curLen: MLMultiArray,
-        state: MLState
-    ) throws -> [Float] {
+        frontKvK: MLMultiArray,
+        frontKvV: MLMultiArray,
+        backKvK: MLMultiArray,
+        backKvV: MLMultiArray,
+        logitsBacking: MLMultiArray,
+        logits: inout [Float]
+    ) throws {
         let features: [String: Any] = [
             "inputs_embeds": inputsEmbeds,
             "cur_len": curLen,
+            "kv_k": frontKvK,
+            "kv_v": frontKvV,
         ]
         let provider = try MLDictionaryFeatureProvider(dictionary: features)
-        let output = try models.decode.prediction(from: provider, using: state)
 
-        guard
-            let logitsArr = output.featureValue(for: "speech_logits")?.multiArrayValue
-        else {
-            throw CosyVoice3Error.predictionFailed("decode: missing speech_logits")
+        var fastPathSucceeded = false
+        if useOutputBackings {
+            let opts = MLPredictionOptions()
+            opts.outputBackings = [
+                "kv_k_out": backKvK,
+                "kv_v_out": backKvV,
+                "speech_logits": logitsBacking,
+            ]
+            do {
+                _ = try models.decode.prediction(from: provider, options: opts)
+                Self.readLogits(from: logitsBacking, into: &logits)
+                if !loggedFastPath {
+                    logger.info(
+                        "LLM-Decode outputBackings accepted; double-buffered "
+                            + "AR loop active")
+                    loggedFastPath = true
+                }
+                fastPathSucceeded = true
+            } catch {
+                // CoreML refused our pre-allocated backings — typically
+                // because `LLM-Decode-M768-fp16.mlpackage` was exported
+                // without explicit MultiArray shape/dtype constraints on
+                // its outputs. Latch the flag off so we don't throw + catch
+                // on every one of ~163 steps for the rest of the corpus.
+                // Warning level so it shows in release builds — this is a
+                // perf regression worth surfacing to anyone running with a
+                // re-exported model.
+                useOutputBackings = false
+                logger.warning(
+                    "LLM-Decode outputBackings rejected "
+                        + "(\(error.localizedDescription)); switching to "
+                        + "fresh-alloc fallback for the rest of the run")
+            }
         }
-        // logits shape = [1, 1, 6761] fp32; strides may be non-compact.
+
+        if !fastPathSucceeded {
+            // Slow path: per-step CoreML allocation, then memcpy outputs
+            // into the pre-allocated backings so the front/back rotation
+            // protocol still works after this call.
+            let output = try models.decode.prediction(from: provider)
+            guard
+                let logitsArr = output.featureValue(for: "speech_logits")?.multiArrayValue,
+                let kvKOutArr = output.featureValue(for: "kv_k_out")?.multiArrayValue,
+                let kvVOutArr = output.featureValue(for: "kv_v_out")?.multiArrayValue
+            else {
+                throw CosyVoice3Error.predictionFailed(
+                    "decode: missing speech_logits / kv_k_out / kv_v_out")
+            }
+            try Self.copyKvOutput(kvKOutArr, into: backKvK, name: "kv_k_out")
+            try Self.copyKvOutput(kvVOutArr, into: backKvV, name: "kv_v_out")
+            Self.readLogits(from: logitsArr, into: &logits)
+        }
+    }
+
+    /// Read a `[1, 1, 6761]` fp32 logits MLMultiArray into `dst`. Honors the
+    /// last-dim stride (CoreML may emit non-compact strides on aligned
+    /// allocations) — uses `memcpy` when stride==1, strided gather otherwise.
+    private static func readLogits(from arr: MLMultiArray, into dst: inout [Float]) {
         let count = CosyVoice3Constants.speechVocab
-        var logits = [Float](repeating: 0, count: count)
-        let strides = logitsArr.strides.map { $0.intValue }
+        let strides = arr.strides.map { $0.intValue }
         let vocabStride = strides.last ?? 1
-        let base = logitsArr.dataPointer.bindMemory(to: Float.self, capacity: logitsArr.count)
-        for i in 0..<count { logits[i] = base[i * vocabStride] }
-        return logits
+        let base = arr.dataPointer.bindMemory(to: Float.self, capacity: arr.count)
+        if vocabStride == 1 {
+            dst.withUnsafeMutableBytes { rawDst in
+                guard let dstPtr = rawDst.baseAddress else { return }
+                memcpy(dstPtr, base, count * MemoryLayout<Float>.size)
+            }
+        } else {
+            for i in 0..<count { dst[i] = base[i * vocabStride] }
+        }
     }
 
-    /// Seed the 48 decode state buffers (`kv_k_0..kv_k_23`, `kv_v_0..kv_v_23`)
-    /// from prefill's `kv_k` / `kv_v` outputs.
-    ///
-    /// Prefill logical shape per cache is `[L=24, 1, Hkv=2, M=768, D=64]`
-    /// fp16; each per-layer state buffer is `[1, 2, 768, 64]` fp16. Copy
-    /// layer-by-layer using stride-aware indexing (prefill strides may not
-    /// be compact), letting CoreML's state writer convert to the underlying
-    /// fp16 storage.
-    private func seedDecodeState(
-        state: MLState,
-        kvK: MLMultiArray,
-        kvV: MLMultiArray
+    /// Copy a CoreML-allocated `kv_k_out` / `kv_v_out` MLMultiArray into our
+    /// pre-allocated backing array. Used on the `outputBackings`-rejected
+    /// fallback path so the front/back rotation protocol stays consistent.
+    private static func copyKvOutput(
+        _ src: MLMultiArray,
+        into dst: MLMultiArray,
+        name: String
     ) throws {
-        // Prefill declares fp32 KV outputs at its CoreML I/O boundary
-        // (even though the weights / activations internally are fp16).
-        // Decode state buffers are fp16. Convert per-element as we copy.
-        guard kvK.dataType == .float32 && kvV.dataType == .float32 else {
+        guard src.dataType == dst.dataType else {
             throw CosyVoice3Error.predictionFailed(
-                "seedDecodeState: expected fp32 KV from prefill (kv_k=\(kvK.dataType.rawValue) kv_v=\(kvV.dataType.rawValue))"
-            )
+                "decode \(name): dtype mismatch \(src.dataType.rawValue) vs \(dst.dataType.rawValue)")
         }
-
-        let L = CosyVoice3Constants.numLayers
-        let H = CosyVoice3Constants.kvHeads
-        let M = CosyVoice3Constants.kvMaxLength
-        let D = CosyVoice3Constants.headDim
-
-        // Prefill output strides for shape [L, 1, H, M, D].
-        let kStrides = kvK.strides.map { $0.intValue }
-        let vStrides = kvV.strides.map { $0.intValue }
-        let kLayerStride = kStrides[0]
-        let kHStride = kStrides[2]
-        let kMStride = kStrides[3]
-        let kDStride = kStrides[4]
-        let vLayerStride = vStrides[0]
-        let vHStride = vStrides[2]
-        let vMStride = vStrides[3]
-        let vDStride = vStrides[4]
-
-        let kSrcPtr = kvK.dataPointer.bindMemory(to: Float.self, capacity: kvK.count)
-        let vSrcPtr = kvV.dataPointer.bindMemory(to: Float.self, capacity: kvV.count)
-
-        // Collect dtype-mismatch errors from inside the non-throwing closures.
-        var stateDtypeError: String?
-
-        for i in 0..<L {
-            state.withMultiArray(for: "kv_k_\(i)") { buf in
-                guard buf.dataType == .float16 else {
-                    if stateDtypeError == nil {
-                        stateDtypeError = "kv_k_\(i) expected fp16 state, got \(buf.dataType.rawValue)"
-                    }
-                    return
-                }
-                let b = buf.strides.map { $0.intValue }
-                let dPtr = buf.dataPointer.bindMemory(to: Float16.self, capacity: buf.count)
-                Self.copyLayerF32ToF16(
-                    src: kSrcPtr, srcLayerBase: i * kLayerStride,
-                    srcHStride: kHStride, srcMStride: kMStride, srcDStride: kDStride,
-                    dst: dPtr,
-                    dstHStride: b[1], dstMStride: b[2], dstDStride: b[3],
-                    H: H, M: M, D: D)
-            }
-            state.withMultiArray(for: "kv_v_\(i)") { buf in
-                guard buf.dataType == .float16 else {
-                    if stateDtypeError == nil {
-                        stateDtypeError = "kv_v_\(i) expected fp16 state, got \(buf.dataType.rawValue)"
-                    }
-                    return
-                }
-                let b = buf.strides.map { $0.intValue }
-                let dPtr = buf.dataPointer.bindMemory(to: Float16.self, capacity: buf.count)
-                Self.copyLayerF32ToF16(
-                    src: vSrcPtr, srcLayerBase: i * vLayerStride,
-                    srcHStride: vHStride, srcMStride: vMStride, srcDStride: vDStride,
-                    dst: dPtr,
-                    dstHStride: b[1], dstMStride: b[2], dstDStride: b[3],
-                    H: H, M: M, D: D)
-            }
-        }
-
-        if let msg = stateDtypeError {
-            throw CosyVoice3Error.predictionFailed("seedDecodeState: \(msg)")
-        }
-    }
-
-    /// Copy one `[H, M, D]` KV slab from a fp32 prefill output into a fp16
-    /// decode state buffer. Strides may be non-compact on either side.
-    private static func copyLayerF32ToF16(
-        src: UnsafeMutablePointer<Float>,
-        srcLayerBase: Int,
-        srcHStride: Int, srcMStride: Int, srcDStride: Int,
-        dst: UnsafeMutablePointer<Float16>,
-        dstHStride: Int, dstMStride: Int, dstDStride: Int,
-        H: Int, M: Int, D: Int
-    ) {
-        for h in 0..<H {
-            for m in 0..<M {
-                for d in 0..<D {
-                    let sOff = srcLayerBase + h * srcHStride + m * srcMStride + d * srcDStride
-                    let dOff = h * dstHStride + m * dstMStride + d * dstDStride
-                    dst[dOff] = Float16(src[sOff])
-                }
-            }
+        guard src.count == dst.count else {
+            throw CosyVoice3Error.predictionFailed(
+                "decode \(name): count mismatch \(src.count) vs \(dst.count)")
         }
+        // KV outputs are fp32. With contiguous strides (the default for
+        // freshly-allocated CoreML outputs in this graph) memcpy is safe.
+        let bytes = src.count * MemoryLayout<Float>.size
+        memcpy(dst.dataPointer, src.dataPointer, bytes)
     }
 
     private func runFlow(
diff --git a/Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Synthesize/CosyVoice3Types.swift b/Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Synthesize/CosyVoice3Types.swift
index 14cbf532..54aabe80 100644
--- a/Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Synthesize/CosyVoice3Types.swift
+++ b/Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Synthesize/CosyVoice3Types.swift
@@ -10,6 +10,24 @@ public struct CosyVoice3SynthesisResult: Sendable {
     public let generatedTokenCount: Int
     /// Decoded speech token ids (useful for debugging + round-trip).
     public let decodedTokens: [Int32]
+    /// `true` when the LLM-Decode AR loop ended on an EOS token in
+    /// `CosyVoice3Constants.stopRange` (natural termination); `false` when
+    /// the loop exhausted its decode budget (`flowTotalTokens - nPrompt`)
+    /// without observing EOS — the audio is truncated mid-utterance.
+    /// See the `.warning`-level log emitted from `CosyVoice3Synthesizer`
+    /// when this is `false`.
+    public let finishedOnEos: Bool
+
+    public init(
+        samples: [Float], sampleRate: Int, generatedTokenCount: Int,
+        decodedTokens: [Int32], finishedOnEos: Bool
+    ) {
+        self.samples = samples
+        self.sampleRate = sampleRate
+        self.generatedTokenCount = generatedTokenCount
+        self.decodedTokens = decodedTokens
+        self.finishedOnEos = finishedOnEos
+    }
 }
 
 /// Options controlling a CosyVoice3 parity / synthesis call.
@@ -42,9 +60,20 @@ public struct CosyVoice3SynthesisOptions: Sendable {
     public let maxNewTokens: Int?
     /// Sampler seed for the top-p/top-k + multinomial fallback path.
     public let seed: UInt64
+    /// When `true`, skips `CosyVoice3TextChunker.chunk(...)` and runs a
+    /// single synthesizer call regardless of input length. Useful for
+    /// callers that pre-segment input themselves (e.g. UI-driven streaming
+    /// per sentence). The structural 250-token Flow cap still applies and
+    /// long inputs will truncate mid-utterance with a `.warning` log.
+    public let disableAutoChunking: Bool
 
-    public init(maxNewTokens: Int? = nil, seed: UInt64 = 42) {
+    public init(
+        maxNewTokens: Int? = nil,
+        seed: UInt64 = 42,
+        disableAutoChunking: Bool = false
+    ) {
         self.maxNewTokens = maxNewTokens
         self.seed = seed
+        self.disableAutoChunking = disableAutoChunking
     }
 }
diff --git a/Sources/FluidAudio/TTS/KokoroAne/Pipeline/KokoroAneModelStore.swift b/Sources/FluidAudio/TTS/KokoroAne/Pipeline/KokoroAneModelStore.swift
index 1acbea77..6171e44b 100644
--- a/Sources/FluidAudio/TTS/KokoroAne/Pipeline/KokoroAneModelStore.swift
+++ b/Sources/FluidAudio/TTS/KokoroAne/Pipeline/KokoroAneModelStore.swift
@@ -42,6 +42,39 @@ public struct KokoroAneComputeUnits: Sendable, Equatable {
         prosody: .cpuAndGPU, noise: .cpuAndGPU, vocoder: .cpuAndGPU, tail: .cpuAndGPU
     )
 
+    /// Force every stage onto `.cpuAndNeuralEngine`. Stages that hit
+    /// ANE-incompatible ops will fall back to CPU silently — included
+    /// for the benchmark sweep (efficiency vs. latency comparison).
+    public static let allAne = KokoroAneComputeUnits(
+        albert: .cpuAndNeuralEngine, postAlbert: .cpuAndNeuralEngine,
+        alignment: .cpuAndNeuralEngine, prosody: .cpuAndNeuralEngine,
+        noise: .cpuAndNeuralEngine, vocoder: .cpuAndNeuralEngine,
+        tail: .cpuAndNeuralEngine
+    )
+
+    /// CPU-only (no ANE, no GPU). Slowest but most predictable; useful
+    /// as a debugging / fallback baseline.
+    public static let cpuOnly = KokoroAneComputeUnits(
+        albert: .cpuOnly, postAlbert: .cpuOnly, alignment: .cpuOnly,
+        prosody: .cpuOnly, noise: .cpuOnly, vocoder: .cpuOnly, tail: .cpuOnly
+    )
+
+    /// Build a configuration from a generic preset (used by the
+    /// `tts-benchmark` CLI so a single flag maps cleanly across
+    /// backends).
+    public init(preset: TtsComputeUnitPreset) {
+        switch preset {
+        case .default:
+            self = .default
+        case .allAne:
+            self = .allAne
+        case .cpuAndGpu:
+            self = .cpuAndGpu
+        case .cpuOnly:
+            self = .cpuOnly
+        }
+    }
+
     func units(for stage: KokoroAneStage) -> MLComputeUnits {
         switch stage {
         case .albert: return albert
diff --git a/Sources/FluidAudio/TTS/Magpie/MagpieTtsManager.swift b/Sources/FluidAudio/TTS/Magpie/MagpieTtsManager.swift
index 68cc42bb..7f791e34 100644
--- a/Sources/FluidAudio/TTS/Magpie/MagpieTtsManager.swift
+++ b/Sources/FluidAudio/TTS/Magpie/MagpieTtsManager.swift
@@ -75,6 +75,11 @@ public actor MagpieTtsManager {
     public func initialize() async throws {
         if synthesizer != nil { return }
 
+        logger.warning(
+            "Magpie TTS is experimental / beta. Synthesis is below real-time "
+                + "(agg-RTFx ~0.41× on M2 for the MiniMax-English corpus) — "
+                + "see Documentation/TTS/Magpie.md.")
+
         let store = MagpieModelStore(
             directory: directory,
             computeUnits: computeUnits,
diff --git a/Sources/FluidAudio/TTS/Magpie/Pipeline/Synthesize/MagpieKvCache.swift b/Sources/FluidAudio/TTS/Magpie/Pipeline/Synthesize/MagpieKvCache.swift
index addd5b60..ee3a816f 100644
--- a/Sources/FluidAudio/TTS/Magpie/Pipeline/Synthesize/MagpieKvCache.swift
+++ b/Sources/FluidAudio/TTS/Magpie/Pipeline/Synthesize/MagpieKvCache.swift
@@ -54,6 +54,15 @@ public final class MagpieKvCache {
     public private(set) var cachesV: [MLMultiArray]
     public private(set) var positions: [MLMultiArray]
 
+    /// Set to `false` once `decoder_step.mlmodelc` rejects `outputBackings`
+    /// (e.g. when the model was exported without explicit MultiArray
+    /// shape/dtype constraints on its KV outputs). The rejection is a static
+    /// property of the model, so once it fails we permanently skip the fast
+    /// path and go straight to the fresh-alloc fallback to avoid throwing +
+    /// catching an exception on every one of the ~500 AR decode steps per
+    /// utterance.
+    public var useOutputBackings: Bool = true
+
     /// Back-buffer set for double-buffered AR loop. Used as `outputBackings` so
     /// CoreML writes new K/V/pos straight into our pre-allocated arrays instead
     /// of allocating ~18.9 MB of fresh fp16 buffers per step. After each
diff --git a/Sources/FluidAudio/TTS/Magpie/Pipeline/Synthesize/MagpieSynthesizer.swift b/Sources/FluidAudio/TTS/Magpie/Pipeline/Synthesize/MagpieSynthesizer.swift
index 68321cc6..492a2eeb 100644
--- a/Sources/FluidAudio/TTS/Magpie/Pipeline/Synthesize/MagpieSynthesizer.swift
+++ b/Sources/FluidAudio/TTS/Magpie/Pipeline/Synthesize/MagpieSynthesizer.swift
@@ -769,14 +769,73 @@ public actor MagpieSynthesizer {
         // step. The cache provides 24 K/V + 12 position back-buffers, the
         // synthesizer provides the 1 hidden buffer. After the call,
         // `swapBackings` promotes back→front for the next step's inputs.
-        var backings: [String: Any] = [:]
-        cache.addOutputBackings(to: &backings)
-        backings[MagpieKvCache.decoderHiddenKey] = hiddenBacking
-        let predOpts = MLPredictionOptions()
-        predOpts.outputBackings = backings
+        //
+        // If a previous step already proved that this model was exported
+        // without explicit MultiArray shape/dtype constraints on its KV
+        // outputs, `cache.useOutputBackings` is `false` and we skip the
+        // fast path entirely. This avoids the per-step throw/catch overhead
+        // and debug-log spam across the entire AR loop (~500 iterations).
+        var fastPathSucceeded = false
+        if cache.useOutputBackings {
+            var backings: [String: Any] = [:]
+            cache.addOutputBackings(to: &backings)
+            backings[MagpieKvCache.decoderHiddenKey] = hiddenBacking
+            let predOpts = MLPredictionOptions()
+            predOpts.outputBackings = backings
 
-        _ = try model.prediction(from: provider, options: predOpts)
-        cache.swapBackings()
+            do {
+                _ = try model.prediction(from: provider, options: predOpts)
+                cache.swapBackings()
+                fastPathSucceeded = true
+            } catch {
+                // CoreML refused our pre-allocated outputBackings — typically
+                // because `decoder_step.mlmodelc` was exported without
+                // explicit MultiArray shape/dtype constraints on its KV
+                // outputs, so the runtime can't validate the buffer layout
+                // and bails with
+                //   "Output feature (null) doesn't support output backing
+                //    because it doesn't have a MultiArray constraints."
+                // The rejection is a static property of the model, so latch
+                // the cache flag off to skip the fast path on every
+                // subsequent step (avoids ~500 throw/catch + log lines per
+                // utterance).
+                cache.useOutputBackings = false
+                logger.debug(
+                    "decoder_step outputBackings rejected "
+                        + "(\(error.localizedDescription)); switching to "
+                        + "fresh-alloc fallback for the rest of the run")
+            }
+        }
+
+        if !fastPathSucceeded {
+            // Slow path: re-run without `outputBackings`, route the
+            // freshly-allocated K/V/pos through `MagpieKvCache.absorbOutputs`
+            // (which replaces front pointers directly), and copy the hidden
+            // state into `hiddenBacking` so the rest of this function works
+            // unchanged. Costs ~18.9 MB of fresh fp16 allocation per step;
+            // proper fix is to re-export `decoder_step.mlmodelc` with
+            // shape/dtype constraints on `new_k_*`/`new_v_*`/`var_*`.
+            let output = try model.prediction(from: provider)
+            try cache.absorbOutputs(output)
+            guard
+                let hidden = output.featureValue(for: MagpieKvCache.decoderHiddenKey)?
+                    .multiArrayValue
+            else {
+                throw MagpieError.inferenceFailed(
+                    stage: "decoder_step",
+                    underlying:
+                        "missing hidden output key \(MagpieKvCache.decoderHiddenKey)")
+            }
+            guard hidden.dataType == .float16, hidden.count == hiddenBacking.count else {
+                throw MagpieError.inferenceFailed(
+                    stage: "decoder_step",
+                    underlying:
+                        "decoder hidden mismatch (dtype=\(hidden.dataType.rawValue) "
+                        + "count=\(hidden.count) expected=\(hiddenBacking.count))")
+            }
+            let bytes = hiddenBacking.count * MemoryLayout<UInt16>.size
+            memcpy(hiddenBacking.dataPointer, hidden.dataPointer, bytes)
+        }
 
         // Hidden state lives in `hiddenBacking` after the call. Convert fp16
         // → fp32 via vImage into a fresh [Float] result buffer (the sampler
diff --git a/Sources/FluidAudio/TTS/Shared/TtsComputeUnitPreset.swift b/Sources/FluidAudio/TTS/Shared/TtsComputeUnitPreset.swift
new file mode 100644
index 00000000..33744942
--- /dev/null
+++ b/Sources/FluidAudio/TTS/Shared/TtsComputeUnitPreset.swift
@@ -0,0 +1,72 @@
+@preconcurrency import CoreML
+import Foundation
+
+/// Generic compute-unit preset shared across TTS backends.
+///
+/// Each backend keeps its own per-stage `<Backend>ComputeUnits` struct
+/// because stage names differ (Kokoro ANE has 7 stages, PocketTTS has 4
+/// CoreML models, StyleTTS2 has 4 models, etc.). This preset is the
+/// uniform knob the benchmarking harness flips so a single CLI flag
+/// (`--compute-units default|all-ane|cpu-and-gpu|cpu-only`) maps to a
+/// sensible per-stage assignment on every backend.
+///
+/// Backends opt in by adding `init(preset: TtsComputeUnitPreset)` to
+/// their compute-units struct (see `KokoroAneComputeUnits` for the
+/// reference implementation).
+public enum TtsComputeUnitPreset: String, Sendable, CaseIterable {
+
+    /// The backend's empirically-tuned default — typically a mix of
+    /// ANE-friendly and CPU+GPU stages chosen by the conversion author.
+    case `default`
+
+    /// Force every stage to `.cpuAndNeuralEngine`. Worst case for stages
+    /// that fall back to CPU on ANE-incompatible ops, but the most
+    /// energy-efficient when ops are ANE-clean.
+    case allAne
+
+    /// Force every stage to `.cpuAndGPU`. Skips the ANE entirely;
+    /// useful as a latency baseline when the ANE compile cache is cold
+    /// (no `anecompilerservice` time on first call).
+    case cpuAndGpu
+
+    /// Force every stage to `.cpuOnly`. Fallback / debugging baseline;
+    /// every backend should at least run here, however slowly.
+    case cpuOnly
+
+    /// Concrete `MLComputeUnits` for "force every stage to X" presets.
+    /// Returns `nil` for `.default`, which means "let the backend keep
+    /// its empirical mapping".
+    public var uniformUnits: MLComputeUnits? {
+        switch self {
+        case .default: return nil
+        case .allAne: return .cpuAndNeuralEngine
+        case .cpuAndGpu: return .cpuAndGPU
+        case .cpuOnly: return .cpuOnly
+        }
+    }
+
+    /// Parse the CLI flag value (`default`, `all-ane`, `cpu-and-gpu`,
+    /// `cpu-only`). Returns `nil` for unrecognised values so callers
+    /// can surface a usage error.
+    public init?(cliValue: String) {
+        switch cliValue.lowercased() {
+        case "default": self = .default
+        case "all-ane", "ane", "neural-engine": self = .allAne
+        case "cpu-and-gpu", "cpuandgpu", "gpu": self = .cpuAndGpu
+        case "cpu-only", "cpu", "cpuonly": self = .cpuOnly
+        default: return nil
+        }
+    }
+
+    /// Canonical kebab-case form, matching the CLI flag values the
+    /// `init?(cliValue:)` parser accepts. Use this for log lines and
+    /// JSON reports so values round-trip back through the parser.
+    public var cliValue: String {
+        switch self {
+        case .default: return "default"
+        case .allAne: return "all-ane"
+        case .cpuAndGpu: return "cpu-and-gpu"
+        case .cpuOnly: return "cpu-only"
+        }
+    }
+}
diff --git a/Sources/FluidAudio/TTS/StyleTTS2/Assets/StyleTTS2Vocab.swift b/Sources/FluidAudio/TTS/StyleTTS2/Assets/StyleTTS2Vocab.swift
index 8e21ec80..8b912c1f 100644
--- a/Sources/FluidAudio/TTS/StyleTTS2/Assets/StyleTTS2Vocab.swift
+++ b/Sources/FluidAudio/TTS/StyleTTS2/Assets/StyleTTS2Vocab.swift
@@ -94,4 +94,25 @@ public struct StyleTTS2Vocab: Sendable {
         }
         return ids
     }
+
+    /// Diagnostic encode: same logic as `encode(_:)` but also returns a
+    /// frequency map of every scalar that fell off the floor because no
+    /// vocab entry exists for it. Used by the StyleTTS2 CLI's
+    /// `--tokenize-only` mode to quantify the misaki ↔ espeak inventory
+    /// gap without actually invoking the diffusion pipeline.
+    public func encodeWithReport(
+        _ phonemes: String
+    ) -> (ids: [Int32], dropped: [Unicode.Scalar: Int]) {
+        var ids: [Int32] = []
+        ids.reserveCapacity(phonemes.unicodeScalars.count)
+        var dropped: [Unicode.Scalar: Int] = [:]
+        for scalar in phonemes.unicodeScalars {
+            if let id = map[Character(scalar)] {
+                ids.append(id)
+            } else {
+                dropped[scalar, default: 0] += 1
+            }
+        }
+        return (ids, dropped)
+    }
 }
diff --git a/Sources/FluidAudio/TTS/StyleTTS2/Pipeline/StyleTTS2Phonemizer.swift b/Sources/FluidAudio/TTS/StyleTTS2/Pipeline/StyleTTS2Phonemizer.swift
index f30424fc..12436aa2 100644
--- a/Sources/FluidAudio/TTS/StyleTTS2/Pipeline/StyleTTS2Phonemizer.swift
+++ b/Sources/FluidAudio/TTS/StyleTTS2/Pipeline/StyleTTS2Phonemizer.swift
@@ -4,14 +4,30 @@ import Foundation
 ///
 /// For English (`.americanEnglish`), uses the in-tree `G2PModel` (BART
 /// encoder-decoder, misaki-style IPA) and remaps the misaki conventions to
-/// the espeak-ng convention that StyleTTS2's LibriTTS checkpoint expects:
+/// the espeak-ng convention that StyleTTS2's LibriTTS checkpoint expects.
+///
+/// **Per-piece (single glyph) remap** — applied as misaki emits each piece:
 ///
 ///   misaki → espeak-ng
 ///   A → eɪ   I → aɪ   O → oʊ   W → aʊ   Y → ɔɪ
 ///   ᵊ → ə   (tiny-schwa offglide; not in StyleTTS2's 178-vocab)
 ///
-/// Other glyphs (`ʤ`, `ʧ`, `ˈ`, `ˌ`, `ð`, `θ`, `ɹ`, `ɾ`, etc.) are already in
-/// the 178-token espeak-ng vocabulary and pass through.
+/// **Post-pass (multi-glyph) remap** — applied to the assembled phoneme
+/// string after every word has been emitted. Both the ligature and the
+/// decomposed forms exist as distinct tokens in the 178-vocab, but the
+/// LibriTTS checkpoint was trained against espeak-ng output, so the model's
+/// embeddings for the misaki ligature glyphs (`ʧ`, `ʤ`) are essentially
+/// untrained noise. Same story for the schwa+r digraphs that espeak collapses
+/// into single rhotic vowels (`ɝ`, `ɚ`):
+///
+///   misaki → espeak-ng         word example
+///   ʧ      → tʃ               choice  → tʃˈɔɪs
+///   ʤ      → dʒ               jump    → dʒˈʌmps
+///   ɜɹ     → ɝ  (U+025D)      girl    → ɡˈɝl
+///   əɹ     → ɚ  (U+025A)      over    → ˈoʊvɚ
+///
+/// Other glyphs (`ˈ`, `ˌ`, `ð`, `θ`, `ɹ`, `ɾ`, etc.) are already in the
+/// 178-token espeak-ng vocabulary and pass through unchanged.
 ///
 /// Non-English languages fall back to `MultilingualG2PModel` (CharsiuG2P
 /// ByT5). Output quality there is unvalidated — the LibriTTS checkpoint is
@@ -46,6 +62,30 @@ public enum StyleTTS2Phonemizer {
         "ᵊ": "ə",
     ]
 
+    /// Post-pass multi-glyph remap applied to the assembled phoneme string
+    /// after all word pieces have been concatenated. Decomposes misaki's
+    /// affricate ligatures and collapses the schwa+r digraphs into the
+    /// single rhotic vowels espeak-ng emits — see the type-level docs for
+    /// rationale. Order matters only insofar as `əɹ` and `ɜɹ` must be
+    /// applied before any rule that would consume the trailing `ɹ` (none
+    /// exist today; left ordered for future-proofing).
+    private static let misakiToEspeakPostPass: [(String, String)] = [
+        ("ʧ", "tʃ"),
+        ("ʤ", "dʒ"),
+        ("ɜɹ", "ɝ"),
+        ("əɹ", "ɚ"),
+    ]
+
+    /// Apply `misakiToEspeakPostPass` rules to a phoneme string in order.
+    /// Exposed `internal` for unit tests.
+    internal static func applyEspeakPostPass(_ s: String) -> String {
+        var out = s
+        for (from, to) in misakiToEspeakPostPass {
+            out = out.replacingOccurrences(of: from, with: to)
+        }
+        return out
+    }
+
     /// Convert raw text to an IPA phoneme string for StyleTTS2.
     ///
     /// - Parameters:
@@ -87,6 +127,13 @@ public enum StyleTTS2Phonemizer {
             try await flushWord(&wordBuffer, language: language, into: &output)
         }
 
+        // Multi-glyph misaki → espeak normalization. Only meaningful for
+        // English (the LibriTTS checkpoint is English-only); skipping for
+        // other languages avoids touching CharsiuG2P output we don't have
+        // a model contract for.
+        if language == .americanEnglish {
+            output = applyEspeakPostPass(output)
+        }
         return output
     }
 
diff --git a/Sources/FluidAudio/TTS/StyleTTS2/Pipeline/StyleTTS2Synthesizer.swift b/Sources/FluidAudio/TTS/StyleTTS2/Pipeline/StyleTTS2Synthesizer.swift
index 1a09163d..923b6e4f 100644
--- a/Sources/FluidAudio/TTS/StyleTTS2/Pipeline/StyleTTS2Synthesizer.swift
+++ b/Sources/FluidAudio/TTS/StyleTTS2/Pipeline/StyleTTS2Synthesizer.swift
@@ -239,6 +239,13 @@ public actor StyleTTS2Synthesizer {
     /// Slice an MLMultiArray of shape `(1, leading, trailing)` to the first
     /// `take` entries along either the leading or trailing axis. Returns a
     /// flat row-major `[Float]`.
+    ///
+    /// Reads via `dataPointer` instead of `arr[idx].floatValue` and avoids
+    /// `arr.strides` entirely — both trigger
+    /// `E5RT: tensor_buffer has known strides while the model has
+    /// FlexibleShapeInfo` on `text_predictor`'s flex-shape outputs. CoreML
+    /// emits dense row-major buffers, so for shape `(1, leading, trailing)`
+    /// the flat index is simply `r * trailing + c`.
     private func sliceFirstAxis2D(
         arr: MLMultiArray,
         leading: Int,
@@ -246,29 +253,50 @@ public actor StyleTTS2Synthesizer {
         take: Int,
         sliceDim: SliceDim
     ) -> [Float] {
-        let strides = arr.strides.map { $0.intValue }
+        let outCount: Int
         switch sliceDim {
-        case .leading:
-            // Result shape: (take, trailing).
-            var out = [Float](repeating: 0, count: take * trailing)
-            for r in 0..<take {
-                for c in 0..<trailing {
-                    let idx = r * strides[1] + c * strides[2]
-                    out[r * trailing + c] = arr[idx].floatValue
-                }
-            }
-            return out
-        case .trailing:
-            // Result shape: (leading, take).
-            var out = [Float](repeating: 0, count: leading * take)
-            for r in 0..<leading {
-                for c in 0..<take {
-                    let idx = r * strides[1] + c * strides[2]
-                    out[r * take + c] = arr[idx].floatValue
-                }
-            }
-            return out
+        case .leading: outCount = take * trailing
+        case .trailing: outCount = leading * take
         }
+        var out = [Float](repeating: 0, count: outCount)
+
+        func fill(_ get: (Int) -> Float) {
+            switch sliceDim {
+            case .leading:
+                // Result shape: (take, trailing).
+                for r in 0..<take {
+                    for c in 0..<trailing {
+                        out[r * trailing + c] = get(r * trailing + c)
+                    }
+                }
+            case .trailing:
+                // Result shape: (leading, take).
+                for r in 0..<leading {
+                    for c in 0..<take {
+                        out[r * take + c] = get(r * trailing + c)
+                    }
+                }
+            }
+        }
+
+        let count = arr.count
+        switch arr.dataType {
+        case .float32:
+            let p = arr.dataPointer.bindMemory(to: Float.self, capacity: count)
+            fill { p[$0] }
+        case .float16:
+            let p = arr.dataPointer.bindMemory(to: Float16.self, capacity: count)
+            fill { Float(p[$0]) }
+        case .double:
+            let p = arr.dataPointer.bindMemory(to: Double.self, capacity: count)
+            fill { Float(p[$0]) }
+        default:
+            // Fallback re-introduces the FlexibleShapeInfo trip wire, but
+            // we don't expect text_predictor to emit anything other than
+            // fp16/fp32.
+            fill { arr[$0].floatValue }
+        }
+        return out
     }
 
     // MARK: - Durations
diff --git a/Sources/FluidAudio/TTS/StyleTTS2/StyleTTS2Manager.swift b/Sources/FluidAudio/TTS/StyleTTS2/StyleTTS2Manager.swift
index 56078ccd..7894b991 100644
--- a/Sources/FluidAudio/TTS/StyleTTS2/StyleTTS2Manager.swift
+++ b/Sources/FluidAudio/TTS/StyleTTS2/StyleTTS2Manager.swift
@@ -15,10 +15,6 @@ import Foundation
 ///   - cumsum-of-durations → one-hot → matmul hard-alignment,
 ///   - bucket selection (round token length → text_predictor; round
 ///     mel frames → decoder).
-///
-/// **Status:** scaffold only. Synthesis is not yet implemented; calls to
-/// `synthesize` throw `processingFailed`. The asset bring-up (download +
-/// model store) is wired up so dependent layers can land incrementally.
 public actor StyleTTS2Manager {
 
     private let logger = AppLogger(category: "StyleTTS2Manager")
@@ -47,6 +43,10 @@ public actor StyleTTS2Manager {
     public func initialize(
         progressHandler: DownloadUtils.ProgressHandler? = nil
     ) async throws {
+        logger.warning(
+            "StyleTTS2 is experimental / beta. WER on long English phrases is "
+                + "elevated on the MiniMax corpus (~44% vs Kokoro 1.3%) — see "
+                + "Documentation/TTS/Benchmarks.md.")
         _ = try await modelStore.ensureAssetsAvailable(progressHandler: progressHandler)
         let config = try await modelStore.bundleConfig()
         try config.validate()
@@ -111,6 +111,34 @@ public actor StyleTTS2Manager {
         return try await synthesizer.synthesize(ids: ids, voice: voice, options: options)
     }
 
+    /// Same as `synthesize` but returns raw fp32 PCM samples + sample rate.
+    /// Used by callers (e.g. the tts-benchmark harness, ASR pairing) that
+    /// don't want the WAV-encoding round trip.
+    public func synthesizeSamples(
+        text: String,
+        voiceStyleURL: URL,
+        language: MultilingualG2PLanguage = .americanEnglish,
+        diffusionSteps: Int = StyleTTS2Constants.defaultDiffusionSteps,
+        alpha: Float = 0.3,
+        beta: Float = 0.7,
+        randomSeed: UInt64? = nil
+    ) async throws -> (samples: [Float], sampleRate: Int) {
+        guard isInitialized else {
+            throw StyleTTS2Error.modelNotFound("StyleTTS2 model not initialized")
+        }
+        let voice = try StyleTTS2VoiceStyle.load(from: voiceStyleURL)
+        let (_, ids) = try await tokenize(text: text, language: language)
+        let options = StyleTTS2Synthesizer.Options(
+            diffusionSteps: diffusionSteps,
+            alpha: alpha,
+            beta: beta,
+            randomSeed: randomSeed
+        )
+        let samples = try await synthesizer.synthesizeSamples(
+            ids: ids, voice: voice, options: options)
+        return (samples, StyleTTS2Constants.audioSampleRate)
+    }
+
     /// Run the text frontend (preprocess → G2P → vocab encode) end-to-end.
     ///
     /// Available before the diffusion synthesizer is wired so callers can
@@ -138,6 +166,27 @@ public actor StyleTTS2Manager {
         return (phonemes, ids)
     }
 
+    /// Diagnostic tokenize: same as `tokenize(text:language:)` but also
+    /// returns the per-scalar drop frequency from
+    /// `StyleTTS2Vocab.encodeWithReport`. Used by the CLI to quantify
+    /// how much of the misaki BART G2P output the espeak-ng-trained
+    /// 178-token vocab can actually consume.
+    public func tokenizeWithReport(
+        text: String,
+        language: MultilingualG2PLanguage = .americanEnglish
+    ) async throws -> (
+        phonemes: String, ids: [Int32], dropped: [Unicode.Scalar: Int]
+    ) {
+        guard isInitialized else {
+            throw StyleTTS2Error.modelNotFound("StyleTTS2 model not initialized")
+        }
+        let phonemes = try await StyleTTS2Phonemizer.phonemize(
+            text: text, language: language)
+        let vocab = try await modelStore.vocabulary()
+        let (ids, dropped) = vocab.encodeWithReport(phonemes)
+        return (phonemes, ids, dropped)
+    }
+
     public func cleanup() {
         isInitialized = false
     }
diff --git a/Sources/FluidAudioCLI/Commands/CosyVoice3/ParityCommand.swift b/Sources/FluidAudioCLI/Commands/CosyVoice3/ParityCommand.swift
index 020a10f0..b1946d1f 100644
--- a/Sources/FluidAudioCLI/Commands/CosyVoice3/ParityCommand.swift
+++ b/Sources/FluidAudioCLI/Commands/CosyVoice3/ParityCommand.swift
@@ -13,7 +13,6 @@ import Foundation
 ///   --output     .../build/swift_e2e.wav \
 ///   --seed 42
 /// ```
-@available(macOS 15, iOS 18, *)
 enum CosyVoice3ParityCLI {
 
     private static let logger = AppLogger(category: "CosyVoice3ParityCLI")
diff --git a/Sources/FluidAudioCLI/Commands/CosyVoice3/TextCommand.swift b/Sources/FluidAudioCLI/Commands/CosyVoice3/TextCommand.swift
index cbf64d92..c4389ad5 100644
--- a/Sources/FluidAudioCLI/Commands/CosyVoice3/TextCommand.swift
+++ b/Sources/FluidAudioCLI/Commands/CosyVoice3/TextCommand.swift
@@ -19,7 +19,6 @@ import Foundation
 ///   --output              .../build/swift_cv3_text.wav \
 ///   --seed 42
 /// ```
-@available(macOS 15, iOS 18, *)
 enum CosyVoice3TextCLI {
 
     private static let logger = AppLogger(category: "CosyVoice3TextCLI")
diff --git a/Sources/FluidAudioCLI/Commands/MinimaxCorpusCommand.swift b/Sources/FluidAudioCLI/Commands/MinimaxCorpusCommand.swift
new file mode 100644
index 00000000..1880ae50
--- /dev/null
+++ b/Sources/FluidAudioCLI/Commands/MinimaxCorpusCommand.swift
@@ -0,0 +1,234 @@
+#if os(macOS)
+import FluidAudio
+import Foundation
+
+/// Swift port of `Scripts/fetch_minimax_tts_corpus.py`.
+///
+/// Fetches the MiniMax Multilingual TTS Test Set per-language `.txt` files
+/// from HuggingFace and converts them to the FluidAudio TTS-benchmark
+/// corpus format (strip `<cloning_audio_filename>|` prefix, prepend a
+/// header documenting source + revision + license).
+///
+/// Reuses `DownloadUtils.fetchHuggingFaceFile` so we get the same auth
+/// (HF_TOKEN env), retry, and backoff treatment as every other HF asset
+/// pull in the project — no hardcoded URLs, no swift-transformers
+/// dependency added just for one corpus fetch.
+///
+/// Source dataset:  https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set
+/// License:         CC-BY-SA-4.0
+public enum MinimaxCorpusCommand {
+
+    private static let logger = AppLogger(category: "MinimaxCorpusCommand")
+
+    private static let repo = "MiniMaxAI/TTS-Multilingual-Test-Set"
+
+    /// Pin to the initial public commit so re-runs reproduce the vendored
+    /// files. Matches `DEFAULT_REVISION` in the Python script.
+    private static let defaultRevision = "cb416f0ac3658da0577e97873065e19fe6488917"
+
+    /// All 24 languages in the upstream `text/` directory. Keep in sync with
+    /// `ALL_LANGUAGES` in `Scripts/fetch_minimax_tts_corpus.py`.
+    private static let allLanguages: [String] = [
+        "arabic", "cantonese", "chinese", "czech", "dutch", "english",
+        "finnish", "french", "german", "greek", "hindi", "indonesian",
+        "italian", "japanese", "korean", "polish", "portuguese", "romanian",
+        "russian", "spanish", "thai", "turkish", "ukrainian", "vietnamese",
+    ]
+
+    public static func run(arguments: [String]) async {
+        var languages = allLanguages
+        var revision = defaultRevision
+        var outDir: URL? = nil
+
+        var i = 0
+        while i < arguments.count {
+            let arg = arguments[i]
+            switch arg {
+            case "--languages", "-l":
+                if i + 1 < arguments.count {
+                    languages = arguments[i + 1]
+                        .split(separator: ",")
+                        .map { $0.trimmingCharacters(in: .whitespaces) }
+                        .filter { !$0.isEmpty }
+                    i += 1
+                }
+            case "--revision":
+                if i + 1 < arguments.count {
+                    revision = arguments[i + 1]
+                    i += 1
+                }
+            case "--out-dir":
+                if i + 1 < arguments.count {
+                    outDir = URL(fileURLWithPath: arguments[i + 1])
+                    i += 1
+                }
+            case "help", "--help", "-h":
+                printUsage()
+                return
+            default:
+                logger.error("Unknown argument: \(arg)")
+                printUsage()
+                exit(1)
+            }
+            i += 1
+        }
+
+        let unknown = Set(languages).subtracting(allLanguages).sorted()
+        if !unknown.isEmpty {
+            logger.error("Unknown language(s): \(unknown.joined(separator: ", "))")
+            logger.error("Available: \(allLanguages.joined(separator: ", "))")
+            exit(2)
+        }
+
+        let resolvedOutDir = outDir ?? defaultOutDir()
+
+        do {
+            try FileManager.default.createDirectory(
+                at: resolvedOutDir, withIntermediateDirectories: true)
+        } catch {
+            logger.error("Failed to create output directory: \(error.localizedDescription)")
+            exit(1)
+        }
+
+        logger.info("Fetching MiniMax TTS Multilingual Test Set @ \(revision)")
+        logger.info("  out_dir: \(resolvedOutDir.path)")
+        logger.info("  langs:   \(languages.count)")
+
+        var total = 0
+        for lang in languages {
+            guard let url = URL(string: hfURL(repo: repo, revision: revision, path: "text/\(lang).txt"))
+            else {
+                logger.error("[\(lang)] failed to construct URL")
+                exit(1)
+            }
+            do {
+                let data = try await DownloadUtils.fetchHuggingFaceFile(
+                    from: url, description: "minimax TTS corpus (\(lang))")
+                guard let raw = String(data: data, encoding: .utf8) else {
+                    logger.error("[\(lang)] response was not valid UTF-8")
+                    exit(1)
+                }
+                let phrases = convert(raw: raw)
+                let outPath = try writeCorpus(
+                    lang: lang, phrases: phrases, outDir: resolvedOutDir,
+                    revision: revision)
+                let countStr = String(format: "%3d", phrases.count)
+                let relPath = relativePath(outPath, from: repoRoot())
+                logger.info("  [\(lang)] \(countStr) phrases -> \(relPath)")
+                total += phrases.count
+            } catch {
+                logger.error("[\(lang)] FAILED: \(error.localizedDescription)")
+                exit(1)
+            }
+        }
+
+        logger.info("OK — \(total) phrases across \(languages.count) language(s).")
+    }
+
+    // MARK: - Helpers
+
+    private static func hfURL(repo: String, revision: String, path: String) -> String {
+        "https://huggingface.co/datasets/\(repo)/resolve/\(revision)/\(path)"
+    }
+
+    /// Strip `<filename>|` prefix and return the list of trimmed phrases.
+    /// Mirrors `convert()` in the Python script.
+    private static func convert(raw: String) -> [String] {
+        var out: [String] = []
+        for rawLine in raw.split(separator: "\n", omittingEmptySubsequences: false) {
+            let line = rawLine.trimmingCharacters(in: .whitespacesAndNewlines)
+            if line.isEmpty { continue }
+            // Format: "<cloning_audio_filename>|<text>". Some lines may have
+            // extra `|` inside the text — keep only the first split.
+            let text: String
+            if let sepIdx = line.firstIndex(of: "|") {
+                text = String(line[line.index(after: sepIdx)...])
+                    .trimmingCharacters(in: .whitespacesAndNewlines)
+            } else {
+                text = line
+            }
+            if !text.isEmpty {
+                out.append(text)
+            }
+        }
+        return out
+    }
+
+    private static func writeCorpus(
+        lang: String,
+        phrases: [String],
+        outDir: URL,
+        revision: String
+    ) throws -> URL {
+        let outPath = outDir.appendingPathComponent("\(lang).txt")
+        let header: [String] = [
+            "# MiniMax Multilingual TTS Test Set — \(lang)",
+            "# Source:   https://huggingface.co/datasets/\(repo)",
+            "# Revision: \(revision)",
+            "# License:  CC-BY-SA-4.0 (Creative Commons Attribution-ShareAlike 4.0)",
+            "# Phrases:  \(phrases.count)",
+            "#",
+            "# Cloning-audio filenames have been stripped — we only need the",
+            "# text for the FluidAudio TTS benchmark harness. Voice selection",
+            "# is per-backend (see Documentation/TTS/MinimaxCorpus.md).",
+            "",
+        ]
+        let body = (header + phrases).joined(separator: "\n") + "\n"
+        try body.write(to: outPath, atomically: true, encoding: .utf8)
+        return outPath
+    }
+
+    /// `<repo>/Benchmarks/tts/corpus/minimax/`. Resolves relative to the
+    /// current working directory (the standard place `swift run` is invoked
+    /// from); falls back gracefully if the layout doesn't exist yet because
+    /// we `createDirectory(withIntermediateDirectories: true)` before write.
+    private static func defaultOutDir() -> URL {
+        repoRoot()
+            .appendingPathComponent("Benchmarks", isDirectory: true)
+            .appendingPathComponent("tts", isDirectory: true)
+            .appendingPathComponent("corpus", isDirectory: true)
+            .appendingPathComponent("minimax", isDirectory: true)
+    }
+
+    private static func repoRoot() -> URL {
+        URL(fileURLWithPath: FileManager.default.currentDirectoryPath, isDirectory: true)
+    }
+
+    private static func relativePath(_ url: URL, from base: URL) -> String {
+        let path = url.standardizedFileURL.path
+        let basePath = base.standardizedFileURL.path
+        if path.hasPrefix(basePath + "/") {
+            return String(path.dropFirst(basePath.count + 1))
+        }
+        return path
+    }
+
+    private static func printUsage() {
+        logger.info(
+            """
+            Usage: fluidaudio minimax-corpus [options]
+
+            Fetches the MiniMax Multilingual TTS Test Set text files from
+            HuggingFace and converts them to the FluidAudio TTS-benchmark
+            corpus format. Outputs one file per language.
+
+            Options:
+                --languages, -l <list>   Comma-separated subset of languages
+                                         (default: all 24).
+                --revision <sha>         HuggingFace dataset revision
+                                         (default: \(defaultRevision)).
+                --out-dir <path>         Output directory
+                                         (default: Benchmarks/tts/corpus/minimax).
+                --help, -h               Show this help.
+
+            Available languages:
+                \(allLanguages.joined(separator: ", "))
+
+            Examples:
+                fluidaudio minimax-corpus
+                fluidaudio minimax-corpus --languages english,spanish,hindi
+                fluidaudio minimax-corpus --revision <commit-sha>
+            """)
+    }
+}
+#endif
diff --git a/Sources/FluidAudioCLI/Commands/StyleTTS2Command.swift b/Sources/FluidAudioCLI/Commands/StyleTTS2Command.swift
index 3fa3f2d2..3b33d39b 100644
--- a/Sources/FluidAudioCLI/Commands/StyleTTS2Command.swift
+++ b/Sources/FluidAudioCLI/Commands/StyleTTS2Command.swift
@@ -23,6 +23,8 @@ public enum StyleTTS2Command {
         var alpha: Float = 0.3
         var beta: Float = 0.7
         var seed: UInt64?
+        var tokenizeOnly = false
+        var corpusPath: String?
 
         var i = 0
         while i < arguments.count {
@@ -74,6 +76,16 @@ public enum StyleTTS2Command {
                     fputs("--seed requires an integer\n", stderr)
                     exit(2)
                 }
+            case "--tokenize-only":
+                tokenizeOnly = true
+                i += 1
+            case "--corpus":
+                guard i + 1 < arguments.count else {
+                    fputs("--corpus requires a path\n", stderr)
+                    exit(2)
+                }
+                corpusPath = arguments[i + 1]
+                i += 2
             case "--help", "-h":
                 printUsage()
                 return
@@ -88,6 +100,11 @@ public enum StyleTTS2Command {
             }
         }
 
+        if tokenizeOnly {
+            await runTokenizeOnly(text: text, corpusPath: corpusPath)
+            return
+        }
+
         guard let text else {
             fputs("Missing required text argument\n", stderr)
             printUsage()
@@ -136,6 +153,105 @@ public enum StyleTTS2Command {
         }
     }
 
+    /// `--tokenize-only`: phonemize + encode without invoking the diffusion
+    /// pipeline. Reports phoneme string, token id sequence, and any scalars
+    /// that the 178-token espeak-ng vocab silently dropped. With `--corpus`
+    /// runs over every line of a phrase file and aggregates a histogram of
+    /// dropped scalars for the whole corpus.
+    private static func runTokenizeOnly(text: String?, corpusPath: String?) async {
+        do {
+            let manager = StyleTTS2Manager()
+            try await manager.initialize { _ in }
+
+            var totalScalars = 0
+            var totalIds = 0
+            var totalDropped = 0
+            var dropHist: [Unicode.Scalar: Int] = [:]
+            var phraseCount = 0
+
+            func process(_ phrase: String) async throws {
+                let (phonemes, ids, dropped) =
+                    try await manager.tokenizeWithReport(text: phrase)
+                let scalars = phonemes.unicodeScalars.count
+                totalScalars += scalars
+                totalIds += ids.count
+                let phraseDropCount = dropped.values.reduce(0, +)
+                totalDropped += phraseDropCount
+                for (k, v) in dropped { dropHist[k, default: 0] += v }
+                phraseCount += 1
+
+                if corpusPath == nil {
+                    print("INPUT      : \(phrase)")
+                    print("PHONEMES   : \(phonemes)")
+                    print("TOKEN_IDS  (\(ids.count)): \(ids)")
+                    let formatted =
+                        dropped
+                        .sorted { $0.value > $1.value }
+                        .map {
+                            "U+\(String($0.key.value, radix: 16, uppercase: true))"
+                                + " '\($0.key)' ×\($0.value)"
+                        }
+                        .joined(separator: ", ")
+                    print(
+                        "DROPPED    (\(phraseDropCount) of \(scalars) scalars):"
+                            + " \(formatted)")
+                }
+            }
+
+            if let corpusPath {
+                let url = expand(corpusPath)
+                let raw = try String(contentsOf: url, encoding: .utf8)
+                let phrases = raw.split(separator: "\n", omittingEmptySubsequences: true)
+                    .map { $0.trimmingCharacters(in: .whitespaces) }
+                    .filter { !$0.isEmpty && !$0.hasPrefix("#") }
+                for (idx, phrase) in phrases.enumerated() {
+                    do {
+                        try await process(phrase)
+                        let dropPct =
+                            Double(totalDropped) / Double(max(totalScalars, 1)) * 100
+                        if (idx + 1) % 10 == 0 || idx + 1 == phrases.count {
+                            fputs(
+                                "  [\(idx + 1)/\(phrases.count)] running drop rate "
+                                    + "\(String(format: "%.2f", dropPct))%\n",
+                                stderr)
+                        }
+                    } catch {
+                        fputs("  [\(idx + 1)] phrase failed: \(error)\n", stderr)
+                    }
+                }
+            } else if let text {
+                try await process(text)
+            } else {
+                fputs("--tokenize-only requires either text or --corpus\n", stderr)
+                exit(2)
+            }
+
+            let dropPct = Double(totalDropped) / Double(max(totalScalars, 1)) * 100
+            let kept = totalScalars - totalDropped
+            print("")
+            print("=== StyleTTS2 vocab coverage ===")
+            print("phrases                : \(phraseCount)")
+            print("phoneme scalars total  : \(totalScalars)")
+            print("encoded token ids      : \(totalIds)  (== kept scalars: \(kept))")
+            print(
+                "dropped scalars        : \(totalDropped)  "
+                    + "(\(String(format: "%.2f", dropPct))%)")
+            print("distinct dropped chars : \(dropHist.count)")
+            if !dropHist.isEmpty {
+                print("")
+                print("dropped histogram (most → least frequent):")
+                for (scalar, count) in dropHist.sorted(by: { $0.value > $1.value }) {
+                    let hex = String(scalar.value, radix: 16, uppercase: true)
+                    print(
+                        "  \(String(format: "%6d", count))  U+\(hex)  '\(scalar)'")
+                }
+            }
+        } catch {
+            fputs("StyleTTS2 tokenize-only failed: \(error)\n", stderr)
+            exit(1)
+        }
+    }
+
     private static func expand(_ path: String) -> URL {
         let exp = (path as NSString).expandingTildeInPath
         if exp.hasPrefix("/") {
@@ -152,12 +268,15 @@ public enum StyleTTS2Command {
               fluidaudio styletts2 "<text>" --voice <ref_s.bin> [options]
 
             Options:
-              --voice <path>    Required. Path to precomputed ref_s.bin (256 fp32 LE).
+              --voice <path>    Required for synthesis. Path to precomputed ref_s.bin (256 fp32 LE).
               --output <path>   Output WAV path (default: styletts2.wav).
               --steps <int>     ADPM2 sampler steps (default: 5).
               --alpha <float>   Acoustic style mix weight (default: 0.3).
               --beta <float>    Prosody style mix weight (default: 0.7).
               --seed <uint>     Deterministic noise seed (default: system RNG).
+              --tokenize-only   Run G2P + vocab encode only; report dropped scalars.
+                                No --voice needed. Use with text or --corpus.
+              --corpus <path>   Phrase-per-line corpus file (with --tokenize-only).
 
             Example:
               fluidaudio styletts2 "Hello world" \\
diff --git a/Sources/FluidAudioCLI/Commands/TTSCommand.swift b/Sources/FluidAudioCLI/Commands/TTSCommand.swift
index 132fcc8d..9a79a8ac 100644
--- a/Sources/FluidAudioCLI/Commands/TTSCommand.swift
+++ b/Sources/FluidAudioCLI/Commands/TTSCommand.swift
@@ -414,22 +414,17 @@ public struct TTS {
                 )
                 return
             }
-            if #available(macOS 15, iOS 18, *) {
-                await CosyVoice3TextCLI.run(
-                    text: inputText,
-                    modelsDir: modelsDir,
-                    tokenizerDir: tokDir,
-                    embeddingsFile: embFile,
-                    specialTokensFile: specFile,
-                    promptAssetsPath: promptAssets,
-                    outputPath: output,
-                    seed: cv3Seed,
-                    maxNewTokens: cv3MaxNewTokens,
-                    cpuOnly: cv3CpuOnly)
-            } else {
-                logger.error(
-                    "CosyVoice3 requires macOS 15 / iOS 18 (uses CoreML MLState).")
-            }
+            await CosyVoice3TextCLI.run(
+                text: inputText,
+                modelsDir: modelsDir,
+                tokenizerDir: tokDir,
+                embeddingsFile: embFile,
+                specialTokensFile: specFile,
+                promptAssetsPath: promptAssets,
+                outputPath: output,
+                seed: cv3Seed,
+                maxNewTokens: cv3MaxNewTokens,
+                cpuOnly: cv3CpuOnly)
             return
         }
 
@@ -440,19 +435,14 @@ public struct TTS {
                 )
                 return
             }
-            if #available(macOS 15, iOS 18, *) {
-                await CosyVoice3ParityCLI.run(
-                    fixturePath: fixture,
-                    modelsDir: modelsDir,
-                    referencePath: cv3ReferencePath,
-                    outputPath: output,
-                    seed: cv3Seed,
-                    cpuOnly: cv3CpuOnly,
-                    replayTokens: cv3ReplayTokens)
-            } else {
-                logger.error(
-                    "CosyVoice3 requires macOS 15 / iOS 18 (uses CoreML MLState).")
-            }
+            await CosyVoice3ParityCLI.run(
+                fixturePath: fixture,
+                modelsDir: modelsDir,
+                referencePath: cv3ReferencePath,
+                outputPath: output,
+                seed: cv3Seed,
+                cpuOnly: cv3CpuOnly,
+                replayTokens: cv3ReplayTokens)
             return
         }
 
diff --git a/Sources/FluidAudioCLI/Commands/TtsBenchmarkCommand.swift b/Sources/FluidAudioCLI/Commands/TtsBenchmarkCommand.swift
new file mode 100644
index 00000000..ae268b9e
--- /dev/null
+++ b/Sources/FluidAudioCLI/Commands/TtsBenchmarkCommand.swift
@@ -0,0 +1,1359 @@
+#if os(macOS)
+import CoreML
+import FluidAudio
+import Foundation
+
+/// `fluidaudio tts-benchmark` — quantitative TTS benchmark harness.
+///
+/// Reports **TTFT / cold-start / warm-start latency, per-stage timings,
+/// peak RSS, WER + CER per category** — i.e. the things conversational
+/// TTS users actually feel — instead of just RTFx.
+///
+/// Backends:
+///   kokoro-ane    — 7-stage ANE pipeline (per-stage timings, per-stage CU)
+///   kokoro        — single-graph CPU+GPU (chunk-level only)
+///   pocket-tts    — streaming flow-matching (no per-stage timings)
+///   magpie        — encoder-decoder + NanoCodec (6-stage timings, slow)
+///   cosyvoice3    — Mandarin LLM-based (Mandarin corpus only, no WER)
+///   styletts2     — diffusion + HiFi-GAN (one-shot, requires --voice ref_s.bin)
+///
+/// Usage:
+///   fluidaudio tts-benchmark --backend kokoro-ane \
+///       --corpus minimax-english \
+///       --voice af_heart \
+///       --compute-units default \
+///       --output-json bench.json
+///
+/// Corpora land in `Benchmarks/tts/corpus/minimax/<lang>.txt` —
+/// the MiniMax Multilingual TTS Test Set (CC-BY-SA-4.0,
+/// 24 languages × 100 phrases). The `.txt` files are gitignored;
+/// populate them with `swift run fluidaudio minimax-corpus`. See
+/// `Documentation/TTS/MinimaxCorpus.md` for attribution + reproduction
+/// notes and `Documentation/TTS/Benchmarks.md` for the per-backend ↔
+/// language coverage matrix. Reference with `--corpus minimax-<lang>`
+/// (e.g. `minimax-english`, `minimax-chinese`, `minimax-vietnamese`, …).
+public enum TtsBenchmarkCommand {
+
+    private static let logger = AppLogger(category: "TtsBenchmarkCommand")
+
+    // MARK: - Per-phrase sample emitted by every backend driver.
+    private struct BackendPhraseSample {
+        let synthMs: Double
+        let ttftMs: Double  // For one-shot backends, == synthMs.
+        let samples: [Float]
+        let sampleRate: Int
+        let stageMs: [String: Double]  // Empty if backend has no per-stage timings.
+        let extraFields: [String: Any]  // encoder_tokens, finished_on_eos, etc.
+    }
+
+    // MARK: - ASR backend selection
+    //
+    // The harness supports two ASR backends for the TTS→ASR roundtrip:
+    //   .parakeet — Parakeet TDT (English-only, auto-downloaded).
+    //   .cohere   — Cohere Transcribe cache-external (14 languages incl. zh).
+    // CosyVoice3's Mandarin output requires `.cohere` for a meaningful CER
+    // — Parakeet's English-only output collapses to ~100% on zh.
+    fileprivate enum AsrChoice {
+        case skip
+        case parakeet
+        case cohere(modelDir: URL, language: CohereAsrConfig.Language, computeUnits: MLComputeUnits)
+
+        var label: String {
+            switch self {
+            case .skip: return "skip"
+            case .parakeet: return "parakeet-tdt"
+            case .cohere(_, let lang, let cu):
+                return "cohere-transcribe-\(lang.rawValue)/\(Self.computeLabel(cu))"
+            }
+        }
+
+        private static func computeLabel(_ cu: MLComputeUnits) -> String {
+            switch cu {
+            case .all: return "all"
+            case .cpuAndNeuralEngine: return "cpu+ane"
+            case .cpuAndGPU: return "cpu+gpu"
+            case .cpuOnly: return "cpu"
+            @unknown default: return "unknown"
+            }
+        }
+
+        var skipped: Bool {
+            if case .skip = self { return true } else { return false }
+        }
+    }
+
+    /// Closure-based ASR adapter so `runPhraseLoop` doesn't have to know
+    /// which backend it's driving. Built once before the per-phrase loop,
+    /// torn down after.
+    fileprivate struct AsrLoop {
+        let label: String
+        let transcribeOne: (URL) async throws -> String
+        let cleanup: () async -> Void
+    }
+
+    public static func run(arguments: [String]) async {
+        var backendName = "kokoro-ane"
+        var corpusName: String?
+        var corpusPath: String?
+        var voice: String?
+        var speakerName: String?
+        var languageName: String?
+        var computeUnitsName = "default"
+        var outputJson: String?
+        var audioDir: String?
+        var skipAsr = false
+        var asrBackendName: String?
+        var cohereModelDirArg: String?
+        var asrLanguageArg: String?
+        var cohereComputeUnitsArg: String?
+
+        var i = 0
+        while i < arguments.count {
+            let arg = arguments[i]
+            switch arg {
+            case "--backend":
+                if i + 1 < arguments.count {
+                    backendName = arguments[i + 1]
+                    i += 1
+                }
+            case "--corpus":
+                if i + 1 < arguments.count {
+                    corpusName = arguments[i + 1]
+                    i += 1
+                }
+            case "--corpus-path":
+                if i + 1 < arguments.count {
+                    corpusPath = arguments[i + 1]
+                    i += 1
+                }
+            case "--voice":
+                if i + 1 < arguments.count {
+                    voice = arguments[i + 1]
+                    i += 1
+                }
+            case "--speaker":
+                if i + 1 < arguments.count {
+                    speakerName = arguments[i + 1]
+                    i += 1
+                }
+            case "--language":
+                if i + 1 < arguments.count {
+                    languageName = arguments[i + 1]
+                    i += 1
+                }
+            case "--compute-units":
+                if i + 1 < arguments.count {
+                    computeUnitsName = arguments[i + 1]
+                    i += 1
+                }
+            case "--output-json":
+                if i + 1 < arguments.count {
+                    outputJson = arguments[i + 1]
+                    i += 1
+                }
+            case "--audio-dir":
+                if i + 1 < arguments.count {
+                    audioDir = arguments[i + 1]
+                    i += 1
+                }
+            case "--skip-asr":
+                skipAsr = true
+            case "--asr-backend":
+                if i + 1 < arguments.count {
+                    asrBackendName = arguments[i + 1]
+                    i += 1
+                }
+            case "--cohere-model-dir":
+                if i + 1 < arguments.count {
+                    cohereModelDirArg = arguments[i + 1]
+                    i += 1
+                }
+            case "--asr-language":
+                if i + 1 < arguments.count {
+                    asrLanguageArg = arguments[i + 1]
+                    i += 1
+                }
+            case "--cohere-compute-units":
+                if i + 1 < arguments.count {
+                    cohereComputeUnitsArg = arguments[i + 1]
+                    i += 1
+                }
+            case "--help", "-h":
+                printUsage()
+                return
+            default:
+                logger.warning("Unknown argument: \(arg)")
+            }
+            i += 1
+        }
+
+        let backend = parseBackend(backendName)
+
+        // Resolve corpus.
+        let phrases: [(category: String, text: String)]
+        let corpusLabel: String
+        do {
+            if let corpusPath {
+                let url = resolveURL(corpusPath, isDirectory: false)
+                let raw = try String(contentsOf: url, encoding: .utf8)
+                phrases = parseCorpus(raw, category: url.deletingPathExtension().lastPathComponent)
+                corpusLabel = url.lastPathComponent
+            } else {
+                let resolved = corpusName ?? backend.defaultCorpus
+                phrases = try loadShippedCorpus(resolved)
+                corpusLabel = resolved
+            }
+        } catch {
+            logger.error("Failed to load corpus: \(error.localizedDescription)")
+            exit(1)
+        }
+        guard !phrases.isEmpty else {
+            logger.error("Corpus is empty after parsing")
+            exit(1)
+        }
+        logger.info("Loaded \(phrases.count) phrase(s) from corpus '\(corpusLabel)'")
+
+        guard let preset = TtsComputeUnitPreset(cliValue: computeUnitsName) else {
+            logger.error(
+                "Unknown --compute-units value: \(computeUnitsName). Expected default | all-ane | cpu-and-gpu | cpu-only."
+            )
+            exit(1)
+        }
+
+        // Resolve ASR backend choice. Precedence:
+        //   --skip-asr or --asr-backend none → .skip
+        //   --asr-backend cohere             → .cohere(modelDir, language)
+        //   --asr-backend parakeet           → .parakeet
+        //   no flag, backend == cosyvoice3   → .skip (Parakeet is English-only;
+        //                                       Mandarin output collapses to ~100% WER)
+        //   no flag, otherwise               → .parakeet
+        let asrChoice: AsrChoice
+        do {
+            asrChoice = try resolveAsrChoice(
+                skipAsrFlag: skipAsr,
+                backendName: asrBackendName,
+                cohereModelDir: cohereModelDirArg,
+                asrLanguage: asrLanguageArg,
+                cohereComputeUnits: cohereComputeUnitsArg,
+                corpusLabel: corpusLabel,
+                ttsBackend: backend)
+        } catch {
+            logger.error("Failed to resolve ASR backend: \(error.localizedDescription)")
+            exit(1)
+        }
+        logger.info("ASR backend: \(asrChoice.label)")
+
+        do {
+            switch backend {
+            case .kokoroAne:
+                try await runKokoroAne(
+                    phrases: phrases, corpusLabel: corpusLabel,
+                    voice: voice ?? KokoroAneConstants.defaultVoice,
+                    preset: preset, outputJson: outputJson, audioDir: audioDir,
+                    asrChoice: asrChoice)
+            case .kokoro:
+                try await runKokoro(
+                    phrases: phrases, corpusLabel: corpusLabel,
+                    voice: voice ?? TtsConstants.recommendedVoice,
+                    preset: preset, outputJson: outputJson, audioDir: audioDir,
+                    asrChoice: asrChoice)
+            case .pocketTts:
+                try await runPocketTts(
+                    phrases: phrases, corpusLabel: corpusLabel,
+                    voice: voice ?? PocketTtsConstants.defaultVoice,
+                    languageName: languageName,
+                    preset: preset, outputJson: outputJson, audioDir: audioDir,
+                    asrChoice: asrChoice)
+            case .magpie:
+                try await runMagpie(
+                    phrases: phrases, corpusLabel: corpusLabel,
+                    speakerName: speakerName, languageName: languageName,
+                    preset: preset, outputJson: outputJson, audioDir: audioDir,
+                    asrChoice: asrChoice)
+            case .cosyVoice3:
+                try await runCosyVoice3(
+                    phrases: phrases, corpusLabel: corpusLabel,
+                    voice: voice,
+                    preset: preset, outputJson: outputJson, audioDir: audioDir,
+                    asrChoice: asrChoice)
+            case .styleTts2:
+                try await runStyleTts2(
+                    phrases: phrases, corpusLabel: corpusLabel,
+                    voicePath: voice,
+                    preset: preset, outputJson: outputJson, audioDir: audioDir,
+                    asrChoice: asrChoice)
+            }
+        } catch {
+            logger.error("tts-benchmark failed: \(error)")
+            exit(1)
+        }
+    }
+
+    // MARK: - Kokoro ANE driver
+
+    private static func runKokoroAne(
+        phrases: [(category: String, text: String)],
+        corpusLabel: String,
+        voice: String,
+        preset: TtsComputeUnitPreset,
+        outputJson: String?,
+        audioDir: String?,
+        asrChoice: AsrChoice
+    ) async throws {
+        let units = KokoroAneComputeUnits(preset: preset)
+        let manager = KokoroAneManager(defaultVoice: voice, computeUnits: units)
+
+        let coldStart = Date()
+        try await manager.initialize()
+        let coldStartS = Date().timeIntervalSince(coldStart)
+        logger.info(String(format: "Cold start (initialize): %.2fs", coldStartS))
+
+        let firstStart = Date()
+        _ = try await manager.synthesizeDetailed(
+            text: "Initialization warm-up.", voice: voice, speed: 1.0)
+        let firstSynthMs = Date().timeIntervalSince(firstStart) * 1000
+        logger.info(String(format: "First synth: %.0f ms", firstSynthMs))
+
+        try await runPhraseLoop(
+            backendId: "kokoro-ane",
+            voiceLabel: voice,
+            corpusLabel: corpusLabel,
+            phrases: phrases,
+            preset: preset,
+            coldStartS: coldStartS,
+            firstSynthMs: firstSynthMs,
+            outputJson: outputJson,
+            audioDir: audioDir,
+            asrChoice: asrChoice,
+            extraSummary: ["voice": voice]
+        ) { text in
+            let t0 = Date()
+            let result = try await manager.synthesizeDetailed(
+                text: text, voice: voice, speed: 1.0)
+            let synthMs = Date().timeIntervalSince(t0) * 1000
+            return BackendPhraseSample(
+                synthMs: synthMs,
+                ttftMs: synthMs,
+                samples: result.samples,
+                sampleRate: result.sampleRate,
+                stageMs: [
+                    "albert": result.timings.albert,
+                    "post_albert": result.timings.postAlbert,
+                    "alignment": result.timings.alignment,
+                    "prosody": result.timings.prosody,
+                    "noise": result.timings.noise,
+                    "vocoder": result.timings.vocoder,
+                    "tail": result.timings.tail,
+                    "total": result.timings.totalMs,
+                ],
+                extraFields: [
+                    "encoder_tokens": result.encoderTokens,
+                    "acoustic_frames": result.acousticFrames,
+                ]
+            )
+        }
+    }
+
+    // MARK: - Kokoro driver (single-graph)
+
+    private static func runKokoro(
+        phrases: [(category: String, text: String)],
+        corpusLabel: String,
+        voice: String,
+        preset: TtsComputeUnitPreset,
+        outputJson: String?,
+        audioDir: String?,
+        asrChoice: AsrChoice
+    ) async throws {
+        let units = preset.uniformUnits ?? .all
+        let manager = KokoroTtsManager(defaultVoice: voice, computeUnits: units)
+
+        let coldStart = Date()
+        try await manager.initialize(preloadVoices: [voice])
+        let coldStartS = Date().timeIntervalSince(coldStart)
+        logger.info(String(format: "Cold start (initialize): %.2fs", coldStartS))
+
+        let firstStart = Date()
+        _ = try await manager.synthesizeDetailed(text: "Initialization warm-up.", voice: voice)
+        let firstSynthMs = Date().timeIntervalSince(firstStart) * 1000
+        logger.info(String(format: "First synth: %.0f ms", firstSynthMs))
+
+        try await runPhraseLoop(
+            backendId: "kokoro",
+            voiceLabel: voice,
+            corpusLabel: corpusLabel,
+            phrases: phrases,
+            preset: preset,
+            coldStartS: coldStartS,
+            firstSynthMs: firstSynthMs,
+            outputJson: outputJson,
+            audioDir: audioDir,
+            asrChoice: asrChoice,
+            extraSummary: ["voice": voice]
+        ) { text in
+            let t0 = Date()
+            let result = try await manager.synthesizeDetailed(text: text, voice: voice)
+            let synthMs = Date().timeIntervalSince(t0) * 1000
+            let samples = result.chunks.flatMap { $0.samples }
+            return BackendPhraseSample(
+                synthMs: synthMs,
+                ttftMs: synthMs,
+                samples: samples,
+                sampleRate: 24000,
+                stageMs: [:],
+                extraFields: [
+                    "chunk_count": result.chunks.count,
+                    "wav_bytes": result.audio.count,
+                ]
+            )
+        }
+    }
+
+    // MARK: - PocketTTS driver
+
+    private static func runPocketTts(
+        phrases: [(category: String, text: String)],
+        corpusLabel: String,
+        voice: String,
+        languageName: String?,
+        preset: TtsComputeUnitPreset,
+        outputJson: String?,
+        audioDir: String?,
+        asrChoice: AsrChoice
+    ) async throws {
+        if preset != .default {
+            logger.warning(
+                "PocketTTS does not expose per-call compute-unit overrides; --compute-units \(preset.cliValue) ignored."
+            )
+        }
+        let language = parsePocketLanguage(languageName)
+        logger.info("PocketTTS language: \(language.rawValue)")
+
+        let manager = PocketTtsManager(defaultVoice: voice, language: language)
+
+        let coldStart = Date()
+        try await manager.initialize()
+        let coldStartS = Date().timeIntervalSince(coldStart)
+        logger.info(String(format: "Cold start (initialize): %.2fs", coldStartS))
+
+        let firstStart = Date()
+        var firstFrameMs: Double = 0
+        var firstFrameCount = 0
+        let warmupStream = try await manager.synthesizeStreaming(
+            text: "Initialization warm-up.", voice: voice)
+        for try await frame in warmupStream {
+            if firstFrameCount == 0 {
+                firstFrameMs = Date().timeIntervalSince(firstStart) * 1000
+            }
+            firstFrameCount += 1
+            _ = frame.samples
+        }
+        let firstSynthMs = Date().timeIntervalSince(firstStart) * 1000
+        logger.info(
+            String(
+                format: "First synth: %.0f ms total, %.0f ms TTFT (frames=%d)",
+                firstSynthMs, firstFrameMs, firstFrameCount))
+
+        try await runPhraseLoop(
+            backendId: "pocket-tts",
+            voiceLabel: voice,
+            corpusLabel: corpusLabel,
+            phrases: phrases,
+            preset: preset,
+            coldStartS: coldStartS,
+            firstSynthMs: firstSynthMs,
+            outputJson: outputJson,
+            audioDir: audioDir,
+            asrChoice: asrChoice,
+            extraSummary: ["voice": voice, "language": language.rawValue]
+        ) { text in
+            // PocketTTS is streaming-first: we measure TTFT (time to first
+            // audio frame) separately from total synth time so the benchmark
+            // numbers reflect what a streaming consumer actually experiences.
+            let t0 = Date()
+            let stream = try await manager.synthesizeStreaming(text: text, voice: voice)
+            var aggregated: [Float] = []
+            var ttftMs: Double = 0
+            var frameCount = 0
+            var lastChunkCount = 0
+            for try await frame in stream {
+                if frameCount == 0 {
+                    ttftMs = Date().timeIntervalSince(t0) * 1000
+                }
+                aggregated.append(contentsOf: frame.samples)
+                frameCount += 1
+                lastChunkCount = frame.chunkCount
+            }
+            let synthMs = Date().timeIntervalSince(t0) * 1000
+            return BackendPhraseSample(
+                synthMs: synthMs,
+                ttftMs: ttftMs,
+                samples: aggregated,
+                sampleRate: PocketTtsConstants.audioSampleRate,
+                stageMs: [:],
+                extraFields: [
+                    "frame_count": frameCount,
+                    "chunk_count": lastChunkCount,
+                ]
+            )
+        }
+    }
+
+    // MARK: - Magpie driver
+
+    private static func runMagpie(
+        phrases: [(category: String, text: String)],
+        corpusLabel: String,
+        speakerName: String?,
+        languageName: String?,
+        preset: TtsComputeUnitPreset,
+        outputJson: String?,
+        audioDir: String?,
+        asrChoice: AsrChoice
+    ) async throws {
+        let units = preset.uniformUnits ?? .cpuAndNeuralEngine
+        let language = parseMagpieLanguage(languageName)
+        let speaker = parseMagpieSpeaker(speakerName)
+        logger.info("Magpie speaker=\(speaker.displayName) language=\(language.rawValue)")
+
+        let manager = MagpieTtsManager(
+            computeUnits: units, preferredLanguages: [language])
+
+        let coldStart = Date()
+        try await manager.initialize()
+        let coldStartS = Date().timeIntervalSince(coldStart)
+        logger.info(String(format: "Cold start (initialize): %.2fs", coldStartS))
+
+        let firstStart = Date()
+        _ = try await manager.synthesize(
+            text: "Initialization warm-up.", speaker: speaker, language: language)
+        let firstSynthMs = Date().timeIntervalSince(firstStart) * 1000
+        logger.info(String(format: "First synth: %.0f ms", firstSynthMs))
+
+        try await runPhraseLoop(
+            backendId: "magpie",
+            voiceLabel: speaker.displayName,
+            corpusLabel: corpusLabel,
+            phrases: phrases,
+            preset: preset,
+            coldStartS: coldStartS,
+            firstSynthMs: firstSynthMs,
+            outputJson: outputJson,
+            audioDir: audioDir,
+            asrChoice: asrChoice,
+            extraSummary: [
+                "speaker": speaker.displayName, "language": language.rawValue,
+            ]
+        ) { text in
+            // Drive Magpie through `synthesizeStream` so TTFT measures
+            // time-to-first-chunk-yield rather than full-utterance wall.
+            // The chunker carves a small first chunk
+            // (`MagpieChunker.streamingFirstChunkCap` = 50 codec frames ≈
+            // 2.3 s of audio) when the first sentence is long enough; for
+            // short phrases the stream degrades to one chunk == whole
+            // utterance and TTFT == synthMs (no streaming benefit, no
+            // measurement penalty).
+            //
+            // Trade-off vs. the prior `synthesize()` path: per-stage
+            // timings (`text_encoder`/`prefill`/`ar_loop`/…) are only
+            // surfaced on `MagpieSynthesisResult`, not per
+            // `MagpieAudioChunk`, so `stageMs` is empty here. That matches
+            // PocketTTS streaming which also publishes empty `stageMs`.
+            let t0 = Date()
+            let stream = try await manager.synthesizeStream(
+                text: text, speaker: speaker, language: language)
+            var aggregated: [Float] = []
+            var ttftMs: Double = 0
+            var chunkCount = 0
+            var codeCount = 0
+            var finishedOnEos = false
+            var sampleRate = MagpieConstants.audioSampleRate
+            for try await chunk in stream {
+                if chunkCount == 0 {
+                    ttftMs = Date().timeIntervalSince(t0) * 1000
+                }
+                aggregated.append(contentsOf: chunk.samples)
+                chunkCount += 1
+                codeCount += chunk.codeCount
+                sampleRate = chunk.sampleRate
+                if chunk.isFinal {
+                    finishedOnEos = chunk.finishedOnEos
+                }
+            }
+            let synthMs = Date().timeIntervalSince(t0) * 1000
+            // Empty-stream guard (synthesizeStream returns immediately on
+            // zero-length input). Fall back to synthMs so downstream
+            // percentile math doesn't see ttftMs == 0.
+            if chunkCount == 0 { ttftMs = synthMs }
+            return BackendPhraseSample(
+                synthMs: synthMs,
+                ttftMs: ttftMs,
+                samples: aggregated,
+                sampleRate: sampleRate,
+                stageMs: [:],
+                extraFields: [
+                    "code_count": codeCount,
+                    "finished_on_eos": finishedOnEos,
+                    "chunk_count": chunkCount,
+                ]
+            )
+        }
+    }
+
+    // MARK: - CosyVoice3 driver
+
+    private static func runCosyVoice3(
+        phrases: [(category: String, text: String)],
+        corpusLabel: String,
+        voice: String?,
+        preset: TtsComputeUnitPreset,
+        outputJson: String?,
+        audioDir: String?,
+        asrChoice: AsrChoice
+    ) async throws {
+        let units = preset.uniformUnits ?? .cpuAndNeuralEngine
+        let voiceId = voice ?? "cosyvoice3-default-zh"
+
+        let coldStart = Date()
+        let manager = try await CosyVoice3TtsManager.downloadAndCreate(
+            cacheDirectory: nil, includeDefaultVoice: true, computeUnits: units)
+        try await manager.initialize()
+        let promptAssets = try await manager.loadVoice(voiceId)
+        let coldStartS = Date().timeIntervalSince(coldStart)
+        logger.info(String(format: "Cold start (download+init+voice): %.2fs", coldStartS))
+
+        let firstStart = Date()
+        _ = try await manager.synthesize(text: "你好", promptAssets: promptAssets)
+        let firstSynthMs = Date().timeIntervalSince(firstStart) * 1000
+        logger.info(String(format: "First synth: %.0f ms", firstSynthMs))
+
+        try await runPhraseLoop(
+            backendId: "cosyvoice3",
+            voiceLabel: voiceId,
+            corpusLabel: corpusLabel,
+            phrases: phrases,
+            preset: preset,
+            coldStartS: coldStartS,
+            firstSynthMs: firstSynthMs,
+            outputJson: outputJson,
+            audioDir: audioDir,
+            asrChoice: asrChoice,
+            extraSummary: ["voice": voiceId]
+        ) { text in
+            let t0 = Date()
+            let result = try await manager.synthesize(text: text, promptAssets: promptAssets)
+            let synthMs = Date().timeIntervalSince(t0) * 1000
+            return BackendPhraseSample(
+                synthMs: synthMs,
+                ttftMs: synthMs,
+                samples: result.samples,
+                sampleRate: result.sampleRate,
+                stageMs: [:],
+                extraFields: [
+                    "generated_token_count": result.generatedTokenCount,
+                    "decoded_token_count": result.decodedTokens.count,
+                    // Surface the structural 250-token Flow-input cap as a
+                    // per-phrase boolean so corpus reports can tally how many
+                    // long phrases hit silent truncation.
+                    "finished_on_eos": result.finishedOnEos,
+                ]
+            )
+        }
+    }
+
+    // MARK: - StyleTTS2 driver
+
+    private static func runStyleTts2(
+        phrases: [(category: String, text: String)],
+        corpusLabel: String,
+        voicePath: String?,
+        preset: TtsComputeUnitPreset,
+        outputJson: String?,
+        audioDir: String?,
+        asrChoice: AsrChoice
+    ) async throws {
+        guard let voicePath, !voicePath.isEmpty else {
+            logger.error(
+                "StyleTTS2 requires --voice <path/to/ref_s.bin> "
+                    + "(256 fp32 LE blob from mobius-styletts2/scripts/06_dump_ref_s.py)")
+            exit(1)
+        }
+        let voiceURL = resolveURL(voicePath, isDirectory: false)
+        let voiceLabel = voiceURL.deletingPathExtension().lastPathComponent
+
+        // StyleTTS2 doesn't expose a compute-units knob today; --compute-units
+        // is accepted for parity with other backends but only labels the run.
+        let manager = StyleTTS2Manager()
+
+        let coldStart = Date()
+        try await manager.initialize()
+        let coldStartS = Date().timeIntervalSince(coldStart)
+        logger.info(String(format: "Cold start (initialize): %.2fs", coldStartS))
+
+        let firstStart = Date()
+        _ = try await manager.synthesizeSamples(
+            text: "Initialization warm-up.", voiceStyleURL: voiceURL, randomSeed: 42)
+        let firstSynthMs = Date().timeIntervalSince(firstStart) * 1000
+        logger.info(String(format: "First synth: %.0f ms", firstSynthMs))
+
+        try await runPhraseLoop(
+            backendId: "styletts2",
+            voiceLabel: voiceLabel,
+            corpusLabel: corpusLabel,
+            phrases: phrases,
+            preset: preset,
+            coldStartS: coldStartS,
+            firstSynthMs: firstSynthMs,
+            outputJson: outputJson,
+            audioDir: audioDir,
+            asrChoice: asrChoice,
+            extraSummary: ["voice": voiceLabel]
+        ) { text in
+            let t0 = Date()
+            let result = try await manager.synthesizeSamples(
+                text: text, voiceStyleURL: voiceURL, randomSeed: 42)
+            let synthMs = Date().timeIntervalSince(t0) * 1000
+            return BackendPhraseSample(
+                synthMs: synthMs,
+                ttftMs: synthMs,
+                samples: result.samples,
+                sampleRate: result.sampleRate,
+                stageMs: [:],
+                extraFields: [:]
+            )
+        }
+    }
+
+    // MARK: - Shared per-phrase loop + summary
+
+    private static func runPhraseLoop(
+        backendId: String,
+        voiceLabel: String,
+        corpusLabel: String,
+        phrases: [(category: String, text: String)],
+        preset: TtsComputeUnitPreset,
+        coldStartS: Double,
+        firstSynthMs: Double,
+        outputJson: String?,
+        audioDir: String?,
+        asrChoice: AsrChoice,
+        extraSummary: [String: Any],
+        synthOne: (String) async throws -> BackendPhraseSample
+    ) async throws {
+        // Optional output dir for WAVs.
+        var audioDirURL: URL? = nil
+        if let audioDir {
+            let url = resolveURL(audioDir, isDirectory: true)
+            try FileManager.default.createDirectory(
+                at: url, withIntermediateDirectories: true)
+            audioDirURL = url
+        }
+
+        // Build optional ASR backend (Parakeet, Cohere, or none).
+        let asrLoop = try await buildAsrLoop(asrChoice)
+
+        var perPhrase: [[String: Any]] = []
+        var byCategory: [String: [Int]] = [:]
+
+        for (idx, item) in phrases.enumerated() {
+            let label = String(format: "[%02d/%02d]", idx + 1, phrases.count)
+            logger.info("\(label) [\(item.category)] \(item.text)")
+
+            let sample = try await synthOne(item.text)
+            let audioMs =
+                Double(sample.samples.count) / Double(sample.sampleRate) * 1000
+            let rtfx = sample.synthMs > 0 ? audioMs / sample.synthMs : 0
+
+            // Persist WAV (audioDir if set, else temp file for ASR).
+            let wavURL: URL
+            if let audioDirURL {
+                wavURL = audioDirURL.appendingPathComponent(
+                    String(format: "phrase_%03d.wav", idx + 1))
+            } else {
+                wavURL = FileManager.default.temporaryDirectory
+                    .appendingPathComponent("tts-benchmark-\(UUID().uuidString).wav")
+            }
+            let wavData = try AudioWAV.data(
+                from: sample.samples, sampleRate: Double(sample.sampleRate))
+            try wavData.write(to: wavURL)
+
+            var werValue = Double.nan
+            var cerValue = Double.nan
+            var hypothesis = ""
+            var asrMs = 0.0
+            if let asrLoop {
+                let asr0 = Date()
+                hypothesis = try await asrLoop.transcribeOne(wavURL)
+                asrMs = Date().timeIntervalSince(asr0) * 1000
+                let m = WERCalculator.calculateWERAndCER(
+                    hypothesis: hypothesis, reference: item.text)
+                werValue = m.wer
+                cerValue = m.cer
+            }
+
+            if audioDirURL == nil {
+                try? FileManager.default.removeItem(at: wavURL)
+            }
+
+            logger.info(
+                String(
+                    format:
+                        "  ttft=%.0f ms  synth=%.0f ms  audio=%.0f ms  rtfx=%.2fx  wer=%.1f%%  cer=%.1f%%",
+                    sample.ttftMs, sample.synthMs, audioMs, rtfx,
+                    werValue.isNaN ? 0 : werValue * 100,
+                    cerValue.isNaN ? 0 : cerValue * 100))
+
+            byCategory[item.category, default: []].append(perPhrase.count)
+            var phraseDict: [String: Any] = [
+                "index": idx + 1,
+                "category": item.category,
+                "reference": item.text,
+                "hypothesis": hypothesis,
+                "ttft_ms": sample.ttftMs,
+                "synth_ms": sample.synthMs,
+                "audio_ms": audioMs,
+                "rtfx": rtfx,
+                "wer": werValue.isNaN ? NSNull() : werValue as Any,
+                "cer": cerValue.isNaN ? NSNull() : cerValue as Any,
+                "asr_ms": asrMs,
+                "stage_ms": sample.stageMs,
+                "wav_path": audioDirURL == nil ? "" : wavURL.path,
+            ]
+            for (k, v) in sample.extraFields {
+                phraseDict[k] = v
+            }
+            perPhrase.append(phraseDict)
+        }
+
+        if let asrLoop {
+            await asrLoop.cleanup()
+        }
+
+        // Aggregate.
+        let totalSynthMs = perPhrase.reduce(0.0) { $0 + ($1["synth_ms"] as? Double ?? 0) }
+        let totalAudioMs = perPhrase.reduce(0.0) { $0 + ($1["audio_ms"] as? Double ?? 0) }
+        let aggRtfx = totalSynthMs > 0 ? totalAudioMs / totalSynthMs : 0
+
+        let synthMsValues = perPhrase.compactMap { $0["synth_ms"] as? Double }.sorted()
+        let p50 = percentile(synthMsValues, 0.5)
+        let p95 = percentile(synthMsValues, 0.95)
+        let ttftValues = perPhrase.compactMap { $0["ttft_ms"] as? Double }.sorted()
+        let ttftP50 = percentile(ttftValues, 0.5)
+        let ttftP95 = percentile(ttftValues, 0.95)
+
+        var categories: [[String: Any]] = []
+        for (cat, indexes) in byCategory.sorted(by: { $0.key < $1.key }) {
+            let werVals = indexes.compactMap { perPhrase[$0]["wer"] as? Double }
+            let cerVals = indexes.compactMap { perPhrase[$0]["cer"] as? Double }
+            let synthVals = indexes.compactMap { perPhrase[$0]["synth_ms"] as? Double }
+            let audioVals = indexes.compactMap { perPhrase[$0]["audio_ms"] as? Double }
+            let synthSum = synthVals.reduce(0, +)
+            let audioSum = audioVals.reduce(0, +)
+            let macroWer =
+                werVals.isEmpty ? Double.nan : werVals.reduce(0, +) / Double(werVals.count)
+            let macroCer =
+                cerVals.isEmpty ? Double.nan : cerVals.reduce(0, +) / Double(cerVals.count)
+            categories.append([
+                "category": cat,
+                "phrase_count": indexes.count,
+                "macro_wer": macroWer.isNaN ? NSNull() : macroWer as Any,
+                "macro_cer": macroCer.isNaN ? NSNull() : macroCer as Any,
+                "synth_ms_p50": percentile(synthVals.sorted(), 0.5),
+                "synth_ms_p95": percentile(synthVals.sorted(), 0.95),
+                "rtfx": synthSum > 0 ? audioSum / synthSum : 0,
+            ])
+        }
+
+        let peakRssMb =
+            Double(FluidAudioCLI.fetchPeakMemoryUsageBytes() ?? 0) / 1024 / 1024
+
+        // Banner.
+        logger.info("--- Summary ---")
+        logger.info("  backend:        \(backendId)")
+        logger.info("  voice/speaker:  \(voiceLabel)")
+        logger.info("  corpus:         \(corpusLabel) (n=\(phrases.count))")
+        logger.info("  compute units:  \(preset.cliValue)")
+        logger.info(String(format: "  cold start:     %.2fs", coldStartS))
+        logger.info(String(format: "  first synth:    %.0f ms", firstSynthMs))
+        logger.info(String(format: "  TTFT p50/p95:   %.0f / %.0f ms", ttftP50, ttftP95))
+        logger.info(String(format: "  warm synth p50: %.0f ms", p50))
+        logger.info(String(format: "  warm synth p95: %.0f ms", p95))
+        logger.info(String(format: "  agg RTFx:       %.2fx", aggRtfx))
+        logger.info(String(format: "  peak RSS:       %.0f MB", peakRssMb))
+        if !asrChoice.skipped {
+            let werVals = perPhrase.compactMap { $0["wer"] as? Double }
+            let cerVals = perPhrase.compactMap { $0["cer"] as? Double }
+            let macroWer =
+                werVals.isEmpty ? 0 : werVals.reduce(0, +) / Double(werVals.count)
+            let macroCer =
+                cerVals.isEmpty ? 0 : cerVals.reduce(0, +) / Double(cerVals.count)
+            logger.info("  ASR backend:    \(asrChoice.label)")
+            logger.info(String(format: "  macro WER:      %.2f%%", macroWer * 100))
+            logger.info(String(format: "  macro CER:      %.2f%%", macroCer * 100))
+            // Word-level WER is meaningless on whitespace-free scripts (zh, ja).
+            // Surface that explicitly so readers don't trust ~100% WER for zh.
+            if case .cohere(_, let lang, _) = asrChoice,
+                lang == .chinese || lang == .japanese
+            {
+                logger.info(
+                    "  note:           WER is whitespace-tokenized; trust CER for \(lang.rawValue).")
+            }
+        } else {
+            logger.info("  WER/CER:        skipped")
+        }
+
+        if let outputJson {
+            var summary: [String: Any] = [
+                "backend": backendId,
+                "corpus": corpusLabel,
+                "phrase_count": phrases.count,
+                "compute_units": preset.cliValue,
+                "cold_start_s": coldStartS,
+                "first_synth_ms": firstSynthMs,
+                "ttft_ms_p50": ttftP50,
+                "ttft_ms_p95": ttftP95,
+                "warm_synth_ms_p50": p50,
+                "warm_synth_ms_p95": p95,
+                "agg_rtfx": aggRtfx,
+                "peak_rss_mb": peakRssMb,
+                "asr_skipped": asrChoice.skipped,
+                "asr_backend": asrChoice.label,
+            ]
+            for (k, v) in extraSummary {
+                summary[k] = v
+            }
+            let report: [String: Any] = [
+                "summary": summary,
+                "categories": categories,
+                "phrases": perPhrase,
+            ]
+            let url = resolveURL(outputJson, isDirectory: false)
+            try FileManager.default.createDirectory(
+                at: url.deletingLastPathComponent(),
+                withIntermediateDirectories: true)
+            let data = try JSONSerialization.data(
+                withJSONObject: report, options: [.prettyPrinted, .sortedKeys])
+            try data.write(to: url)
+            logger.info("Report written: \(url.path)")
+        }
+    }
+
+    // MARK: - Corpus loading
+
+    private static func loadShippedCorpus(
+        _ name: String
+    ) throws -> [(category: String, text: String)] {
+        let cwd = URL(
+            fileURLWithPath: FileManager.default.currentDirectoryPath, isDirectory: true)
+        let relativePath = corpusRelativePath(for: name)
+        let url = cwd.appendingPathComponent(relativePath, isDirectory: false)
+        let raw = try String(contentsOf: url, encoding: .utf8)
+        return parseCorpus(raw, category: name)
+    }
+
+    /// Map a `--corpus` name to its on-disk relative path.
+    ///
+    /// All shipped corpora are MiniMax Multilingual TTS Test Set
+    /// languages — `minimax-<lang>` resolves to
+    /// `Benchmarks/tts/corpus/minimax/<lang>.txt`. The CC-BY-SA-4.0
+    /// attribution lives next to the data in `minimax/README.md`.
+    /// Pass `--corpus-path` for ad-hoc files outside the shipped set.
+    private static func corpusRelativePath(for name: String) -> String {
+        let prefix = "minimax-"
+        if name.hasPrefix(prefix) {
+            let lang = String(name.dropFirst(prefix.count))
+            return "Benchmarks/tts/corpus/minimax/\(lang).txt"
+        }
+        // Back-compat shim — anything else is assumed to live next to
+        // the minimax subdirectory. Prefer `--corpus-path` for non-shipped
+        // corpora.
+        return "Benchmarks/tts/corpus/\(name).txt"
+    }
+
+    private static func parseCorpus(
+        _ raw: String, category: String
+    ) -> [(category: String, text: String)] {
+        return
+            raw
+            .split(whereSeparator: \.isNewline)
+            .map { $0.trimmingCharacters(in: .whitespaces) }
+            .filter { !$0.isEmpty && !$0.hasPrefix("#") }
+            .map { (category: category, text: $0) }
+    }
+
+    // MARK: - Backend dispatch
+
+    private enum Backend: String {
+        case kokoroAne
+        case kokoro
+        case pocketTts
+        case magpie
+        case cosyVoice3
+        case styleTts2
+
+        var defaultCorpus: String {
+            switch self {
+            case .cosyVoice3: return "minimax-chinese"
+            default: return "minimax-english"
+            }
+        }
+    }
+
+    private static func parseBackend(_ name: String) -> Backend {
+        switch name.lowercased() {
+        case "kokoro-ane", "kokoroane", "kokoro_ane", "lai":
+            return .kokoroAne
+        case "kokoro":
+            return .kokoro
+        case "pocket-tts", "pockettts", "pocket":
+            return .pocketTts
+        case "magpie":
+            return .magpie
+        case "cosyvoice3", "cosyvoice", "cosy":
+            return .cosyVoice3
+        case "styletts2", "style-tts2", "styletts", "style":
+            return .styleTts2
+        default:
+            logger.warning("Unknown backend '\(name)' — defaulting to kokoro-ane")
+            return .kokoroAne
+        }
+    }
+
+    private static func parsePocketLanguage(_ name: String?) -> PocketTtsLanguage {
+        guard let name, let l = PocketTtsLanguage(rawValue: name.lowercased()) else {
+            return .english
+        }
+        return l
+    }
+
+    private static func parseMagpieLanguage(_ name: String?) -> MagpieLanguage {
+        guard let name, let l = MagpieLanguage(rawValue: name.lowercased()) else {
+            return .english
+        }
+        return l
+    }
+
+    private static func parseMagpieSpeaker(_ name: String?) -> MagpieSpeaker {
+        switch name?.lowercased() {
+        case "sofia": return .sofia
+        case "aria": return .aria
+        case "jason": return .jason
+        case "leo": return .leo
+        case "john", nil, "": return .john
+        default: return .john
+        }
+    }
+
+    // MARK: - Helpers
+
+    private static func percentile(_ sorted: [Double], _ p: Double) -> Double {
+        guard !sorted.isEmpty else { return 0 }
+        let idx = Int((Double(sorted.count - 1) * p).rounded())
+        return sorted[max(0, min(sorted.count - 1, idx))]
+    }
+
+    private static func resolveURL(_ path: String, isDirectory: Bool) -> URL {
+        let expanded = (path as NSString).expandingTildeInPath
+        if expanded.hasPrefix("/") {
+            return URL(fileURLWithPath: expanded, isDirectory: isDirectory)
+        }
+        let cwd = URL(
+            fileURLWithPath: FileManager.default.currentDirectoryPath, isDirectory: true)
+        return cwd.appendingPathComponent(expanded, isDirectory: isDirectory)
+    }
+
+    // MARK: - ASR backend resolution & adapter construction
+
+    /// Map CLI flags + TTS backend defaults to a concrete `AsrChoice`.
+    ///
+    /// Precedence: `--skip-asr` and `--asr-backend none` always win. With
+    /// no flag, English-friendly TTS backends default to Parakeet TDT and
+    /// CosyVoice3 defaults to `.skip` (Parakeet is English-only — its WER
+    /// on Mandarin output reads ~100% and is meaningless).
+    private static func resolveAsrChoice(
+        skipAsrFlag: Bool,
+        backendName: String?,
+        cohereModelDir: String?,
+        asrLanguage: String?,
+        cohereComputeUnits: String?,
+        corpusLabel: String,
+        ttsBackend: Backend
+    ) throws -> AsrChoice {
+        let normalized = backendName?.lowercased()
+        if skipAsrFlag || normalized == "none" {
+            return .skip
+        }
+        switch normalized {
+        case "cohere":
+            let dir = try resolveCohereModelDir(cohereModelDir)
+            let language = inferCohereLanguage(
+                explicit: asrLanguage, corpus: corpusLabel)
+            let units = try resolveCohereComputeUnits(cohereComputeUnits)
+            return .cohere(modelDir: dir, language: language, computeUnits: units)
+        case "parakeet":
+            return .parakeet
+        case nil:
+            // Implicit defaults: skip for CosyVoice3 (no English ASR pairing),
+            // Parakeet otherwise.
+            if ttsBackend == .cosyVoice3 {
+                logger.info(
+                    "CosyVoice3: no --asr-backend selected; skipping ASR. "
+                        + "Pass `--asr-backend cohere --cohere-model-dir <dir>` for CER.")
+                return .skip
+            }
+            return .parakeet
+        default:
+            logger.warning(
+                "Unknown --asr-backend value '\(normalized ?? "")', falling back to parakeet.")
+            return .parakeet
+        }
+    }
+
+    /// Resolve a Cohere Transcribe model directory (must contain
+    /// `cohere_encoder.mlmodelc`, `cohere_decoder_cache_external_v2.mlmodelc`,
+    /// and `vocab.json`).
+    ///
+    /// Order of resolution:
+    ///   1. Explicit `--cohere-model-dir <path>`.
+    ///   2. The default cache location at
+    ///      `~/Library/Application Support/FluidAudio/Models/cohere-transcribe/q8`,
+    ///      matching `Repo.cohereTranscribeCoreml.folderName`.
+    ///
+    /// Auto-download is intentionally not wired here: the upstream
+    /// `Repo.cohereTranscribeCoreml` registration ships `vocab.json` in
+    /// `requiredModels`, but the file lives at the repo root rather than
+    /// under the `q8/` subPath, so `DownloadUtils.downloadRepo` would fail
+    /// the post-download verify. Fix this when the registry learns about
+    /// repo-root files; until then, callers must pre-populate the cache
+    /// (e.g. via `fluidaudio cohere-transcribe ... --model-dir <dir>`).
+    private static func resolveCohereModelDir(_ override: String?) throws -> URL {
+        if let override {
+            return resolveURL(override, isDirectory: true)
+        }
+        let appSupport = try FileManager.default.url(
+            for: .applicationSupportDirectory,
+            in: .userDomainMask, appropriateFor: nil, create: true)
+        let target =
+            appSupport
+            .appendingPathComponent("FluidAudio/Models/cohere-transcribe/q8")
+        let needed = [
+            ModelNames.CohereTranscribe.encoderCompiledFile,
+            ModelNames.CohereTranscribe.decoderCacheExternalV2CompiledFile,
+            "vocab.json",
+        ]
+        let missing = needed.filter { name in
+            !FileManager.default.fileExists(
+                atPath: target.appendingPathComponent(name).path)
+        }
+        guard missing.isEmpty else {
+            throw NSError(
+                domain: "TtsBenchmark", code: 1,
+                userInfo: [
+                    NSLocalizedDescriptionKey:
+                        "Cohere model dir incomplete at \(target.path). "
+                        + "Missing: \(missing.joined(separator: ", ")). "
+                        + "Pass --cohere-model-dir <dir> with the required files, or "
+                        + "pre-populate the cache via `fluidaudio cohere-transcribe`."
+                ])
+        }
+        return target
+    }
+
+    /// Pick a `CohereAsrConfig.Language` from an explicit flag value or by
+    /// scanning the corpus label (covers the shipped `minimax-<lang>` set).
+    private static func inferCohereLanguage(
+        explicit: String?, corpus: String
+    ) -> CohereAsrConfig.Language {
+        if let explicit,
+            let lang = CohereAsrConfig.Language(rawValue: explicit.lowercased())
+        {
+            return lang
+        }
+        let lower = corpus.lowercased()
+        if lower.contains("chinese") || lower.contains("mandarin") || lower.hasSuffix("-zh") {
+            return .chinese
+        }
+        if lower.contains("japanese") || lower.contains("-ja") { return .japanese }
+        if lower.contains("korean") || lower.contains("-ko") { return .korean }
+        if lower.contains("vietnamese") || lower.contains("-vi") { return .vietnamese }
+        if lower.contains("french") || lower.contains("-fr") { return .french }
+        if lower.contains("german") || lower.contains("-de") { return .german }
+        if lower.contains("spanish") || lower.contains("-es") { return .spanish }
+        if lower.contains("italian") || lower.contains("-it") { return .italian }
+        if lower.contains("portuguese") || lower.contains("-pt") { return .portuguese }
+        if lower.contains("dutch") || lower.contains("-nl") { return .dutch }
+        if lower.contains("polish") || lower.contains("-pl") { return .polish }
+        if lower.contains("greek") || lower.contains("-el") { return .greek }
+        if lower.contains("arabic") || lower.contains("-ar") { return .arabic }
+        return .english
+    }
+
+    /// Parse `--cohere-compute-units` into `MLComputeUnits`. Defaults to
+    /// `.all` (CoreML decides). Use `cpu-and-gpu` to skip the ANE compile
+    /// attempt when the q8 encoder fails ANE compilation (observed:
+    /// `MILCompilerForANE error: failed to compile ANE model using ANEF`,
+    /// CoreML falls back to CPU+GPU but pays a multi-minute compile cost
+    /// on the first call).
+    private static func resolveCohereComputeUnits(
+        _ flag: String?
+    ) throws
+        -> MLComputeUnits
+    {
+        guard let raw = flag?.lowercased(), !raw.isEmpty else { return .all }
+        switch raw {
+        case "all", "default": return .all
+        case "all-ane", "ane", "neural-engine", "cpu-and-ane":
+            return .cpuAndNeuralEngine
+        case "cpu-and-gpu", "cpuandgpu", "gpu": return .cpuAndGPU
+        case "cpu-only", "cpu", "cpuonly": return .cpuOnly
+        default:
+            throw NSError(
+                domain: "TtsBenchmark", code: 3,
+                userInfo: [
+                    NSLocalizedDescriptionKey:
+                        "Unknown --cohere-compute-units value '\(raw)'. "
+                        + "Expected: all | cpu-and-gpu | cpu-only | all-ane."
+                ])
+        }
+    }
+
+    /// Human-readable label for log lines.
+    private static func describeComputeUnits(_ cu: MLComputeUnits) -> String {
+        switch cu {
+        case .all: return "all (CPU+GPU+ANE)"
+        case .cpuAndNeuralEngine: return "cpu-and-ane"
+        case .cpuAndGPU: return "cpu-and-gpu"
+        case .cpuOnly: return "cpu-only"
+        @unknown default: return "unknown"
+        }
+    }
+
+    /// Build the per-phrase ASR adapter for a resolved choice. Returns
+    /// `nil` for `.skip` so the loop can short-circuit.
+    private static func buildAsrLoop(_ choice: AsrChoice) async throws -> AsrLoop? {
+        switch choice {
+        case .skip:
+            return nil
+        case .parakeet:
+            let asrModels = try await AsrModels.downloadAndLoad()
+            let asr = AsrManager()
+            try await asr.loadModels(asrModels)
+            let layers = await asr.decoderLayerCount
+            return AsrLoop(
+                label: "parakeet-tdt",
+                transcribeOne: { url in
+                    var state = TdtDecoderState.make(decoderLayers: layers)
+                    let r = try await asr.transcribe(url, decoderState: &state)
+                    return r.text
+                },
+                cleanup: { await asr.cleanup() }
+            )
+        case .cohere(let modelDir, let language, let computeUnits):
+            guard #available(macOS 14, iOS 17, *) else {
+                throw NSError(
+                    domain: "TtsBenchmark", code: 2,
+                    userInfo: [
+                        NSLocalizedDescriptionKey:
+                            "Cohere ASR backend requires macOS 14+ / iOS 17+."
+                    ])
+            }
+            logger.info(
+                "Loading Cohere Transcribe (lang=\(language.englishName), "
+                    + "compute=\(describeComputeUnits(computeUnits))) from \(modelDir.path)")
+            let models = try await CoherePipeline.loadModels(
+                encoderDir: modelDir,
+                decoderDir: modelDir,
+                vocabDir: modelDir,
+                decoderVariant: .v2,
+                computeUnits: computeUnits)
+            let pipeline = CoherePipeline()
+            let converter = AudioConverter()
+            return AsrLoop(
+                label: "cohere-transcribe-\(language.rawValue)",
+                transcribeOne: { url in
+                    let samples = try converter.resampleAudioFile(path: url.path)
+                    let r = try await pipeline.transcribe(
+                        audio: samples,
+                        models: models,
+                        language: language,
+                        maxNewTokens: 108,
+                        repetitionPenalty: 1.1,
+                        noRepeatNgram: 3)
+                    return r.text
+                },
+                cleanup: {}
+            )
+        }
+    }
+
+    private static func printUsage() {
+        logger.info(
+            """
+            Usage: fluidaudio tts-benchmark [options]
+
+            Quantitative TTS benchmark — TTFT, cold/warm split, per-stage timings,
+            peak RSS, WER + CER per category, configurable compute-unit preset.
+
+            Backends:
+              kokoro-ane    7-stage ANE pipeline (per-stage timings, per-stage CU)
+              kokoro        Single-graph CPU+GPU
+              pocket-tts    Streaming flow-matching (multilingual)
+              magpie        Encoder-decoder + NanoCodec (per-stage, slow)
+              cosyvoice3    Mandarin LLM-based (auto-picks Cohere ASR for zh)
+
+            Options:
+              --backend <name>          See list above (default: kokoro-ane)
+              --corpus <name>           MiniMax corpus name: minimax-<lang>
+                                        (e.g. minimax-english, minimax-chinese,
+                                        minimax-vietnamese — 24 languages total;
+                                        see Documentation/TTS/MinimaxCorpus.md)
+              --corpus-path <path>      Custom corpus file (overrides --corpus)
+              --voice <name>            Voice id (Kokoro/PocketTTS/CosyVoice3)
+              --speaker <name>          Magpie speaker: john|sofia|aria|jason|leo
+              --language <code>         PocketTTS lang pack or Magpie language code
+              --compute-units <preset>  default | all-ane | cpu-and-gpu | cpu-only
+              --output-json <path>      Write JSON report
+              --audio-dir <path>        Keep generated WAVs under this dir
+              --skip-asr                Skip ASR roundtrip (no WER/CER)
+              --asr-backend <name>      ASR engine for the WER/CER pass:
+                                          parakeet  English-only (default for en)
+                                          cohere    Multilingual (default for non-en)
+                                          none      Same as --skip-asr
+              --cohere-model-dir <path> Path to a directory containing Cohere
+                                        Transcribe encoder/decoder/vocab.json.
+                                        Required when --asr-backend cohere is
+                                        active (auto-download is not wired —
+                                        vocab.json lives at the repo root, not
+                                        under /q8). Default: cache at
+                                        ~/Library/Application Support/FluidAudio/
+                                        Models/cohere-transcribe/q8
+              --asr-language <code>     Override Cohere language code (default:
+                                        inferred from corpus name). One of:
+                                        en, zh, ja, ko, vi, fr, de, es, it, pt,
+                                        nl, pl, el, ar
+              --cohere-compute-units <p>  Cohere ASR compute mapping:
+                                        all (default; CoreML decides) |
+                                        cpu-and-gpu | cpu-only | all-ane.
+                                        Use cpu-and-gpu when q8 ANE compile
+                                        fails (`MILCompilerForANE error: …`)
+                                        — avoids the multi-minute fallback
+                                        compile on first call.
+              --help, -h                Show this help
+
+            Examples:
+              fluidaudio tts-benchmark --backend kokoro-ane --output-json bench.json
+              fluidaudio tts-benchmark --backend kokoro --corpus minimax-english
+              fluidaudio tts-benchmark --backend pocket-tts --corpus minimax-german --language german
+              fluidaudio tts-benchmark --backend magpie --speaker sofia --language en
+              fluidaudio tts-benchmark --backend cosyvoice3 --corpus minimax-chinese \\
+                  --asr-backend cohere --cohere-model-dir ~/.fluidaudio/cohere/q8
+
+            Notes:
+              For Chinese (zh) and Japanese (ja), WER is meaningless because
+              WERCalculator splits on whitespace; trust the CER column instead.
+              The summary banner prints an explicit reminder for these langs.
+            """
+        )
+    }
+}
+#endif
diff --git a/Sources/FluidAudioCLI/FluidAudioCLI.swift b/Sources/FluidAudioCLI/FluidAudioCLI.swift
index 903683a7..1b601209 100644
--- a/Sources/FluidAudioCLI/FluidAudioCLI.swift
+++ b/Sources/FluidAudioCLI/FluidAudioCLI.swift
@@ -50,6 +50,10 @@ struct FluidAudioCLI {
             await MagpieCommand.run(arguments: Array(arguments.dropFirst(2)))
         case "tts-asr-verify":
             await TTSAsrVerifyCommand.run(arguments: Array(arguments.dropFirst(2)))
+        case "tts-benchmark":
+            await TtsBenchmarkCommand.run(arguments: Array(arguments.dropFirst(2)))
+        case "minimax-corpus":
+            await MinimaxCorpusCommand.run(arguments: Array(arguments.dropFirst(2)))
         case "diarization-benchmark":
             await StreamDiarizationBenchmark.run(arguments: Array(arguments.dropFirst(2)))
         case "process":
@@ -116,6 +120,8 @@ struct FluidAudioCLI {
                 tts                     Synthesize speech from text using Kokoro TTS
                 magpie                  Magpie TTS Multilingual 357M (experimental, ~0.04 RTFx — slow, needs perf work)
                 tts-asr-verify          Batch TTS→ASR roundtrip WER verification
+                tts-benchmark           Quantitative TTS benchmark (latency, quality, compute-unit sweep)
+                minimax-corpus          Fetch MiniMax TTS Multilingual Test Set into Benchmarks/tts/corpus/minimax
                 parakeet-eou            Run Parakeet EOU Streaming ASR on a single file
                 ctc-earnings-benchmark  Run CTC keyword spotting benchmark on Earnings22
                 sortformer              Run Sortformer streaming diarization
diff --git a/Tests/FluidAudioTests/TTS/CosyVoice3ModelNameTests.swift b/Tests/FluidAudioTests/TTS/CosyVoice3ModelNameTests.swift
new file mode 100644
index 00000000..f0faeacc
--- /dev/null
+++ b/Tests/FluidAudioTests/TTS/CosyVoice3ModelNameTests.swift
@@ -0,0 +1,60 @@
+import XCTest
+
+@testable import FluidAudio
+
+/// Guard the stateful → stateless decode rename. The HF repo
+/// `FluidInference/CosyVoice3-0.5B-coreml` ships only `LLM-Decode-M768-fp16`
+/// (non-stateful, external KV cache); resurrecting `-stateful` here would
+/// re-break the download path and regress macOS 14 support.
+final class CosyVoice3ModelNameTests: XCTestCase {
+
+    // MARK: - ModelNames.CosyVoice3
+
+    func testLlmDecodeIsStatelessName() {
+        XCTAssertEqual(ModelNames.CosyVoice3.llmDecode, "LLM-Decode-M768-fp16")
+        XCTAssertFalse(
+            ModelNames.CosyVoice3.llmDecode.contains("stateful"),
+            "llmDecode must not reference the dropped stateful variant")
+    }
+
+    func testLlmDecodeFileMatchesBaseName() {
+        XCTAssertEqual(
+            ModelNames.CosyVoice3.llmDecodeFile,
+            "LLM-Decode-M768-fp16.mlmodelc")
+    }
+
+    func testRequiredModelsContainsStatelessDecode() {
+        XCTAssertTrue(
+            ModelNames.CosyVoice3.requiredModels.contains("LLM-Decode-M768-fp16.mlmodelc"),
+            "requiredModels must list the stateless decode bundle")
+        XCTAssertFalse(
+            ModelNames.CosyVoice3.requiredModels.contains(
+                "LLM-Decode-M768-fp16-stateful.mlmodelc"),
+            "requiredModels must not list the dropped stateful bundle")
+    }
+
+    func testRequiredModelsHasFourEntries() {
+        XCTAssertEqual(
+            ModelNames.CosyVoice3.requiredModels.count, 4,
+            "Pipeline ships exactly 4 CoreML bundles: prefill, decode, flow, hift")
+    }
+
+    // MARK: - CosyVoice3Constants.Files
+
+    func testFilesLlmDecodeIsStatelessPackage() {
+        XCTAssertEqual(
+            CosyVoice3Constants.Files.llmDecode,
+            "LLM-Decode-M768-fp16.mlpackage")
+        XCTAssertFalse(
+            CosyVoice3Constants.Files.llmDecode.contains("stateful"))
+    }
+
+    func testFilesLlmDecodeSubdirIsRenamed() {
+        XCTAssertEqual(
+            CosyVoice3Constants.Files.llmDecodeSubdir,
+            "llm-fp16-decode",
+            "Local-build subdir must be the renamed stateless directory")
+        XCTAssertFalse(
+            CosyVoice3Constants.Files.llmDecodeSubdir.contains("stateful"))
+    }
+}
diff --git a/Tests/FluidAudioTests/TTS/CosyVoice3TextChunkerTests.swift b/Tests/FluidAudioTests/TTS/CosyVoice3TextChunkerTests.swift
new file mode 100644
index 00000000..ea419224
--- /dev/null
+++ b/Tests/FluidAudioTests/TTS/CosyVoice3TextChunkerTests.swift
@@ -0,0 +1,184 @@
+import XCTest
+
+@testable import FluidAudio
+
+final class CosyVoice3TextChunkerTests: XCTestCase {
+
+    // MARK: - estimateSpeechTokens
+
+    func testEstimateSpeechTokensCJK() {
+        // 4 CJK chars × 7.5 = 30 tokens
+        XCTAssertEqual(CosyVoice3TextChunker.estimateSpeechTokens("你好世界"), 30)
+    }
+
+    func testEstimateSpeechTokensASCII() {
+        // 5 ASCII chars × 1.5 = 7.5 → rounds to 8
+        XCTAssertEqual(CosyVoice3TextChunker.estimateSpeechTokens("hello"), 8)
+    }
+
+    func testEstimateSpeechTokensEmpty() {
+        XCTAssertEqual(CosyVoice3TextChunker.estimateSpeechTokens(""), 0)
+    }
+
+    // MARK: - chunk: short input fast path
+
+    func testChunkEmptyReturnsEmpty() {
+        XCTAssertEqual(CosyVoice3TextChunker.chunk(""), [])
+        XCTAssertEqual(CosyVoice3TextChunker.chunk("   "), [])
+        XCTAssertEqual(CosyVoice3TextChunker.chunk("\n\n"), [])
+    }
+
+    func testChunkShortReturnsSingle() {
+        // 5 chars (4 CJK + 「。」) ≈ 33 tokens, well under default 110
+        XCTAssertEqual(
+            CosyVoice3TextChunker.chunk("你好世界。"),
+            ["你好世界。"])
+    }
+
+    func testChunkShortTrimsWhitespace() {
+        XCTAssertEqual(
+            CosyVoice3TextChunker.chunk("  hello world.  "),
+            ["hello world."])
+    }
+
+    // MARK: - chunk: hard sentence enders
+
+    func testChunkSplitsOnHardEnders() {
+        // 25 CJK chars × 7.5 = 187.5 tokens > 110 default → must split
+        let text = "今天天气很好。我们去公园散步。明天可能会下雨。下周打算去看电影。"
+        let chunks = CosyVoice3TextChunker.chunk(text)
+        XCTAssertGreaterThan(chunks.count, 1)
+        // No chunk should exceed budget by more than the soft margin
+        for chunk in chunks {
+            let est = CosyVoice3TextChunker.estimateSpeechTokens(chunk)
+            XCTAssertLessThanOrEqual(est, 110 + 30 + 8, "chunk over force-split margin: \(chunk)")
+        }
+        // Concatenating chunks back should reconstruct the input modulo
+        // whitespace trimming.
+        XCTAssertEqual(chunks.joined(), text)
+    }
+
+    func testChunkSplitsOnEnglishSentenceEnders() {
+        // Each sentence ≈ 25–30 tokens; with maxSpeechTokens=80 every
+        // sentence fits individually so the chunker should commit on the
+        // first hard ender it sees rather than packing greedily across
+        // sentences and hitting force-split.
+        let text = "Hello world. This is a test. Pack my box with five jugs. Quick brown fox jumps."
+        let chunks = CosyVoice3TextChunker.chunk(text, maxSpeechTokens: 80)
+        XCTAssertGreaterThan(chunks.count, 1)
+        for chunk in chunks {
+            XCTAssertTrue(
+                chunk.hasSuffix(".") || chunk.hasSuffix("!") || chunk.hasSuffix("?"),
+                "chunk does not end at hard boundary: \(chunk)")
+        }
+    }
+
+    // MARK: - chunk: soft enders fall-through
+
+    func testChunkFallsBackToSoftEnders() {
+        // One huge sentence with commas, no periods. Should split on 「，」.
+        let text = "一个非常非常长的句子，里面有很多分句，每个分句都不是很长，但是加在一起就会超过预算限制"
+        let chunks = CosyVoice3TextChunker.chunk(text, maxSpeechTokens: 50)
+        XCTAssertGreaterThan(chunks.count, 1)
+        for chunk in chunks {
+            let est = CosyVoice3TextChunker.estimateSpeechTokens(chunk)
+            // Force-split allows one CJK char of overshoot past the +30 margin
+            // because the budget check runs AFTER appending the current char.
+            XCTAssertLessThanOrEqual(est, 50 + 30 + 8)
+        }
+    }
+
+    // MARK: - chunk: force-split fallback
+
+    func testChunkForceSplitsOnContinuousCJKWithoutPunctuation() {
+        // 30 CJK chars, no punctuation: ≈ 225 tokens, must force-split
+        // somewhere even without natural boundaries.
+        let text = "今天天气很好我们去公园散步明天可能会下雨下周打算看电影然后回家"
+        let chunks = CosyVoice3TextChunker.chunk(text, maxSpeechTokens: 50)
+        XCTAssertGreaterThan(chunks.count, 1)
+        for chunk in chunks {
+            let est = CosyVoice3TextChunker.estimateSpeechTokens(chunk)
+            // Force-split has a 30-token overshoot allowance + one CJK char (7.5)
+            XCTAssertLessThanOrEqual(est, 50 + 30 + 8, "chunk overflow on force-split: \(chunk)")
+        }
+        // No content lost
+        XCTAssertEqual(chunks.joined(), text)
+    }
+
+    func testChunkForceSplitsOnEnglishSpacesWhenNoPunctuation() {
+        // Long English with no terminal punctuation; should split on spaces
+        // when the running estimate exceeds budget.
+        let text = "the quick brown fox jumps over the lazy dog and then runs back home very fast"
+        let chunks = CosyVoice3TextChunker.chunk(text, maxSpeechTokens: 20)
+        XCTAssertGreaterThan(chunks.count, 1)
+        for chunk in chunks {
+            // No leading/trailing whitespace expected on returned chunks
+            XCTAssertEqual(chunk, chunk.trimmingCharacters(in: .whitespaces))
+        }
+    }
+
+    // MARK: - concatWithCrossfade
+
+    func testConcatEmptyReturnsEmpty() {
+        let out = CosyVoice3TtsManager.concatWithCrossfade(
+            [], sampleRate: 24_000, fadeMs: 8)
+        XCTAssertEqual(out, [])
+    }
+
+    func testConcatSingleChunkPassthrough() {
+        let chunk: [Float] = [0.1, 0.2, 0.3, 0.4]
+        let out = CosyVoice3TtsManager.concatWithCrossfade(
+            [chunk], sampleRate: 24_000, fadeMs: 8)
+        XCTAssertEqual(out, chunk)
+    }
+
+    func testConcatZeroFadeIsSimpleAppend() {
+        let a: [Float] = [0.1, 0.2, 0.3]
+        let b: [Float] = [0.4, 0.5, 0.6]
+        let out = CosyVoice3TtsManager.concatWithCrossfade(
+            [a, b], sampleRate: 24_000, fadeMs: 0)
+        XCTAssertEqual(out, [0.1, 0.2, 0.3, 0.4, 0.5, 0.6])
+    }
+
+    func testConcatCrossfadeShrinksGracefullyForShortChunks() {
+        // 4-sample chunks; nominal fade at 24 kHz × 8 ms = 192 samples,
+        // gets clamped to min(out.count/2, next.count/2) = 2.
+        let a: [Float] = [1.0, 1.0, 1.0, 1.0]
+        let b: [Float] = [0.0, 0.0, 0.0, 0.0]
+        let out = CosyVoice3TtsManager.concatWithCrossfade(
+            [a, b], sampleRate: 24_000, fadeMs: 8)
+        // Output length: 4 (a) - 2 (fade) + 4 (b) = 6; first 2 of a remain
+        // pristine, then a 2-sample crossfade region, then last 2 of b
+        XCTAssertEqual(out.count, 6)
+        XCTAssertEqual(out[0], 1.0)
+        XCTAssertEqual(out[1], 1.0)
+        // Crossfade region: a's 1.0 fades to 0; b's 0.0 fades from 0.
+        // At j=0: down=1, up=0 → 1.0 * 1 + 0.0 * 0 = 1.0
+        // At j=1: down=0.5, up=0.5 → 1.0*0.5 + 0.0*0.5 = 0.5
+        XCTAssertEqual(out[2], 1.0, accuracy: 1e-5)
+        XCTAssertEqual(out[3], 0.5, accuracy: 1e-5)
+        XCTAssertEqual(out[4], 0.0, accuracy: 1e-5)
+        XCTAssertEqual(out[5], 0.0, accuracy: 1e-5)
+    }
+
+    func testConcatCrossfadePreservesPrefixAndSuffix() {
+        // Long enough chunks for a full fade window
+        let sampleRate = 24_000
+        let fadeMs = 4.0  // 96 samples
+        let a = [Float](repeating: 1.0, count: 480)
+        let b = [Float](repeating: 0.0, count: 480)
+        let out = CosyVoice3TtsManager.concatWithCrossfade(
+            [a, b], sampleRate: sampleRate, fadeMs: fadeMs)
+        let fade = Int((Double(sampleRate) * fadeMs / 1000).rounded())
+        // Output length: a.count - fade + b.count
+        XCTAssertEqual(out.count, a.count - fade + b.count)
+        // Prefix of `a` (before crossfade region) untouched
+        for j in 0..<(a.count - fade) {
+            XCTAssertEqual(out[j], 1.0)
+        }
+        // Suffix of `b` (after crossfade region) untouched
+        for j in (a.count..<out.count) {
+            XCTAssertEqual(out[j - 0], 0.0)
+        }
+    }
+}
diff --git a/Tests/FluidAudioTests/TTS/Magpie/MagpieKvCacheTests.swift b/Tests/FluidAudioTests/TTS/Magpie/MagpieKvCacheTests.swift
index 28f1f28f..3f46bbde 100644
--- a/Tests/FluidAudioTests/TTS/Magpie/MagpieKvCacheTests.swift
+++ b/Tests/FluidAudioTests/TTS/Magpie/MagpieKvCacheTests.swift
@@ -53,4 +53,84 @@ final class MagpieKvCacheTests: XCTestCase {
         XCTAssertEqual(
             MagpieKvCache.positionOutputKeys.count, MagpieConstants.numDecoderLayers)
     }
+
+    /// Drives the slow-path fallback used by `MagpieSynthesizer.runDecoderStep`
+    /// when CoreML rejects `outputBackings`. Builds a synthetic feature
+    /// provider that mirrors the `decoder_step.mlmodelc` output schema, hands
+    /// it to `absorbOutputs`, and verifies the cache front pointers + position
+    /// were replaced (i.e. the fallback can take over without `swapBackings`).
+    func testAbsorbOutputsReplacesFrontPointers() throws {
+        let numLayers = 3
+        let maxCacheLength = 16
+        let numHeads = 2
+        let headDim = 4
+        let cache = try MagpieKvCache(
+            numLayers: numLayers, maxCacheLength: maxCacheLength,
+            numHeads: numHeads, headDim: headDim)
+
+        let preK = (0..<numLayers).map { ObjectIdentifier(cache.cachesK[$0]) }
+        let preV = (0..<numLayers).map { ObjectIdentifier(cache.cachesV[$0]) }
+        let prePos = (0..<numLayers).map { ObjectIdentifier(cache.positions[$0]) }
+
+        let cacheShape: [NSNumber] = [
+            1,
+            NSNumber(value: maxCacheLength),
+            NSNumber(value: numHeads),
+            NSNumber(value: headDim),
+        ]
+        var features: [String: MLFeatureValue] = [:]
+        for i in 0..<numLayers {
+            let kArr = try MLMultiArray(shape: cacheShape, dataType: .float16)
+            kArr.zeroFillFloat16()
+            let vArr = try MLMultiArray(shape: cacheShape, dataType: .float16)
+            vArr.zeroFillFloat16()
+            let posArr = try MLMultiArray(shape: [1], dataType: .float16)
+            posArr.zeroFillFloat16()
+            posArr[0] = NSNumber(value: Float(i + 1))
+            features[MagpieKvCache.cacheKOutputKeys[i]] = MLFeatureValue(multiArray: kArr)
+            features[MagpieKvCache.cacheVOutputKeys[i]] = MLFeatureValue(multiArray: vArr)
+            features[MagpieKvCache.positionOutputKeys[i]] = MLFeatureValue(multiArray: posArr)
+        }
+        let provider = try MLDictionaryFeatureProvider(dictionary: features)
+
+        try cache.absorbOutputs(provider)
+
+        for i in 0..<numLayers {
+            XCTAssertNotEqual(
+                ObjectIdentifier(cache.cachesK[i]), preK[i],
+                "absorbOutputs must replace cachesK[\(i)] front pointer")
+            XCTAssertNotEqual(
+                ObjectIdentifier(cache.cachesV[i]), preV[i],
+                "absorbOutputs must replace cachesV[\(i)] front pointer")
+            XCTAssertNotEqual(
+                ObjectIdentifier(cache.positions[i]), prePos[i],
+                "absorbOutputs must replace positions[\(i)] front pointer")
+        }
+        // positions[0] = 1 → cache.position reads layer-0 scalar.
+        XCTAssertEqual(cache.position, 1)
+    }
+
+    func testAbsorbOutputsThrowsWhenCacheKOutputMissing() throws {
+        let cache = try MagpieKvCache(
+            numLayers: 2, maxCacheLength: 8, numHeads: 1, headDim: 2)
+
+        // Provide a feature provider with the wrong key for cache_k_0 so the
+        // first lookup fails. This guards the error message users will see
+        // when the fallback path is actually exercised.
+        let bogus = try MLMultiArray(shape: [1, 8, 1, 2], dataType: .float16)
+        bogus.zeroFillFloat16()
+        let provider = try MLDictionaryFeatureProvider(dictionary: [
+            "wrong_key": MLFeatureValue(multiArray: bogus)
+        ])
+
+        XCTAssertThrowsError(try cache.absorbOutputs(provider)) { error in
+            guard case MagpieError.inferenceFailed(_, let underlying) = error else {
+                XCTFail("expected MagpieError.inferenceFailed, got \(error)")
+                return
+            }
+            XCTAssertTrue(
+                underlying.contains("missing K cache output key"),
+                "underlying should mention the missing K key, got: \(underlying)")
+        }
+    }
 }
diff --git a/Tests/FluidAudioTests/TTS/TtsComputeUnitPresetTests.swift b/Tests/FluidAudioTests/TTS/TtsComputeUnitPresetTests.swift
new file mode 100644
index 00000000..206463cf
--- /dev/null
+++ b/Tests/FluidAudioTests/TTS/TtsComputeUnitPresetTests.swift
@@ -0,0 +1,114 @@
+@preconcurrency import CoreML
+import XCTest
+
+@testable import FluidAudio
+
+final class TtsComputeUnitPresetTests: XCTestCase {
+
+    // MARK: - init?(cliValue:)
+
+    func testCliValueParsing_canonicalKebabCase() {
+        XCTAssertEqual(TtsComputeUnitPreset(cliValue: "default"), .default)
+        XCTAssertEqual(TtsComputeUnitPreset(cliValue: "all-ane"), .allAne)
+        XCTAssertEqual(TtsComputeUnitPreset(cliValue: "cpu-and-gpu"), .cpuAndGpu)
+        XCTAssertEqual(TtsComputeUnitPreset(cliValue: "cpu-only"), .cpuOnly)
+    }
+
+    func testCliValueParsing_aliases() {
+        XCTAssertEqual(TtsComputeUnitPreset(cliValue: "ane"), .allAne)
+        XCTAssertEqual(TtsComputeUnitPreset(cliValue: "neural-engine"), .allAne)
+        XCTAssertEqual(TtsComputeUnitPreset(cliValue: "cpuandgpu"), .cpuAndGpu)
+        XCTAssertEqual(TtsComputeUnitPreset(cliValue: "gpu"), .cpuAndGpu)
+        XCTAssertEqual(TtsComputeUnitPreset(cliValue: "cpu"), .cpuOnly)
+        XCTAssertEqual(TtsComputeUnitPreset(cliValue: "cpuonly"), .cpuOnly)
+    }
+
+    func testCliValueParsing_caseInsensitive() {
+        XCTAssertEqual(TtsComputeUnitPreset(cliValue: "DEFAULT"), .default)
+        XCTAssertEqual(TtsComputeUnitPreset(cliValue: "All-Ane"), .allAne)
+        XCTAssertEqual(TtsComputeUnitPreset(cliValue: "CPU-AND-GPU"), .cpuAndGpu)
+    }
+
+    func testCliValueParsing_unknownReturnsNil() {
+        XCTAssertNil(TtsComputeUnitPreset(cliValue: ""))
+        XCTAssertNil(TtsComputeUnitPreset(cliValue: "fastest"))
+        XCTAssertNil(TtsComputeUnitPreset(cliValue: "all_ane"))  // underscore rejected
+        XCTAssertNil(TtsComputeUnitPreset(cliValue: "ane-only"))
+        XCTAssertNil(TtsComputeUnitPreset(cliValue: "neuralengine"))
+    }
+
+    // MARK: - cliValue (round-trip)
+
+    func testCliValueRoundTrip() {
+        for preset in TtsComputeUnitPreset.allCases {
+            let canonical = preset.cliValue
+            XCTAssertEqual(
+                TtsComputeUnitPreset(cliValue: canonical), preset,
+                "cliValue '\(canonical)' must round-trip back to \(preset)")
+        }
+    }
+
+    func testCliValueIsKebabCase() {
+        XCTAssertEqual(TtsComputeUnitPreset.default.cliValue, "default")
+        XCTAssertEqual(TtsComputeUnitPreset.allAne.cliValue, "all-ane")
+        XCTAssertEqual(TtsComputeUnitPreset.cpuAndGpu.cliValue, "cpu-and-gpu")
+        XCTAssertEqual(TtsComputeUnitPreset.cpuOnly.cliValue, "cpu-only")
+    }
+
+    // MARK: - uniformUnits
+
+    func testUniformUnits_defaultIsNil() {
+        XCTAssertNil(TtsComputeUnitPreset.default.uniformUnits)
+    }
+
+    func testUniformUnits_concretePresets() {
+        XCTAssertEqual(TtsComputeUnitPreset.allAne.uniformUnits, .cpuAndNeuralEngine)
+        XCTAssertEqual(TtsComputeUnitPreset.cpuAndGpu.uniformUnits, .cpuAndGPU)
+        XCTAssertEqual(TtsComputeUnitPreset.cpuOnly.uniformUnits, .cpuOnly)
+    }
+
+    // MARK: - KokoroAneComputeUnits(preset:)
+
+    func testKokoroAnePreset_defaultMatchesStaticDefault() {
+        XCTAssertEqual(KokoroAneComputeUnits(preset: .default), .default)
+    }
+
+    func testKokoroAnePreset_allAneMatchesStatic() {
+        XCTAssertEqual(KokoroAneComputeUnits(preset: .allAne), .allAne)
+    }
+
+    func testKokoroAnePreset_cpuAndGpuMatchesStatic() {
+        XCTAssertEqual(KokoroAneComputeUnits(preset: .cpuAndGpu), .cpuAndGpu)
+    }
+
+    func testKokoroAnePreset_cpuOnlyMatchesStatic() {
+        XCTAssertEqual(KokoroAneComputeUnits(preset: .cpuOnly), .cpuOnly)
+    }
+
+    func testKokoroAnePreset_allAneForcesEveryStageToANE() {
+        let cu = KokoroAneComputeUnits(preset: .allAne)
+        for stage in KokoroAneStage.allCases {
+            XCTAssertEqual(
+                cu.units(for: stage), .cpuAndNeuralEngine,
+                "stage \(stage) should be .cpuAndNeuralEngine under .allAne")
+        }
+    }
+
+    func testKokoroAnePreset_cpuOnlyForcesEveryStageToCPU() {
+        let cu = KokoroAneComputeUnits(preset: .cpuOnly)
+        for stage in KokoroAneStage.allCases {
+            XCTAssertEqual(
+                cu.units(for: stage), .cpuOnly,
+                "stage \(stage) should be .cpuOnly under .cpuOnly")
+        }
+    }
+
+    func testKokoroAnePreset_cpuAndGpuForcesEveryStageToCPUAndGPU() {
+        let cu = KokoroAneComputeUnits(preset: .cpuAndGpu)
+        for stage in KokoroAneStage.allCases {
+            XCTAssertEqual(
+                cu.units(for: stage), .cpuAndGPU,
+                "stage \(stage) should be .cpuAndGPU under .cpuAndGpu")
+        }
+    }
+}