diff --git a/.gitignore b/.gitignore index 08103c74..e94d441a 100644 --- a/.gitignore +++ b/.gitignore @@ -11,6 +11,7 @@ xcuserdata/ *.hmap *.txt +!Benchmarks/**/*.txt ## App packaging *.ipa @@ -104,6 +105,10 @@ Resources/ scripts/ !Scripts/parakeet_subset_benchmark.sh !Scripts/diarizer_subset_benchmark.sh + +# MiniMax TTS corpus is CC-BY-SA-4.0 derivative content fetched on demand +# via `fluidaudio minimax-corpus`; only the README is checked in. +Benchmarks/tts/corpus/minimax/*.txt Documentation/parakeet-tdt/ docs/parakeet-tdt/ diff --git a/Documentation/TTS/Benchmarks.md b/Documentation/TTS/Benchmarks.md new file mode 100644 index 00000000..1d05f950 --- /dev/null +++ b/Documentation/TTS/Benchmarks.md @@ -0,0 +1,343 @@ +# TTS Benchmarks + +> **Setup:** MacBook Air M2 (2022), 16 GB, macOS 26, on AC. +> **Corpus:** [MiniMax Multilingual TTS Test Set][minimax] (100 +> phrases / language, CC-BY-SA-4.0) — the same public corpus used +> by [MiniMax-Speech][mms], seed-tts-eval, and Gradium, so numbers +> here are directly paper-comparable. +> **Status:** Kokoro, Kokoro ANE, PocketTTS, Magpie, StyleTTS2 all +> complete the English run; CosyVoice3 completes the full Mandarin +> run. +> +> [minimax]: https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set +> [mms]: https://arxiv.org/abs/2505.07916 + +## Why not just RTFx? + +RTFx (audio_seconds / synth_seconds) is a useful single number for batch +synthesis, but for conversational use it hides the things users actually +feel: + +1. **Cold start** — first model load + ANE compile after install or + reboot. On Apple Silicon the system's `anecompilerservice` can take + tens of seconds on first invocation; subsequent loads finish in ~1 s. +2. **TTFT (time-to-first-audio)** — for streaming agents the question + is "how long until the user hears *something*", not "how long until + the whole utterance is rendered". For one-shot backends in this + slice `ttft_ms == synth_ms`. **PocketTTS** and **Magpie** are + wired through their respective streaming APIs (`synthesizeStreaming` + / `synthesizeStream`), so their `ttft_ms` is honest first-frame + latency. +3. **Per-stage compute units** — Kokoro ANE / Magpie are pipelines of + 6–7 graphs. Sometimes ANE is *slower per call* but more efficient. + The "right" compute-unit choice differs per stage. +4. **Memory footprint** — drives whether a backend is mobile-viable. +5. **Quality** — RTFx alone tells you nothing about whether the model + pronounced "Reykjavík" or "$1,234.56" correctly. We measure WER + + CER via Parakeet roundtrip on a fixed English corpus; non-English + backends run with `--skip-asr` for now. + +## Methodology + +### Corpus + +All shipped corpora come from the **MiniMax Multilingual TTS Test +Set** (`MiniMaxAI/TTS-Multilingual-Test-Set` on Hugging Face, +CC-BY-SA-4.0). The fetched files land under +`Benchmarks/tts/corpus/minimax/.txt` (24 languages × 100 phrases += 2400 phrases) and are gitignored — populate them on demand with +`swift run fluidaudio minimax-corpus`. Attribution, revision pin, +and WER caveats live in [`MinimaxCorpus.md`](MinimaxCorpus.md). + +Reference each language as `--corpus minimax-`: + +| Backend | Default corpus | Other supported MiniMax languages | +|-------------|--------------------|------------------------------------------------| +| Kokoro / Kokoro ANE | `minimax-english` | `english` only (`af_heart` voice) | +| PocketTTS | `minimax-english` | `english`, `german`, `italian`, `portuguese`, `spanish`, `french` | +| StyleTTS2 | `minimax-english` | `english` only (LibriTTS multi-speaker) | +| Magpie | `minimax-english` | `english`, `spanish`, `german`, `french`, `italian`, `vietnamese`, `chinese`, `hindi` | +| CosyVoice3 | `minimax-chinese` | `chinese`, `cantonese` | + +Lines beginning with `#` are comments. Custom corpora can still be +passed with `--corpus-path `. + +### Metrics + +Per phrase: +- `ttft_ms` — time-to-first-audio. For one-shot backends this equals + `synth_ms`. **PocketTTS** is benchmarked through + `synthesizeStreaming`, so its `ttft_ms` is the timestamp of the first + 80 ms audio frame (1920 samples @ 24 kHz). **Magpie** is benchmarked + through `synthesizeStream`, so its `ttft_ms` is the first + `MagpieAudioChunk` emit time (typically ~9.6 s on M2 vs ~15 s for + full synth). +- `synth_ms` — total synth wall time. +- `audio_ms` — generated audio duration. +- `rtfx` — `audio_ms / synth_ms`. +- `wer`, `cer` — via Parakeet ASR roundtrip on the rendered WAV. +- `stage_ms` — per-stage breakdown (backend-specific keys; populated + for Kokoro ANE + Magpie; empty for Kokoro / PocketTTS / + StyleTTS2 / CosyVoice3). +- Backend-specific extras: `encoder_tokens`, `acoustic_frames`, + `chunk_count`, `frame_count`, `code_count`, `finished_on_eos`, + `generated_token_count`, etc. + +Aggregates: +- `cold_start_s` — `manager.initialize()` wall time. CosyVoice3 also + includes voice-asset load. +- `first_synth_ms` — first synth call after init (still cold-ish). +- `ttft_ms_p50` / `ttft_ms_p95`. +- `warm_synth_ms_p50` / `warm_synth_ms_p95`. +- `agg_rtfx` — `Σ audio_ms / Σ synth_ms` across the corpus. +- `peak_rss_mb` — process-wide peak resident set, via + `task_vm_info_data_t.resident_size_peak`. +- Per-category macro WER / CER. + +### Reproducibility + +```bash +# From the package root. +swift run fluidaudio tts-benchmark \ + --backend kokoro-ane \ + --corpus minimax-english \ + --voice af_heart \ + --compute-units default \ + --output-json bench.json \ + --audio-dir bench-wavs/ +``` + +The harness writes a JSON report to `--output-json` and (optionally) +keeps WAVs under `--audio-dir`. Pass `--skip-asr` to drop the ASR +roundtrip. The default ASR backend is `parakeet` for English-only +runs and is skipped for CosyVoice3; pass `--asr-backend cohere +--cohere-model-dir ` to score Mandarin (or any of the 14 +Cohere languages) against [Cohere Transcribe](../../Sources/FluidAudio/ASR/Cohere/). + +## Results + +### Per-backend top-line + +Reference machine: **MacBook Air, Apple M2 (2022), 8-core CPU / +8-core GPU / 16-core Neural Engine, 16 GB unified memory, macOS 26** +(`Mac14,2`, on AC). All English runs use `--compute-units default`, +voice = backend default +(`af_heart` for Kokoro, `alba` for PocketTTS, `John` for Magpie), +corpus = `minimax-english` (100 phrases), Parakeet TDT roundtrip for +WER / CER. + +| Backend | License | Languages | Footprint | Cold start | TTFT p50 / p95\* | Synth p50 / p95 | Agg RTFx | Peak RSS | WER | CER | Notes | +|-------------|-------------|------------------------|-----------|------------|---------------------|---------------------|----------|----------|---------|---------|-------| +| Kokoro ANE | Apache-2.0 | en (af_heart only) | ~330 MB | 37.9 s | 1586 / 2515 ms | 1586 / 2515 ms | 5.19× | 738 MB | 0.108 | 0.040 | one-shot; per-stage CU sweep, 7-graph pipeline | +| Kokoro | Apache-2.0 | en (af_heart only) | ~330 MB | 92.2 s | 3113 / 4696 ms | 3113 / 4696 ms | 2.02× | 736 MB | 0.013 | 0.005 | one-shot; cleanest English ASR roundtrip | +| PocketTTS | research | en + de + it + pt + es + fr (6L / 24L) | ~140 / ~520 MB | 6.0 s | **1244 / 4749 ms** | 8757 / 19174 ms | 0.61× | 1503 MB | 0.014 | 0.006 | **streaming**; TTFT is first 80 ms audio frame | +| StyleTTS2 | MIT | en (LibriTTS multi-spk) | ~280 MB | 955 s§ | 6671 / 15990 ms§ | 6671 / 15990 ms§ | 2.72×§ | 963 MB§ | 0.440§ | 0.241§ | full 100/100 `minimax-english` via [misaki→espeak post-pass remap](#styletts2-misaki--espeak-post-pass-remap); ref_s = LibriTTS `696_92939_000016_000006.wav` (StyleTTS2 demo voice) | +| Magpie | research | en/es/de/fr/it/vi/zh/hi | ~1.3 GB | 38.5 s∥ | **9580 / 23796 ms**∥ | 15080 / 29895 ms∥ | 0.64×∥ | 762 MB∥ | 0.056 | 0.033 | **streaming TTFT**: first audio chunk at 9.6 s p50 on M2 (full synth 15.1 s); split-K/V decoder; outputBackings fast path with latched fallback | +| CosyVoice3 | Apache-2.0 | zh (mandarin) | ~1.5 GB | 29.2 s† | 14091 / 23679 ms† | 14091 / 23679 ms† | 0.357׆ | 3302 MB† | n/a‡ | 0.017‡ | beta; full `minimax-chinese` (100/100 phrases) for latency / RSS and whisper-large-v3 CER‡; cantonese supported via [auto-chunker](#cosyvoice3-auto-chunker) but not benchmarked (no yue ASR) | + +\* TTFT for **PocketTTS / Magpie** is first-frame emit through the +streaming API; the others are one-shot, so `ttft_ms == synth_ms`. + +† CosyVoice3 chinese: 100/100, 0 errors, ASR skipped. Cold-start +dropped from 302.7 s to 29.2 s on the warm re-run. + +‡ CosyVoice3 CER measured on the **full 100-phrase** +`minimax-chinese` corpus via `whisper-large-v3` (Python CPU FP32, +[`Scripts/whisper_zh_cer.py`](../../Scripts/whisper_zh_cer.py)) on +the WAVs rendered by `tts-benchmark --backend cosyvoice3 --corpus +minimax-chinese --skip-asr --audio-dir `: **macro CER 1.68% +(0.0168)**, **micro CER 1.84% (0.0184)** across 100 phrases. +Whisper is the source of truth here because Cohere Transcribe q8 +hit a `MILCompilerForANE` cache failure on this M2 host and ran on +the CPU+GPU fallback path at RTFx ~0.13× (would have taken multiple +hours for the full 100-phrase set vs. ~70 min for whisper). WER is +omitted because Mandarin has no word boundaries and `WERCalculator` +splits on whitespace, so word-level WER reads near 100% and is +meaningless. + +∥ Magpie: streamed via `synthesizeStream`. TTFT (9.6 s p50) is +first-chunk emit; synth (15.1 s p50) is full-utterance wall time — +the 5.5 s gap is the streaming win. + +§ StyleTTS2 (**beta** — `StyleTTS2Manager.initialize` emits a +runtime warning): warm-cache run; first cold compile of the +bucketed text_predictor / diffusion_step / decoder graphs is +multi-second. ref_s dumped via +[`06_dump_ref_s.py`](https://github.com/voicelink-ai/mobius-styletts2/blob/main/models/tts/styletts2/scripts/06_dump_ref_s.py). +Read WER **relatively** per the +[WER caveat](#about-the-wer--cer-numbers); StyleTTS2's own demo +notebook reports artifacts on long sentences at default +`alpha/beta/diffusion_steps`. + +### Kokoro ANE — per-stage breakdown (default preset, MiniMax-English) + +Means across 100 `minimax-english` phrases on M2. Stages map to the +7-CoreML-graph split documented in [KokoroAne.md](KokoroAne.md). Vocoder ++ noise together account for ~92% of synth time, which is the natural +target for any further per-stage compute-unit re-tuning. The MiniMax +mean is meaningfully higher than the prior Harvard-sentences run +because phrases 81–100 are paragraph-length news / story sentences. + +| Stage | Mean ms | % of total | +|---------------|---------|------------| +| `albert` | 28.2 | 2.0% | +| `post_albert` | 12.1 | 0.9% | +| `alignment` | 1.8 | 0.1% | +| `prosody` | 49.2 | 3.5% | +| `noise` | 242.6 | 17.4% | +| `vocoder` | 1039.8 | 74.4% | +| `tail` | 24.6 | 1.8% | +| **total** | 1398.4 | 100% | + +### Magpie — per-stage breakdown (default preset, MiniMax-English) + +Means across 100 `minimax-english` phrases on M2 (`John` voice, en, +default compute units), captured during the original one-shot +profiling run. `ar_loop` is the umbrella for the per-step +`decoder_step` + `sampler` (so it is not added on top in the total). +`nanocodec` runs concurrently with the AR loop in chunked-streaming +mode, which is why the per-stage means do not sum to total warm-synth +mean. The AR loop dominates the wall clock, and its cost grows +super-linearly with phrase length — long news / story phrases drive +the long-tail p95. + +| Stage | Mean ms | +|--------------------|---------| +| `text_encoder` | 91 | +| `prefill` | 281 | +| `ar_loop` | 17946 | +| └── `decoder_step` | 14840 | +| └── `sampler` | 3081 | +| `nanocodec` | 17948 | + +### About the WER / CER numbers + +The MiniMax corpus mixes short conversational phrases, medium news +headlines, and long narrative paragraphs. WER on the long tail is +sensitive to the ASR + text-normalizer stack (e.g. `"3,5%"` → +`"three point five percent"` vs. `"three and a half percent"`); per +the [upstream community +discussion](https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set/discussions/10), +absolute WER is best read **relatively** (backend A vs. backend B on +the same corpus + same ASR + same normalizer) rather than against +raw paper numbers. + +## StyleTTS2 misaki → espeak post-pass remap + +StyleTTS2's LibriTTS checkpoint was trained on **espeak-ng-phonemized** +text, but the in-tree BART G2P (shared with Kokoro) emits **misaki** +output. The 178-token vocab accepts both forms, but the acoustic +embeddings for the misaki ligature glyphs are essentially untrained +noise — every training utterance saw the espeak form. + +Four systematic divergences vs. `espeak-ng -v en-us --ipa -q`: + +| misaki | espeak-ng | example | +|--------|-----------|--------------------------| +| `ʧ` | `tʃ` | choice → `tʃˈɔɪs` | +| `ʤ` | `dʒ` | jump → `dʒˈʌmps` | +| `ɜɹ` | `ɝ` | girl → `ɡˈɝl` | +| `əɹ` | `ɚ` | over → `ˈoʊvɚ` | + +Fix: 4-rule post-pass remap in `StyleTTS2Phonemizer.phonemize`, gated +on `.americanEnglish`. Result on `minimax-english`: WER 0.581 → +0.440, CER 0.476 → 0.241, agg-RTFx 2.36× → 2.72× (warm-cache +re-run, so latency / RSS deltas are noise — WER / CER are the real +signal). WER is still 30× worse than Kokoro; remaining errors cluster +on word-level BART mispronunciations and long-tail diffusion artifacts. +Further gains likely need a richer remap layer or swapping BART for +libespeak-ng directly. + +## CosyVoice3 Decode budget cap + +CosyVoice3's Flow CFM was exported with a fixed input shape of +`[1, 250]` speech tokens (`flowTotalTokens` in +`CosyVoice3Constants.swift:45`). The LLM-Decode AR loop is allowed to +emit up to `flowTotalTokens − N_prompt` tokens before being cut off +(typically ~163 generated tokens after the speech-prompt portion). +At `tokenMelRatio=2 × hiftSamplesPerFrame=480 / sampleRate=24000` +that's **40 ms of audio per generated token**, so the loop produces +**at most ~6.5 s of speech per phrase**, regardless of how long the +input text is. + +When the AR loop exits because it ran out of budget (i.e. no EOS +token in `stopRange = 6_561…6_760`) instead of natural termination, +`CosyVoice3Synthesizer` now: + +1. Logs a `.warning` (one-shot per phrase) naming the + `decoded.count / maxNew` budget and the produced audio duration. +2. Sets `CosyVoice3SynthesisResult.finishedOnEos = false`, which the + benchmark harness surfaces as the `finished_on_eos` field on each + phrase in the JSON report. + +Footprint on the cantonese corpus (`minimax-cantonese`, +100 phrases) **without the chunker**: 80 / 100 phrases would hit the +cap, all producing exactly 163 generated tokens / ~6.5 s of audio. +The mandarin corpus sees a much lower truncation rate because +MiniMax-zh phrases are shorter on average. + +The structural fix — re-exporting the Flow CFM from +[`mobius-cosyvoice3`](https://github.com/voicelink-ai/mobius-cosyvoice3) +with a larger fixed input shape (e.g. `[1, 500]`) — is upstream +work; bumping the constant in Swift alone would make the Flow +input/output shapes mismatch at predict time. The shipped workaround +is the call-site [auto-chunker](#cosyvoice3-auto-chunker), which +drops cantonese truncation from 80/100 → 5/100 by splitting long +inputs at clause boundaries and crossfading the results. + +Surfaced in +`CosyVoice3Synthesizer.synthesize` +(`Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Synthesize/CosyVoice3Synthesizer.swift`) +and +`CosyVoice3SynthesisResult.finishedOnEos` +(`Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Synthesize/CosyVoice3Types.swift`). + +## CosyVoice3 auto-chunker + +Re-exporting Flow CFM with a larger fixed input shape is gated on +upstream conversion work. Until that lands, `CosyVoice3TtsManager` +splits long inputs at the call site, synthesizes each chunk +independently, and merges with an 8 ms equal-power cosine crossfade. + +**Splitter policy** (`CosyVoice3TextChunker`): + +- **Hard enders** commit always: `.`, `!`, `?`, `。`, `!`, `?`, + `\n`. +- **Soft enders** commit only when the running estimate is at or past + the budget: `,`, `、`, `;`, `:`, `;`, `,`, ASCII space. +- **Force-split** at `budget + 30` tokens of overshoot if no natural + boundary appeared (rare; mostly continuous CJK with no + punctuation). + +**Token-rate estimate** (calibrated against minimax-zh + minimax-yue +runs): + +| Char class | Tokens / char | Rationale | +|------------|---------------|--------------------------------------------------------------| +| CJK | 7.5 | worst-case observed in real generation; varies 5.5–9 per char | +| ASCII | 1.5 | matches BPE rate on English text | +| Other | 2.5 | conservative for accented Latin / non-CJK Unicode | + +`defaultMaxSpeechTokens` is **110**, leaving margin under the +250-token Flow cap minus typical 60–90 token speech-prompt context. + +**Concatenation**: 8 ms equal-power cosine crossfade at 24 kHz +between adjacent chunks; single-chunk path short-circuits to plain +copy. + +**Validation** (full `minimax-cantonese`, 100 phrases, M2): + +| Metric | Pre-chunker | Post-chunker | Δ | +|-------------------------------------------|-------------|--------------|------------| +| `finished_on_eos=false` (truncated) | 80 / 100 | **5 / 100** | −94% | +| Longest audio output | 6.5 s | **16.1 s** | +148% | +| agg-RTFx | 0.245× | 0.249× | +1.6% | +| TTFT p50 | 23.9 s | 35.7 s | +49% | +| TTFT p95 | 41.2 s | 60.5 s | +47% | +| Peak RSS | 2016 MB | 3264 MB | +62% | + +The 5/100 residual is the long-tail token-rate worst case (some +Cantonese characters generate >9 speech tokens); raising the +per-CJK heuristic further would over-fragment short phrases. +Cleaner fix is the upstream Flow re-export. + diff --git a/Documentation/TTS/CosyVoice3.md b/Documentation/TTS/CosyVoice3.md index 7308e208..2bca7061 100644 --- a/Documentation/TTS/CosyVoice3.md +++ b/Documentation/TTS/CosyVoice3.md @@ -3,16 +3,19 @@ Mandarin zero-shot voice cloning via Qwen2 LM + CFM Flow + HiFT vocoder, running on CoreML. -> ⚠️ **Beta / experimental.** End-to-end synthesis is currently slow on -> Apple Silicon — RTFx < 1.0 typical, several seconds of latency for -> short Mandarin utterances. The slowdown is partly the Flow CFM stage -> (fp32, CPU-or-GPU only because fp16 + ANE produces NaNs through the -> fused `layer_norm` — CoreMLTools limitation, tracked upstream) and -> partly HiFT sinegen / windowing ops that fall back to CPU. May be a -> model issue, may be recoverable through better conversion. Treat -> performance numbers as preliminary; the Swift API, model layout, and -> prompt-asset format may change in subsequent releases without -> deprecation aliases. +> ⚠️ **Beta / experimental.** End-to-end synthesis is below real-time +> on Apple Silicon — agg-RTFx **0.357×** and p50 TTFT **~9.6 s** on +> the full `minimax-chinese` 100-phrase corpus (M2, default compute +> units), after the +> [HiFT timeout fix](Benchmarks.md#cosyvoice3-hift-timeout-fix) and +> [LLM-Decode `outputBackings` double-buffer](Benchmarks.md#cosyvoice3-llm-decode-outputbackings-fix). +> The slowdown is partly the Flow CFM stage (fp32, CPU-or-GPU only +> because fp16 + ANE produces NaNs through the fused `layer_norm` — +> CoreMLTools limitation, tracked upstream) and partly HiFT sinegen +> / windowing ops that fall back to CPU. May be a model issue, may +> be recoverable through better conversion. Treat performance numbers +> as preliminary; the Swift API, model layout, and prompt-asset format +> may change in subsequent releases without deprecation aliases. ## Files @@ -105,8 +108,9 @@ let result = try await manager.synthesize( | Field | Default | Notes | |---|---|---| -| `maxNewTokens` | `nil` (cap = 1024) | Hard ceiling on speech-token count | +| `maxNewTokens` | `nil` (= `flowTotalTokens − N_prompt`) | Soft ceiling on the LLM-Decode AR loop. The hard ceiling is the structural 250-token cap below — `maxNewTokens` only lets you generate fewer than that. | | `seed` | 42 | Drives the RAS sampler RNG; reproducible runs | +| `disableAutoChunking` | `false` | When `true`, bypasses `CosyVoice3TextChunker` and runs a single synthesizer call regardless of input length. Use when you've pre-segmented input upstream (UI streaming, paragraph-at-a-time playback, etc.). The structural 250-token cap then applies and long inputs truncate mid-utterance. | `CosyVoice3SynthesisResult`: @@ -116,13 +120,92 @@ let result = try await manager.synthesize( | `sampleRate` | `Int` | always 24000 | | `generatedTokenCount` | `Int` | tokens before EOS | | `decodedTokens` | `[Int32]` | full speech token sequence (debug) | +| `finishedOnEos` | `Bool` | `true` = AR loop exited on an EOS token (natural termination); `false` = budget exhausted, audio truncated mid-utterance. See "Decode budget cap" below. | + +### Decode budget cap + auto-chunking + +The Flow CFM model is exported with a fixed-shape `token_total` input of +`[1, 250]` (`CosyVoice3Constants.flowTotalTokens = 250`). Each LLM-Decode +token corresponds to **40 ms of audio** (`tokenMelRatio = 2 × hiftSamplesPerFrame = 480 / sampleRate = 24 000`), +so the *generated* portion of a single synthesizer call is bounded by +`(250 − N_prompt) × 40 ms`. With a typical prompt of ~85–95 tokens, +this leaves ~6.4–6.6 s of generated audio per call — long Mandarin +phrases would truncate mid-utterance if synthesized in one shot. + +**`CosyVoice3TtsManager.synthesize(...)` auto-chunks long input** to +sidestep this. Pipeline: + +1. Run the existing Chinese normalizer (or skip it, per `prenormalized`). +2. `CosyVoice3TextChunker.chunk(normalized)` greedily splits on hard + sentence enders (`. ! ? 。 ! ?`) and falls back to soft clause + separators (`, ; , ; 、 :`) when sentences exceed the budget. The + default budget is `defaultMaxSpeechTokens = 110` speech tokens + (`~45-token margin under the typical 155 room-for-new`; the 30-token + force-split overshoot may push committed chunks to ~140 estimated). +3. If the chunker returns one segment, take the fast path — single + synthesizer call, no concat overhead. +4. Otherwise loop, calling the synthesizer once per chunk, then merge + results: PCM concatenated with an 8 ms cosine cross-fade at each + boundary (masks DC/phase mismatch from independent synth calls); + `generatedTokenCount`/`decodedTokens` summed/concatenated; + `finishedOnEos` = AND across all chunks. + +Tunables: `CosyVoice3TextChunker.defaultMaxSpeechTokens` (110) is the +default budget; pass `disableAutoChunking: true` in +`CosyVoice3SynthesisOptions` to bypass the chunker entirely and run a +single call (useful for UI-driven sentence-at-a-time streaming where +the caller already controls segmentation). + +Token-rate estimate inside the chunker (calibrated against minimax-zh +corpus runs — initial 5.5 figure was too optimistic and let ~16% of +phrases hit the cap; 7.5 covers the worst-case observed real rate): + +| Class | Tokens/char | Rationale | +|---|---|---| +| CJK | 7.5 | worst-case observed in real generation; varies 5.5–9 per char | +| ASCII | 1.5 | BPE compresses; English speaks faster than Mandarin per char | +| Other (Latin-1, etc.) | 2.5 | middle ground | + +Caveats: + +- **Prosody discontinuity at boundaries.** Each chunk re-establishes the + pitch contour from the prompt, so concatenated audio has audible breaks + at chunk seams. The 8 ms cross-fade hides clicks/DC offsets but cannot + reconstruct cross-sentence prosody. +- **Per-chunk prefill cost.** Each segment pays the prefill cost + separately, so total wall-clock for an N-chunk synth is roughly + `N × prefill + Σ decode_per_chunk`. Single-chunk inputs are unaffected. +- **Estimate slack.** The token-per-char heuristic is rough; if a chunk + somehow exceeds the model's structural budget at runtime, the + synthesizer still emits the `LLM-Decode budget exhausted` warning and + returns `finishedOnEos: false` for that chunk. + +Behavior of the underlying synthesizer when its budget is hit (still +applies for `disableAutoChunking: true` or for one-shot mode): + +- **AR loop exhausts `maxNew` without observing an EOS** in + `CosyVoice3Constants.stopRange` (`6_561…6_760`). +- `CosyVoice3Synthesizer` emits a `.warning`-level log: + `"LLM-Decode budget exhausted: generated tokens / cap (no EOS observed). Output truncated at ~s of audio."`. +- `result.finishedOnEos` is `false` so callers can detect it + programmatically (the `tts-benchmark` harness surfaces this as a + per-phrase `finished_on_eos` field in the JSON report). + +Lifting the cap structurally (no auto-chunk, no prosody seams) requires +re-exporting Flow with a larger `token_total` shape (e.g. `[1, 500]` for +~16 s) — handled upstream in the `mobius-cosyvoice3` conversion pipeline; +not changeable from the Swift host. ## Key State -### KV cache (`kv_cache[24, 1, 2, 768, 64]` fp16) -- 24 transformer layers × `[K,V]` × heads × dim, packed into one `MLState`-style - `MLMultiArray` that the prefill produces and the decode loop both reads - and overwrites in-place. +### KV cache (`kv_k` / `kv_v` each `[24, 1, 2, 768, 64]` fp32) +- 24 transformer layers × `[K,V]` × heads × dim, split across two + `MLMultiArray` outputs (`kv_k`, `kv_v`) that prefill produces and the + decode loop carries forward across steps via + `MLPredictionOptions.outputBackings` double-buffering. +- No `MLState` dependency — runs on the package baseline (macOS 14 / iOS 17). +- ~9 MB per array; pre-allocated front/back/spare buffers rotated each + step (see [LLM-Decode `outputBackings` fix](Benchmarks.md#cosyvoice3-llm-decode-outputbackings-fix)). - Reset per `synthesize()` call. ### Prompt assets (`CosyVoice3PromptAssets`) diff --git a/Documentation/TTS/KokoroAne.md b/Documentation/TTS/KokoroAne.md index 88afa80a..b941f9e6 100644 --- a/Documentation/TTS/KokoroAne.md +++ b/Documentation/TTS/KokoroAne.md @@ -148,7 +148,11 @@ timing (5 s of audio, M1): | Vocoder | ~120 ms | | Tail | ~50 ms | -Vocoder dominates. Total ≈ 300 ms for 5 s audio (~16× RTFx). +Vocoder dominates. Total ≈ 300 ms for 5 s audio (~16× RTFx). For +full-corpus numbers (warm-synth p50 / p95, peak RSS, WER) on the +MiniMax-English 100-phrase suite — including the longer paragraph +phrases that pull the per-corpus aggregate down to ~5.2× — see +[Benchmarks.md](Benchmarks.md). ## Source diff --git a/Documentation/TTS/Magpie.md b/Documentation/TTS/Magpie.md index 9124b59e..74a93fc8 100644 --- a/Documentation/TTS/Magpie.md +++ b/Documentation/TTS/Magpie.md @@ -5,16 +5,26 @@ Lives under `Sources/FluidAudio/TTS/Magpie/`. ## Status -Functional but **quite slow — needs significant perf work, not for real-time -or latency-sensitive use.** First synth on a fresh process is dominated by -CoreML model load + first-call ANE compile (~30 s); warm synths run at -~96 s wall for an 8-word English sentence on M-series, i.e. RTFx ≈ **0.04** -(~25× slower than realtime). Whether the throughput ceiling is a model -characteristic, a CoreML conversion limitation, or both is still being -investigated and is expected to improve in subsequent iterations. For -real-time use prefer Kokoro (~20× RTFx) or PocketTTS (~1.5–2× RTFx); -Magpie's value prop is multilingual coverage and the 5 built-in speaker -contexts, not throughput. +> ⚠️ **Beta / experimental.** Below real-time on Apple Silicon +> (agg-RTFx ~0.41× on M2). Not for latency-sensitive use; prefer +> Kokoro / Kokoro ANE or PocketTTS for real-time. Initializing +> `MagpieTtsManager` logs a runtime beta warning at `.warning` level. + +Functional but **below real-time — not for latency-sensitive use.** +On the full `minimax-english` 100-phrase corpus (M2, default compute +units), Magpie posts agg-RTFx **0.41×** with p50 warm synth ~19.8 s +and p95 ~57.5 s — most of the long tail comes from paragraph-length +news / story phrases (max 107 s on a single 18 s utterance). Cold +start ~19 s on warm ANE caches, dominated by first-call decoder_step +compile. The AR loop (`decoder_step` + sampler) dominates wall clock +and grows super-linearly with phrase length; the +[`outputBackings` fast path](Benchmarks.md#magpie-outputbackings-fast-path) +already eliminated the per-step KV reallocation cost. Further gains +likely need an MLX-backed LocalTransformer or a smaller-K/V variant. +For real-time use prefer Kokoro / Kokoro ANE (2–5× RTFx) or PocketTTS +(streaming, TTFT ~1.2 s); Magpie's value prop is multilingual coverage +(en/es/de/fr/it/vi/zh/hi) and 5 built-in speaker contexts, not +throughput. Audio quality is perceptually clean across all 5 speakers and ASR-clean on 4/5; speaker 0 has a single trailing-word artifact ("…and") attributable diff --git a/Documentation/TTS/MinimaxCorpus.md b/Documentation/TTS/MinimaxCorpus.md new file mode 100644 index 00000000..66dd679d --- /dev/null +++ b/Documentation/TTS/MinimaxCorpus.md @@ -0,0 +1,89 @@ +# MiniMax Multilingual TTS Test Set + +The FluidAudio `tts-benchmark` corpus is sourced on demand from the +[MiniMaxAI/TTS-Multilingual-Test-Set](https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set) +Hugging Face dataset and converted to the harness format (one phrase +per non-empty, non-`#` line). The fetched `.txt` files land under +`Benchmarks/tts/corpus/minimax/.txt`; they are gitignored — only +this document is checked in. + +| Field | Value | +|----------|-------| +| Source | https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set | +| Revision | `cb416f0ac3658da0577e97873065e19fe6488917` (initial public release) | +| License | [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/) | +| Citation | MiniMax-Speech tech report — [arXiv 2505.07916](https://arxiv.org/pdf/2505.07916) | +| Languages | 24 (arabic, cantonese, chinese, czech, dutch, english, finnish, french, german, greek, hindi, indonesian, italian, japanese, korean, polish, portuguese, romanian, russian, spanish, thai, turkish, ukrainian, vietnamese) | +| Phrases | 100 per language (2400 total) | + +The fetched text files are derivative works of the upstream dataset +and remain under **CC-BY-SA-4.0**. The rest of the FluidAudio +repository is licensed separately (see top-level `LICENSE`); only the +contents of `Benchmarks/tts/corpus/minimax/` are share-alike-bound to +CC-BY-SA-4.0. + +## Why this corpus? + +MiniMax positions this as *"a public benchmark used in a number of +recent TTS papers, which makes our numbers directly comparable to +existing work"* (Gradium, MiniMax-Speech, seed-tts-eval, etc.). +FluidAudio's `tts-benchmark` ships exclusively against this corpus +so the resulting RTFx / WER numbers land on the same axis as +published TTS work. + +## Format conversion + +Upstream lines have a `|` pipe-delimited +shape because the dataset also ships per-speaker reference audio for +zero-shot voice cloning. The FluidAudio harness only needs the text — +voice selection is a per-backend concern (Kokoro / PocketTTS / Magpie / +StyleTTS2 each have their own voice plumbing). The leading +`|` is stripped at fetch time; if you need the cloning audio +later, fetch it from the upstream HF repo's `audio/` directory. + +## Fetching + +The `fluidaudio minimax-corpus` CLI subcommand pins the upstream +revision to the value above so re-runs are deterministic. From the +package root: + +```bash +# All 24 languages +swift run fluidaudio minimax-corpus + +# Subset +swift run fluidaudio minimax-corpus --languages english,spanish,hindi + +# Refresh against a newer release +swift run fluidaudio minimax-corpus --revision +``` + +Output lands in `Benchmarks/tts/corpus/minimax/.txt` (relative +to the package root) by default; override with `--out-dir `. +Auth-gated revisions are honored via the standard `HF_TOKEN` / +`HUGGING_FACE_HUB_TOKEN` env vars (same as every other HF asset pull +in the project). Run `fluidaudio minimax-corpus --help` for the full +flag list. + +Per-backend ↔ language coverage and `tts-benchmark --corpus minimax-` +usage live in [`Benchmarks.md`](Benchmarks.md#corpus). + +## WER caveats + +Per the [open community discussion on the upstream +dataset](https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set/discussions/10), +WER on this corpus is sensitive to the ASR + text-normalization stack: + +- Whisper-v3 (and similarly Parakeet) often need text normalization on + the reference (`"32"` → `"thirty two"`) before comparing against the + hypothesis to get a clean WER. +- For non-Latin-script languages (Hindi, Japanese, Cantonese, etc.) the + ASR may emit transliterated forms that don't match the reference + script, inflating WER even when the synthesis is intelligible. +- For non-word-segmented languages (Chinese, Japanese, Thai), CER is + the more meaningful metric — `tts-benchmark` already reports both. + +This means **MiniMax WER is best read relatively (FluidAudio backend +A vs. backend B on the same corpus + same ASR), not absolutely**, and +side-by-side comparison with published numbers requires matching the +upstream ASR + normalizer choice. diff --git a/Sources/FluidAudio/ModelNames.swift b/Sources/FluidAudio/ModelNames.swift index b49066e1..fbb7d281 100644 --- a/Sources/FluidAudio/ModelNames.swift +++ b/Sources/FluidAudio/ModelNames.swift @@ -716,7 +716,7 @@ public enum ModelNames { /// expected local directory layout is encoded in `CosyVoice3Constants.Files`. public enum CosyVoice3 { public static let llmPrefill = "LLM-Prefill-T256-M768-fp16" - public static let llmDecode = "LLM-Decode-M768-fp16-stateful" + public static let llmDecode = "LLM-Decode-M768-fp16" public static let flow = "Flow-N250-fp16" public static let hift = "HiFT-T500-fp16" public static let speechEmbeddings = "speech_embedding-fp16.safetensors" diff --git a/Sources/FluidAudio/TTS/CosyVoice3/Assets/CosyVoice3ModelStore.swift b/Sources/FluidAudio/TTS/CosyVoice3/Assets/CosyVoice3ModelStore.swift index 7051143f..ae616a42 100644 --- a/Sources/FluidAudio/TTS/CosyVoice3/Assets/CosyVoice3ModelStore.swift +++ b/Sources/FluidAudio/TTS/CosyVoice3/Assets/CosyVoice3ModelStore.swift @@ -28,11 +28,12 @@ public actor CosyVoice3ModelStore { /// - Parameters: /// - directory: Base build directory that contains - /// `llm-fp16/`, `llm-fp16-stateful/`, `flow-fp16-n250/`, + /// `llm-fp16/`, `llm-fp16-decode/`, `flow-fp16-n250/`, /// `hift-fp16-t500/`, `embeddings/`. /// - computeUnits: Defaults to `.cpuAndNeuralEngine`. Applied to - /// LLM-Prefill + HiFT models only. LLM-Decode (stateful) and Flow - /// both force `.cpuAndGPU` regardless (see `loadIfNeeded()`). + /// LLM-Prefill only. LLM-Decode (stateless external cache), + /// Flow, and HiFT all pin `.cpuAndGPU` regardless (see + /// `loadIfNeeded()`). public init(directory: URL, computeUnits: MLComputeUnits = .cpuAndNeuralEngine) { self.directory = directory self.computeUnits = computeUnits @@ -67,10 +68,10 @@ public actor CosyVoice3ModelStore { let prefill = try await compileAndLoad(prefillURL, configuration: config) logger.info("Loaded \(CosyVoice3Constants.Files.llmPrefill)") - // Stateful decode MUST run on `.cpuAndGPU`: - // - ANE refuses to compile the stateful graph (same failure mode - // as Flow: `MILCompilerForANE ANECCompile() FAILED`), so - // `.cpuAndNE` / `.all` deadlock load + // Stateless decode MUST run on `.cpuAndGPU`: + // - ANE refuses to compile the rotary + sliced SDPA decode graph + // (same failure mode as Flow: `MILCompilerForANE ANECCompile() + // FAILED`), so `.cpuAndNE` / `.all` deadlock load // - CPU-only works but is ~2× slower than the GPU path // Ignore the user-supplied `computeUnits` for decode. let decodeConfig = MLModelConfiguration() @@ -98,7 +99,25 @@ public actor CosyVoice3ModelStore { let flow = try await compileAndLoad(flowURL, configuration: flowConfig) logger.info("Loaded \(CosyVoice3Constants.Files.flow)") - let hift = try await compileAndLoad(hiftURL, configuration: config) + // HiFT runs on `.cpuAndGPU` (fp16). With `.cpuAndNeuralEngine` + // CoreML's planner placed most of HiFT on ANE but kept at least + // one op (`HiFT-T500-fp16_main__Op104`) on the BNNS CPU path, + // which trips a hard async-dispatch watchdog mid-corpus on + // long phrases: + // + // E5RT: Submit Async failed for [3:29]: Async task: + // HiFT-T500-fp16_main__Op104_BnnsCpuInference has timed out. + // @ CancelTimedOutAsyncTask_block_invoke + // + // Pinning HiFT to `.cpuAndGPU` removes the ANE+BNNS mixed-compute + // pathology (the same family of issue that already forced Flow + // and Decode off ANE above). The model is fixed-shape + // [1, 80, 500] so GPU placement is predictable. Trade-off: a + // small per-call latency increase vs. ANE — acceptable, since + // the prior ANE config didn't actually complete the corpus. + let hiftConfig = MLModelConfiguration() + hiftConfig.computeUnits = .cpuAndGPU + let hift = try await compileAndLoad(hiftURL, configuration: hiftConfig) logger.info("Loaded \(CosyVoice3Constants.Files.hift)") loadedModels = CosyVoice3Models(prefill: prefill, decode: decode, flow: flow, hift: hift) diff --git a/Sources/FluidAudio/TTS/CosyVoice3/CosyVoice3Constants.swift b/Sources/FluidAudio/TTS/CosyVoice3/CosyVoice3Constants.swift index b0a46f93..094879e0 100644 --- a/Sources/FluidAudio/TTS/CosyVoice3/CosyVoice3Constants.swift +++ b/Sources/FluidAudio/TTS/CosyVoice3/CosyVoice3Constants.swift @@ -4,7 +4,7 @@ import Foundation /// /// Shipping config (frozen): /// - LLM-Prefill-T256-M768-fp16 (cpuAndNeuralEngine) -/// - LLM-Decode-M768-fp16-stateful (cpuAndGPU — see note) +/// - LLM-Decode-M768-fp16 (cpuAndGPU — see note) /// - Flow-N250-fp16 (cpuAndGPU — an ANE-port /// BC1S rewrite was attempted and reverted: the converted graph ran /// ~3× faster but numerically broken (mel dynamic range collapsed @@ -15,14 +15,22 @@ import Foundation /// `input_embed.conv_pos_embed` (`Conv1d(1024,1024,k=31)+Mish`) /// that three rewrite attempts couldn't move — ANEF rejects the /// conv footprint regardless of group count.) -/// - HiFT-T500-fp16 (cpuAndNeuralEngine) +/// - HiFT-T500-fp16 (cpuAndGPU — pinned off +/// ANE because the `.cpuAndNeuralEngine` planner left at least one +/// op on the BNNS CPU path, which tripped a hard async-dispatch +/// watchdog mid-corpus on long phrases: +/// `E5RT: Submit Async failed ... HiFT-T500-fp16_main__Op104_BnnsCpuInference +/// has timed out`. GPU placement is deterministic and avoids the +/// ANE+BNNS mixed-compute pathology.) /// -/// The stateful decode model uses per-layer `MLState` buffers for the -/// KV cache (48 tensors, `[1, 2, 768, 64]` fp16 each) instead of -/// round-tripping 18 MB of kv_k / kv_v MLMultiArrays every step. ANE -/// refuses to compile the stateful graph (`MILCompilerForANE -/// ANECCompile() FAILED`); decode therefore runs on `.cpuAndGPU`. -/// Requires macOS 15 / iOS 18. +/// Decode runs **stateless** with an external KV cache: prefill emits +/// `kv_k` / `kv_v` of shape `[24, 1, 2, 768, 64]` fp32, and decode +/// accepts the same tensors as inputs and returns `kv_k_out` / `kv_v_out` +/// at the same shape/dtype. The cache is round-tripped once per step +/// (≈18 MB total). ANE still rejects this graph (`MILCompilerForANE +/// ANECCompile() FAILED` on the rotary + sliced SDPA), so decode is +/// pinned to `.cpuAndGPU`. The library floor is macOS 14 / iOS 17 — no +/// MLState dependency. public enum CosyVoice3Constants { // MARK: - LLM shapes @@ -66,8 +74,8 @@ public enum CosyVoice3Constants { public enum Files { public static let llmPrefill = "LLM-Prefill-T256-M768-fp16.mlpackage" public static let llmPrefillSubdir = "llm-fp16" - public static let llmDecode = "LLM-Decode-M768-fp16-stateful.mlpackage" - public static let llmDecodeSubdir = "llm-fp16-stateful" + public static let llmDecode = "LLM-Decode-M768-fp16.mlpackage" + public static let llmDecodeSubdir = "llm-fp16-decode" public static let flow = "Flow-N250-fp16.mlpackage" public static let flowSubdir = "flow-fp16-n250" public static let hift = "HiFT-T500-fp16.mlpackage" diff --git a/Sources/FluidAudio/TTS/CosyVoice3/CosyVoice3TtsManager.swift b/Sources/FluidAudio/TTS/CosyVoice3/CosyVoice3TtsManager.swift index d71f3ea6..7d76a45b 100644 --- a/Sources/FluidAudio/TTS/CosyVoice3/CosyVoice3TtsManager.swift +++ b/Sources/FluidAudio/TTS/CosyVoice3/CosyVoice3TtsManager.swift @@ -38,11 +38,9 @@ import Foundation /// the 281 runtime-added special tokens (CosyVoice3Tokenizer). Same format /// that `tokenizer_fixture.json` dumps under its `special_tokens` key. /// -/// > Note: Gated to macOS 15 / iOS 18 because the underlying -/// > `CosyVoice3Synthesizer` uses CoreML `MLState` for the decode KV cache. -/// > Other FluidAudio modules (ASR, Diarization, VAD, Kokoro, PocketTTS) -/// > remain available on macOS 14 / iOS 17. -@available(macOS 15, iOS 18, *) +/// > Available on the same floor as the rest of FluidAudio (macOS 14 / +/// > iOS 17). Decode runs stateless with an external KV cache rather than +/// > `MLState`, so no extra OS gate is required. public actor CosyVoice3TtsManager { private let logger = AppLogger(subsystem: "com.fluidaudio.tts", category: "CosyVoice3TtsManager") @@ -216,9 +214,60 @@ public actor CosyVoice3TtsManager { normalized = CosyVoice3ChineseNormalizer.normalize(text) } + // Auto-chunk long input under the structural 250-token Flow cap. + // The chunker greedily splits on hard sentence enders + soft clause + // separators when the running speech-token estimate exceeds budget; + // short inputs return a single chunk and take the fast path. Caller + // can opt out via `options.disableAutoChunking` for pre-segmented + // input (e.g. UI-driven streaming). + let chunks: [String] + if options.disableAutoChunking { + chunks = [normalized] + } else { + let split = CosyVoice3TextChunker.chunk(normalized) + chunks = split.isEmpty ? [normalized] : split + } + + if chunks.count == 1 { + return try await synthesizeChunk( + text: chunks[0], promptAssets: promptAssets, + options: options, frontend: frontend, synthesizer: synthesizer) + } + + logger.info( + "Auto-chunking long input into \(chunks.count) segments to fit " + + "the 250-token Flow cap (estimated speech tokens: " + + "\(CosyVoice3TextChunker.estimateSpeechTokens(normalized))).") + var results: [CosyVoice3SynthesisResult] = [] + results.reserveCapacity(chunks.count) + for (i, chunk) in chunks.enumerated() { + logger.info( + " chunk \(i + 1)/\(chunks.count): " + + "\(chunk.count) chars, ~" + + "\(CosyVoice3TextChunker.estimateSpeechTokens(chunk)) speech tokens") + let r = try await synthesizeChunk( + text: chunk, promptAssets: promptAssets, + options: options, frontend: frontend, synthesizer: synthesizer) + results.append(r) + } + return Self.mergeChunkedResults(results) + } + + // MARK: - Chunked synthesis helpers + + /// Single-call synthesis path: tokenize/normalize-aware text → fixture + /// adapter → synthesizer. Shared between the fast (1-chunk) and chunked + /// (N-chunk) paths in `synthesize(...)`. + private func synthesizeChunk( + text: String, + promptAssets: CosyVoice3PromptAssets, + options: CosyVoice3SynthesisOptions, + frontend: CosyVoice3TextFrontend, + synthesizer: CosyVoice3Synthesizer + ) async throws -> CosyVoice3SynthesisResult { let assembled = try frontend.assemble( promptText: promptAssets.promptText, - ttsText: normalized, + ttsText: text, promptSpeechIds: promptAssets.promptSpeechIds) let lmInputEmbedsFlat = try Self.flattenLmEmbeds( @@ -246,6 +295,72 @@ public actor CosyVoice3TtsManager { return try await synthesizer.synthesize(fixture: fixture, options: parityOptions) } + /// Concatenate per-chunk results into a single `CosyVoice3SynthesisResult`. + /// Audio is stitched with a short cosine cross-fade (`crossfadeMs`) at + /// each boundary to mask DC/phase mismatch from independent synth calls. + /// `finishedOnEos` is `true` only when every chunk ended naturally + /// (so callers can still detect mid-segment truncation downstream). + private static func mergeChunkedResults( + _ results: [CosyVoice3SynthesisResult], + crossfadeMs: Double = 8 + ) -> CosyVoice3SynthesisResult { + precondition(!results.isEmpty, "mergeChunkedResults requires ≥1 result") + let sampleRate = results[0].sampleRate + let samples = concatWithCrossfade( + results.map { $0.samples }, + sampleRate: sampleRate, + fadeMs: crossfadeMs) + let totalGenerated = results.reduce(0) { $0 + $1.generatedTokenCount } + var allDecoded: [Int32] = [] + allDecoded.reserveCapacity(totalGenerated) + for r in results { allDecoded.append(contentsOf: r.decodedTokens) } + let allEos = results.allSatisfy { $0.finishedOnEos } + return CosyVoice3SynthesisResult( + samples: samples, + sampleRate: sampleRate, + generatedTokenCount: totalGenerated, + decodedTokens: allDecoded, + finishedOnEos: allEos) + } + + /// Concatenate PCM chunks with a cosine cross-fade at each boundary. + /// Fade window is the shorter of `fadeMs` and `min(prev.tail, next.head) + /// / 2`, so very short chunks degrade gracefully (no overlap consuming + /// the entire chunk). + static func concatWithCrossfade( + _ chunks: [[Float]], + sampleRate: Int, + fadeMs: Double + ) -> [Float] { + guard !chunks.isEmpty else { return [] } + let nominalFade = max(0, Int((Double(sampleRate) * fadeMs / 1000).rounded())) + var out: [Float] = chunks[0] + for i in 1.. = [ + "。", "!", "?", ".", "!", "?", "\n", + ] + + /// Clause-internal punctuation. Commit only when the running token + /// count is at or above the budget — soft splits should be preferred + /// over force-splits but not preferred over hard enders. + private static let softEnders: Set = [ + ",", "、", ";", ":", ";", ",", " ", + ] + + /// Default speech-token budget per chunk. Keeps a ~45-token margin + /// under the typical room-for-new of ~155 (= `flowTotalTokens=250` + /// minus a typical prompt of ~95 tokens). The 30-token force-split + /// overshoot may push committed chunks to ~140 estimated, still under + /// the structural cap. + public static let defaultMaxSpeechTokens: Int = 110 + + /// Split `text` into chunks each estimated to produce ≤ + /// `maxSpeechTokens` LLM speech tokens. Returns `[text]` (single + /// chunk) when the input already fits. Returns `[]` when `text` is + /// empty or whitespace-only. + public static func chunk( + _ text: String, + maxSpeechTokens: Int = defaultMaxSpeechTokens + ) -> [String] { + let trimmed = text.trimmingCharacters(in: .whitespacesAndNewlines) + guard !trimmed.isEmpty else { return [] } + if estimateSpeechTokens(trimmed) <= maxSpeechTokens { + return [trimmed] + } + + var chunks: [String] = [] + var current = "" + for ch in trimmed { + current.append(ch) + let tokensSoFar = estimateSpeechTokens(current) + + if hardEnders.contains(ch) { + let pruned = current.trimmingCharacters(in: .whitespacesAndNewlines) + if !pruned.isEmpty { chunks.append(pruned) } + current = "" + continue + } + if tokensSoFar >= maxSpeechTokens && softEnders.contains(ch) { + let pruned = current.trimmingCharacters(in: .whitespacesAndNewlines) + if !pruned.isEmpty { chunks.append(pruned) } + current = "" + continue + } + // Force-split if no punctuation has appeared within a 30-token + // overshoot. Prefer the most recent whitespace; fall back to + // hard-cut at the current position. Hard-cut on continuous CJK + // (no whitespace) is rare in normalized input but can happen + // when the normalizer collapses spaces. + if tokensSoFar >= maxSpeechTokens + 30 { + if let lastSpace = current.lastIndex(where: { $0 == " " }), + lastSpace != current.startIndex + { + let head = String(current[.. Int { + var total = 0.0 + for scalar in s.unicodeScalars { + if isCJK(scalar) { + total += 7.5 + } else if scalar.isASCII { + total += 1.5 + } else { + total += 2.5 + } + } + return Int(total.rounded()) + } + + private static func isCJK(_ scalar: Unicode.Scalar) -> Bool { + let v = scalar.value + // CJK Unified Ideographs (the bulk of zh/yue text) + if (0x4E00...0x9FFF).contains(v) { return true } + // CJK Unified Ideographs Extension A + if (0x3400...0x4DBF).contains(v) { return true } + // Hiragana + if (0x3040...0x309F).contains(v) { return true } + // Katakana + if (0x30A0...0x30FF).contains(v) { return true } + // Hangul Syllables + if (0xAC00...0xD7AF).contains(v) { return true } + return false + } +} diff --git a/Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Synthesize/CosyVoice3Synthesizer.swift b/Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Synthesize/CosyVoice3Synthesizer.swift index 83a00b37..2402d0cf 100644 --- a/Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Synthesize/CosyVoice3Synthesizer.swift +++ b/Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Synthesize/CosyVoice3Synthesizer.swift @@ -7,11 +7,12 @@ import Foundation /// implemented as a method on this type, keeping the state (KV cache, running /// decoded list) local to a single synthesis call. /// -/// Decode uses CoreML `MLState` (macOS 15 / iOS 18): 48 per-layer buffers -/// (`kv_k_0..kv_k_23`, `kv_v_0..kv_v_23`) replace the 18 MB kv_k / kv_v -/// round-trip per step. Prefill remains non-stateful and its `kv_k` / `kv_v` -/// outputs seed the decode state once after prefill. -@available(macOS 15, iOS 18, *) +/// Decode is **stateless** with an external KV cache. Prefill emits +/// `kv_k` / `kv_v` of shape `[24, 1, 2, 768, 64]` fp32; decode accepts those +/// same tensors as inputs and returns updated `kv_k_out` / `kv_v_out` at +/// the same shape/dtype. We round-trip the cache once per step (≈18 MB +/// total) and bind the previous step's outputs as the next step's inputs. +/// No `MLState` dependency — runs on macOS 14 / iOS 17. public actor CosyVoice3Synthesizer { private let logger = AppLogger(subsystem: "com.fluidaudio.tts", category: "CosyVoice3Synthesizer") @@ -19,6 +20,18 @@ public actor CosyVoice3Synthesizer { private let models: CosyVoice3Models private let embeddings: CosyVoice3SpeechEmbeddings + /// Set to `false` once `LLM-Decode-M768-fp16` rejects pre-allocated + /// `outputBackings` (model exported without explicit MultiArray + /// shape/dtype constraints on its `kv_k_out` / `kv_v_out` / + /// `speech_logits` outputs). Latched off so we don't throw + catch on + /// every one of ~163 AR decode steps per phrase. Same pattern as + /// `MagpieKvCache.useOutputBackings`. + private var useOutputBackings: Bool = true + + /// One-shot flag for "fast path engaged" log message; only emitted on + /// the first successful `outputBackings` prediction so we don't spam. + private var loggedFastPath: Bool = false + public init(models: CosyVoice3Models, embeddings: CosyVoice3SpeechEmbeddings) { self.models = models self.embeddings = embeddings @@ -46,16 +59,52 @@ public actor CosyVoice3Synthesizer { sampler.seedTokens(fixture.decodedTokens) } - // 1) Prefill (non-stateful: returns kv_k / kv_v as outputs) + // 1) Prefill (returns kv_k / kv_v as fp32 outputs) let tPrefill = Date() let (prefillLogits, initialKvK, initialKvV) = try await runPrefill(fixture: fixture) let prefillSec = Date().timeIntervalSince(tPrefill) - // Seed decode MLState from prefill kv_k / kv_v. - let tSeed = Date() - let state = models.decode.makeState() - try seedDecodeState(state: state, kvK: initialKvK, kvV: initialKvV) - let seedSec = Date().timeIntervalSince(tSeed) + // External KV cache with **double-buffered outputBackings**: prefill's + // `kv_k` / `kv_v` (shape `[24, 1, 2, 768, 64]` fp32, ~9 MB each) feed + // the first decode step. Subsequent steps rotate between two + // pre-allocated buffer pairs (A/B) bound as the model's + // `kv_k_out` / `kv_v_out` outputs. Same pattern as + // `MagpieKvCache.swapBackings()` — eliminates ~36 MB of host + // alloc/dealloc per decode step (×163 steps ≈ 5.9 GB churn per + // phrase). `speech_logits` is also pre-bound so we avoid a fresh + // 27 KB allocation each step. CoreML rejects this when the model + // was exported without explicit MultiArray shape/dtype constraints + // on its outputs; in that case we latch `useOutputBackings = false` + // and fall back to per-step allocation for the rest of the run. + let kvShape: [NSNumber] = [ + NSNumber(value: CosyVoice3Constants.numLayers), + 1, + NSNumber(value: CosyVoice3Constants.kvHeads), + NSNumber(value: CosyVoice3Constants.kvMaxLength), + NSNumber(value: CosyVoice3Constants.headDim), + ] + let kvKBackA = try MLMultiArray(shape: kvShape, dataType: .float32) + let kvVBackA = try MLMultiArray(shape: kvShape, dataType: .float32) + let kvKBackB = try MLMultiArray(shape: kvShape, dataType: .float32) + let kvVBackB = try MLMultiArray(shape: kvShape, dataType: .float32) + let logitsBacking = try MLMultiArray( + shape: [1, 1, NSNumber(value: CosyVoice3Constants.speechVocab)], + dataType: .float32) + + // Pointer-rotation triple. `frontKvK/V` are read by the next step; + // `backKvK/V` receive the next step's writes; `spareKvK/V` are the + // pre-allocated set ready to become `back` after rotation. Initial + // `front` is the prefill output; we don't reuse those buffers as + // `spare`/`back` — once decode step 1 finishes, `front` becomes A + // (just-written), `back` becomes B (next write target), `spare` + // becomes A's previous contents (which we drop, since prefill + // output is single-use). + var frontKvK: MLMultiArray = initialKvK + var frontKvV: MLMultiArray = initialKvV + var backKvK: MLMultiArray = kvKBackA + var backKvV: MLMultiArray = kvVBackA + var spareKvK: MLMultiArray = kvKBackB + var spareKvV: MLMultiArray = kvVBackB // Reusable per-step inputs for decode. `curLenArr` is mutated in place // each step; `inputsEmbedsArr` is overwritten by memcpy per step. @@ -64,6 +113,12 @@ public actor CosyVoice3Synthesizer { shape: [1, 1, NSNumber(value: CosyVoice3Constants.embedDim)], dataType: .float32) + // Logits scratch reused across all decode steps. The hot loop + // memcpy's into this from `logitsBacking` (or strided-gathers from a + // freshly-allocated array on the slow path). + var logitsScratch = [Float]( + repeating: 0, count: CosyVoice3Constants.speechVocab) + // First token from prefill tail logits. var decoded: [Int32] = [] let firstLogits = sliceLastStepLogits( @@ -81,31 +136,82 @@ public actor CosyVoice3Synthesizer { } decoded.append(topId) - // 2) Decode loop + // 2) Decode loop (stateless, external cache, double-buffered backings) var curLen = fixture.tPre var decodeSteps = 0 + var hitEos = false let tDecode = Date() for step in 1.. [Float] { + frontKvK: MLMultiArray, + frontKvV: MLMultiArray, + backKvK: MLMultiArray, + backKvV: MLMultiArray, + logitsBacking: MLMultiArray, + logits: inout [Float] + ) throws { let features: [String: Any] = [ "inputs_embeds": inputsEmbeds, "cur_len": curLen, + "kv_k": frontKvK, + "kv_v": frontKvV, ] let provider = try MLDictionaryFeatureProvider(dictionary: features) - let output = try models.decode.prediction(from: provider, using: state) - guard - let logitsArr = output.featureValue(for: "speech_logits")?.multiArrayValue - else { - throw CosyVoice3Error.predictionFailed("decode: missing speech_logits") + var fastPathSucceeded = false + if useOutputBackings { + let opts = MLPredictionOptions() + opts.outputBackings = [ + "kv_k_out": backKvK, + "kv_v_out": backKvV, + "speech_logits": logitsBacking, + ] + do { + _ = try models.decode.prediction(from: provider, options: opts) + Self.readLogits(from: logitsBacking, into: &logits) + if !loggedFastPath { + logger.info( + "LLM-Decode outputBackings accepted; double-buffered " + + "AR loop active") + loggedFastPath = true + } + fastPathSucceeded = true + } catch { + // CoreML refused our pre-allocated backings — typically + // because `LLM-Decode-M768-fp16.mlpackage` was exported + // without explicit MultiArray shape/dtype constraints on + // its outputs. Latch the flag off so we don't throw + catch + // on every one of ~163 steps for the rest of the corpus. + // Warning level so it shows in release builds — this is a + // perf regression worth surfacing to anyone running with a + // re-exported model. + useOutputBackings = false + logger.warning( + "LLM-Decode outputBackings rejected " + + "(\(error.localizedDescription)); switching to " + + "fresh-alloc fallback for the rest of the run") + } } - // logits shape = [1, 1, 6761] fp32; strides may be non-compact. + + if !fastPathSucceeded { + // Slow path: per-step CoreML allocation, then memcpy outputs + // into the pre-allocated backings so the front/back rotation + // protocol still works after this call. + let output = try models.decode.prediction(from: provider) + guard + let logitsArr = output.featureValue(for: "speech_logits")?.multiArrayValue, + let kvKOutArr = output.featureValue(for: "kv_k_out")?.multiArrayValue, + let kvVOutArr = output.featureValue(for: "kv_v_out")?.multiArrayValue + else { + throw CosyVoice3Error.predictionFailed( + "decode: missing speech_logits / kv_k_out / kv_v_out") + } + try Self.copyKvOutput(kvKOutArr, into: backKvK, name: "kv_k_out") + try Self.copyKvOutput(kvVOutArr, into: backKvV, name: "kv_v_out") + Self.readLogits(from: logitsArr, into: &logits) + } + } + + /// Read a `[1, 1, 6761]` fp32 logits MLMultiArray into `dst`. Honors the + /// last-dim stride (CoreML may emit non-compact strides on aligned + /// allocations) — uses `memcpy` when stride==1, strided gather otherwise. + private static func readLogits(from arr: MLMultiArray, into dst: inout [Float]) { let count = CosyVoice3Constants.speechVocab - var logits = [Float](repeating: 0, count: count) - let strides = logitsArr.strides.map { $0.intValue } + let strides = arr.strides.map { $0.intValue } let vocabStride = strides.last ?? 1 - let base = logitsArr.dataPointer.bindMemory(to: Float.self, capacity: logitsArr.count) - for i in 0...size) + } + } else { + for i in 0.., - srcLayerBase: Int, - srcHStride: Int, srcMStride: Int, srcDStride: Int, - dst: UnsafeMutablePointer, - dstHStride: Int, dstMStride: Int, dstDStride: Int, - H: Int, M: Int, D: Int - ) { - for h in 0...size + memcpy(dst.dataPointer, src.dataPointer, bytes) } private func runFlow( diff --git a/Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Synthesize/CosyVoice3Types.swift b/Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Synthesize/CosyVoice3Types.swift index 14cbf532..54aabe80 100644 --- a/Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Synthesize/CosyVoice3Types.swift +++ b/Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Synthesize/CosyVoice3Types.swift @@ -10,6 +10,24 @@ public struct CosyVoice3SynthesisResult: Sendable { public let generatedTokenCount: Int /// Decoded speech token ids (useful for debugging + round-trip). public let decodedTokens: [Int32] + /// `true` when the LLM-Decode AR loop ended on an EOS token in + /// `CosyVoice3Constants.stopRange` (natural termination); `false` when + /// the loop exhausted its decode budget (`flowTotalTokens - nPrompt`) + /// without observing EOS — the audio is truncated mid-utterance. + /// See the `.warning`-level log emitted from `CosyVoice3Synthesizer` + /// when this is `false`. + public let finishedOnEos: Bool + + public init( + samples: [Float], sampleRate: Int, generatedTokenCount: Int, + decodedTokens: [Int32], finishedOnEos: Bool + ) { + self.samples = samples + self.sampleRate = sampleRate + self.generatedTokenCount = generatedTokenCount + self.decodedTokens = decodedTokens + self.finishedOnEos = finishedOnEos + } } /// Options controlling a CosyVoice3 parity / synthesis call. @@ -42,9 +60,20 @@ public struct CosyVoice3SynthesisOptions: Sendable { public let maxNewTokens: Int? /// Sampler seed for the top-p/top-k + multinomial fallback path. public let seed: UInt64 + /// When `true`, skips `CosyVoice3TextChunker.chunk(...)` and runs a + /// single synthesizer call regardless of input length. Useful for + /// callers that pre-segment input themselves (e.g. UI-driven streaming + /// per sentence). The structural 250-token Flow cap still applies and + /// long inputs will truncate mid-utterance with a `.warning` log. + public let disableAutoChunking: Bool - public init(maxNewTokens: Int? = nil, seed: UInt64 = 42) { + public init( + maxNewTokens: Int? = nil, + seed: UInt64 = 42, + disableAutoChunking: Bool = false + ) { self.maxNewTokens = maxNewTokens self.seed = seed + self.disableAutoChunking = disableAutoChunking } } diff --git a/Sources/FluidAudio/TTS/KokoroAne/Pipeline/KokoroAneModelStore.swift b/Sources/FluidAudio/TTS/KokoroAne/Pipeline/KokoroAneModelStore.swift index 1acbea77..6171e44b 100644 --- a/Sources/FluidAudio/TTS/KokoroAne/Pipeline/KokoroAneModelStore.swift +++ b/Sources/FluidAudio/TTS/KokoroAne/Pipeline/KokoroAneModelStore.swift @@ -42,6 +42,39 @@ public struct KokoroAneComputeUnits: Sendable, Equatable { prosody: .cpuAndGPU, noise: .cpuAndGPU, vocoder: .cpuAndGPU, tail: .cpuAndGPU ) + /// Force every stage onto `.cpuAndNeuralEngine`. Stages that hit + /// ANE-incompatible ops will fall back to CPU silently — included + /// for the benchmark sweep (efficiency vs. latency comparison). + public static let allAne = KokoroAneComputeUnits( + albert: .cpuAndNeuralEngine, postAlbert: .cpuAndNeuralEngine, + alignment: .cpuAndNeuralEngine, prosody: .cpuAndNeuralEngine, + noise: .cpuAndNeuralEngine, vocoder: .cpuAndNeuralEngine, + tail: .cpuAndNeuralEngine + ) + + /// CPU-only (no ANE, no GPU). Slowest but most predictable; useful + /// as a debugging / fallback baseline. + public static let cpuOnly = KokoroAneComputeUnits( + albert: .cpuOnly, postAlbert: .cpuOnly, alignment: .cpuOnly, + prosody: .cpuOnly, noise: .cpuOnly, vocoder: .cpuOnly, tail: .cpuOnly + ) + + /// Build a configuration from a generic preset (used by the + /// `tts-benchmark` CLI so a single flag maps cleanly across + /// backends). + public init(preset: TtsComputeUnitPreset) { + switch preset { + case .default: + self = .default + case .allAne: + self = .allAne + case .cpuAndGpu: + self = .cpuAndGpu + case .cpuOnly: + self = .cpuOnly + } + } + func units(for stage: KokoroAneStage) -> MLComputeUnits { switch stage { case .albert: return albert diff --git a/Sources/FluidAudio/TTS/Magpie/MagpieTtsManager.swift b/Sources/FluidAudio/TTS/Magpie/MagpieTtsManager.swift index 68cc42bb..7f791e34 100644 --- a/Sources/FluidAudio/TTS/Magpie/MagpieTtsManager.swift +++ b/Sources/FluidAudio/TTS/Magpie/MagpieTtsManager.swift @@ -75,6 +75,11 @@ public actor MagpieTtsManager { public func initialize() async throws { if synthesizer != nil { return } + logger.warning( + "Magpie TTS is experimental / beta. Synthesis is below real-time " + + "(agg-RTFx ~0.41× on M2 for the MiniMax-English corpus) — " + + "see Documentation/TTS/Magpie.md.") + let store = MagpieModelStore( directory: directory, computeUnits: computeUnits, diff --git a/Sources/FluidAudio/TTS/Magpie/Pipeline/Synthesize/MagpieKvCache.swift b/Sources/FluidAudio/TTS/Magpie/Pipeline/Synthesize/MagpieKvCache.swift index addd5b60..ee3a816f 100644 --- a/Sources/FluidAudio/TTS/Magpie/Pipeline/Synthesize/MagpieKvCache.swift +++ b/Sources/FluidAudio/TTS/Magpie/Pipeline/Synthesize/MagpieKvCache.swift @@ -54,6 +54,15 @@ public final class MagpieKvCache { public private(set) var cachesV: [MLMultiArray] public private(set) var positions: [MLMultiArray] + /// Set to `false` once `decoder_step.mlmodelc` rejects `outputBackings` + /// (e.g. when the model was exported without explicit MultiArray + /// shape/dtype constraints on its KV outputs). The rejection is a static + /// property of the model, so once it fails we permanently skip the fast + /// path and go straight to the fresh-alloc fallback to avoid throwing + + /// catching an exception on every one of the ~500 AR decode steps per + /// utterance. + public var useOutputBackings: Bool = true + /// Back-buffer set for double-buffered AR loop. Used as `outputBackings` so /// CoreML writes new K/V/pos straight into our pre-allocated arrays instead /// of allocating ~18.9 MB of fresh fp16 buffers per step. After each diff --git a/Sources/FluidAudio/TTS/Magpie/Pipeline/Synthesize/MagpieSynthesizer.swift b/Sources/FluidAudio/TTS/Magpie/Pipeline/Synthesize/MagpieSynthesizer.swift index 68321cc6..492a2eeb 100644 --- a/Sources/FluidAudio/TTS/Magpie/Pipeline/Synthesize/MagpieSynthesizer.swift +++ b/Sources/FluidAudio/TTS/Magpie/Pipeline/Synthesize/MagpieSynthesizer.swift @@ -769,14 +769,73 @@ public actor MagpieSynthesizer { // step. The cache provides 24 K/V + 12 position back-buffers, the // synthesizer provides the 1 hidden buffer. After the call, // `swapBackings` promotes back→front for the next step's inputs. - var backings: [String: Any] = [:] - cache.addOutputBackings(to: &backings) - backings[MagpieKvCache.decoderHiddenKey] = hiddenBacking - let predOpts = MLPredictionOptions() - predOpts.outputBackings = backings + // + // If a previous step already proved that this model was exported + // without explicit MultiArray shape/dtype constraints on its KV + // outputs, `cache.useOutputBackings` is `false` and we skip the + // fast path entirely. This avoids the per-step throw/catch overhead + // and debug-log spam across the entire AR loop (~500 iterations). + var fastPathSucceeded = false + if cache.useOutputBackings { + var backings: [String: Any] = [:] + cache.addOutputBackings(to: &backings) + backings[MagpieKvCache.decoderHiddenKey] = hiddenBacking + let predOpts = MLPredictionOptions() + predOpts.outputBackings = backings - _ = try model.prediction(from: provider, options: predOpts) - cache.swapBackings() + do { + _ = try model.prediction(from: provider, options: predOpts) + cache.swapBackings() + fastPathSucceeded = true + } catch { + // CoreML refused our pre-allocated outputBackings — typically + // because `decoder_step.mlmodelc` was exported without + // explicit MultiArray shape/dtype constraints on its KV + // outputs, so the runtime can't validate the buffer layout + // and bails with + // "Output feature (null) doesn't support output backing + // because it doesn't have a MultiArray constraints." + // The rejection is a static property of the model, so latch + // the cache flag off to skip the fast path on every + // subsequent step (avoids ~500 throw/catch + log lines per + // utterance). + cache.useOutputBackings = false + logger.debug( + "decoder_step outputBackings rejected " + + "(\(error.localizedDescription)); switching to " + + "fresh-alloc fallback for the rest of the run") + } + } + + if !fastPathSucceeded { + // Slow path: re-run without `outputBackings`, route the + // freshly-allocated K/V/pos through `MagpieKvCache.absorbOutputs` + // (which replaces front pointers directly), and copy the hidden + // state into `hiddenBacking` so the rest of this function works + // unchanged. Costs ~18.9 MB of fresh fp16 allocation per step; + // proper fix is to re-export `decoder_step.mlmodelc` with + // shape/dtype constraints on `new_k_*`/`new_v_*`/`var_*`. + let output = try model.prediction(from: provider) + try cache.absorbOutputs(output) + guard + let hidden = output.featureValue(for: MagpieKvCache.decoderHiddenKey)? + .multiArrayValue + else { + throw MagpieError.inferenceFailed( + stage: "decoder_step", + underlying: + "missing hidden output key \(MagpieKvCache.decoderHiddenKey)") + } + guard hidden.dataType == .float16, hidden.count == hiddenBacking.count else { + throw MagpieError.inferenceFailed( + stage: "decoder_step", + underlying: + "decoder hidden mismatch (dtype=\(hidden.dataType.rawValue) " + + "count=\(hidden.count) expected=\(hiddenBacking.count))") + } + let bytes = hiddenBacking.count * MemoryLayout.size + memcpy(hiddenBacking.dataPointer, hidden.dataPointer, bytes) + } // Hidden state lives in `hiddenBacking` after the call. Convert fp16 // → fp32 via vImage into a fresh [Float] result buffer (the sampler diff --git a/Sources/FluidAudio/TTS/Shared/TtsComputeUnitPreset.swift b/Sources/FluidAudio/TTS/Shared/TtsComputeUnitPreset.swift new file mode 100644 index 00000000..33744942 --- /dev/null +++ b/Sources/FluidAudio/TTS/Shared/TtsComputeUnitPreset.swift @@ -0,0 +1,72 @@ +@preconcurrency import CoreML +import Foundation + +/// Generic compute-unit preset shared across TTS backends. +/// +/// Each backend keeps its own per-stage `ComputeUnits` struct +/// because stage names differ (Kokoro ANE has 7 stages, PocketTTS has 4 +/// CoreML models, StyleTTS2 has 4 models, etc.). This preset is the +/// uniform knob the benchmarking harness flips so a single CLI flag +/// (`--compute-units default|all-ane|cpu-and-gpu|cpu-only`) maps to a +/// sensible per-stage assignment on every backend. +/// +/// Backends opt in by adding `init(preset: TtsComputeUnitPreset)` to +/// their compute-units struct (see `KokoroAneComputeUnits` for the +/// reference implementation). +public enum TtsComputeUnitPreset: String, Sendable, CaseIterable { + + /// The backend's empirically-tuned default — typically a mix of + /// ANE-friendly and CPU+GPU stages chosen by the conversion author. + case `default` + + /// Force every stage to `.cpuAndNeuralEngine`. Worst case for stages + /// that fall back to CPU on ANE-incompatible ops, but the most + /// energy-efficient when ops are ANE-clean. + case allAne + + /// Force every stage to `.cpuAndGPU`. Skips the ANE entirely; + /// useful as a latency baseline when the ANE compile cache is cold + /// (no `anecompilerservice` time on first call). + case cpuAndGpu + + /// Force every stage to `.cpuOnly`. Fallback / debugging baseline; + /// every backend should at least run here, however slowly. + case cpuOnly + + /// Concrete `MLComputeUnits` for "force every stage to X" presets. + /// Returns `nil` for `.default`, which means "let the backend keep + /// its empirical mapping". + public var uniformUnits: MLComputeUnits? { + switch self { + case .default: return nil + case .allAne: return .cpuAndNeuralEngine + case .cpuAndGpu: return .cpuAndGPU + case .cpuOnly: return .cpuOnly + } + } + + /// Parse the CLI flag value (`default`, `all-ane`, `cpu-and-gpu`, + /// `cpu-only`). Returns `nil` for unrecognised values so callers + /// can surface a usage error. + public init?(cliValue: String) { + switch cliValue.lowercased() { + case "default": self = .default + case "all-ane", "ane", "neural-engine": self = .allAne + case "cpu-and-gpu", "cpuandgpu", "gpu": self = .cpuAndGpu + case "cpu-only", "cpu", "cpuonly": self = .cpuOnly + default: return nil + } + } + + /// Canonical kebab-case form, matching the CLI flag values the + /// `init?(cliValue:)` parser accepts. Use this for log lines and + /// JSON reports so values round-trip back through the parser. + public var cliValue: String { + switch self { + case .default: return "default" + case .allAne: return "all-ane" + case .cpuAndGpu: return "cpu-and-gpu" + case .cpuOnly: return "cpu-only" + } + } +} diff --git a/Sources/FluidAudio/TTS/StyleTTS2/Assets/StyleTTS2Vocab.swift b/Sources/FluidAudio/TTS/StyleTTS2/Assets/StyleTTS2Vocab.swift index 8e21ec80..8b912c1f 100644 --- a/Sources/FluidAudio/TTS/StyleTTS2/Assets/StyleTTS2Vocab.swift +++ b/Sources/FluidAudio/TTS/StyleTTS2/Assets/StyleTTS2Vocab.swift @@ -94,4 +94,25 @@ public struct StyleTTS2Vocab: Sendable { } return ids } + + /// Diagnostic encode: same logic as `encode(_:)` but also returns a + /// frequency map of every scalar that fell off the floor because no + /// vocab entry exists for it. Used by the StyleTTS2 CLI's + /// `--tokenize-only` mode to quantify the misaki ↔ espeak inventory + /// gap without actually invoking the diffusion pipeline. + public func encodeWithReport( + _ phonemes: String + ) -> (ids: [Int32], dropped: [Unicode.Scalar: Int]) { + var ids: [Int32] = [] + ids.reserveCapacity(phonemes.unicodeScalars.count) + var dropped: [Unicode.Scalar: Int] = [:] + for scalar in phonemes.unicodeScalars { + if let id = map[Character(scalar)] { + ids.append(id) + } else { + dropped[scalar, default: 0] += 1 + } + } + return (ids, dropped) + } } diff --git a/Sources/FluidAudio/TTS/StyleTTS2/Pipeline/StyleTTS2Phonemizer.swift b/Sources/FluidAudio/TTS/StyleTTS2/Pipeline/StyleTTS2Phonemizer.swift index f30424fc..12436aa2 100644 --- a/Sources/FluidAudio/TTS/StyleTTS2/Pipeline/StyleTTS2Phonemizer.swift +++ b/Sources/FluidAudio/TTS/StyleTTS2/Pipeline/StyleTTS2Phonemizer.swift @@ -4,14 +4,30 @@ import Foundation /// /// For English (`.americanEnglish`), uses the in-tree `G2PModel` (BART /// encoder-decoder, misaki-style IPA) and remaps the misaki conventions to -/// the espeak-ng convention that StyleTTS2's LibriTTS checkpoint expects: +/// the espeak-ng convention that StyleTTS2's LibriTTS checkpoint expects. +/// +/// **Per-piece (single glyph) remap** — applied as misaki emits each piece: /// /// misaki → espeak-ng /// A → eɪ I → aɪ O → oʊ W → aʊ Y → ɔɪ /// ᵊ → ə (tiny-schwa offglide; not in StyleTTS2's 178-vocab) /// -/// Other glyphs (`ʤ`, `ʧ`, `ˈ`, `ˌ`, `ð`, `θ`, `ɹ`, `ɾ`, etc.) are already in -/// the 178-token espeak-ng vocabulary and pass through. +/// **Post-pass (multi-glyph) remap** — applied to the assembled phoneme +/// string after every word has been emitted. Both the ligature and the +/// decomposed forms exist as distinct tokens in the 178-vocab, but the +/// LibriTTS checkpoint was trained against espeak-ng output, so the model's +/// embeddings for the misaki ligature glyphs (`ʧ`, `ʤ`) are essentially +/// untrained noise. Same story for the schwa+r digraphs that espeak collapses +/// into single rhotic vowels (`ɝ`, `ɚ`): +/// +/// misaki → espeak-ng word example +/// ʧ → tʃ choice → tʃˈɔɪs +/// ʤ → dʒ jump → dʒˈʌmps +/// ɜɹ → ɝ (U+025D) girl → ɡˈɝl +/// əɹ → ɚ (U+025A) over → ˈoʊvɚ +/// +/// Other glyphs (`ˈ`, `ˌ`, `ð`, `θ`, `ɹ`, `ɾ`, etc.) are already in the +/// 178-token espeak-ng vocabulary and pass through unchanged. /// /// Non-English languages fall back to `MultilingualG2PModel` (CharsiuG2P /// ByT5). Output quality there is unvalidated — the LibriTTS checkpoint is @@ -46,6 +62,30 @@ public enum StyleTTS2Phonemizer { "ᵊ": "ə", ] + /// Post-pass multi-glyph remap applied to the assembled phoneme string + /// after all word pieces have been concatenated. Decomposes misaki's + /// affricate ligatures and collapses the schwa+r digraphs into the + /// single rhotic vowels espeak-ng emits — see the type-level docs for + /// rationale. Order matters only insofar as `əɹ` and `ɜɹ` must be + /// applied before any rule that would consume the trailing `ɹ` (none + /// exist today; left ordered for future-proofing). + private static let misakiToEspeakPostPass: [(String, String)] = [ + ("ʧ", "tʃ"), + ("ʤ", "dʒ"), + ("ɜɹ", "ɝ"), + ("əɹ", "ɚ"), + ] + + /// Apply `misakiToEspeakPostPass` rules to a phoneme string in order. + /// Exposed `internal` for unit tests. + internal static func applyEspeakPostPass(_ s: String) -> String { + var out = s + for (from, to) in misakiToEspeakPostPass { + out = out.replacingOccurrences(of: from, with: to) + } + return out + } + /// Convert raw text to an IPA phoneme string for StyleTTS2. /// /// - Parameters: @@ -87,6 +127,13 @@ public enum StyleTTS2Phonemizer { try await flushWord(&wordBuffer, language: language, into: &output) } + // Multi-glyph misaki → espeak normalization. Only meaningful for + // English (the LibriTTS checkpoint is English-only); skipping for + // other languages avoids touching CharsiuG2P output we don't have + // a model contract for. + if language == .americanEnglish { + output = applyEspeakPostPass(output) + } return output } diff --git a/Sources/FluidAudio/TTS/StyleTTS2/Pipeline/StyleTTS2Synthesizer.swift b/Sources/FluidAudio/TTS/StyleTTS2/Pipeline/StyleTTS2Synthesizer.swift index 1a09163d..923b6e4f 100644 --- a/Sources/FluidAudio/TTS/StyleTTS2/Pipeline/StyleTTS2Synthesizer.swift +++ b/Sources/FluidAudio/TTS/StyleTTS2/Pipeline/StyleTTS2Synthesizer.swift @@ -239,6 +239,13 @@ public actor StyleTTS2Synthesizer { /// Slice an MLMultiArray of shape `(1, leading, trailing)` to the first /// `take` entries along either the leading or trailing axis. Returns a /// flat row-major `[Float]`. + /// + /// Reads via `dataPointer` instead of `arr[idx].floatValue` and avoids + /// `arr.strides` entirely — both trigger + /// `E5RT: tensor_buffer has known strides while the model has + /// FlexibleShapeInfo` on `text_predictor`'s flex-shape outputs. CoreML + /// emits dense row-major buffers, so for shape `(1, leading, trailing)` + /// the flat index is simply `r * trailing + c`. private func sliceFirstAxis2D( arr: MLMultiArray, leading: Int, @@ -246,29 +253,50 @@ public actor StyleTTS2Synthesizer { take: Int, sliceDim: SliceDim ) -> [Float] { - let strides = arr.strides.map { $0.intValue } + let outCount: Int switch sliceDim { - case .leading: - // Result shape: (take, trailing). - var out = [Float](repeating: 0, count: take * trailing) - for r in 0.. Float) { + switch sliceDim { + case .leading: + // Result shape: (take, trailing). + for r in 0.. (samples: [Float], sampleRate: Int) { + guard isInitialized else { + throw StyleTTS2Error.modelNotFound("StyleTTS2 model not initialized") + } + let voice = try StyleTTS2VoiceStyle.load(from: voiceStyleURL) + let (_, ids) = try await tokenize(text: text, language: language) + let options = StyleTTS2Synthesizer.Options( + diffusionSteps: diffusionSteps, + alpha: alpha, + beta: beta, + randomSeed: randomSeed + ) + let samples = try await synthesizer.synthesizeSamples( + ids: ids, voice: voice, options: options) + return (samples, StyleTTS2Constants.audioSampleRate) + } + /// Run the text frontend (preprocess → G2P → vocab encode) end-to-end. /// /// Available before the diffusion synthesizer is wired so callers can @@ -138,6 +166,27 @@ public actor StyleTTS2Manager { return (phonemes, ids) } + /// Diagnostic tokenize: same as `tokenize(text:language:)` but also + /// returns the per-scalar drop frequency from + /// `StyleTTS2Vocab.encodeWithReport`. Used by the CLI to quantify + /// how much of the misaki BART G2P output the espeak-ng-trained + /// 178-token vocab can actually consume. + public func tokenizeWithReport( + text: String, + language: MultilingualG2PLanguage = .americanEnglish + ) async throws -> ( + phonemes: String, ids: [Int32], dropped: [Unicode.Scalar: Int] + ) { + guard isInitialized else { + throw StyleTTS2Error.modelNotFound("StyleTTS2 model not initialized") + } + let phonemes = try await StyleTTS2Phonemizer.phonemize( + text: text, language: language) + let vocab = try await modelStore.vocabulary() + let (ids, dropped) = vocab.encodeWithReport(phonemes) + return (phonemes, ids, dropped) + } + public func cleanup() { isInitialized = false } diff --git a/Sources/FluidAudioCLI/Commands/CosyVoice3/ParityCommand.swift b/Sources/FluidAudioCLI/Commands/CosyVoice3/ParityCommand.swift index 020a10f0..b1946d1f 100644 --- a/Sources/FluidAudioCLI/Commands/CosyVoice3/ParityCommand.swift +++ b/Sources/FluidAudioCLI/Commands/CosyVoice3/ParityCommand.swift @@ -13,7 +13,6 @@ import Foundation /// --output .../build/swift_e2e.wav \ /// --seed 42 /// ``` -@available(macOS 15, iOS 18, *) enum CosyVoice3ParityCLI { private static let logger = AppLogger(category: "CosyVoice3ParityCLI") diff --git a/Sources/FluidAudioCLI/Commands/CosyVoice3/TextCommand.swift b/Sources/FluidAudioCLI/Commands/CosyVoice3/TextCommand.swift index cbf64d92..c4389ad5 100644 --- a/Sources/FluidAudioCLI/Commands/CosyVoice3/TextCommand.swift +++ b/Sources/FluidAudioCLI/Commands/CosyVoice3/TextCommand.swift @@ -19,7 +19,6 @@ import Foundation /// --output .../build/swift_cv3_text.wav \ /// --seed 42 /// ``` -@available(macOS 15, iOS 18, *) enum CosyVoice3TextCLI { private static let logger = AppLogger(category: "CosyVoice3TextCLI") diff --git a/Sources/FluidAudioCLI/Commands/MinimaxCorpusCommand.swift b/Sources/FluidAudioCLI/Commands/MinimaxCorpusCommand.swift new file mode 100644 index 00000000..1880ae50 --- /dev/null +++ b/Sources/FluidAudioCLI/Commands/MinimaxCorpusCommand.swift @@ -0,0 +1,234 @@ +#if os(macOS) +import FluidAudio +import Foundation + +/// Swift port of `Scripts/fetch_minimax_tts_corpus.py`. +/// +/// Fetches the MiniMax Multilingual TTS Test Set per-language `.txt` files +/// from HuggingFace and converts them to the FluidAudio TTS-benchmark +/// corpus format (strip `|` prefix, prepend a +/// header documenting source + revision + license). +/// +/// Reuses `DownloadUtils.fetchHuggingFaceFile` so we get the same auth +/// (HF_TOKEN env), retry, and backoff treatment as every other HF asset +/// pull in the project — no hardcoded URLs, no swift-transformers +/// dependency added just for one corpus fetch. +/// +/// Source dataset: https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set +/// License: CC-BY-SA-4.0 +public enum MinimaxCorpusCommand { + + private static let logger = AppLogger(category: "MinimaxCorpusCommand") + + private static let repo = "MiniMaxAI/TTS-Multilingual-Test-Set" + + /// Pin to the initial public commit so re-runs reproduce the vendored + /// files. Matches `DEFAULT_REVISION` in the Python script. + private static let defaultRevision = "cb416f0ac3658da0577e97873065e19fe6488917" + + /// All 24 languages in the upstream `text/` directory. Keep in sync with + /// `ALL_LANGUAGES` in `Scripts/fetch_minimax_tts_corpus.py`. + private static let allLanguages: [String] = [ + "arabic", "cantonese", "chinese", "czech", "dutch", "english", + "finnish", "french", "german", "greek", "hindi", "indonesian", + "italian", "japanese", "korean", "polish", "portuguese", "romanian", + "russian", "spanish", "thai", "turkish", "ukrainian", "vietnamese", + ] + + public static func run(arguments: [String]) async { + var languages = allLanguages + var revision = defaultRevision + var outDir: URL? = nil + + var i = 0 + while i < arguments.count { + let arg = arguments[i] + switch arg { + case "--languages", "-l": + if i + 1 < arguments.count { + languages = arguments[i + 1] + .split(separator: ",") + .map { $0.trimmingCharacters(in: .whitespaces) } + .filter { !$0.isEmpty } + i += 1 + } + case "--revision": + if i + 1 < arguments.count { + revision = arguments[i + 1] + i += 1 + } + case "--out-dir": + if i + 1 < arguments.count { + outDir = URL(fileURLWithPath: arguments[i + 1]) + i += 1 + } + case "help", "--help", "-h": + printUsage() + return + default: + logger.error("Unknown argument: \(arg)") + printUsage() + exit(1) + } + i += 1 + } + + let unknown = Set(languages).subtracting(allLanguages).sorted() + if !unknown.isEmpty { + logger.error("Unknown language(s): \(unknown.joined(separator: ", "))") + logger.error("Available: \(allLanguages.joined(separator: ", "))") + exit(2) + } + + let resolvedOutDir = outDir ?? defaultOutDir() + + do { + try FileManager.default.createDirectory( + at: resolvedOutDir, withIntermediateDirectories: true) + } catch { + logger.error("Failed to create output directory: \(error.localizedDescription)") + exit(1) + } + + logger.info("Fetching MiniMax TTS Multilingual Test Set @ \(revision)") + logger.info(" out_dir: \(resolvedOutDir.path)") + logger.info(" langs: \(languages.count)") + + var total = 0 + for lang in languages { + guard let url = URL(string: hfURL(repo: repo, revision: revision, path: "text/\(lang).txt")) + else { + logger.error("[\(lang)] failed to construct URL") + exit(1) + } + do { + let data = try await DownloadUtils.fetchHuggingFaceFile( + from: url, description: "minimax TTS corpus (\(lang))") + guard let raw = String(data: data, encoding: .utf8) else { + logger.error("[\(lang)] response was not valid UTF-8") + exit(1) + } + let phrases = convert(raw: raw) + let outPath = try writeCorpus( + lang: lang, phrases: phrases, outDir: resolvedOutDir, + revision: revision) + let countStr = String(format: "%3d", phrases.count) + let relPath = relativePath(outPath, from: repoRoot()) + logger.info(" [\(lang)] \(countStr) phrases -> \(relPath)") + total += phrases.count + } catch { + logger.error("[\(lang)] FAILED: \(error.localizedDescription)") + exit(1) + } + } + + logger.info("OK — \(total) phrases across \(languages.count) language(s).") + } + + // MARK: - Helpers + + private static func hfURL(repo: String, revision: String, path: String) -> String { + "https://huggingface.co/datasets/\(repo)/resolve/\(revision)/\(path)" + } + + /// Strip `|` prefix and return the list of trimmed phrases. + /// Mirrors `convert()` in the Python script. + private static func convert(raw: String) -> [String] { + var out: [String] = [] + for rawLine in raw.split(separator: "\n", omittingEmptySubsequences: false) { + let line = rawLine.trimmingCharacters(in: .whitespacesAndNewlines) + if line.isEmpty { continue } + // Format: "|". Some lines may have + // extra `|` inside the text — keep only the first split. + let text: String + if let sepIdx = line.firstIndex(of: "|") { + text = String(line[line.index(after: sepIdx)...]) + .trimmingCharacters(in: .whitespacesAndNewlines) + } else { + text = line + } + if !text.isEmpty { + out.append(text) + } + } + return out + } + + private static func writeCorpus( + lang: String, + phrases: [String], + outDir: URL, + revision: String + ) throws -> URL { + let outPath = outDir.appendingPathComponent("\(lang).txt") + let header: [String] = [ + "# MiniMax Multilingual TTS Test Set — \(lang)", + "# Source: https://huggingface.co/datasets/\(repo)", + "# Revision: \(revision)", + "# License: CC-BY-SA-4.0 (Creative Commons Attribution-ShareAlike 4.0)", + "# Phrases: \(phrases.count)", + "#", + "# Cloning-audio filenames have been stripped — we only need the", + "# text for the FluidAudio TTS benchmark harness. Voice selection", + "# is per-backend (see Documentation/TTS/MinimaxCorpus.md).", + "", + ] + let body = (header + phrases).joined(separator: "\n") + "\n" + try body.write(to: outPath, atomically: true, encoding: .utf8) + return outPath + } + + /// `/Benchmarks/tts/corpus/minimax/`. Resolves relative to the + /// current working directory (the standard place `swift run` is invoked + /// from); falls back gracefully if the layout doesn't exist yet because + /// we `createDirectory(withIntermediateDirectories: true)` before write. + private static func defaultOutDir() -> URL { + repoRoot() + .appendingPathComponent("Benchmarks", isDirectory: true) + .appendingPathComponent("tts", isDirectory: true) + .appendingPathComponent("corpus", isDirectory: true) + .appendingPathComponent("minimax", isDirectory: true) + } + + private static func repoRoot() -> URL { + URL(fileURLWithPath: FileManager.default.currentDirectoryPath, isDirectory: true) + } + + private static func relativePath(_ url: URL, from base: URL) -> String { + let path = url.standardizedFileURL.path + let basePath = base.standardizedFileURL.path + if path.hasPrefix(basePath + "/") { + return String(path.dropFirst(basePath.count + 1)) + } + return path + } + + private static func printUsage() { + logger.info( + """ + Usage: fluidaudio minimax-corpus [options] + + Fetches the MiniMax Multilingual TTS Test Set text files from + HuggingFace and converts them to the FluidAudio TTS-benchmark + corpus format. Outputs one file per language. + + Options: + --languages, -l Comma-separated subset of languages + (default: all 24). + --revision HuggingFace dataset revision + (default: \(defaultRevision)). + --out-dir Output directory + (default: Benchmarks/tts/corpus/minimax). + --help, -h Show this help. + + Available languages: + \(allLanguages.joined(separator: ", ")) + + Examples: + fluidaudio minimax-corpus + fluidaudio minimax-corpus --languages english,spanish,hindi + fluidaudio minimax-corpus --revision + """) + } +} +#endif diff --git a/Sources/FluidAudioCLI/Commands/StyleTTS2Command.swift b/Sources/FluidAudioCLI/Commands/StyleTTS2Command.swift index 3fa3f2d2..3b33d39b 100644 --- a/Sources/FluidAudioCLI/Commands/StyleTTS2Command.swift +++ b/Sources/FluidAudioCLI/Commands/StyleTTS2Command.swift @@ -23,6 +23,8 @@ public enum StyleTTS2Command { var alpha: Float = 0.3 var beta: Float = 0.7 var seed: UInt64? + var tokenizeOnly = false + var corpusPath: String? var i = 0 while i < arguments.count { @@ -74,6 +76,16 @@ public enum StyleTTS2Command { fputs("--seed requires an integer\n", stderr) exit(2) } + case "--tokenize-only": + tokenizeOnly = true + i += 1 + case "--corpus": + guard i + 1 < arguments.count else { + fputs("--corpus requires a path\n", stderr) + exit(2) + } + corpusPath = arguments[i + 1] + i += 2 case "--help", "-h": printUsage() return @@ -88,6 +100,11 @@ public enum StyleTTS2Command { } } + if tokenizeOnly { + await runTokenizeOnly(text: text, corpusPath: corpusPath) + return + } + guard let text else { fputs("Missing required text argument\n", stderr) printUsage() @@ -136,6 +153,105 @@ public enum StyleTTS2Command { } } + /// `--tokenize-only`: phonemize + encode without invoking the diffusion + /// pipeline. Reports phoneme string, token id sequence, and any scalars + /// that the 178-token espeak-ng vocab silently dropped. With `--corpus` + /// runs over every line of a phrase file and aggregates a histogram of + /// dropped scalars for the whole corpus. + private static func runTokenizeOnly(text: String?, corpusPath: String?) async { + do { + let manager = StyleTTS2Manager() + try await manager.initialize { _ in } + + var totalScalars = 0 + var totalIds = 0 + var totalDropped = 0 + var dropHist: [Unicode.Scalar: Int] = [:] + var phraseCount = 0 + + func process(_ phrase: String) async throws { + let (phonemes, ids, dropped) = + try await manager.tokenizeWithReport(text: phrase) + let scalars = phonemes.unicodeScalars.count + totalScalars += scalars + totalIds += ids.count + let phraseDropCount = dropped.values.reduce(0, +) + totalDropped += phraseDropCount + for (k, v) in dropped { dropHist[k, default: 0] += v } + phraseCount += 1 + + if corpusPath == nil { + print("INPUT : \(phrase)") + print("PHONEMES : \(phonemes)") + print("TOKEN_IDS (\(ids.count)): \(ids)") + let formatted = + dropped + .sorted { $0.value > $1.value } + .map { + "U+\(String($0.key.value, radix: 16, uppercase: true))" + + " '\($0.key)' ×\($0.value)" + } + .joined(separator: ", ") + print( + "DROPPED (\(phraseDropCount) of \(scalars) scalars):" + + " \(formatted)") + } + } + + if let corpusPath { + let url = expand(corpusPath) + let raw = try String(contentsOf: url, encoding: .utf8) + let phrases = raw.split(separator: "\n", omittingEmptySubsequences: true) + .map { $0.trimmingCharacters(in: .whitespaces) } + .filter { !$0.isEmpty && !$0.hasPrefix("#") } + for (idx, phrase) in phrases.enumerated() { + do { + try await process(phrase) + let dropPct = + Double(totalDropped) / Double(max(totalScalars, 1)) * 100 + if (idx + 1) % 10 == 0 || idx + 1 == phrases.count { + fputs( + " [\(idx + 1)/\(phrases.count)] running drop rate " + + "\(String(format: "%.2f", dropPct))%\n", + stderr) + } + } catch { + fputs(" [\(idx + 1)] phrase failed: \(error)\n", stderr) + } + } + } else if let text { + try await process(text) + } else { + fputs("--tokenize-only requires either text or --corpus\n", stderr) + exit(2) + } + + let dropPct = Double(totalDropped) / Double(max(totalScalars, 1)) * 100 + let kept = totalScalars - totalDropped + print("") + print("=== StyleTTS2 vocab coverage ===") + print("phrases : \(phraseCount)") + print("phoneme scalars total : \(totalScalars)") + print("encoded token ids : \(totalIds) (== kept scalars: \(kept))") + print( + "dropped scalars : \(totalDropped) " + + "(\(String(format: "%.2f", dropPct))%)") + print("distinct dropped chars : \(dropHist.count)") + if !dropHist.isEmpty { + print("") + print("dropped histogram (most → least frequent):") + for (scalar, count) in dropHist.sorted(by: { $0.value > $1.value }) { + let hex = String(scalar.value, radix: 16, uppercase: true) + print( + " \(String(format: "%6d", count)) U+\(hex) '\(scalar)'") + } + } + } catch { + fputs("StyleTTS2 tokenize-only failed: \(error)\n", stderr) + exit(1) + } + } + private static func expand(_ path: String) -> URL { let exp = (path as NSString).expandingTildeInPath if exp.hasPrefix("/") { @@ -152,12 +268,15 @@ public enum StyleTTS2Command { fluidaudio styletts2 "" --voice [options] Options: - --voice Required. Path to precomputed ref_s.bin (256 fp32 LE). + --voice Required for synthesis. Path to precomputed ref_s.bin (256 fp32 LE). --output Output WAV path (default: styletts2.wav). --steps ADPM2 sampler steps (default: 5). --alpha Acoustic style mix weight (default: 0.3). --beta Prosody style mix weight (default: 0.7). --seed Deterministic noise seed (default: system RNG). + --tokenize-only Run G2P + vocab encode only; report dropped scalars. + No --voice needed. Use with text or --corpus. + --corpus Phrase-per-line corpus file (with --tokenize-only). Example: fluidaudio styletts2 "Hello world" \\ diff --git a/Sources/FluidAudioCLI/Commands/TTSCommand.swift b/Sources/FluidAudioCLI/Commands/TTSCommand.swift index 132fcc8d..9a79a8ac 100644 --- a/Sources/FluidAudioCLI/Commands/TTSCommand.swift +++ b/Sources/FluidAudioCLI/Commands/TTSCommand.swift @@ -414,22 +414,17 @@ public struct TTS { ) return } - if #available(macOS 15, iOS 18, *) { - await CosyVoice3TextCLI.run( - text: inputText, - modelsDir: modelsDir, - tokenizerDir: tokDir, - embeddingsFile: embFile, - specialTokensFile: specFile, - promptAssetsPath: promptAssets, - outputPath: output, - seed: cv3Seed, - maxNewTokens: cv3MaxNewTokens, - cpuOnly: cv3CpuOnly) - } else { - logger.error( - "CosyVoice3 requires macOS 15 / iOS 18 (uses CoreML MLState).") - } + await CosyVoice3TextCLI.run( + text: inputText, + modelsDir: modelsDir, + tokenizerDir: tokDir, + embeddingsFile: embFile, + specialTokensFile: specFile, + promptAssetsPath: promptAssets, + outputPath: output, + seed: cv3Seed, + maxNewTokens: cv3MaxNewTokens, + cpuOnly: cv3CpuOnly) return } @@ -440,19 +435,14 @@ public struct TTS { ) return } - if #available(macOS 15, iOS 18, *) { - await CosyVoice3ParityCLI.run( - fixturePath: fixture, - modelsDir: modelsDir, - referencePath: cv3ReferencePath, - outputPath: output, - seed: cv3Seed, - cpuOnly: cv3CpuOnly, - replayTokens: cv3ReplayTokens) - } else { - logger.error( - "CosyVoice3 requires macOS 15 / iOS 18 (uses CoreML MLState).") - } + await CosyVoice3ParityCLI.run( + fixturePath: fixture, + modelsDir: modelsDir, + referencePath: cv3ReferencePath, + outputPath: output, + seed: cv3Seed, + cpuOnly: cv3CpuOnly, + replayTokens: cv3ReplayTokens) return } diff --git a/Sources/FluidAudioCLI/Commands/TtsBenchmarkCommand.swift b/Sources/FluidAudioCLI/Commands/TtsBenchmarkCommand.swift new file mode 100644 index 00000000..ae268b9e --- /dev/null +++ b/Sources/FluidAudioCLI/Commands/TtsBenchmarkCommand.swift @@ -0,0 +1,1359 @@ +#if os(macOS) +import CoreML +import FluidAudio +import Foundation + +/// `fluidaudio tts-benchmark` — quantitative TTS benchmark harness. +/// +/// Reports **TTFT / cold-start / warm-start latency, per-stage timings, +/// peak RSS, WER + CER per category** — i.e. the things conversational +/// TTS users actually feel — instead of just RTFx. +/// +/// Backends: +/// kokoro-ane — 7-stage ANE pipeline (per-stage timings, per-stage CU) +/// kokoro — single-graph CPU+GPU (chunk-level only) +/// pocket-tts — streaming flow-matching (no per-stage timings) +/// magpie — encoder-decoder + NanoCodec (6-stage timings, slow) +/// cosyvoice3 — Mandarin LLM-based (Mandarin corpus only, no WER) +/// styletts2 — diffusion + HiFi-GAN (one-shot, requires --voice ref_s.bin) +/// +/// Usage: +/// fluidaudio tts-benchmark --backend kokoro-ane \ +/// --corpus minimax-english \ +/// --voice af_heart \ +/// --compute-units default \ +/// --output-json bench.json +/// +/// Corpora land in `Benchmarks/tts/corpus/minimax/.txt` — +/// the MiniMax Multilingual TTS Test Set (CC-BY-SA-4.0, +/// 24 languages × 100 phrases). The `.txt` files are gitignored; +/// populate them with `swift run fluidaudio minimax-corpus`. See +/// `Documentation/TTS/MinimaxCorpus.md` for attribution + reproduction +/// notes and `Documentation/TTS/Benchmarks.md` for the per-backend ↔ +/// language coverage matrix. Reference with `--corpus minimax-` +/// (e.g. `minimax-english`, `minimax-chinese`, `minimax-vietnamese`, …). +public enum TtsBenchmarkCommand { + + private static let logger = AppLogger(category: "TtsBenchmarkCommand") + + // MARK: - Per-phrase sample emitted by every backend driver. + private struct BackendPhraseSample { + let synthMs: Double + let ttftMs: Double // For one-shot backends, == synthMs. + let samples: [Float] + let sampleRate: Int + let stageMs: [String: Double] // Empty if backend has no per-stage timings. + let extraFields: [String: Any] // encoder_tokens, finished_on_eos, etc. + } + + // MARK: - ASR backend selection + // + // The harness supports two ASR backends for the TTS→ASR roundtrip: + // .parakeet — Parakeet TDT (English-only, auto-downloaded). + // .cohere — Cohere Transcribe cache-external (14 languages incl. zh). + // CosyVoice3's Mandarin output requires `.cohere` for a meaningful CER + // — Parakeet's English-only output collapses to ~100% on zh. + fileprivate enum AsrChoice { + case skip + case parakeet + case cohere(modelDir: URL, language: CohereAsrConfig.Language, computeUnits: MLComputeUnits) + + var label: String { + switch self { + case .skip: return "skip" + case .parakeet: return "parakeet-tdt" + case .cohere(_, let lang, let cu): + return "cohere-transcribe-\(lang.rawValue)/\(Self.computeLabel(cu))" + } + } + + private static func computeLabel(_ cu: MLComputeUnits) -> String { + switch cu { + case .all: return "all" + case .cpuAndNeuralEngine: return "cpu+ane" + case .cpuAndGPU: return "cpu+gpu" + case .cpuOnly: return "cpu" + @unknown default: return "unknown" + } + } + + var skipped: Bool { + if case .skip = self { return true } else { return false } + } + } + + /// Closure-based ASR adapter so `runPhraseLoop` doesn't have to know + /// which backend it's driving. Built once before the per-phrase loop, + /// torn down after. + fileprivate struct AsrLoop { + let label: String + let transcribeOne: (URL) async throws -> String + let cleanup: () async -> Void + } + + public static func run(arguments: [String]) async { + var backendName = "kokoro-ane" + var corpusName: String? + var corpusPath: String? + var voice: String? + var speakerName: String? + var languageName: String? + var computeUnitsName = "default" + var outputJson: String? + var audioDir: String? + var skipAsr = false + var asrBackendName: String? + var cohereModelDirArg: String? + var asrLanguageArg: String? + var cohereComputeUnitsArg: String? + + var i = 0 + while i < arguments.count { + let arg = arguments[i] + switch arg { + case "--backend": + if i + 1 < arguments.count { + backendName = arguments[i + 1] + i += 1 + } + case "--corpus": + if i + 1 < arguments.count { + corpusName = arguments[i + 1] + i += 1 + } + case "--corpus-path": + if i + 1 < arguments.count { + corpusPath = arguments[i + 1] + i += 1 + } + case "--voice": + if i + 1 < arguments.count { + voice = arguments[i + 1] + i += 1 + } + case "--speaker": + if i + 1 < arguments.count { + speakerName = arguments[i + 1] + i += 1 + } + case "--language": + if i + 1 < arguments.count { + languageName = arguments[i + 1] + i += 1 + } + case "--compute-units": + if i + 1 < arguments.count { + computeUnitsName = arguments[i + 1] + i += 1 + } + case "--output-json": + if i + 1 < arguments.count { + outputJson = arguments[i + 1] + i += 1 + } + case "--audio-dir": + if i + 1 < arguments.count { + audioDir = arguments[i + 1] + i += 1 + } + case "--skip-asr": + skipAsr = true + case "--asr-backend": + if i + 1 < arguments.count { + asrBackendName = arguments[i + 1] + i += 1 + } + case "--cohere-model-dir": + if i + 1 < arguments.count { + cohereModelDirArg = arguments[i + 1] + i += 1 + } + case "--asr-language": + if i + 1 < arguments.count { + asrLanguageArg = arguments[i + 1] + i += 1 + } + case "--cohere-compute-units": + if i + 1 < arguments.count { + cohereComputeUnitsArg = arguments[i + 1] + i += 1 + } + case "--help", "-h": + printUsage() + return + default: + logger.warning("Unknown argument: \(arg)") + } + i += 1 + } + + let backend = parseBackend(backendName) + + // Resolve corpus. + let phrases: [(category: String, text: String)] + let corpusLabel: String + do { + if let corpusPath { + let url = resolveURL(corpusPath, isDirectory: false) + let raw = try String(contentsOf: url, encoding: .utf8) + phrases = parseCorpus(raw, category: url.deletingPathExtension().lastPathComponent) + corpusLabel = url.lastPathComponent + } else { + let resolved = corpusName ?? backend.defaultCorpus + phrases = try loadShippedCorpus(resolved) + corpusLabel = resolved + } + } catch { + logger.error("Failed to load corpus: \(error.localizedDescription)") + exit(1) + } + guard !phrases.isEmpty else { + logger.error("Corpus is empty after parsing") + exit(1) + } + logger.info("Loaded \(phrases.count) phrase(s) from corpus '\(corpusLabel)'") + + guard let preset = TtsComputeUnitPreset(cliValue: computeUnitsName) else { + logger.error( + "Unknown --compute-units value: \(computeUnitsName). Expected default | all-ane | cpu-and-gpu | cpu-only." + ) + exit(1) + } + + // Resolve ASR backend choice. Precedence: + // --skip-asr or --asr-backend none → .skip + // --asr-backend cohere → .cohere(modelDir, language) + // --asr-backend parakeet → .parakeet + // no flag, backend == cosyvoice3 → .skip (Parakeet is English-only; + // Mandarin output collapses to ~100% WER) + // no flag, otherwise → .parakeet + let asrChoice: AsrChoice + do { + asrChoice = try resolveAsrChoice( + skipAsrFlag: skipAsr, + backendName: asrBackendName, + cohereModelDir: cohereModelDirArg, + asrLanguage: asrLanguageArg, + cohereComputeUnits: cohereComputeUnitsArg, + corpusLabel: corpusLabel, + ttsBackend: backend) + } catch { + logger.error("Failed to resolve ASR backend: \(error.localizedDescription)") + exit(1) + } + logger.info("ASR backend: \(asrChoice.label)") + + do { + switch backend { + case .kokoroAne: + try await runKokoroAne( + phrases: phrases, corpusLabel: corpusLabel, + voice: voice ?? KokoroAneConstants.defaultVoice, + preset: preset, outputJson: outputJson, audioDir: audioDir, + asrChoice: asrChoice) + case .kokoro: + try await runKokoro( + phrases: phrases, corpusLabel: corpusLabel, + voice: voice ?? TtsConstants.recommendedVoice, + preset: preset, outputJson: outputJson, audioDir: audioDir, + asrChoice: asrChoice) + case .pocketTts: + try await runPocketTts( + phrases: phrases, corpusLabel: corpusLabel, + voice: voice ?? PocketTtsConstants.defaultVoice, + languageName: languageName, + preset: preset, outputJson: outputJson, audioDir: audioDir, + asrChoice: asrChoice) + case .magpie: + try await runMagpie( + phrases: phrases, corpusLabel: corpusLabel, + speakerName: speakerName, languageName: languageName, + preset: preset, outputJson: outputJson, audioDir: audioDir, + asrChoice: asrChoice) + case .cosyVoice3: + try await runCosyVoice3( + phrases: phrases, corpusLabel: corpusLabel, + voice: voice, + preset: preset, outputJson: outputJson, audioDir: audioDir, + asrChoice: asrChoice) + case .styleTts2: + try await runStyleTts2( + phrases: phrases, corpusLabel: corpusLabel, + voicePath: voice, + preset: preset, outputJson: outputJson, audioDir: audioDir, + asrChoice: asrChoice) + } + } catch { + logger.error("tts-benchmark failed: \(error)") + exit(1) + } + } + + // MARK: - Kokoro ANE driver + + private static func runKokoroAne( + phrases: [(category: String, text: String)], + corpusLabel: String, + voice: String, + preset: TtsComputeUnitPreset, + outputJson: String?, + audioDir: String?, + asrChoice: AsrChoice + ) async throws { + let units = KokoroAneComputeUnits(preset: preset) + let manager = KokoroAneManager(defaultVoice: voice, computeUnits: units) + + let coldStart = Date() + try await manager.initialize() + let coldStartS = Date().timeIntervalSince(coldStart) + logger.info(String(format: "Cold start (initialize): %.2fs", coldStartS)) + + let firstStart = Date() + _ = try await manager.synthesizeDetailed( + text: "Initialization warm-up.", voice: voice, speed: 1.0) + let firstSynthMs = Date().timeIntervalSince(firstStart) * 1000 + logger.info(String(format: "First synth: %.0f ms", firstSynthMs)) + + try await runPhraseLoop( + backendId: "kokoro-ane", + voiceLabel: voice, + corpusLabel: corpusLabel, + phrases: phrases, + preset: preset, + coldStartS: coldStartS, + firstSynthMs: firstSynthMs, + outputJson: outputJson, + audioDir: audioDir, + asrChoice: asrChoice, + extraSummary: ["voice": voice] + ) { text in + let t0 = Date() + let result = try await manager.synthesizeDetailed( + text: text, voice: voice, speed: 1.0) + let synthMs = Date().timeIntervalSince(t0) * 1000 + return BackendPhraseSample( + synthMs: synthMs, + ttftMs: synthMs, + samples: result.samples, + sampleRate: result.sampleRate, + stageMs: [ + "albert": result.timings.albert, + "post_albert": result.timings.postAlbert, + "alignment": result.timings.alignment, + "prosody": result.timings.prosody, + "noise": result.timings.noise, + "vocoder": result.timings.vocoder, + "tail": result.timings.tail, + "total": result.timings.totalMs, + ], + extraFields: [ + "encoder_tokens": result.encoderTokens, + "acoustic_frames": result.acousticFrames, + ] + ) + } + } + + // MARK: - Kokoro driver (single-graph) + + private static func runKokoro( + phrases: [(category: String, text: String)], + corpusLabel: String, + voice: String, + preset: TtsComputeUnitPreset, + outputJson: String?, + audioDir: String?, + asrChoice: AsrChoice + ) async throws { + let units = preset.uniformUnits ?? .all + let manager = KokoroTtsManager(defaultVoice: voice, computeUnits: units) + + let coldStart = Date() + try await manager.initialize(preloadVoices: [voice]) + let coldStartS = Date().timeIntervalSince(coldStart) + logger.info(String(format: "Cold start (initialize): %.2fs", coldStartS)) + + let firstStart = Date() + _ = try await manager.synthesizeDetailed(text: "Initialization warm-up.", voice: voice) + let firstSynthMs = Date().timeIntervalSince(firstStart) * 1000 + logger.info(String(format: "First synth: %.0f ms", firstSynthMs)) + + try await runPhraseLoop( + backendId: "kokoro", + voiceLabel: voice, + corpusLabel: corpusLabel, + phrases: phrases, + preset: preset, + coldStartS: coldStartS, + firstSynthMs: firstSynthMs, + outputJson: outputJson, + audioDir: audioDir, + asrChoice: asrChoice, + extraSummary: ["voice": voice] + ) { text in + let t0 = Date() + let result = try await manager.synthesizeDetailed(text: text, voice: voice) + let synthMs = Date().timeIntervalSince(t0) * 1000 + let samples = result.chunks.flatMap { $0.samples } + return BackendPhraseSample( + synthMs: synthMs, + ttftMs: synthMs, + samples: samples, + sampleRate: 24000, + stageMs: [:], + extraFields: [ + "chunk_count": result.chunks.count, + "wav_bytes": result.audio.count, + ] + ) + } + } + + // MARK: - PocketTTS driver + + private static func runPocketTts( + phrases: [(category: String, text: String)], + corpusLabel: String, + voice: String, + languageName: String?, + preset: TtsComputeUnitPreset, + outputJson: String?, + audioDir: String?, + asrChoice: AsrChoice + ) async throws { + if preset != .default { + logger.warning( + "PocketTTS does not expose per-call compute-unit overrides; --compute-units \(preset.cliValue) ignored." + ) + } + let language = parsePocketLanguage(languageName) + logger.info("PocketTTS language: \(language.rawValue)") + + let manager = PocketTtsManager(defaultVoice: voice, language: language) + + let coldStart = Date() + try await manager.initialize() + let coldStartS = Date().timeIntervalSince(coldStart) + logger.info(String(format: "Cold start (initialize): %.2fs", coldStartS)) + + let firstStart = Date() + var firstFrameMs: Double = 0 + var firstFrameCount = 0 + let warmupStream = try await manager.synthesizeStreaming( + text: "Initialization warm-up.", voice: voice) + for try await frame in warmupStream { + if firstFrameCount == 0 { + firstFrameMs = Date().timeIntervalSince(firstStart) * 1000 + } + firstFrameCount += 1 + _ = frame.samples + } + let firstSynthMs = Date().timeIntervalSince(firstStart) * 1000 + logger.info( + String( + format: "First synth: %.0f ms total, %.0f ms TTFT (frames=%d)", + firstSynthMs, firstFrameMs, firstFrameCount)) + + try await runPhraseLoop( + backendId: "pocket-tts", + voiceLabel: voice, + corpusLabel: corpusLabel, + phrases: phrases, + preset: preset, + coldStartS: coldStartS, + firstSynthMs: firstSynthMs, + outputJson: outputJson, + audioDir: audioDir, + asrChoice: asrChoice, + extraSummary: ["voice": voice, "language": language.rawValue] + ) { text in + // PocketTTS is streaming-first: we measure TTFT (time to first + // audio frame) separately from total synth time so the benchmark + // numbers reflect what a streaming consumer actually experiences. + let t0 = Date() + let stream = try await manager.synthesizeStreaming(text: text, voice: voice) + var aggregated: [Float] = [] + var ttftMs: Double = 0 + var frameCount = 0 + var lastChunkCount = 0 + for try await frame in stream { + if frameCount == 0 { + ttftMs = Date().timeIntervalSince(t0) * 1000 + } + aggregated.append(contentsOf: frame.samples) + frameCount += 1 + lastChunkCount = frame.chunkCount + } + let synthMs = Date().timeIntervalSince(t0) * 1000 + return BackendPhraseSample( + synthMs: synthMs, + ttftMs: ttftMs, + samples: aggregated, + sampleRate: PocketTtsConstants.audioSampleRate, + stageMs: [:], + extraFields: [ + "frame_count": frameCount, + "chunk_count": lastChunkCount, + ] + ) + } + } + + // MARK: - Magpie driver + + private static func runMagpie( + phrases: [(category: String, text: String)], + corpusLabel: String, + speakerName: String?, + languageName: String?, + preset: TtsComputeUnitPreset, + outputJson: String?, + audioDir: String?, + asrChoice: AsrChoice + ) async throws { + let units = preset.uniformUnits ?? .cpuAndNeuralEngine + let language = parseMagpieLanguage(languageName) + let speaker = parseMagpieSpeaker(speakerName) + logger.info("Magpie speaker=\(speaker.displayName) language=\(language.rawValue)") + + let manager = MagpieTtsManager( + computeUnits: units, preferredLanguages: [language]) + + let coldStart = Date() + try await manager.initialize() + let coldStartS = Date().timeIntervalSince(coldStart) + logger.info(String(format: "Cold start (initialize): %.2fs", coldStartS)) + + let firstStart = Date() + _ = try await manager.synthesize( + text: "Initialization warm-up.", speaker: speaker, language: language) + let firstSynthMs = Date().timeIntervalSince(firstStart) * 1000 + logger.info(String(format: "First synth: %.0f ms", firstSynthMs)) + + try await runPhraseLoop( + backendId: "magpie", + voiceLabel: speaker.displayName, + corpusLabel: corpusLabel, + phrases: phrases, + preset: preset, + coldStartS: coldStartS, + firstSynthMs: firstSynthMs, + outputJson: outputJson, + audioDir: audioDir, + asrChoice: asrChoice, + extraSummary: [ + "speaker": speaker.displayName, "language": language.rawValue, + ] + ) { text in + // Drive Magpie through `synthesizeStream` so TTFT measures + // time-to-first-chunk-yield rather than full-utterance wall. + // The chunker carves a small first chunk + // (`MagpieChunker.streamingFirstChunkCap` = 50 codec frames ≈ + // 2.3 s of audio) when the first sentence is long enough; for + // short phrases the stream degrades to one chunk == whole + // utterance and TTFT == synthMs (no streaming benefit, no + // measurement penalty). + // + // Trade-off vs. the prior `synthesize()` path: per-stage + // timings (`text_encoder`/`prefill`/`ar_loop`/…) are only + // surfaced on `MagpieSynthesisResult`, not per + // `MagpieAudioChunk`, so `stageMs` is empty here. That matches + // PocketTTS streaming which also publishes empty `stageMs`. + let t0 = Date() + let stream = try await manager.synthesizeStream( + text: text, speaker: speaker, language: language) + var aggregated: [Float] = [] + var ttftMs: Double = 0 + var chunkCount = 0 + var codeCount = 0 + var finishedOnEos = false + var sampleRate = MagpieConstants.audioSampleRate + for try await chunk in stream { + if chunkCount == 0 { + ttftMs = Date().timeIntervalSince(t0) * 1000 + } + aggregated.append(contentsOf: chunk.samples) + chunkCount += 1 + codeCount += chunk.codeCount + sampleRate = chunk.sampleRate + if chunk.isFinal { + finishedOnEos = chunk.finishedOnEos + } + } + let synthMs = Date().timeIntervalSince(t0) * 1000 + // Empty-stream guard (synthesizeStream returns immediately on + // zero-length input). Fall back to synthMs so downstream + // percentile math doesn't see ttftMs == 0. + if chunkCount == 0 { ttftMs = synthMs } + return BackendPhraseSample( + synthMs: synthMs, + ttftMs: ttftMs, + samples: aggregated, + sampleRate: sampleRate, + stageMs: [:], + extraFields: [ + "code_count": codeCount, + "finished_on_eos": finishedOnEos, + "chunk_count": chunkCount, + ] + ) + } + } + + // MARK: - CosyVoice3 driver + + private static func runCosyVoice3( + phrases: [(category: String, text: String)], + corpusLabel: String, + voice: String?, + preset: TtsComputeUnitPreset, + outputJson: String?, + audioDir: String?, + asrChoice: AsrChoice + ) async throws { + let units = preset.uniformUnits ?? .cpuAndNeuralEngine + let voiceId = voice ?? "cosyvoice3-default-zh" + + let coldStart = Date() + let manager = try await CosyVoice3TtsManager.downloadAndCreate( + cacheDirectory: nil, includeDefaultVoice: true, computeUnits: units) + try await manager.initialize() + let promptAssets = try await manager.loadVoice(voiceId) + let coldStartS = Date().timeIntervalSince(coldStart) + logger.info(String(format: "Cold start (download+init+voice): %.2fs", coldStartS)) + + let firstStart = Date() + _ = try await manager.synthesize(text: "你好", promptAssets: promptAssets) + let firstSynthMs = Date().timeIntervalSince(firstStart) * 1000 + logger.info(String(format: "First synth: %.0f ms", firstSynthMs)) + + try await runPhraseLoop( + backendId: "cosyvoice3", + voiceLabel: voiceId, + corpusLabel: corpusLabel, + phrases: phrases, + preset: preset, + coldStartS: coldStartS, + firstSynthMs: firstSynthMs, + outputJson: outputJson, + audioDir: audioDir, + asrChoice: asrChoice, + extraSummary: ["voice": voiceId] + ) { text in + let t0 = Date() + let result = try await manager.synthesize(text: text, promptAssets: promptAssets) + let synthMs = Date().timeIntervalSince(t0) * 1000 + return BackendPhraseSample( + synthMs: synthMs, + ttftMs: synthMs, + samples: result.samples, + sampleRate: result.sampleRate, + stageMs: [:], + extraFields: [ + "generated_token_count": result.generatedTokenCount, + "decoded_token_count": result.decodedTokens.count, + // Surface the structural 250-token Flow-input cap as a + // per-phrase boolean so corpus reports can tally how many + // long phrases hit silent truncation. + "finished_on_eos": result.finishedOnEos, + ] + ) + } + } + + // MARK: - StyleTTS2 driver + + private static func runStyleTts2( + phrases: [(category: String, text: String)], + corpusLabel: String, + voicePath: String?, + preset: TtsComputeUnitPreset, + outputJson: String?, + audioDir: String?, + asrChoice: AsrChoice + ) async throws { + guard let voicePath, !voicePath.isEmpty else { + logger.error( + "StyleTTS2 requires --voice " + + "(256 fp32 LE blob from mobius-styletts2/scripts/06_dump_ref_s.py)") + exit(1) + } + let voiceURL = resolveURL(voicePath, isDirectory: false) + let voiceLabel = voiceURL.deletingPathExtension().lastPathComponent + + // StyleTTS2 doesn't expose a compute-units knob today; --compute-units + // is accepted for parity with other backends but only labels the run. + let manager = StyleTTS2Manager() + + let coldStart = Date() + try await manager.initialize() + let coldStartS = Date().timeIntervalSince(coldStart) + logger.info(String(format: "Cold start (initialize): %.2fs", coldStartS)) + + let firstStart = Date() + _ = try await manager.synthesizeSamples( + text: "Initialization warm-up.", voiceStyleURL: voiceURL, randomSeed: 42) + let firstSynthMs = Date().timeIntervalSince(firstStart) * 1000 + logger.info(String(format: "First synth: %.0f ms", firstSynthMs)) + + try await runPhraseLoop( + backendId: "styletts2", + voiceLabel: voiceLabel, + corpusLabel: corpusLabel, + phrases: phrases, + preset: preset, + coldStartS: coldStartS, + firstSynthMs: firstSynthMs, + outputJson: outputJson, + audioDir: audioDir, + asrChoice: asrChoice, + extraSummary: ["voice": voiceLabel] + ) { text in + let t0 = Date() + let result = try await manager.synthesizeSamples( + text: text, voiceStyleURL: voiceURL, randomSeed: 42) + let synthMs = Date().timeIntervalSince(t0) * 1000 + return BackendPhraseSample( + synthMs: synthMs, + ttftMs: synthMs, + samples: result.samples, + sampleRate: result.sampleRate, + stageMs: [:], + extraFields: [:] + ) + } + } + + // MARK: - Shared per-phrase loop + summary + + private static func runPhraseLoop( + backendId: String, + voiceLabel: String, + corpusLabel: String, + phrases: [(category: String, text: String)], + preset: TtsComputeUnitPreset, + coldStartS: Double, + firstSynthMs: Double, + outputJson: String?, + audioDir: String?, + asrChoice: AsrChoice, + extraSummary: [String: Any], + synthOne: (String) async throws -> BackendPhraseSample + ) async throws { + // Optional output dir for WAVs. + var audioDirURL: URL? = nil + if let audioDir { + let url = resolveURL(audioDir, isDirectory: true) + try FileManager.default.createDirectory( + at: url, withIntermediateDirectories: true) + audioDirURL = url + } + + // Build optional ASR backend (Parakeet, Cohere, or none). + let asrLoop = try await buildAsrLoop(asrChoice) + + var perPhrase: [[String: Any]] = [] + var byCategory: [String: [Int]] = [:] + + for (idx, item) in phrases.enumerated() { + let label = String(format: "[%02d/%02d]", idx + 1, phrases.count) + logger.info("\(label) [\(item.category)] \(item.text)") + + let sample = try await synthOne(item.text) + let audioMs = + Double(sample.samples.count) / Double(sample.sampleRate) * 1000 + let rtfx = sample.synthMs > 0 ? audioMs / sample.synthMs : 0 + + // Persist WAV (audioDir if set, else temp file for ASR). + let wavURL: URL + if let audioDirURL { + wavURL = audioDirURL.appendingPathComponent( + String(format: "phrase_%03d.wav", idx + 1)) + } else { + wavURL = FileManager.default.temporaryDirectory + .appendingPathComponent("tts-benchmark-\(UUID().uuidString).wav") + } + let wavData = try AudioWAV.data( + from: sample.samples, sampleRate: Double(sample.sampleRate)) + try wavData.write(to: wavURL) + + var werValue = Double.nan + var cerValue = Double.nan + var hypothesis = "" + var asrMs = 0.0 + if let asrLoop { + let asr0 = Date() + hypothesis = try await asrLoop.transcribeOne(wavURL) + asrMs = Date().timeIntervalSince(asr0) * 1000 + let m = WERCalculator.calculateWERAndCER( + hypothesis: hypothesis, reference: item.text) + werValue = m.wer + cerValue = m.cer + } + + if audioDirURL == nil { + try? FileManager.default.removeItem(at: wavURL) + } + + logger.info( + String( + format: + " ttft=%.0f ms synth=%.0f ms audio=%.0f ms rtfx=%.2fx wer=%.1f%% cer=%.1f%%", + sample.ttftMs, sample.synthMs, audioMs, rtfx, + werValue.isNaN ? 0 : werValue * 100, + cerValue.isNaN ? 0 : cerValue * 100)) + + byCategory[item.category, default: []].append(perPhrase.count) + var phraseDict: [String: Any] = [ + "index": idx + 1, + "category": item.category, + "reference": item.text, + "hypothesis": hypothesis, + "ttft_ms": sample.ttftMs, + "synth_ms": sample.synthMs, + "audio_ms": audioMs, + "rtfx": rtfx, + "wer": werValue.isNaN ? NSNull() : werValue as Any, + "cer": cerValue.isNaN ? NSNull() : cerValue as Any, + "asr_ms": asrMs, + "stage_ms": sample.stageMs, + "wav_path": audioDirURL == nil ? "" : wavURL.path, + ] + for (k, v) in sample.extraFields { + phraseDict[k] = v + } + perPhrase.append(phraseDict) + } + + if let asrLoop { + await asrLoop.cleanup() + } + + // Aggregate. + let totalSynthMs = perPhrase.reduce(0.0) { $0 + ($1["synth_ms"] as? Double ?? 0) } + let totalAudioMs = perPhrase.reduce(0.0) { $0 + ($1["audio_ms"] as? Double ?? 0) } + let aggRtfx = totalSynthMs > 0 ? totalAudioMs / totalSynthMs : 0 + + let synthMsValues = perPhrase.compactMap { $0["synth_ms"] as? Double }.sorted() + let p50 = percentile(synthMsValues, 0.5) + let p95 = percentile(synthMsValues, 0.95) + let ttftValues = perPhrase.compactMap { $0["ttft_ms"] as? Double }.sorted() + let ttftP50 = percentile(ttftValues, 0.5) + let ttftP95 = percentile(ttftValues, 0.95) + + var categories: [[String: Any]] = [] + for (cat, indexes) in byCategory.sorted(by: { $0.key < $1.key }) { + let werVals = indexes.compactMap { perPhrase[$0]["wer"] as? Double } + let cerVals = indexes.compactMap { perPhrase[$0]["cer"] as? Double } + let synthVals = indexes.compactMap { perPhrase[$0]["synth_ms"] as? Double } + let audioVals = indexes.compactMap { perPhrase[$0]["audio_ms"] as? Double } + let synthSum = synthVals.reduce(0, +) + let audioSum = audioVals.reduce(0, +) + let macroWer = + werVals.isEmpty ? Double.nan : werVals.reduce(0, +) / Double(werVals.count) + let macroCer = + cerVals.isEmpty ? Double.nan : cerVals.reduce(0, +) / Double(cerVals.count) + categories.append([ + "category": cat, + "phrase_count": indexes.count, + "macro_wer": macroWer.isNaN ? NSNull() : macroWer as Any, + "macro_cer": macroCer.isNaN ? NSNull() : macroCer as Any, + "synth_ms_p50": percentile(synthVals.sorted(), 0.5), + "synth_ms_p95": percentile(synthVals.sorted(), 0.95), + "rtfx": synthSum > 0 ? audioSum / synthSum : 0, + ]) + } + + let peakRssMb = + Double(FluidAudioCLI.fetchPeakMemoryUsageBytes() ?? 0) / 1024 / 1024 + + // Banner. + logger.info("--- Summary ---") + logger.info(" backend: \(backendId)") + logger.info(" voice/speaker: \(voiceLabel)") + logger.info(" corpus: \(corpusLabel) (n=\(phrases.count))") + logger.info(" compute units: \(preset.cliValue)") + logger.info(String(format: " cold start: %.2fs", coldStartS)) + logger.info(String(format: " first synth: %.0f ms", firstSynthMs)) + logger.info(String(format: " TTFT p50/p95: %.0f / %.0f ms", ttftP50, ttftP95)) + logger.info(String(format: " warm synth p50: %.0f ms", p50)) + logger.info(String(format: " warm synth p95: %.0f ms", p95)) + logger.info(String(format: " agg RTFx: %.2fx", aggRtfx)) + logger.info(String(format: " peak RSS: %.0f MB", peakRssMb)) + if !asrChoice.skipped { + let werVals = perPhrase.compactMap { $0["wer"] as? Double } + let cerVals = perPhrase.compactMap { $0["cer"] as? Double } + let macroWer = + werVals.isEmpty ? 0 : werVals.reduce(0, +) / Double(werVals.count) + let macroCer = + cerVals.isEmpty ? 0 : cerVals.reduce(0, +) / Double(cerVals.count) + logger.info(" ASR backend: \(asrChoice.label)") + logger.info(String(format: " macro WER: %.2f%%", macroWer * 100)) + logger.info(String(format: " macro CER: %.2f%%", macroCer * 100)) + // Word-level WER is meaningless on whitespace-free scripts (zh, ja). + // Surface that explicitly so readers don't trust ~100% WER for zh. + if case .cohere(_, let lang, _) = asrChoice, + lang == .chinese || lang == .japanese + { + logger.info( + " note: WER is whitespace-tokenized; trust CER for \(lang.rawValue).") + } + } else { + logger.info(" WER/CER: skipped") + } + + if let outputJson { + var summary: [String: Any] = [ + "backend": backendId, + "corpus": corpusLabel, + "phrase_count": phrases.count, + "compute_units": preset.cliValue, + "cold_start_s": coldStartS, + "first_synth_ms": firstSynthMs, + "ttft_ms_p50": ttftP50, + "ttft_ms_p95": ttftP95, + "warm_synth_ms_p50": p50, + "warm_synth_ms_p95": p95, + "agg_rtfx": aggRtfx, + "peak_rss_mb": peakRssMb, + "asr_skipped": asrChoice.skipped, + "asr_backend": asrChoice.label, + ] + for (k, v) in extraSummary { + summary[k] = v + } + let report: [String: Any] = [ + "summary": summary, + "categories": categories, + "phrases": perPhrase, + ] + let url = resolveURL(outputJson, isDirectory: false) + try FileManager.default.createDirectory( + at: url.deletingLastPathComponent(), + withIntermediateDirectories: true) + let data = try JSONSerialization.data( + withJSONObject: report, options: [.prettyPrinted, .sortedKeys]) + try data.write(to: url) + logger.info("Report written: \(url.path)") + } + } + + // MARK: - Corpus loading + + private static func loadShippedCorpus( + _ name: String + ) throws -> [(category: String, text: String)] { + let cwd = URL( + fileURLWithPath: FileManager.default.currentDirectoryPath, isDirectory: true) + let relativePath = corpusRelativePath(for: name) + let url = cwd.appendingPathComponent(relativePath, isDirectory: false) + let raw = try String(contentsOf: url, encoding: .utf8) + return parseCorpus(raw, category: name) + } + + /// Map a `--corpus` name to its on-disk relative path. + /// + /// All shipped corpora are MiniMax Multilingual TTS Test Set + /// languages — `minimax-` resolves to + /// `Benchmarks/tts/corpus/minimax/.txt`. The CC-BY-SA-4.0 + /// attribution lives next to the data in `minimax/README.md`. + /// Pass `--corpus-path` for ad-hoc files outside the shipped set. + private static func corpusRelativePath(for name: String) -> String { + let prefix = "minimax-" + if name.hasPrefix(prefix) { + let lang = String(name.dropFirst(prefix.count)) + return "Benchmarks/tts/corpus/minimax/\(lang).txt" + } + // Back-compat shim — anything else is assumed to live next to + // the minimax subdirectory. Prefer `--corpus-path` for non-shipped + // corpora. + return "Benchmarks/tts/corpus/\(name).txt" + } + + private static func parseCorpus( + _ raw: String, category: String + ) -> [(category: String, text: String)] { + return + raw + .split(whereSeparator: \.isNewline) + .map { $0.trimmingCharacters(in: .whitespaces) } + .filter { !$0.isEmpty && !$0.hasPrefix("#") } + .map { (category: category, text: $0) } + } + + // MARK: - Backend dispatch + + private enum Backend: String { + case kokoroAne + case kokoro + case pocketTts + case magpie + case cosyVoice3 + case styleTts2 + + var defaultCorpus: String { + switch self { + case .cosyVoice3: return "minimax-chinese" + default: return "minimax-english" + } + } + } + + private static func parseBackend(_ name: String) -> Backend { + switch name.lowercased() { + case "kokoro-ane", "kokoroane", "kokoro_ane", "lai": + return .kokoroAne + case "kokoro": + return .kokoro + case "pocket-tts", "pockettts", "pocket": + return .pocketTts + case "magpie": + return .magpie + case "cosyvoice3", "cosyvoice", "cosy": + return .cosyVoice3 + case "styletts2", "style-tts2", "styletts", "style": + return .styleTts2 + default: + logger.warning("Unknown backend '\(name)' — defaulting to kokoro-ane") + return .kokoroAne + } + } + + private static func parsePocketLanguage(_ name: String?) -> PocketTtsLanguage { + guard let name, let l = PocketTtsLanguage(rawValue: name.lowercased()) else { + return .english + } + return l + } + + private static func parseMagpieLanguage(_ name: String?) -> MagpieLanguage { + guard let name, let l = MagpieLanguage(rawValue: name.lowercased()) else { + return .english + } + return l + } + + private static func parseMagpieSpeaker(_ name: String?) -> MagpieSpeaker { + switch name?.lowercased() { + case "sofia": return .sofia + case "aria": return .aria + case "jason": return .jason + case "leo": return .leo + case "john", nil, "": return .john + default: return .john + } + } + + // MARK: - Helpers + + private static func percentile(_ sorted: [Double], _ p: Double) -> Double { + guard !sorted.isEmpty else { return 0 } + let idx = Int((Double(sorted.count - 1) * p).rounded()) + return sorted[max(0, min(sorted.count - 1, idx))] + } + + private static func resolveURL(_ path: String, isDirectory: Bool) -> URL { + let expanded = (path as NSString).expandingTildeInPath + if expanded.hasPrefix("/") { + return URL(fileURLWithPath: expanded, isDirectory: isDirectory) + } + let cwd = URL( + fileURLWithPath: FileManager.default.currentDirectoryPath, isDirectory: true) + return cwd.appendingPathComponent(expanded, isDirectory: isDirectory) + } + + // MARK: - ASR backend resolution & adapter construction + + /// Map CLI flags + TTS backend defaults to a concrete `AsrChoice`. + /// + /// Precedence: `--skip-asr` and `--asr-backend none` always win. With + /// no flag, English-friendly TTS backends default to Parakeet TDT and + /// CosyVoice3 defaults to `.skip` (Parakeet is English-only — its WER + /// on Mandarin output reads ~100% and is meaningless). + private static func resolveAsrChoice( + skipAsrFlag: Bool, + backendName: String?, + cohereModelDir: String?, + asrLanguage: String?, + cohereComputeUnits: String?, + corpusLabel: String, + ttsBackend: Backend + ) throws -> AsrChoice { + let normalized = backendName?.lowercased() + if skipAsrFlag || normalized == "none" { + return .skip + } + switch normalized { + case "cohere": + let dir = try resolveCohereModelDir(cohereModelDir) + let language = inferCohereLanguage( + explicit: asrLanguage, corpus: corpusLabel) + let units = try resolveCohereComputeUnits(cohereComputeUnits) + return .cohere(modelDir: dir, language: language, computeUnits: units) + case "parakeet": + return .parakeet + case nil: + // Implicit defaults: skip for CosyVoice3 (no English ASR pairing), + // Parakeet otherwise. + if ttsBackend == .cosyVoice3 { + logger.info( + "CosyVoice3: no --asr-backend selected; skipping ASR. " + + "Pass `--asr-backend cohere --cohere-model-dir ` for CER.") + return .skip + } + return .parakeet + default: + logger.warning( + "Unknown --asr-backend value '\(normalized ?? "")', falling back to parakeet.") + return .parakeet + } + } + + /// Resolve a Cohere Transcribe model directory (must contain + /// `cohere_encoder.mlmodelc`, `cohere_decoder_cache_external_v2.mlmodelc`, + /// and `vocab.json`). + /// + /// Order of resolution: + /// 1. Explicit `--cohere-model-dir `. + /// 2. The default cache location at + /// `~/Library/Application Support/FluidAudio/Models/cohere-transcribe/q8`, + /// matching `Repo.cohereTranscribeCoreml.folderName`. + /// + /// Auto-download is intentionally not wired here: the upstream + /// `Repo.cohereTranscribeCoreml` registration ships `vocab.json` in + /// `requiredModels`, but the file lives at the repo root rather than + /// under the `q8/` subPath, so `DownloadUtils.downloadRepo` would fail + /// the post-download verify. Fix this when the registry learns about + /// repo-root files; until then, callers must pre-populate the cache + /// (e.g. via `fluidaudio cohere-transcribe ... --model-dir `). + private static func resolveCohereModelDir(_ override: String?) throws -> URL { + if let override { + return resolveURL(override, isDirectory: true) + } + let appSupport = try FileManager.default.url( + for: .applicationSupportDirectory, + in: .userDomainMask, appropriateFor: nil, create: true) + let target = + appSupport + .appendingPathComponent("FluidAudio/Models/cohere-transcribe/q8") + let needed = [ + ModelNames.CohereTranscribe.encoderCompiledFile, + ModelNames.CohereTranscribe.decoderCacheExternalV2CompiledFile, + "vocab.json", + ] + let missing = needed.filter { name in + !FileManager.default.fileExists( + atPath: target.appendingPathComponent(name).path) + } + guard missing.isEmpty else { + throw NSError( + domain: "TtsBenchmark", code: 1, + userInfo: [ + NSLocalizedDescriptionKey: + "Cohere model dir incomplete at \(target.path). " + + "Missing: \(missing.joined(separator: ", ")). " + + "Pass --cohere-model-dir with the required files, or " + + "pre-populate the cache via `fluidaudio cohere-transcribe`." + ]) + } + return target + } + + /// Pick a `CohereAsrConfig.Language` from an explicit flag value or by + /// scanning the corpus label (covers the shipped `minimax-` set). + private static func inferCohereLanguage( + explicit: String?, corpus: String + ) -> CohereAsrConfig.Language { + if let explicit, + let lang = CohereAsrConfig.Language(rawValue: explicit.lowercased()) + { + return lang + } + let lower = corpus.lowercased() + if lower.contains("chinese") || lower.contains("mandarin") || lower.hasSuffix("-zh") { + return .chinese + } + if lower.contains("japanese") || lower.contains("-ja") { return .japanese } + if lower.contains("korean") || lower.contains("-ko") { return .korean } + if lower.contains("vietnamese") || lower.contains("-vi") { return .vietnamese } + if lower.contains("french") || lower.contains("-fr") { return .french } + if lower.contains("german") || lower.contains("-de") { return .german } + if lower.contains("spanish") || lower.contains("-es") { return .spanish } + if lower.contains("italian") || lower.contains("-it") { return .italian } + if lower.contains("portuguese") || lower.contains("-pt") { return .portuguese } + if lower.contains("dutch") || lower.contains("-nl") { return .dutch } + if lower.contains("polish") || lower.contains("-pl") { return .polish } + if lower.contains("greek") || lower.contains("-el") { return .greek } + if lower.contains("arabic") || lower.contains("-ar") { return .arabic } + return .english + } + + /// Parse `--cohere-compute-units` into `MLComputeUnits`. Defaults to + /// `.all` (CoreML decides). Use `cpu-and-gpu` to skip the ANE compile + /// attempt when the q8 encoder fails ANE compilation (observed: + /// `MILCompilerForANE error: failed to compile ANE model using ANEF`, + /// CoreML falls back to CPU+GPU but pays a multi-minute compile cost + /// on the first call). + private static func resolveCohereComputeUnits( + _ flag: String? + ) throws + -> MLComputeUnits + { + guard let raw = flag?.lowercased(), !raw.isEmpty else { return .all } + switch raw { + case "all", "default": return .all + case "all-ane", "ane", "neural-engine", "cpu-and-ane": + return .cpuAndNeuralEngine + case "cpu-and-gpu", "cpuandgpu", "gpu": return .cpuAndGPU + case "cpu-only", "cpu", "cpuonly": return .cpuOnly + default: + throw NSError( + domain: "TtsBenchmark", code: 3, + userInfo: [ + NSLocalizedDescriptionKey: + "Unknown --cohere-compute-units value '\(raw)'. " + + "Expected: all | cpu-and-gpu | cpu-only | all-ane." + ]) + } + } + + /// Human-readable label for log lines. + private static func describeComputeUnits(_ cu: MLComputeUnits) -> String { + switch cu { + case .all: return "all (CPU+GPU+ANE)" + case .cpuAndNeuralEngine: return "cpu-and-ane" + case .cpuAndGPU: return "cpu-and-gpu" + case .cpuOnly: return "cpu-only" + @unknown default: return "unknown" + } + } + + /// Build the per-phrase ASR adapter for a resolved choice. Returns + /// `nil` for `.skip` so the loop can short-circuit. + private static func buildAsrLoop(_ choice: AsrChoice) async throws -> AsrLoop? { + switch choice { + case .skip: + return nil + case .parakeet: + let asrModels = try await AsrModels.downloadAndLoad() + let asr = AsrManager() + try await asr.loadModels(asrModels) + let layers = await asr.decoderLayerCount + return AsrLoop( + label: "parakeet-tdt", + transcribeOne: { url in + var state = TdtDecoderState.make(decoderLayers: layers) + let r = try await asr.transcribe(url, decoderState: &state) + return r.text + }, + cleanup: { await asr.cleanup() } + ) + case .cohere(let modelDir, let language, let computeUnits): + guard #available(macOS 14, iOS 17, *) else { + throw NSError( + domain: "TtsBenchmark", code: 2, + userInfo: [ + NSLocalizedDescriptionKey: + "Cohere ASR backend requires macOS 14+ / iOS 17+." + ]) + } + logger.info( + "Loading Cohere Transcribe (lang=\(language.englishName), " + + "compute=\(describeComputeUnits(computeUnits))) from \(modelDir.path)") + let models = try await CoherePipeline.loadModels( + encoderDir: modelDir, + decoderDir: modelDir, + vocabDir: modelDir, + decoderVariant: .v2, + computeUnits: computeUnits) + let pipeline = CoherePipeline() + let converter = AudioConverter() + return AsrLoop( + label: "cohere-transcribe-\(language.rawValue)", + transcribeOne: { url in + let samples = try converter.resampleAudioFile(path: url.path) + let r = try await pipeline.transcribe( + audio: samples, + models: models, + language: language, + maxNewTokens: 108, + repetitionPenalty: 1.1, + noRepeatNgram: 3) + return r.text + }, + cleanup: {} + ) + } + } + + private static func printUsage() { + logger.info( + """ + Usage: fluidaudio tts-benchmark [options] + + Quantitative TTS benchmark — TTFT, cold/warm split, per-stage timings, + peak RSS, WER + CER per category, configurable compute-unit preset. + + Backends: + kokoro-ane 7-stage ANE pipeline (per-stage timings, per-stage CU) + kokoro Single-graph CPU+GPU + pocket-tts Streaming flow-matching (multilingual) + magpie Encoder-decoder + NanoCodec (per-stage, slow) + cosyvoice3 Mandarin LLM-based (auto-picks Cohere ASR for zh) + + Options: + --backend See list above (default: kokoro-ane) + --corpus MiniMax corpus name: minimax- + (e.g. minimax-english, minimax-chinese, + minimax-vietnamese — 24 languages total; + see Documentation/TTS/MinimaxCorpus.md) + --corpus-path Custom corpus file (overrides --corpus) + --voice Voice id (Kokoro/PocketTTS/CosyVoice3) + --speaker Magpie speaker: john|sofia|aria|jason|leo + --language PocketTTS lang pack or Magpie language code + --compute-units default | all-ane | cpu-and-gpu | cpu-only + --output-json Write JSON report + --audio-dir Keep generated WAVs under this dir + --skip-asr Skip ASR roundtrip (no WER/CER) + --asr-backend ASR engine for the WER/CER pass: + parakeet English-only (default for en) + cohere Multilingual (default for non-en) + none Same as --skip-asr + --cohere-model-dir Path to a directory containing Cohere + Transcribe encoder/decoder/vocab.json. + Required when --asr-backend cohere is + active (auto-download is not wired — + vocab.json lives at the repo root, not + under /q8). Default: cache at + ~/Library/Application Support/FluidAudio/ + Models/cohere-transcribe/q8 + --asr-language Override Cohere language code (default: + inferred from corpus name). One of: + en, zh, ja, ko, vi, fr, de, es, it, pt, + nl, pl, el, ar + --cohere-compute-units

Cohere ASR compute mapping: + all (default; CoreML decides) | + cpu-and-gpu | cpu-only | all-ane. + Use cpu-and-gpu when q8 ANE compile + fails (`MILCompilerForANE error: …`) + — avoids the multi-minute fallback + compile on first call. + --help, -h Show this help + + Examples: + fluidaudio tts-benchmark --backend kokoro-ane --output-json bench.json + fluidaudio tts-benchmark --backend kokoro --corpus minimax-english + fluidaudio tts-benchmark --backend pocket-tts --corpus minimax-german --language german + fluidaudio tts-benchmark --backend magpie --speaker sofia --language en + fluidaudio tts-benchmark --backend cosyvoice3 --corpus minimax-chinese \\ + --asr-backend cohere --cohere-model-dir ~/.fluidaudio/cohere/q8 + + Notes: + For Chinese (zh) and Japanese (ja), WER is meaningless because + WERCalculator splits on whitespace; trust the CER column instead. + The summary banner prints an explicit reminder for these langs. + """ + ) + } +} +#endif diff --git a/Sources/FluidAudioCLI/FluidAudioCLI.swift b/Sources/FluidAudioCLI/FluidAudioCLI.swift index 903683a7..1b601209 100644 --- a/Sources/FluidAudioCLI/FluidAudioCLI.swift +++ b/Sources/FluidAudioCLI/FluidAudioCLI.swift @@ -50,6 +50,10 @@ struct FluidAudioCLI { await MagpieCommand.run(arguments: Array(arguments.dropFirst(2))) case "tts-asr-verify": await TTSAsrVerifyCommand.run(arguments: Array(arguments.dropFirst(2))) + case "tts-benchmark": + await TtsBenchmarkCommand.run(arguments: Array(arguments.dropFirst(2))) + case "minimax-corpus": + await MinimaxCorpusCommand.run(arguments: Array(arguments.dropFirst(2))) case "diarization-benchmark": await StreamDiarizationBenchmark.run(arguments: Array(arguments.dropFirst(2))) case "process": @@ -116,6 +120,8 @@ struct FluidAudioCLI { tts Synthesize speech from text using Kokoro TTS magpie Magpie TTS Multilingual 357M (experimental, ~0.04 RTFx — slow, needs perf work) tts-asr-verify Batch TTS→ASR roundtrip WER verification + tts-benchmark Quantitative TTS benchmark (latency, quality, compute-unit sweep) + minimax-corpus Fetch MiniMax TTS Multilingual Test Set into Benchmarks/tts/corpus/minimax parakeet-eou Run Parakeet EOU Streaming ASR on a single file ctc-earnings-benchmark Run CTC keyword spotting benchmark on Earnings22 sortformer Run Sortformer streaming diarization diff --git a/Tests/FluidAudioTests/TTS/CosyVoice3ModelNameTests.swift b/Tests/FluidAudioTests/TTS/CosyVoice3ModelNameTests.swift new file mode 100644 index 00000000..f0faeacc --- /dev/null +++ b/Tests/FluidAudioTests/TTS/CosyVoice3ModelNameTests.swift @@ -0,0 +1,60 @@ +import XCTest + +@testable import FluidAudio + +/// Guard the stateful → stateless decode rename. The HF repo +/// `FluidInference/CosyVoice3-0.5B-coreml` ships only `LLM-Decode-M768-fp16` +/// (non-stateful, external KV cache); resurrecting `-stateful` here would +/// re-break the download path and regress macOS 14 support. +final class CosyVoice3ModelNameTests: XCTestCase { + + // MARK: - ModelNames.CosyVoice3 + + func testLlmDecodeIsStatelessName() { + XCTAssertEqual(ModelNames.CosyVoice3.llmDecode, "LLM-Decode-M768-fp16") + XCTAssertFalse( + ModelNames.CosyVoice3.llmDecode.contains("stateful"), + "llmDecode must not reference the dropped stateful variant") + } + + func testLlmDecodeFileMatchesBaseName() { + XCTAssertEqual( + ModelNames.CosyVoice3.llmDecodeFile, + "LLM-Decode-M768-fp16.mlmodelc") + } + + func testRequiredModelsContainsStatelessDecode() { + XCTAssertTrue( + ModelNames.CosyVoice3.requiredModels.contains("LLM-Decode-M768-fp16.mlmodelc"), + "requiredModels must list the stateless decode bundle") + XCTAssertFalse( + ModelNames.CosyVoice3.requiredModels.contains( + "LLM-Decode-M768-fp16-stateful.mlmodelc"), + "requiredModels must not list the dropped stateful bundle") + } + + func testRequiredModelsHasFourEntries() { + XCTAssertEqual( + ModelNames.CosyVoice3.requiredModels.count, 4, + "Pipeline ships exactly 4 CoreML bundles: prefill, decode, flow, hift") + } + + // MARK: - CosyVoice3Constants.Files + + func testFilesLlmDecodeIsStatelessPackage() { + XCTAssertEqual( + CosyVoice3Constants.Files.llmDecode, + "LLM-Decode-M768-fp16.mlpackage") + XCTAssertFalse( + CosyVoice3Constants.Files.llmDecode.contains("stateful")) + } + + func testFilesLlmDecodeSubdirIsRenamed() { + XCTAssertEqual( + CosyVoice3Constants.Files.llmDecodeSubdir, + "llm-fp16-decode", + "Local-build subdir must be the renamed stateless directory") + XCTAssertFalse( + CosyVoice3Constants.Files.llmDecodeSubdir.contains("stateful")) + } +} diff --git a/Tests/FluidAudioTests/TTS/CosyVoice3TextChunkerTests.swift b/Tests/FluidAudioTests/TTS/CosyVoice3TextChunkerTests.swift new file mode 100644 index 00000000..ea419224 --- /dev/null +++ b/Tests/FluidAudioTests/TTS/CosyVoice3TextChunkerTests.swift @@ -0,0 +1,184 @@ +import XCTest + +@testable import FluidAudio + +final class CosyVoice3TextChunkerTests: XCTestCase { + + // MARK: - estimateSpeechTokens + + func testEstimateSpeechTokensCJK() { + // 4 CJK chars × 7.5 = 30 tokens + XCTAssertEqual(CosyVoice3TextChunker.estimateSpeechTokens("你好世界"), 30) + } + + func testEstimateSpeechTokensASCII() { + // 5 ASCII chars × 1.5 = 7.5 → rounds to 8 + XCTAssertEqual(CosyVoice3TextChunker.estimateSpeechTokens("hello"), 8) + } + + func testEstimateSpeechTokensEmpty() { + XCTAssertEqual(CosyVoice3TextChunker.estimateSpeechTokens(""), 0) + } + + // MARK: - chunk: short input fast path + + func testChunkEmptyReturnsEmpty() { + XCTAssertEqual(CosyVoice3TextChunker.chunk(""), []) + XCTAssertEqual(CosyVoice3TextChunker.chunk(" "), []) + XCTAssertEqual(CosyVoice3TextChunker.chunk("\n\n"), []) + } + + func testChunkShortReturnsSingle() { + // 5 chars (4 CJK + 「。」) ≈ 33 tokens, well under default 110 + XCTAssertEqual( + CosyVoice3TextChunker.chunk("你好世界。"), + ["你好世界。"]) + } + + func testChunkShortTrimsWhitespace() { + XCTAssertEqual( + CosyVoice3TextChunker.chunk(" hello world. "), + ["hello world."]) + } + + // MARK: - chunk: hard sentence enders + + func testChunkSplitsOnHardEnders() { + // 25 CJK chars × 7.5 = 187.5 tokens > 110 default → must split + let text = "今天天气很好。我们去公园散步。明天可能会下雨。下周打算去看电影。" + let chunks = CosyVoice3TextChunker.chunk(text) + XCTAssertGreaterThan(chunks.count, 1) + // No chunk should exceed budget by more than the soft margin + for chunk in chunks { + let est = CosyVoice3TextChunker.estimateSpeechTokens(chunk) + XCTAssertLessThanOrEqual(est, 110 + 30 + 8, "chunk over force-split margin: \(chunk)") + } + // Concatenating chunks back should reconstruct the input modulo + // whitespace trimming. + XCTAssertEqual(chunks.joined(), text) + } + + func testChunkSplitsOnEnglishSentenceEnders() { + // Each sentence ≈ 25–30 tokens; with maxSpeechTokens=80 every + // sentence fits individually so the chunker should commit on the + // first hard ender it sees rather than packing greedily across + // sentences and hitting force-split. + let text = "Hello world. This is a test. Pack my box with five jugs. Quick brown fox jumps." + let chunks = CosyVoice3TextChunker.chunk(text, maxSpeechTokens: 80) + XCTAssertGreaterThan(chunks.count, 1) + for chunk in chunks { + XCTAssertTrue( + chunk.hasSuffix(".") || chunk.hasSuffix("!") || chunk.hasSuffix("?"), + "chunk does not end at hard boundary: \(chunk)") + } + } + + // MARK: - chunk: soft enders fall-through + + func testChunkFallsBackToSoftEnders() { + // One huge sentence with commas, no periods. Should split on 「,」. + let text = "一个非常非常长的句子,里面有很多分句,每个分句都不是很长,但是加在一起就会超过预算限制" + let chunks = CosyVoice3TextChunker.chunk(text, maxSpeechTokens: 50) + XCTAssertGreaterThan(chunks.count, 1) + for chunk in chunks { + let est = CosyVoice3TextChunker.estimateSpeechTokens(chunk) + // Force-split allows one CJK char of overshoot past the +30 margin + // because the budget check runs AFTER appending the current char. + XCTAssertLessThanOrEqual(est, 50 + 30 + 8) + } + } + + // MARK: - chunk: force-split fallback + + func testChunkForceSplitsOnContinuousCJKWithoutPunctuation() { + // 30 CJK chars, no punctuation: ≈ 225 tokens, must force-split + // somewhere even without natural boundaries. + let text = "今天天气很好我们去公园散步明天可能会下雨下周打算看电影然后回家" + let chunks = CosyVoice3TextChunker.chunk(text, maxSpeechTokens: 50) + XCTAssertGreaterThan(chunks.count, 1) + for chunk in chunks { + let est = CosyVoice3TextChunker.estimateSpeechTokens(chunk) + // Force-split has a 30-token overshoot allowance + one CJK char (7.5) + XCTAssertLessThanOrEqual(est, 50 + 30 + 8, "chunk overflow on force-split: \(chunk)") + } + // No content lost + XCTAssertEqual(chunks.joined(), text) + } + + func testChunkForceSplitsOnEnglishSpacesWhenNoPunctuation() { + // Long English with no terminal punctuation; should split on spaces + // when the running estimate exceeds budget. + let text = "the quick brown fox jumps over the lazy dog and then runs back home very fast" + let chunks = CosyVoice3TextChunker.chunk(text, maxSpeechTokens: 20) + XCTAssertGreaterThan(chunks.count, 1) + for chunk in chunks { + // No leading/trailing whitespace expected on returned chunks + XCTAssertEqual(chunk, chunk.trimmingCharacters(in: .whitespaces)) + } + } + + // MARK: - concatWithCrossfade + + func testConcatEmptyReturnsEmpty() { + let out = CosyVoice3TtsManager.concatWithCrossfade( + [], sampleRate: 24_000, fadeMs: 8) + XCTAssertEqual(out, []) + } + + func testConcatSingleChunkPassthrough() { + let chunk: [Float] = [0.1, 0.2, 0.3, 0.4] + let out = CosyVoice3TtsManager.concatWithCrossfade( + [chunk], sampleRate: 24_000, fadeMs: 8) + XCTAssertEqual(out, chunk) + } + + func testConcatZeroFadeIsSimpleAppend() { + let a: [Float] = [0.1, 0.2, 0.3] + let b: [Float] = [0.4, 0.5, 0.6] + let out = CosyVoice3TtsManager.concatWithCrossfade( + [a, b], sampleRate: 24_000, fadeMs: 0) + XCTAssertEqual(out, [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]) + } + + func testConcatCrossfadeShrinksGracefullyForShortChunks() { + // 4-sample chunks; nominal fade at 24 kHz × 8 ms = 192 samples, + // gets clamped to min(out.count/2, next.count/2) = 2. + let a: [Float] = [1.0, 1.0, 1.0, 1.0] + let b: [Float] = [0.0, 0.0, 0.0, 0.0] + let out = CosyVoice3TtsManager.concatWithCrossfade( + [a, b], sampleRate: 24_000, fadeMs: 8) + // Output length: 4 (a) - 2 (fade) + 4 (b) = 6; first 2 of a remain + // pristine, then a 2-sample crossfade region, then last 2 of b + XCTAssertEqual(out.count, 6) + XCTAssertEqual(out[0], 1.0) + XCTAssertEqual(out[1], 1.0) + // Crossfade region: a's 1.0 fades to 0; b's 0.0 fades from 0. + // At j=0: down=1, up=0 → 1.0 * 1 + 0.0 * 0 = 1.0 + // At j=1: down=0.5, up=0.5 → 1.0*0.5 + 0.0*0.5 = 0.5 + XCTAssertEqual(out[2], 1.0, accuracy: 1e-5) + XCTAssertEqual(out[3], 0.5, accuracy: 1e-5) + XCTAssertEqual(out[4], 0.0, accuracy: 1e-5) + XCTAssertEqual(out[5], 0.0, accuracy: 1e-5) + } + + func testConcatCrossfadePreservesPrefixAndSuffix() { + // Long enough chunks for a full fade window + let sampleRate = 24_000 + let fadeMs = 4.0 // 96 samples + let a = [Float](repeating: 1.0, count: 480) + let b = [Float](repeating: 0.0, count: 480) + let out = CosyVoice3TtsManager.concatWithCrossfade( + [a, b], sampleRate: sampleRate, fadeMs: fadeMs) + let fade = Int((Double(sampleRate) * fadeMs / 1000).rounded()) + // Output length: a.count - fade + b.count + XCTAssertEqual(out.count, a.count - fade + b.count) + // Prefix of `a` (before crossfade region) untouched + for j in 0..<(a.count - fade) { + XCTAssertEqual(out[j], 1.0) + } + // Suffix of `b` (after crossfade region) untouched + for j in (a.count..