mirror of
https://github.com/FluidInference/FluidAudio.git
synced 2026-05-12 20:20:36 +00:00
## Summary Closes the work tracked in #590: bring `Documentation/TTS/Benchmarks.md` into agreement with what's actually shipped on `main` for CoreML TTS backends, and add the two CLI affordances needed to benchmark the in-scope backend × language matrix. ### Doc changes (`Documentation/TTS/Benchmarks.md`) - Single consolidated **per-backend table** that merges basic info (license, language+voice, footprint in **GB**, sample rate, max chunk per pass, streaming flag) with performance metrics (TTFT p50/p95, synth p50/p95, agg RTFx, peak RSS, WER %, CER %). Five rows: Kokoro ANE en (`af_heart`), Kokoro ANE zh (`zf_001`), PocketTTS en (`alba` 6L), Magpie en (`John`, batch-only on `main`), StyleTTS2 en (LibriTTS iteration_3, zero-shot). - Dropped from the top-line per scope decision: non-ANE Kokoro, CosyVoice3 zh, PocketTTS 24L variants, Hindi/Cantonese rows. CosyVoice3 narrative sections (decode budget cap + auto-chunker validation) stay verbatim. - Refreshed Kokoro ANE per-stage breakdown (post-laishere 7-graph chain). - Replaced the old Magpie per-stage table with a pointer paragraph (`MagpieSynthesisResult.timings` is still populated for callers; sub-1.5 s TTFA work referenced in #590 lives on `feat/magpie-lt-fusion`, not `main`). - Corrected PocketTTS footprint to `fp16 ~0.77 / int8 ~0.55 GB` (was `~140 / ~520 MB`); enumerated all 10 packs in the corpus matrix; added zh to the Kokoro ANE corpus row; added a StyleTTS2 row. ### CLI changes (`Sources/FluidAudioCLI/Commands/TtsBenchmarkCommand.swift`) - New `styletts2` / `style-tts2` backend wired to `StyleTTS2Manager.synthesize(text:referenceAudioURL:)`. Requires `--reference <wav>`; the shipped iteration_3 `ref_encoder` is fixed at `[1, 1, 80, 231]`, so the reference must be exactly **2.875 s @ 24 kHz mono** — the harness errors out at predict time on mismatched durations. - New `--variant {english|mandarin}` flag for `kokoro-ane` so the `zf_001` Mandarin voice pack can be benchmarked alongside `af_heart`. Falls back to `english` when unset; the manager constructor now receives the parsed `KokoroAneVariant` and the default voice is variant-aware. ### Methodology 100-phrase MiniMax-Multilingual on MacBook Air M2 (16 GB, macOS 26, on AC), `--compute-units default`. English WER/CER via Parakeet TDT roundtrip; Mandarin CER via `whisper-large-v3` (Python CPU FP32, `Scripts/whisper_zh_cer.py`) — macro 4.01% / micro 4.14% across all 100 zh phrases. WER omitted for Mandarin because `WERCalculator` splits on whitespace. ## Test plan - [x] `swift build` clean on `main`-based branch. - [x] `swift format lint --recursive --configuration .swift-format Sources/FluidAudioCLI/Commands/TtsBenchmarkCommand.swift` clean. - [x] Smoke test: `swift run fluidaudio tts-benchmark --backend styletts2 --reference ref.wav --corpus minimax-english --output-json /tmp/styletts2-smoke.json` — produces a valid JSON report. - [x] Smoke test: `swift run fluidaudio tts-benchmark --backend kokoro-ane --variant mandarin --voice zf_001 --corpus minimax-chinese --skip-asr --output-json /tmp/kokoro-zh-smoke.json` — pulls the Mandarin voice pack and produces audio. - [x] Full 100-phrase runs for all five table rows produced under `Benchmarks/tts/runs/590/` (gitignored); table numbers come straight from those JSON reports. - [ ] Reviewer cross-check: footnote markers (`*`, `‡`, `∥`, `¶`) in the consolidated table all have matching paragraphs below.
This commit is contained in:
+118
-178
@@ -5,9 +5,9 @@
|
||||
> phrases / language, CC-BY-SA-4.0) — the same public corpus used
|
||||
> by [MiniMax-Speech][mms], seed-tts-eval, and Gradium, so numbers
|
||||
> here are directly paper-comparable.
|
||||
> **Status:** Kokoro, Kokoro ANE, PocketTTS, Magpie all
|
||||
> complete the English run; CosyVoice3 completes the full Mandarin
|
||||
> run.
|
||||
> **Status:** Kokoro ANE (English + Mandarin), PocketTTS (English),
|
||||
> Magpie (English), and StyleTTS2 (English, zero-shot) all complete
|
||||
> the full 100-phrase MiniMax run.
|
||||
>
|
||||
> [minimax]: https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set
|
||||
> [mms]: https://arxiv.org/abs/2505.07916
|
||||
@@ -53,10 +53,10 @@ Reference each language as `--corpus minimax-<lang>`:
|
||||
|
||||
| Backend | Default corpus | Other supported MiniMax languages |
|
||||
|-------------|--------------------|------------------------------------------------|
|
||||
| Kokoro / Kokoro ANE | `minimax-english` | `english` only (`af_heart` voice) |
|
||||
| PocketTTS | `minimax-english` | `english`, `german`, `italian`, `portuguese`, `spanish`, `french` |
|
||||
| Kokoro ANE | `minimax-english` | `english` (`af_heart`); Kokoro ANE also ships `chinese` (`--variant mandarin`, voice `zf_001`) |
|
||||
| PocketTTS | `minimax-english` | 6L packs: `english`, `german`, `italian`, `portuguese`, `spanish`. 24L packs: `french_24l`, `german_24l`, `italian_24l`, `portuguese_24l`, `spanish_24l` |
|
||||
| Magpie | `minimax-english` | `english`, `spanish`, `german`, `french`, `italian`, `vietnamese`, `chinese`, `hindi` |
|
||||
| CosyVoice3 | `minimax-chinese` | `chinese`, `cantonese` |
|
||||
| StyleTTS2 | `minimax-english` | `english` only (LibriTTS iteration_3, zero-shot from `--reference` audio) |
|
||||
|
||||
Lines beginning with `#` are comments. Custom corpora can still be
|
||||
passed with `--corpus-path <file.txt>`.
|
||||
@@ -64,26 +64,28 @@ passed with `--corpus-path <file.txt>`.
|
||||
### Metrics
|
||||
|
||||
Per phrase:
|
||||
- `ttft_ms` — time-to-first-audio. For one-shot / batch backends this
|
||||
equals `synth_ms`. **PocketTTS** is benchmarked through
|
||||
`synthesizeStreaming`, so its `ttft_ms` is the timestamp of the first
|
||||
80 ms audio frame (1920 samples @ 24 kHz). **Magpie** is batch-only
|
||||
(`synthesize(...)` returns a single `MagpieSynthesisResult` after
|
||||
the full AR + codec pipeline completes), so `ttft_ms == synth_ms`.
|
||||
- `ttft_ms` — time-to-first-audio. The "first audio" granularity is
|
||||
backend-defined; see [Audio chunk window
|
||||
size](#audio-chunk-window-size) below for the per-backend numbers.
|
||||
**PocketTTS** is benchmarked through `synthesizeStreaming`, so its
|
||||
`ttft_ms` is the timestamp of the first 80 ms audio frame (1920
|
||||
samples @ 24 kHz) — actually-perceptible TTFA. **Kokoro ANE,
|
||||
Magpie, StyleTTS2** are batch / one-shot (`synthesize(...)` returns
|
||||
the full waveform), so `ttft_ms == synth_ms == time-to-complete-wav`
|
||||
for those — interpret it as full-wav latency, not as TTFA.
|
||||
- `synth_ms` — total synth wall time.
|
||||
- `audio_ms` — generated audio duration.
|
||||
- `rtfx` — `audio_ms / synth_ms`.
|
||||
- `wer`, `cer` — via Parakeet ASR roundtrip on the rendered WAV.
|
||||
- `stage_ms` — per-stage breakdown (backend-specific keys; populated
|
||||
for Kokoro ANE + Magpie; empty for Kokoro / PocketTTS /
|
||||
CosyVoice3).
|
||||
for Kokoro ANE; empty for / PocketTTS / Magpie /
|
||||
StyleTTS2 in this report).
|
||||
- Backend-specific extras: `encoder_tokens`, `acoustic_frames`,
|
||||
`chunk_count`, `frame_count`, `code_count`, `finished_on_eos`,
|
||||
`generated_token_count`, etc.
|
||||
`chunk_count`, `frame_count`, `code_count`, `generated_token_count`,
|
||||
etc.
|
||||
|
||||
Aggregates:
|
||||
- `cold_start_s` — `manager.initialize()` wall time. CosyVoice3 also
|
||||
includes voice-asset load.
|
||||
- `cold_start_s` — `manager.initialize()` wall time.
|
||||
- `first_synth_ms` — first synth call after init (still cold-ish).
|
||||
- `ttft_ms_p50` / `ttft_ms_p95`.
|
||||
- `warm_synth_ms_p50` / `warm_synth_ms_p95`.
|
||||
@@ -92,6 +94,21 @@ Aggregates:
|
||||
`task_vm_info_data_t.resident_size_peak`.
|
||||
- Per-category macro WER / CER.
|
||||
|
||||
### Audio chunk window size
|
||||
|
||||
What counts as "first audio" is backend-defined. The vocoder /
|
||||
codec emits in fixed-size chunks; only **PocketTTS** is wired to
|
||||
yield those chunks incrementally on `main`. Everything else returns
|
||||
the full waveform after the full pipeline runs, so `ttft_ms` for
|
||||
those backends measures full-wav latency rather than perceptual
|
||||
TTFA. The consolidated [per-backend table](#per-backend-top-line)
|
||||
below carries the per-backend sample rate, chunk window, and
|
||||
streaming flag inline alongside the performance metrics.
|
||||
|
||||
For batch backends, "average latency" the user perceives is
|
||||
`synth_ms` (full wav) rather than `ttft_ms` — they're equal in
|
||||
that case, so the consolidated table just reports them once.
|
||||
|
||||
### Reproducibility
|
||||
|
||||
```bash
|
||||
@@ -103,14 +120,29 @@ swift run fluidaudio tts-benchmark \
|
||||
--compute-units default \
|
||||
--output-json bench.json \
|
||||
--audio-dir bench-wavs/
|
||||
|
||||
# Kokoro ANE Mandarin (skip Parakeet ASR; whisper CER scored separately).
|
||||
swift run fluidaudio tts-benchmark \
|
||||
--backend kokoro-ane --variant mandarin --voice zf_001 \
|
||||
--corpus minimax-chinese --skip-asr \
|
||||
--output-json bench-zh.json --audio-dir bench-wavs-zh/
|
||||
|
||||
# StyleTTS2 zero-shot (LibriTTS iteration_3). The shipped ref_encoder
|
||||
# is fixed at [1, 1, 80, 231], so the reference must be exactly
|
||||
# 2.875 s @ 24 kHz mono. Trim externally before invoking, e.g.:
|
||||
# ffmpeg -i speaker.wav -t 2.875 -ar 24000 -ac 1 -c:a pcm_s16le ref.wav
|
||||
swift run fluidaudio tts-benchmark \
|
||||
--backend styletts2 --reference ref.wav \
|
||||
--corpus minimax-english \
|
||||
--output-json bench-styletts2.json --audio-dir bench-wavs-styletts2/
|
||||
```
|
||||
|
||||
The harness writes a JSON report to `--output-json` and (optionally)
|
||||
keeps WAVs under `--audio-dir`. Pass `--skip-asr` to drop the ASR
|
||||
roundtrip. The default ASR backend is `parakeet` for English-only
|
||||
runs and is skipped for CosyVoice3; pass `--asr-backend cohere
|
||||
--cohere-model-dir <dir>` to score Mandarin (or any of the 14
|
||||
Cohere languages) against [Cohere Transcribe](../../Sources/FluidAudio/ASR/Cohere/).
|
||||
runs; pass `--asr-backend cohere --cohere-model-dir <dir>` to score
|
||||
Mandarin (or any of the 14 Cohere languages) against
|
||||
[Cohere Transcribe](../../Sources/FluidAudio/ASR/Cohere/).
|
||||
|
||||
## Results
|
||||
|
||||
@@ -118,88 +150,90 @@ Cohere languages) against [Cohere Transcribe](../../Sources/FluidAudio/ASR/Coher
|
||||
|
||||
Reference machine: **MacBook Air, Apple M2 (2022), 8-core CPU /
|
||||
8-core GPU / 16-core Neural Engine, 16 GB unified memory, macOS 26**
|
||||
(`Mac14,2`, on AC). All English runs use `--compute-units default`,
|
||||
voice = backend default
|
||||
(`af_heart` for Kokoro, `alba` for PocketTTS, `John` for Magpie),
|
||||
corpus = `minimax-english` (100 phrases), Parakeet TDT roundtrip for
|
||||
WER / CER.
|
||||
(`Mac14,2`, on AC). All runs use `--compute-units default`, 100
|
||||
phrases per language. Voices are backend defaults
|
||||
(`af_heart` for Kokoro ANE en, `zf_001` for Kokoro ANE zh,
|
||||
`alba` for PocketTTS, `John` for Magpie, LibriTTS iteration_3 for
|
||||
StyleTTS2). English WER / CER via Parakeet TDT roundtrip; Mandarin
|
||||
CER via `whisper-large-v3`.
|
||||
|
||||
| Backend | License | Languages | Footprint | Cold start | TTFT p50 / p95\* | Synth p50 / p95 | Agg RTFx | Peak RSS | WER | CER | Notes |
|
||||
|-------------|-------------|------------------------|-----------|------------|---------------------|---------------------|----------|----------|---------|---------|-------|
|
||||
| Kokoro ANE | Apache-2.0 | en (af_heart only) | ~330 MB | 37.9 s | 1586 / 2515 ms | 1586 / 2515 ms | 5.19× | 738 MB | 0.108 | 0.040 | one-shot; per-stage CU sweep, 7-graph pipeline |
|
||||
| Kokoro | Apache-2.0 | en (af_heart only) | ~330 MB | 92.2 s | 3113 / 4696 ms | 3113 / 4696 ms | 2.02× | 736 MB | 0.013 | 0.005 | one-shot; cleanest English ASR roundtrip |
|
||||
| PocketTTS | research | en + de + it + pt + es + fr (6L / 24L) | ~140 / ~520 MB | 6.0 s | **1244 / 4749 ms** | 8757 / 19174 ms | 0.61× | 1503 MB | 0.014 | 0.006 | **streaming**; TTFT is first 80 ms audio frame |
|
||||
| Magpie | research | en/es/de/fr/it/vi/zh/hi | ~1.3 GB | 38.5 s∥ | 15080 / 29895 ms∥ | 15080 / 29895 ms∥ | 0.64×∥ | 762 MB∥ | 0.056 | 0.033 | **batch-only**; `ttft_ms == synth_ms`; split-K/V decoder; outputBackings fast path with latched fallback |
|
||||
| CosyVoice3 | Apache-2.0 | zh (mandarin) | ~1.5 GB | 29.2 s† | 14091 / 23679 ms† | 14091 / 23679 ms† | 0.357׆ | 3302 MB† | n/a‡ | 0.017‡ | beta; full `minimax-chinese` (100/100 phrases) for latency / RSS and whisper-large-v3 CER‡; cantonese supported via [auto-chunker](#cosyvoice3-auto-chunker) but not benchmarked (no yue ASR) |
|
||||
One consolidated table per backend × language. **Basic info**
|
||||
(license, language, footprint, sample rate, max chunk per pass,
|
||||
streaming flag) is merged with **performance** (TTFT, synth, RTFx,
|
||||
peak RSS, WER, CER) so there is a single source of truth.
|
||||
|
||||
| Backend | License | Language (voice) | Footprint | Sample rate | Max chunk per pass | Streaming | TTFT p50 / p95\* | Synth p50 / p95 | Agg RTFx | Peak RSS | WER | CER |
|
||||
|------------|------------|---------------------------|----------------------------|-------------|------------------------------------------------------------------|-----------|-------------------|-------------------|-----------|----------|--------|--------|
|
||||
| Kokoro ANE | Apache-2.0 | en (`af_heart`) | ~0.33 GB | 24 kHz | 510 phonemes / pass (≈25–30 s of audio) | No | **988 / 2068 ms** | 988 / 2068 ms | **7.47×** | 1027 MB | 10.8% | 4.0% |
|
||||
| Kokoro ANE | Apache-2.0 | zh (`zf_001`) | ~0.33 GB | 24 kHz | 510 phonemes / pass (≈25–30 s of audio) | No | **956 / 1802 ms** | 956 / 1802 ms | 6.37× | 685 MB | n/a‡ | 4.0%‡ |
|
||||
| PocketTTS | research | en (`alba`, 6L pack) | int8 ~0.55 GB | 24 kHz | 80 ms Mimi frame, streams until EOS (no fixed cap) | Yes | **710 / 1496 ms** | 5160 / 9801 ms | 1.10× | 1167 MB | 1.0% | 0.4% |
|
||||
| Magpie | research | en (`John`) | ~1.3 GB | 22.05 kHz | 256 NanoCodec frames / pass (≈11.9 s); sentence-split for longer | No | 11470 / 26042 ms∥ | 11470 / 26042 ms∥ | 0.87×∥ | 543 MB∥ | 3.8% | 2.6% |
|
||||
| StyleTTS2 | research | en (LibriTTS iteration_3) | ~0.67 GB¶ | 24 kHz | 256 tokens / pass (≈30 s of audio max) | No | 1574 / 3088 ms | 1574 / 3088 ms | 4.59× | 522 MB | 9.4% | 4.1% |
|
||||
|
||||
\* TTFT for **PocketTTS** is first-frame emit through the streaming
|
||||
API; **Magpie** is batch-only (`ttft_ms == synth_ms`); the others
|
||||
are one-shot, so `ttft_ms == synth_ms`.
|
||||
API (perceptual TTFA). **Kokoro ANE / Magpie / StyleTTS2** all run
|
||||
one-shot per phrase (no streaming yield on `main`), so for those
|
||||
rows `ttft_ms == synth_ms == time-to-complete-wav`.
|
||||
|
||||
† CosyVoice3 chinese: 100/100, 0 errors, ASR skipped. Cold-start
|
||||
dropped from 302.7 s to 29.2 s on the warm re-run.
|
||||
|
||||
‡ CosyVoice3 CER measured on the **full 100-phrase**
|
||||
‡ Kokoro ANE Mandarin CER measured on the **full 100-phrase**
|
||||
`minimax-chinese` corpus via `whisper-large-v3` (Python CPU FP32,
|
||||
[`Scripts/whisper_zh_cer.py`](../../Scripts/whisper_zh_cer.py)) on
|
||||
the WAVs rendered by `tts-benchmark --backend cosyvoice3 --corpus
|
||||
minimax-chinese --skip-asr --audio-dir <dir>`: **macro CER 1.68%
|
||||
(0.0168)**, **micro CER 1.84% (0.0184)** across 100 phrases.
|
||||
Whisper is the source of truth here because Cohere Transcribe q8
|
||||
hit a `MILCompilerForANE` cache failure on this M2 host and ran on
|
||||
the CPU+GPU fallback path at RTFx ~0.13× (would have taken multiple
|
||||
hours for the full 100-phrase set vs. ~70 min for whisper). WER is
|
||||
omitted because Mandarin has no word boundaries and `WERCalculator`
|
||||
splits on whitespace, so word-level WER reads near 100% and is
|
||||
meaningless.
|
||||
[`Scripts/whisper_zh_cer.py`](../../Scripts/whisper_zh_cer.py))
|
||||
against the WAVs rendered by `tts-benchmark --backend kokoro-ane
|
||||
--variant mandarin --voice zf_001 --corpus minimax-chinese
|
||||
--skip-asr`: **macro CER 4.01% (0.0401)**, **micro CER 4.14%
|
||||
(0.0414)** across 100 phrases (table reports the macro figure).
|
||||
WER is omitted because Mandarin has no word boundaries and
|
||||
`WERCalculator` splits on whitespace — word-level WER reads near
|
||||
100% and is meaningless. Cohere Transcribe q8 hit a
|
||||
`MILCompilerForANE` cache failure on this M2 host, so whisper is
|
||||
the local source of truth for Mandarin CER.
|
||||
|
||||
∥ Magpie: batch-only. `synthesize(...)` returns one
|
||||
`MagpieSynthesisResult` after the full AR + codec pipeline completes,
|
||||
so `ttft_ms == synth_ms`. Long inputs are sentence-split internally
|
||||
(NanoCodec 256-frame static cap) and AR(N+1) ‖ codec(N) chunk-level
|
||||
pipelining overlaps the next chunk's AR loop with the current chunk's
|
||||
codec pass — wallclock optimization, not incremental yield.
|
||||
codec pass — wallclock optimization, not incremental yield. The
|
||||
sub-1.5 s TTFA work referenced in issue #590 (fused sampler +
|
||||
24-frame cap) lives on `feat/magpie-lt-fusion`, not `main`.
|
||||
|
||||
¶ StyleTTS2 footprint is the sum of the shipped iteration_3 mlpackages
|
||||
(text encoder + bert + ref_encoder + post_albert + alignment + prosody
|
||||
+ noise + decoder + tail). The shipped ref_encoder is exported with
|
||||
a fixed `[1, 1, 80, 231]` mel shape, so reference audio must be
|
||||
exactly 2.875 s @ 24 kHz (300-hop). The benchmark harness expects
|
||||
the caller to trim externally; mismatched durations error out at
|
||||
predict time.
|
||||
|
||||
### Kokoro ANE — per-stage breakdown (default preset, MiniMax-English)
|
||||
|
||||
Means across 100 `minimax-english` phrases on M2. Stages map to the
|
||||
7-CoreML-graph split documented in [KokoroAne.md](KokoroAne.md). Vocoder
|
||||
+ noise together account for ~92% of synth time, which is the natural
|
||||
target for any further per-stage compute-unit re-tuning. The MiniMax
|
||||
mean is meaningfully higher than the prior Harvard-sentences run
|
||||
because phrases 81–100 are paragraph-length news / story sentences.
|
||||
Means across 100 `minimax-english` phrases on M2 (`af_heart`,
|
||||
post-laishere 7-graph chain). Stages map to the 7-CoreML-graph split
|
||||
documented in [KokoroAne.md](KokoroAne.md). Vocoder + noise together
|
||||
account for ~90% of synth time, which is the natural target for any
|
||||
further per-stage compute-unit re-tuning.
|
||||
|
||||
| Stage | Mean ms | % of total |
|
||||
|---------------|---------|------------|
|
||||
| `albert` | 28.2 | 2.0% |
|
||||
| `post_albert` | 12.1 | 0.9% |
|
||||
| `alignment` | 1.8 | 0.1% |
|
||||
| `prosody` | 49.2 | 3.5% |
|
||||
| `noise` | 242.6 | 17.4% |
|
||||
| `vocoder` | 1039.8 | 74.4% |
|
||||
| `tail` | 24.6 | 1.8% |
|
||||
| **total** | 1398.4 | 100% |
|
||||
| `albert` | 24.5 | 2.5% |
|
||||
| `post_albert` | 9.3 | 1.0% |
|
||||
| `alignment` | 1.4 | 0.1% |
|
||||
| `prosody` | 40.0 | 4.1% |
|
||||
| `noise` | 169.3 | 17.5% |
|
||||
| `vocoder` | 704.4 | 72.9% |
|
||||
| `tail` | 17.0 | 1.8% |
|
||||
| **total** | 965.9 | 100% |
|
||||
|
||||
### Magpie — per-stage breakdown (default preset, MiniMax-English)
|
||||
### Magpie — per-stage breakdown
|
||||
|
||||
Means across 100 `minimax-english` phrases on M2 (`John` voice, en,
|
||||
default compute units), captured during the original one-shot
|
||||
profiling run. `ar_loop` is the umbrella for the per-step
|
||||
`decoder_step` + `sampler` (so it is not added on top in the total).
|
||||
`nanocodec` runs concurrently with the next chunk's AR loop via
|
||||
chunk-level pipelining inside `synthesize(...)`, which is why the
|
||||
per-stage means do not sum to total warm-synth mean. The AR loop
|
||||
dominates the wall clock, and its cost grows super-linearly with
|
||||
phrase length — long news / story phrases drive the long-tail p95.
|
||||
|
||||
| Stage | Mean ms |
|
||||
|--------------------|---------|
|
||||
| `text_encoder` | 91 |
|
||||
| `prefill` | 281 |
|
||||
| `ar_loop` | 17946 |
|
||||
| └── `decoder_step` | 14840 |
|
||||
| └── `sampler` | 3081 |
|
||||
| `nanocodec` | 17948 |
|
||||
Per-stage timings (`text_encoder`, `prefill`, `ar_loop`,
|
||||
`decoder_step`, `sampler`, `nanocodec`) are still populated on
|
||||
`MagpieSynthesisResult.timings` for callers that want them — see
|
||||
[`MagpieTypes.swift`](../../Sources/FluidAudio/TTS/Magpie/MagpieTypes.swift).
|
||||
This document does not currently re-publish the per-stage table on
|
||||
`main`: the AR loop dominates and its absolute numbers are
|
||||
in active flux on `feat/magpie-lt-fusion` (fused sampler + 24-frame
|
||||
NanoCodec cap). Republish here once that branch lands on `main`.
|
||||
|
||||
### About the WER / CER numbers
|
||||
|
||||
@@ -212,97 +246,3 @@ discussion](https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set/
|
||||
absolute WER is best read **relatively** (backend A vs. backend B on
|
||||
the same corpus + same ASR + same normalizer) rather than against
|
||||
raw paper numbers.
|
||||
|
||||
## CosyVoice3 Decode budget cap
|
||||
|
||||
CosyVoice3's Flow CFM was exported with a fixed input shape of
|
||||
`[1, 250]` speech tokens (`flowTotalTokens` in
|
||||
`CosyVoice3Constants.swift:45`). The LLM-Decode AR loop is allowed to
|
||||
emit up to `flowTotalTokens − N_prompt` tokens before being cut off
|
||||
(typically ~163 generated tokens after the speech-prompt portion).
|
||||
At `tokenMelRatio=2 × hiftSamplesPerFrame=480 / sampleRate=24000`
|
||||
that's **40 ms of audio per generated token**, so the loop produces
|
||||
**at most ~6.5 s of speech per phrase**, regardless of how long the
|
||||
input text is.
|
||||
|
||||
When the AR loop exits because it ran out of budget (i.e. no EOS
|
||||
token in `stopRange = 6_561…6_760`) instead of natural termination,
|
||||
`CosyVoice3Synthesizer` now:
|
||||
|
||||
1. Logs a `.warning` (one-shot per phrase) naming the
|
||||
`decoded.count / maxNew` budget and the produced audio duration.
|
||||
2. Sets `CosyVoice3SynthesisResult.finishedOnEos = false`, which the
|
||||
benchmark harness surfaces as the `finished_on_eos` field on each
|
||||
phrase in the JSON report.
|
||||
|
||||
Footprint on the cantonese corpus (`minimax-cantonese`,
|
||||
100 phrases) **without the chunker**: 80 / 100 phrases would hit the
|
||||
cap, all producing exactly 163 generated tokens / ~6.5 s of audio.
|
||||
The mandarin corpus sees a much lower truncation rate because
|
||||
MiniMax-zh phrases are shorter on average.
|
||||
|
||||
The structural fix — re-exporting the Flow CFM from
|
||||
[`mobius-cosyvoice3`](https://github.com/voicelink-ai/mobius-cosyvoice3)
|
||||
with a larger fixed input shape (e.g. `[1, 500]`) — is upstream
|
||||
work; bumping the constant in Swift alone would make the Flow
|
||||
input/output shapes mismatch at predict time. The shipped workaround
|
||||
is the call-site [auto-chunker](#cosyvoice3-auto-chunker), which
|
||||
drops cantonese truncation from 80/100 → 5/100 by splitting long
|
||||
inputs at clause boundaries and crossfading the results.
|
||||
|
||||
Surfaced in
|
||||
`CosyVoice3Synthesizer.synthesize`
|
||||
(`Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Synthesize/CosyVoice3Synthesizer.swift`)
|
||||
and
|
||||
`CosyVoice3SynthesisResult.finishedOnEos`
|
||||
(`Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Synthesize/CosyVoice3Types.swift`).
|
||||
|
||||
## CosyVoice3 auto-chunker
|
||||
|
||||
Re-exporting Flow CFM with a larger fixed input shape is gated on
|
||||
upstream conversion work. Until that lands, `CosyVoice3TtsManager`
|
||||
splits long inputs at the call site, synthesizes each chunk
|
||||
independently, and merges with an 8 ms equal-power cosine crossfade.
|
||||
|
||||
**Splitter policy** (`CosyVoice3TextChunker`):
|
||||
|
||||
- **Hard enders** commit always: `.`, `!`, `?`, `。`, `!`, `?`,
|
||||
`\n`.
|
||||
- **Soft enders** commit only when the running estimate is at or past
|
||||
the budget: `,`, `、`, `;`, `:`, `;`, `,`, ASCII space.
|
||||
- **Force-split** at `budget + 30` tokens of overshoot if no natural
|
||||
boundary appeared (rare; mostly continuous CJK with no
|
||||
punctuation).
|
||||
|
||||
**Token-rate estimate** (calibrated against minimax-zh + minimax-yue
|
||||
runs):
|
||||
|
||||
| Char class | Tokens / char | Rationale |
|
||||
|------------|---------------|--------------------------------------------------------------|
|
||||
| CJK | 7.5 | worst-case observed in real generation; varies 5.5–9 per char |
|
||||
| ASCII | 1.5 | matches BPE rate on English text |
|
||||
| Other | 2.5 | conservative for accented Latin / non-CJK Unicode |
|
||||
|
||||
`defaultMaxSpeechTokens` is **110**, leaving margin under the
|
||||
250-token Flow cap minus typical 60–90 token speech-prompt context.
|
||||
|
||||
**Concatenation**: 8 ms equal-power cosine crossfade at 24 kHz
|
||||
between adjacent chunks; single-chunk path short-circuits to plain
|
||||
copy.
|
||||
|
||||
**Validation** (full `minimax-cantonese`, 100 phrases, M2):
|
||||
|
||||
| Metric | Pre-chunker | Post-chunker | Δ |
|
||||
|-------------------------------------------|-------------|--------------|------------|
|
||||
| `finished_on_eos=false` (truncated) | 80 / 100 | **5 / 100** | −94% |
|
||||
| Longest audio output | 6.5 s | **16.1 s** | +148% |
|
||||
| agg-RTFx | 0.245× | 0.249× | +1.6% |
|
||||
| TTFT p50 | 23.9 s | 35.7 s | +49% |
|
||||
| TTFT p95 | 41.2 s | 60.5 s | +47% |
|
||||
| Peak RSS | 2016 MB | 3264 MB | +62% |
|
||||
|
||||
The 5/100 residual is the long-tail token-rate worst case (some
|
||||
Cantonese characters generate >9 speech tokens); raising the
|
||||
per-CJK heuristic further would over-fragment short phrases.
|
||||
Cleaner fix is the upstream Flow re-export.
|
||||
|
||||
|
||||
@@ -115,6 +115,18 @@ public actor PunctuationCommitLayer {
|
||||
/// Active debounce timer task.
|
||||
private var debounceTask: Task<Void, Never>?
|
||||
|
||||
/// Monotonically increasing generation for the active debounce timer.
|
||||
///
|
||||
/// Incremented whenever a pending timer is invalidated (new partial
|
||||
/// text, EOU, manual commit, reset, or a fresh timer being started).
|
||||
/// The timer task captures the generation it was created under and
|
||||
/// re-checks it after landing back on the actor, which closes the
|
||||
/// race where `Task.sleep` already returned and the
|
||||
/// `!Task.isCancelled` guard passed before a subsequent call to
|
||||
/// `processPartialText` / `processEOU` / `manualCommit` / `reset`
|
||||
/// requested cancellation.
|
||||
private var debounceGeneration: UInt64 = 0
|
||||
|
||||
/// Callback invoked when updates occur.
|
||||
private var updateCallback: (@Sendable (CommitLayerUpdate) -> Void)?
|
||||
|
||||
@@ -149,7 +161,7 @@ public actor PunctuationCommitLayer {
|
||||
/// - Returns: Update containing committed text, ghost text, and commit reason.
|
||||
public func processPartialText(_ text: String) async -> CommitLayerUpdate {
|
||||
// Cancel existing debounce timer
|
||||
debounceTask?.cancel()
|
||||
invalidateDebounce()
|
||||
lastUpdateTime = Date()
|
||||
|
||||
// Find last punctuation mark in text
|
||||
@@ -223,7 +235,7 @@ public actor PunctuationCommitLayer {
|
||||
///
|
||||
/// - Returns: Update with all text committed and EOU commit reason.
|
||||
public func processEOU() async -> CommitLayerUpdate {
|
||||
debounceTask?.cancel()
|
||||
invalidateDebounce()
|
||||
lastUpdateTime = Date()
|
||||
|
||||
// EOU signals end of utterance: commit everything
|
||||
@@ -262,7 +274,7 @@ public actor PunctuationCommitLayer {
|
||||
///
|
||||
/// - Returns: Update with ghost text promoted to committed text.
|
||||
public func manualCommit() async -> CommitLayerUpdate {
|
||||
debounceTask?.cancel()
|
||||
invalidateDebounce()
|
||||
lastUpdateTime = Date()
|
||||
|
||||
guard !ghostText.isEmpty else {
|
||||
@@ -298,7 +310,7 @@ public actor PunctuationCommitLayer {
|
||||
|
||||
/// Resets the commit layer, clearing all committed and ghost text.
|
||||
public func reset() async {
|
||||
debounceTask?.cancel()
|
||||
invalidateDebounce()
|
||||
committedText = ""
|
||||
ghostText = ""
|
||||
lastUpdateTime = Date()
|
||||
@@ -323,9 +335,19 @@ public actor PunctuationCommitLayer {
|
||||
|
||||
// MARK: - Private Helpers
|
||||
|
||||
/// Cancels any pending debounce timer and bumps the generation so any
|
||||
/// timer task already past its `!Task.isCancelled` guard becomes a
|
||||
/// no-op when it lands back on the actor.
|
||||
private func invalidateDebounce() {
|
||||
debounceTask?.cancel()
|
||||
debounceTask = nil
|
||||
debounceGeneration &+= 1
|
||||
}
|
||||
|
||||
/// Starts a debounce timer that commits ghost text after the timeout expires.
|
||||
private func startDebounceTimer() {
|
||||
debounceTask?.cancel()
|
||||
invalidateDebounce()
|
||||
let pendingGeneration = debounceGeneration
|
||||
|
||||
debounceTask = Task { [weak self, debounceTimeout, commitOnTimeout] in
|
||||
try? await Task.sleep(nanoseconds: UInt64(debounceTimeout * 1_000_000_000))
|
||||
@@ -337,11 +359,21 @@ public actor PunctuationCommitLayer {
|
||||
if commitOnTimeout {
|
||||
// Check cancellation again before acquiring actor executor
|
||||
guard !Task.isCancelled else { return }
|
||||
await self.commitGhostText(reason: .debounceTimeout)
|
||||
await self.fireDebounceCommit(generation: pendingGeneration)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Commits ghost text only if no newer timer / commit / reset has
|
||||
/// superseded the timer that scheduled this commit. Closes the race
|
||||
/// where `Task.sleep` returns and `!Task.isCancelled` is observed
|
||||
/// `false` *before* a subsequent call to `processPartialText`,
|
||||
/// `processEOU`, `manualCommit`, or `reset` cancels us.
|
||||
private func fireDebounceCommit(generation: UInt64) async {
|
||||
guard generation == debounceGeneration else { return }
|
||||
await commitGhostText(reason: .debounceTimeout)
|
||||
}
|
||||
|
||||
/// Commits ghost text to committed text with the specified reason.
|
||||
///
|
||||
/// - Parameter reason: The reason for committing.
|
||||
|
||||
@@ -14,6 +14,7 @@ import Foundation
|
||||
/// kokoro — single-graph CPU+GPU (chunk-level only)
|
||||
/// pocket-tts — streaming flow-matching (no per-stage timings)
|
||||
/// magpie — encoder-decoder + NanoCodec (6-stage timings, slow)
|
||||
/// styletts2 — LibriTTS iteration_3, zero-shot w/ reference audio
|
||||
/// cosyvoice3 — Mandarin LLM-based (Mandarin corpus only, no WER)
|
||||
///
|
||||
/// Usage:
|
||||
@@ -105,6 +106,8 @@ public enum TtsBenchmarkCommand {
|
||||
var cohereModelDirArg: String?
|
||||
var asrLanguageArg: String?
|
||||
var cohereComputeUnitsArg: String?
|
||||
var referencePath: String?
|
||||
var variantArg: String?
|
||||
|
||||
var i = 0
|
||||
while i < arguments.count {
|
||||
@@ -177,6 +180,16 @@ public enum TtsBenchmarkCommand {
|
||||
cohereComputeUnitsArg = arguments[i + 1]
|
||||
i += 1
|
||||
}
|
||||
case "--reference":
|
||||
if i + 1 < arguments.count {
|
||||
referencePath = arguments[i + 1]
|
||||
i += 1
|
||||
}
|
||||
case "--variant":
|
||||
if i + 1 < arguments.count {
|
||||
variantArg = arguments[i + 1]
|
||||
i += 1
|
||||
}
|
||||
case "--help", "-h":
|
||||
printUsage()
|
||||
return
|
||||
@@ -245,9 +258,11 @@ public enum TtsBenchmarkCommand {
|
||||
do {
|
||||
switch backend {
|
||||
case .kokoroAne:
|
||||
let kaVariant = parseKokoroAneVariant(variantArg)
|
||||
try await runKokoroAne(
|
||||
phrases: phrases, corpusLabel: corpusLabel,
|
||||
voice: voice ?? KokoroAneConstants.defaultVoice,
|
||||
variant: kaVariant,
|
||||
voice: voice ?? kaVariant.defaultVoice,
|
||||
preset: preset, outputJson: outputJson, audioDir: audioDir,
|
||||
asrChoice: asrChoice)
|
||||
case .kokoro:
|
||||
@@ -269,6 +284,12 @@ public enum TtsBenchmarkCommand {
|
||||
speakerName: speakerName, languageName: languageName,
|
||||
preset: preset, outputJson: outputJson, audioDir: audioDir,
|
||||
asrChoice: asrChoice)
|
||||
case .styleTts2:
|
||||
try await runStyleTTS2(
|
||||
phrases: phrases, corpusLabel: corpusLabel,
|
||||
referencePath: referencePath,
|
||||
preset: preset, outputJson: outputJson, audioDir: audioDir,
|
||||
asrChoice: asrChoice)
|
||||
case .cosyVoice3:
|
||||
try await runCosyVoice3(
|
||||
phrases: phrases, corpusLabel: corpusLabel,
|
||||
@@ -287,6 +308,7 @@ public enum TtsBenchmarkCommand {
|
||||
private static func runKokoroAne(
|
||||
phrases: [(category: String, text: String)],
|
||||
corpusLabel: String,
|
||||
variant: KokoroAneVariant,
|
||||
voice: String,
|
||||
preset: TtsComputeUnitPreset,
|
||||
outputJson: String?,
|
||||
@@ -294,7 +316,7 @@ public enum TtsBenchmarkCommand {
|
||||
asrChoice: AsrChoice
|
||||
) async throws {
|
||||
let units = KokoroAneComputeUnits(preset: preset)
|
||||
let manager = KokoroAneManager(defaultVoice: voice, computeUnits: units)
|
||||
let manager = KokoroAneManager(variant: variant, defaultVoice: voice, computeUnits: units)
|
||||
|
||||
let coldStart = Date()
|
||||
try await manager.initialize()
|
||||
@@ -560,6 +582,80 @@ public enum TtsBenchmarkCommand {
|
||||
}
|
||||
}
|
||||
|
||||
// MARK: - StyleTTS2 driver
|
||||
|
||||
private static func runStyleTTS2(
|
||||
phrases: [(category: String, text: String)],
|
||||
corpusLabel: String,
|
||||
referencePath: String?,
|
||||
preset: TtsComputeUnitPreset,
|
||||
outputJson: String?,
|
||||
audioDir: String?,
|
||||
asrChoice: AsrChoice
|
||||
) async throws {
|
||||
guard let referencePath, !referencePath.isEmpty else {
|
||||
logger.error(
|
||||
"styletts2 backend requires --reference <speaker-audio-file> "
|
||||
+ "(any sample rate / channel layout — resampled to 24 kHz mono).")
|
||||
exit(1)
|
||||
}
|
||||
let referenceURL = resolveURL(referencePath, isDirectory: false)
|
||||
guard FileManager.default.fileExists(atPath: referenceURL.path) else {
|
||||
logger.error("Reference audio not found: \(referenceURL.path)")
|
||||
exit(1)
|
||||
}
|
||||
logger.info("StyleTTS2 reference audio: \(referenceURL.path)")
|
||||
|
||||
let units = preset.uniformUnits ?? .cpuAndNeuralEngine
|
||||
let manager = StyleTTS2Manager(computeUnits: units)
|
||||
|
||||
let coldStart = Date()
|
||||
try await manager.initialize()
|
||||
let coldStartS = Date().timeIntervalSince(coldStart)
|
||||
logger.info(String(format: "Cold start (initialize): %.2fs", coldStartS))
|
||||
|
||||
let firstStart = Date()
|
||||
_ = try await manager.synthesize(
|
||||
text: "Initialization warm-up.",
|
||||
referenceAudioURL: referenceURL)
|
||||
let firstSynthMs = Date().timeIntervalSince(firstStart) * 1000
|
||||
logger.info(String(format: "First synth: %.0f ms", firstSynthMs))
|
||||
|
||||
try await runPhraseLoop(
|
||||
backendId: "styletts2",
|
||||
voiceLabel: referenceURL.lastPathComponent,
|
||||
corpusLabel: corpusLabel,
|
||||
phrases: phrases,
|
||||
preset: preset,
|
||||
coldStartS: coldStartS,
|
||||
firstSynthMs: firstSynthMs,
|
||||
outputJson: outputJson,
|
||||
audioDir: audioDir,
|
||||
asrChoice: asrChoice,
|
||||
extraSummary: [
|
||||
"reference": referenceURL.path,
|
||||
"alpha": Double(StyleTTS2Constants.defaultAlpha),
|
||||
"beta": Double(StyleTTS2Constants.defaultBeta),
|
||||
]
|
||||
) { text in
|
||||
// StyleTTS2 is a one-shot diffusion-based synthesizer — no
|
||||
// streaming yield, so TTFT == synthMs. The per-phrase mel
|
||||
// recompute is tiny vs. the 5-step ADPM2 + decoder cost.
|
||||
let t0 = Date()
|
||||
let samples = try await manager.synthesize(
|
||||
text: text, referenceAudioURL: referenceURL)
|
||||
let synthMs = Date().timeIntervalSince(t0) * 1000
|
||||
return BackendPhraseSample(
|
||||
synthMs: synthMs,
|
||||
ttftMs: synthMs,
|
||||
samples: samples,
|
||||
sampleRate: StyleTTS2Constants.sampleRate,
|
||||
stageMs: [:],
|
||||
extraFields: [:]
|
||||
)
|
||||
}
|
||||
}
|
||||
|
||||
// MARK: - CosyVoice3 driver
|
||||
|
||||
private static func runCosyVoice3(
|
||||
@@ -885,6 +981,7 @@ public enum TtsBenchmarkCommand {
|
||||
case kokoro
|
||||
case pocketTts
|
||||
case magpie
|
||||
case styleTts2
|
||||
case cosyVoice3
|
||||
|
||||
var defaultCorpus: String {
|
||||
@@ -905,6 +1002,8 @@ public enum TtsBenchmarkCommand {
|
||||
return .pocketTts
|
||||
case "magpie":
|
||||
return .magpie
|
||||
case "styletts2", "style-tts2", "styletts", "style-tts":
|
||||
return .styleTts2
|
||||
case "cosyvoice3", "cosyvoice", "cosy":
|
||||
return .cosyVoice3
|
||||
default:
|
||||
@@ -938,6 +1037,17 @@ public enum TtsBenchmarkCommand {
|
||||
}
|
||||
}
|
||||
|
||||
private static func parseKokoroAneVariant(_ name: String?) -> KokoroAneVariant {
|
||||
switch name?.lowercased() {
|
||||
case "mandarin", "zh", "chinese", "zh-cn":
|
||||
return .mandarin
|
||||
case "english", "en", "en-us", nil, "":
|
||||
return .english
|
||||
default:
|
||||
return .english
|
||||
}
|
||||
}
|
||||
|
||||
// MARK: - Helpers
|
||||
|
||||
private static func percentile(_ sorted: [Double], _ p: Double) -> Double {
|
||||
@@ -1193,6 +1303,7 @@ public enum TtsBenchmarkCommand {
|
||||
kokoro Single-graph CPU+GPU
|
||||
pocket-tts Streaming flow-matching (multilingual)
|
||||
magpie Encoder-decoder + NanoCodec (per-stage, slow)
|
||||
styletts2 LibriTTS iteration_3, zero-shot, requires --reference
|
||||
cosyvoice3 Mandarin LLM-based (auto-picks Cohere ASR for zh)
|
||||
|
||||
Options:
|
||||
@@ -1232,13 +1343,22 @@ public enum TtsBenchmarkCommand {
|
||||
fails (`MILCompilerForANE error: …`)
|
||||
— avoids the multi-minute fallback
|
||||
compile on first call.
|
||||
--reference <path> StyleTTS2 speaker-reference audio
|
||||
(required for --backend styletts2;
|
||||
any sample rate / channel layout —
|
||||
resampled to 24 kHz mono internally)
|
||||
--variant <name> Kokoro ANE variant: english (default) or
|
||||
mandarin (aliases: zh, chinese)
|
||||
--help, -h Show this help
|
||||
|
||||
Examples:
|
||||
fluidaudio tts-benchmark --backend kokoro-ane --output-json bench.json
|
||||
fluidaudio tts-benchmark --backend kokoro-ane --variant mandarin \\
|
||||
--voice zf_001 --corpus minimax-chinese --skip-asr
|
||||
fluidaudio tts-benchmark --backend kokoro --corpus minimax-english
|
||||
fluidaudio tts-benchmark --backend pocket-tts --corpus minimax-german --language german
|
||||
fluidaudio tts-benchmark --backend magpie --speaker sofia --language en
|
||||
fluidaudio tts-benchmark --backend styletts2 --reference speaker.wav
|
||||
fluidaudio tts-benchmark --backend cosyvoice3 --corpus minimax-chinese \\
|
||||
--asr-backend cohere --cohere-model-dir ~/.fluidaudio/cohere/q8
|
||||
|
||||
|
||||
Reference in New Issue
Block a user