## Summary Adds a second Kokoro TTS backend (`KokoroAne`) wrapping the [laishere/kokoro-coreml](https://github.com/laishere/kokoro-coreml) 7-stage chain (Albert → PostAlbert → Alignment → Prosody → Noise → Vocoder → Tail) behind an actor-based facade, used with the upstream author's permission. Per-stage `MLComputeUnits` assignment routes Albert/PostAlbert/Alignment/Vocoder to **ANE**; Prosody/Noise/Tail stay on CPU+GPU for fp32/iSTFT-heavy ops. The companion mobius PR for the conversion side: https://github.com/FluidInference/mobius/pull/45 Existing `KokoroTtsManager` (single fp32 model) is untouched. Both backends ship from the same `FluidInference/kokoro-82m-coreml` HF repo — KokoroAne lives under the `ANE/` subdirectory. ## What's added **Module: `Sources/FluidAudio/TTS/KokoroAne/`** - `KokoroAneManager` — actor facade: `initialize`, `synthesize(text|phonemes)`, `synthesizeDetailed` - `KokoroAneSynthesizer` — 7-stage orchestration with fp16↔fp32 vImage boundaries (Prosody→Noise→Vocoder→Tail). Uses `rebuild16`/`rebuild32` helpers so each output is fetched once. - `KokoroAneModelStore` — per-stage MLModel handles + vocab + voice pack cache. Atomic-commit load (matches `PocketTtsModelStore` pattern) so partial-load failures stay retryable. - `KokoroAneVoicePack` — `[510, 256]` flat fp32 row indexing (timbre cols `[0:128]`, style_s cols `[128:256]`) - `KokoroAneVocab` — IPA → token IDs with BOS/EOS wrap, max 512 - `KokoroAneResourceDownloader` — HF cache management via existing `DownloadUtils`; also downloads the shared kokoro G2P assets on first init (see fix below) - G2P reuses existing `G2PModel.shared` **CLI:** ```bash fluidaudiocli tts "Hello world" --backend kokoro-ane [--metrics m.json] fluidaudiocli tts-asr-verify --texts-file phrases.txt --output-json results.json ``` The `tts-asr-verify` batch command synthesizes each phrase, transcribes with Parakeet, and emits per-phrase + macro/micro WER with stage timings. **Tests** (`Tests/FluidAudioTests/TTS/KokoroAne/`): - 13 unit tests (vocab, voice pack) — no model deps, run on CI - 5 E2E tests (synth + ASR roundtrip) — gated by `FLUIDAUDIO_RUN_KOKOROANE_E2E=1` **Docs:** - New `Documentation/TTS/KokoroAne.md` — when-to-pick decision table, CLI/Swift quick start, per-stage compute targets, voice pack layout, limits, perf numbers, source links. - Top-of-file callout on `Documentation/TTS/Kokoro.md` linking to the ANE-resident variant. - Updated `Documentation/README.md` index, `Documentation/Models.md` TTS table, `Documentation/API.md` reference, `Documentation/CLI.md` example. ## Verified end-to-end on M2 Cold model load: 20.6s (`anecompilerservice` first-run ANE compilation). Warm load: ~300ms. | Phrase | Synth | Audio | RTFx | ASR roundtrip | |---|---|---|---|---| | Hello world | 0.47s | 1.65s | 3.5× | "Hello world." (WER 0%) | | The quick brown fox… | 0.32s | 3.18s | 9.9× | dropped "The" (WER 11%) | | She had been waiting… | 0.25s | 2.80s | 11.4× | "Shay" misheard (WER 12.5%) | Aggregate macro WER 7.9%, micro WER 10.5% — error is ASR-side; TTS audio is intelligible. Steady-state per-stage timings confirm ANE residency (Albert/PostAlbert ~7-10ms each). ## Devin Review fixes addressed in this PR - 🔴 **Partial model load wedged the store** (`KokoroAneModelStore.loadIfNeeded`) — fixed via local `pendingModels` accumulator + atomic commit, matching `PocketTtsModelStore`. - 🐛 **G2P models not downloaded standalone** — `G2PModel.loadIfNeeded` only reads from `~/.cache/fluidaudio/Models/kokoro/` and never downloads. The kokoroAne download set didn't include G2P, so first-time `--backend kokoro-ane` users (no prior `kokoro` use) hit a cryptic `vocabLoadFailed`. Fixed by adding a `g2p-only` sentinel variant to `getRequiredModelNames(.kokoro, …)` and a new `KokoroAneResourceDownloader.ensureG2PAssets(directory:)` that runs before `G2PModel.shared.ensureModelsAvailable()` in `KokoroAneManager.initialize()`. - 🟡 **Voice pack off-by-one (false positive)** — verified upstream `convert-coreml.py:552` uses `voice_pack[len(phonemes) - 1]`, exactly matching the existing Swift `phonemeCount - 1`. No change. ## Refactor pass Internal cleanup applied across the module after the initial implementation landed: - `KokoroAneSynthesizer`: `rebuild16`/`rebuild32` helpers replace 11 inline `outputShape + outputArray + float16Array` patterns; F0/N shapes cached once (was fetched 4×). Fixed a mislabeled `stage:` argument in `outputArray` error reporting. - `KokoroAneSynthesizer+Conversion`: extracted `convertF32toF16`/`convertF16toF32`/`genericCopy` private helpers (eliminates 4× duplicated vImage buffer setup). - `KokoroAneModelStore`: folded `voicePack(_)` + `loadVoicePackIfNeeded(_)` into one method; dropped unreachable post-load guard and dead synthesized-URL throw. - `KokoroAneVocab` / `KokoroAneError`: added `vocabParseFailed(URL, String)` so a malformed top-level JSON object reports parse-failure instead of file-not-found; removed dead NSNumber bridging fallback. - `KokoroAneConstants`: dropped unused `defaultLanguage`, `voicePackTimbreSlice`, `voicePackStyleSSlice`. Changed `defaultSpeed` from `Float16` to `Float` (drops 4 `Float(...)` wraps at default-arg sites). - `KokoroAneError`: dropped unused `unsupportedPhoneme(Character)` — `KokoroAneVocab.encode` silently drops unknown chars per the upstream Python convention. ## Test plan - [x] `swift build` clean - [x] `swift test --filter KokoroAne` — 13 unit tests pass, 5 E2E gated - [x] With models staged at `~/.cache/fluidaudio/Models/kokoro-82m-coreml/ANE/`: - [x] `FLUIDAUDIO_RUN_KOKOROANE_E2E=1 swift test --filter KokoroAne` — all 18 pass - [x] `swift run fluidaudiocli tts "Hello world" --backend kokoro-ane --output /tmp/ane.wav --metrics /tmp/m.json` — produces non-silent audio + metrics with WER - [x] `swift run fluidaudiocli tts-asr-verify --texts-file phrases.txt --output-json /tmp/r.json` — aggregate WER ≤ 0.20 ## Models `FluidInference/kokoro-82m-coreml` on HuggingFace, under the `ANE/` subdirectory: ``` ANE/KokoroAlbert.mlmodelc fp16 + int8pal (CPU+ANE) ANE/KokoroPostAlbert.mlmodelc fp16 + int8pal (CPU+ANE) ANE/KokoroAlignment.mlmodelc fp16 + int8pal (CPU+ANE) ANE/KokoroProsody.mlmodelc fp32 (CPU+GPU) ANE/KokoroNoise.mlmodelc fp32 (CPU+GPU) ANE/KokoroVocoder.mlmodelc fp16 + int8pal (CPU+ANE) ANE/KokoroTail.mlmodelc fp32 + iSTFT (CPU+GPU) ANE/vocab.json 114 IPA tokens ANE/af_heart.bin [510, 256] fp32 voice pack ``` G2P assets (`G2PEncoder.mlmodelc`, `G2PDecoder.mlmodelc`, `g2p_vocab.json`) are pulled from the same repo's root and cached at `~/.cache/fluidaudio/Models/kokoro/`, shared with the regular `KokoroTtsManager` backend. ## License Upstream (laishere) is MIT — carried forward in the mobius PR's LICENSE file. Used with the upstream author's permission.
6.7 KiB
Kokoro Text-to-Speech
Looking for the ANE-resident variant? See KokoroAne — splits the same model into 7 stages so the ANE-friendly layers stay resident on the Neural Engine (3-11× faster on Apple Silicon), at the cost of a single voice (
af_heart), no chunker (≤510 IPA phonemes), and no custom lexicon. Use this page for the default multi-voice / long-form / SSML / custom-lexicon path.
Overview
Kokoro is a TTS backend that generates the entire audio representation in one pass (all frames at once) using flow matching over mel spectrograms, then converts to audio with the Vocos vocoder.
Quick Start
CLI
swift run fluidaudiocli tts "Welcome to FluidAudio text to speech" \
--output ~/Desktop/demo.wav \
--voice af_heart
The first invocation downloads Kokoro models, phoneme dictionaries, and voice embeddings; later runs reuse the cached assets.
Swift
import FluidAudio
let manager = KokoroTtsManager()
try await manager.initialize()
let audioData = try await manager.synthesize(text: "Hello from FluidAudio!")
let outputURL = URL(fileURLWithPath: "/tmp/demo.wav")
try audioData.write(to: outputURL)
Swap in manager.initialize(models:) when you want to preload only the long-form .fifteenSecond variant.
Inspecting Chunk Metadata
let manager = KokoroTtsManager()
try await manager.initialize()
let detailed = try await manager.synthesizeDetailed(
text: "FluidAudio can report chunk splits for you.",
variantPreference: .fifteenSecond
)
for chunk in detailed.chunks {
print("Chunk #\(chunk.index) -> variant: \(chunk.variant), tokens: \(chunk.tokenCount)")
print(" text: \(chunk.text)")
}
KokoroSynthesizer.SynthesisResult also exposes diagnostics for per-run variant and audio footprint totals.
Pipeline
text → G2P model → IPA phonemes → Kokoro model → audio
↑ ↑
custom lexicon SSML <phoneme>
overrides here overrides here
The G2P (grapheme-to-phoneme) step runs outside the model as a preprocessing step using a CoreML BART encoder-decoder. Words found in the built-in lexicon use dictionary lookup; out-of-vocabulary words fall back to the G2P model. You can intercept and edit phonemes before they reach the neural network. This is what enables all pronunciation control features.
Pronunciation Control
Kokoro supports three ways to override pronunciation:
- SSML tags —
<phoneme>,<sub>,<say-as>(cardinal, ordinal, digits, date, time, telephone, fraction, characters). See SSML.md. - Custom lexicon — word → IPA mapping files loaded via
setCustomLexicon(). Entries matched case-sensitive first, then case-insensitive, then normalized. See CustomPronunciation.md. - Markdown syntax — inline
[word](/ipa/)overrides in the input text. Example:[Kokoro](/kəˈkɔɹo/).
Precedence: custom lexicon > built-in dictionaries > morphological stemming > G2P model.
Text Preprocessing
Kokoro includes comprehensive text normalization (numbers, currencies, times, decimal numbers, units, abbreviations, dates). SSML processing runs first, then markdown-style overrides, then normalization.
How It Differs From PocketTTS
| Kokoro | PocketTTS | |
|---|---|---|
| Pipeline | text → CoreML G2P → IPA → model | text → SentencePiece → model |
| Voice conditioning | Style embedding vector | 125 audio prompt tokens |
| Generation | All frames at once | Frame-by-frame autoregressive |
| Flow matching target | Mel spectrogram | 32-dim latent per frame |
| Audio synthesis | Vocos vocoder | Mimi streaming codec |
| Latency to first audio | Must wait for full generation | ~80ms after prefill |
| SSML support | Yes (<phoneme>, <sub>, <say-as>) |
No |
| Custom lexicon | Yes (word → IPA) | No |
| Markdown pronunciation | Yes ([word](/ipa/)) |
No |
| Text preprocessing | Full (numbers, dates, currencies) | Minimal (whitespace, punctuation) |
Kokoro parallelizes across time (fast total, but must wait for everything). PocketTTS is sequential across time (slower total, but audio starts immediately).
PocketTTS cannot support phoneme-level features because it has no phoneme stage — the model was trained on text tokens, not IPA. See PocketTTS.md for details on what can and cannot be added.
V2 Models (ANE-Optimized)
The v2 models (kokoro_21_5s_v2, kokoro_21_15s_v2) are converted with compute_precision=FLOAT16, which moves 833 ops (BERT transformer layers + generator convolutions) to the Apple Neural Engine.
| Metric | V1 (fp32, cpuAndGPU) | V2 (fp16, .all) |
|---|---|---|
| Median latency (5s) | 417 ms | 250 ms |
| RTFx (5s audio) | 12.0x | 20.0x |
| Speedup | — | 1.67x |
| Quality | baseline | identical (round-trip TTS→ASR) |
The 6 LSTM ops (duration predictor) remain on CPU — CoreML does not schedule recurrent ops to ANE regardless of precision. The original v1 models are still available on HuggingFace for backward compatibility.
Known Issues
-
Sibilance in high-pitched voices: Some female
af_*voices (e.g.af_heart,af_bella) produce harsh sibilant sounds (s, sh, z). This is baked into the model output and cannot be fixed with post-processing EQ. Lower-pitched voices (maleam_*variants and some female voices) are unaffected. See mobius#23. -
G2P phoneme mismatch limitation: FluidAudio currently uses
graphemes_to_phonemes_en_us(from HuggingFace: PeterReid/graphemes_to_phonemes_en_us) for grapheme-to-phoneme conversion. The original Kokoro and KittenTTS models were trained using espeak for phoneme generation. This G2P mismatch can cause pronunciation issues in some words (e.g., "hello" and "day" in KittenTTS). We cannot use espeak directly due to licensing constraints. Need: An espeak-compatible alternative with a permissive license that produces matching phoneme outputs. This affects any TTS model in FluidAudio that relies on the shared Kokoro G2P pipeline. See PR #409 for examples.
Enable TTS in Your Project
Kokoro TTS is included in the FluidAudio product — no separate product needed.
Package.swift:
dependencies: [
.package(url: "https://github.com/FluidInference/FluidAudio.git", from: "0.12.4"),
],
targets: [
.target(
name: "YourTarget",
dependencies: [
.product(name: "FluidAudio", package: "FluidAudio")
]
)
]
Import in your code:
import FluidAudio
CLI
swift run fluidaudiocli tts "Welcome to FluidAudio" --output ~/Desktop/demo.wav