Files
Alex 7c115f6b4e feat(tts/kokoro-ane): add laishere 7-stage CoreML chain (ANE-optimized) (#547)
## Summary

Adds a second Kokoro TTS backend (`KokoroAne`) wrapping the
[laishere/kokoro-coreml](https://github.com/laishere/kokoro-coreml)
7-stage chain (Albert → PostAlbert → Alignment → Prosody → Noise →
Vocoder → Tail) behind an actor-based facade, used with the upstream
author's permission. Per-stage `MLComputeUnits` assignment routes
Albert/PostAlbert/Alignment/Vocoder to **ANE**; Prosody/Noise/Tail stay
on CPU+GPU for fp32/iSTFT-heavy ops.

The companion mobius PR for the conversion side:
https://github.com/FluidInference/mobius/pull/45

Existing `KokoroTtsManager` (single fp32 model) is untouched. Both
backends ship from the same `FluidInference/kokoro-82m-coreml` HF repo —
KokoroAne lives under the `ANE/` subdirectory.

## What's added

**Module: `Sources/FluidAudio/TTS/KokoroAne/`**
- `KokoroAneManager` — actor facade: `initialize`,
`synthesize(text|phonemes)`, `synthesizeDetailed`
- `KokoroAneSynthesizer` — 7-stage orchestration with fp16↔fp32 vImage
boundaries (Prosody→Noise→Vocoder→Tail). Uses `rebuild16`/`rebuild32`
helpers so each output is fetched once.
- `KokoroAneModelStore` — per-stage MLModel handles + vocab + voice pack
cache. Atomic-commit load (matches `PocketTtsModelStore` pattern) so
partial-load failures stay retryable.
- `KokoroAneVoicePack` — `[510, 256]` flat fp32 row indexing (timbre
cols `[0:128]`, style_s cols `[128:256]`)
- `KokoroAneVocab` — IPA → token IDs with BOS/EOS wrap, max 512
- `KokoroAneResourceDownloader` — HF cache management via existing
`DownloadUtils`; also downloads the shared kokoro G2P assets on first
init (see fix below)
- G2P reuses existing `G2PModel.shared`

**CLI:**
```bash
fluidaudiocli tts "Hello world" --backend kokoro-ane [--metrics m.json]
fluidaudiocli tts-asr-verify --texts-file phrases.txt --output-json results.json
```
The `tts-asr-verify` batch command synthesizes each phrase, transcribes
with Parakeet, and emits per-phrase + macro/micro WER with stage
timings.

**Tests** (`Tests/FluidAudioTests/TTS/KokoroAne/`):
- 13 unit tests (vocab, voice pack) — no model deps, run on CI
- 5 E2E tests (synth + ASR roundtrip) — gated by
`FLUIDAUDIO_RUN_KOKOROANE_E2E=1`

**Docs:**
- New `Documentation/TTS/KokoroAne.md` — when-to-pick decision table,
CLI/Swift quick start, per-stage compute targets, voice pack layout,
limits, perf numbers, source links.
- Top-of-file callout on `Documentation/TTS/Kokoro.md` linking to the
ANE-resident variant.
- Updated `Documentation/README.md` index, `Documentation/Models.md` TTS
table, `Documentation/API.md` reference, `Documentation/CLI.md` example.

## Verified end-to-end on M2

Cold model load: 20.6s (`anecompilerservice` first-run ANE compilation).
Warm load: ~300ms.

| Phrase | Synth | Audio | RTFx | ASR roundtrip |
|---|---|---|---|---|
| Hello world | 0.47s | 1.65s | 3.5× | "Hello world." (WER 0%) |
| The quick brown fox… | 0.32s | 3.18s | 9.9× | dropped "The" (WER 11%)
|
| She had been waiting… | 0.25s | 2.80s | 11.4× | "Shay" misheard (WER
12.5%) |

Aggregate macro WER 7.9%, micro WER 10.5% — error is ASR-side; TTS audio
is intelligible.

Steady-state per-stage timings confirm ANE residency (Albert/PostAlbert
~7-10ms each).

## Devin Review fixes addressed in this PR

- 🔴 **Partial model load wedged the store**
(`KokoroAneModelStore.loadIfNeeded`) — fixed via local `pendingModels`
accumulator + atomic commit, matching `PocketTtsModelStore`.
- 🐛 **G2P models not downloaded standalone** — `G2PModel.loadIfNeeded`
only reads from `~/.cache/fluidaudio/Models/kokoro/` and never
downloads. The kokoroAne download set didn't include G2P, so first-time
`--backend kokoro-ane` users (no prior `kokoro` use) hit a cryptic
`vocabLoadFailed`. Fixed by adding a `g2p-only` sentinel variant to
`getRequiredModelNames(.kokoro, …)` and a new
`KokoroAneResourceDownloader.ensureG2PAssets(directory:)` that runs
before `G2PModel.shared.ensureModelsAvailable()` in
`KokoroAneManager.initialize()`.
- 🟡 **Voice pack off-by-one (false positive)** — verified upstream
`convert-coreml.py:552` uses `voice_pack[len(phonemes) - 1]`, exactly
matching the existing Swift `phonemeCount - 1`. No change.

## Refactor pass

Internal cleanup applied across the module after the initial
implementation landed:
- `KokoroAneSynthesizer`: `rebuild16`/`rebuild32` helpers replace 11
inline `outputShape + outputArray + float16Array` patterns; F0/N shapes
cached once (was fetched 4×). Fixed a mislabeled `stage:` argument in
`outputArray` error reporting.
- `KokoroAneSynthesizer+Conversion`: extracted
`convertF32toF16`/`convertF16toF32`/`genericCopy` private helpers
(eliminates 4× duplicated vImage buffer setup).
- `KokoroAneModelStore`: folded `voicePack(_)` +
`loadVoicePackIfNeeded(_)` into one method; dropped unreachable
post-load guard and dead synthesized-URL throw.
- `KokoroAneVocab` / `KokoroAneError`: added `vocabParseFailed(URL,
String)` so a malformed top-level JSON object reports parse-failure
instead of file-not-found; removed dead NSNumber bridging fallback.
- `KokoroAneConstants`: dropped unused `defaultLanguage`,
`voicePackTimbreSlice`, `voicePackStyleSSlice`. Changed `defaultSpeed`
from `Float16` to `Float` (drops 4 `Float(...)` wraps at default-arg
sites).
- `KokoroAneError`: dropped unused `unsupportedPhoneme(Character)` —
`KokoroAneVocab.encode` silently drops unknown chars per the upstream
Python convention.

## Test plan

- [x] `swift build` clean
- [x] `swift test --filter KokoroAne` — 13 unit tests pass, 5 E2E gated
- [x] With models staged at
`~/.cache/fluidaudio/Models/kokoro-82m-coreml/ANE/`:
- [x] `FLUIDAUDIO_RUN_KOKOROANE_E2E=1 swift test --filter KokoroAne` —
all 18 pass
- [x] `swift run fluidaudiocli tts "Hello world" --backend kokoro-ane
--output /tmp/ane.wav --metrics /tmp/m.json` — produces non-silent audio
+ metrics with WER
- [x] `swift run fluidaudiocli tts-asr-verify --texts-file phrases.txt
--output-json /tmp/r.json` — aggregate WER ≤ 0.20

## Models

`FluidInference/kokoro-82m-coreml` on HuggingFace, under the `ANE/`
subdirectory:
```
ANE/KokoroAlbert.mlmodelc       fp16 + int8pal  (CPU+ANE)
ANE/KokoroPostAlbert.mlmodelc   fp16 + int8pal  (CPU+ANE)
ANE/KokoroAlignment.mlmodelc    fp16 + int8pal  (CPU+ANE)
ANE/KokoroProsody.mlmodelc      fp32             (CPU+GPU)
ANE/KokoroNoise.mlmodelc        fp32             (CPU+GPU)
ANE/KokoroVocoder.mlmodelc      fp16 + int8pal   (CPU+ANE)
ANE/KokoroTail.mlmodelc         fp32 + iSTFT     (CPU+GPU)
ANE/vocab.json                  114 IPA tokens
ANE/af_heart.bin                [510, 256] fp32 voice pack
```

G2P assets (`G2PEncoder.mlmodelc`, `G2PDecoder.mlmodelc`,
`g2p_vocab.json`) are pulled from the same repo's root and cached at
`~/.cache/fluidaudio/Models/kokoro/`, shared with the regular
`KokoroTtsManager` backend.

## License

Upstream (laishere) is MIT — carried forward in the mobius PR's LICENSE
file. Used with the upstream author's permission.
2026-04-27 20:08:49 -04:00

6.7 KiB
Raw Permalink Blame History

Kokoro Text-to-Speech

Looking for the ANE-resident variant? See KokoroAne — splits the same model into 7 stages so the ANE-friendly layers stay resident on the Neural Engine (3-11× faster on Apple Silicon), at the cost of a single voice (af_heart), no chunker (≤510 IPA phonemes), and no custom lexicon. Use this page for the default multi-voice / long-form / SSML / custom-lexicon path.

Overview

Kokoro is a TTS backend that generates the entire audio representation in one pass (all frames at once) using flow matching over mel spectrograms, then converts to audio with the Vocos vocoder.

Quick Start

CLI

swift run fluidaudiocli tts "Welcome to FluidAudio text to speech" \
  --output ~/Desktop/demo.wav \
  --voice af_heart

The first invocation downloads Kokoro models, phoneme dictionaries, and voice embeddings; later runs reuse the cached assets.

Swift

import FluidAudio

let manager = KokoroTtsManager()
try await manager.initialize()

let audioData = try await manager.synthesize(text: "Hello from FluidAudio!")

let outputURL = URL(fileURLWithPath: "/tmp/demo.wav")
try audioData.write(to: outputURL)

Swap in manager.initialize(models:) when you want to preload only the long-form .fifteenSecond variant.

Inspecting Chunk Metadata

let manager = KokoroTtsManager()
try await manager.initialize()

let detailed = try await manager.synthesizeDetailed(
    text: "FluidAudio can report chunk splits for you.",
    variantPreference: .fifteenSecond
)

for chunk in detailed.chunks {
    print("Chunk #\(chunk.index) -> variant: \(chunk.variant), tokens: \(chunk.tokenCount)")
    print("  text: \(chunk.text)")
}

KokoroSynthesizer.SynthesisResult also exposes diagnostics for per-run variant and audio footprint totals.

Pipeline

text → G2P model → IPA phonemes → Kokoro model → audio
         ↑                ↑
   custom lexicon    SSML <phoneme>
   overrides here    overrides here

The G2P (grapheme-to-phoneme) step runs outside the model as a preprocessing step using a CoreML BART encoder-decoder. Words found in the built-in lexicon use dictionary lookup; out-of-vocabulary words fall back to the G2P model. You can intercept and edit phonemes before they reach the neural network. This is what enables all pronunciation control features.

Pronunciation Control

Kokoro supports three ways to override pronunciation:

  1. SSML tags<phoneme>, <sub>, <say-as> (cardinal, ordinal, digits, date, time, telephone, fraction, characters). See SSML.md.
  2. Custom lexicon — word → IPA mapping files loaded via setCustomLexicon(). Entries matched case-sensitive first, then case-insensitive, then normalized. See CustomPronunciation.md.
  3. Markdown syntax — inline [word](/ipa/) overrides in the input text. Example: [Kokoro](/kəˈkɔɹo/).

Precedence: custom lexicon > built-in dictionaries > morphological stemming > G2P model.

Text Preprocessing

Kokoro includes comprehensive text normalization (numbers, currencies, times, decimal numbers, units, abbreviations, dates). SSML processing runs first, then markdown-style overrides, then normalization.

How It Differs From PocketTTS

Kokoro PocketTTS
Pipeline text → CoreML G2P → IPA → model text → SentencePiece → model
Voice conditioning Style embedding vector 125 audio prompt tokens
Generation All frames at once Frame-by-frame autoregressive
Flow matching target Mel spectrogram 32-dim latent per frame
Audio synthesis Vocos vocoder Mimi streaming codec
Latency to first audio Must wait for full generation ~80ms after prefill
SSML support Yes (<phoneme>, <sub>, <say-as>) No
Custom lexicon Yes (word → IPA) No
Markdown pronunciation Yes ([word](/ipa/)) No
Text preprocessing Full (numbers, dates, currencies) Minimal (whitespace, punctuation)

Kokoro parallelizes across time (fast total, but must wait for everything). PocketTTS is sequential across time (slower total, but audio starts immediately).

PocketTTS cannot support phoneme-level features because it has no phoneme stage — the model was trained on text tokens, not IPA. See PocketTTS.md for details on what can and cannot be added.

V2 Models (ANE-Optimized)

The v2 models (kokoro_21_5s_v2, kokoro_21_15s_v2) are converted with compute_precision=FLOAT16, which moves 833 ops (BERT transformer layers + generator convolutions) to the Apple Neural Engine.

Metric V1 (fp32, cpuAndGPU) V2 (fp16, .all)
Median latency (5s) 417 ms 250 ms
RTFx (5s audio) 12.0x 20.0x
Speedup 1.67x
Quality baseline identical (round-trip TTS→ASR)

The 6 LSTM ops (duration predictor) remain on CPU — CoreML does not schedule recurrent ops to ANE regardless of precision. The original v1 models are still available on HuggingFace for backward compatibility.

Known Issues

  • Sibilance in high-pitched voices: Some female af_* voices (e.g. af_heart, af_bella) produce harsh sibilant sounds (s, sh, z). This is baked into the model output and cannot be fixed with post-processing EQ. Lower-pitched voices (male am_* variants and some female voices) are unaffected. See mobius#23.

  • G2P phoneme mismatch limitation: FluidAudio currently uses graphemes_to_phonemes_en_us (from HuggingFace: PeterReid/graphemes_to_phonemes_en_us) for grapheme-to-phoneme conversion. The original Kokoro and KittenTTS models were trained using espeak for phoneme generation. This G2P mismatch can cause pronunciation issues in some words (e.g., "hello" and "day" in KittenTTS). We cannot use espeak directly due to licensing constraints. Need: An espeak-compatible alternative with a permissive license that produces matching phoneme outputs. This affects any TTS model in FluidAudio that relies on the shared Kokoro G2P pipeline. See PR #409 for examples.

Enable TTS in Your Project

Kokoro TTS is included in the FluidAudio product — no separate product needed.

Package.swift:

dependencies: [
    .package(url: "https://github.com/FluidInference/FluidAudio.git", from: "0.12.4"),
],
targets: [
    .target(
        name: "YourTarget",
        dependencies: [
            .product(name: "FluidAudio", package: "FluidAudio")
        ]
    )
]

Import in your code:

import FluidAudio

CLI

swift run fluidaudiocli tts "Welcome to FluidAudio" --output ~/Desktop/demo.wav