mirror of https://github.com/FluidInference/FluidAudio.git synced 2026-05-12 20:20:36 +00:00

Files

T

Alex 7c115f6b4e feat(tts/kokoro-ane): add laishere 7-stage CoreML chain (ANE-optimized) (#547 )

## Summary

Adds a second Kokoro TTS backend (`KokoroAne`) wrapping the
[laishere/kokoro-coreml](https://github.com/laishere/kokoro-coreml)
7-stage chain (Albert → PostAlbert → Alignment → Prosody → Noise →
Vocoder → Tail) behind an actor-based facade, used with the upstream
author's permission. Per-stage `MLComputeUnits` assignment routes
Albert/PostAlbert/Alignment/Vocoder to **ANE**; Prosody/Noise/Tail stay
on CPU+GPU for fp32/iSTFT-heavy ops.

The companion mobius PR for the conversion side:
https://github.com/FluidInference/mobius/pull/45

Existing `KokoroTtsManager` (single fp32 model) is untouched. Both
backends ship from the same `FluidInference/kokoro-82m-coreml` HF repo —
KokoroAne lives under the `ANE/` subdirectory.

## What's added

**Module: `Sources/FluidAudio/TTS/KokoroAne/`**
- `KokoroAneManager` — actor facade: `initialize`,
`synthesize(text|phonemes)`, `synthesizeDetailed`
- `KokoroAneSynthesizer` — 7-stage orchestration with fp16↔fp32 vImage
boundaries (Prosody→Noise→Vocoder→Tail). Uses `rebuild16`/`rebuild32`
helpers so each output is fetched once.
- `KokoroAneModelStore` — per-stage MLModel handles + vocab + voice pack
cache. Atomic-commit load (matches `PocketTtsModelStore` pattern) so
partial-load failures stay retryable.
- `KokoroAneVoicePack` — `[510, 256]` flat fp32 row indexing (timbre
cols `[0:128]`, style_s cols `[128:256]`)
- `KokoroAneVocab` — IPA → token IDs with BOS/EOS wrap, max 512
- `KokoroAneResourceDownloader` — HF cache management via existing
`DownloadUtils`; also downloads the shared kokoro G2P assets on first
init (see fix below)
- G2P reuses existing `G2PModel.shared`

**CLI:**
```bash
fluidaudiocli tts "Hello world" --backend kokoro-ane [--metrics m.json]
fluidaudiocli tts-asr-verify --texts-file phrases.txt --output-json results.json
```
The `tts-asr-verify` batch command synthesizes each phrase, transcribes
with Parakeet, and emits per-phrase + macro/micro WER with stage
timings.

**Tests** (`Tests/FluidAudioTests/TTS/KokoroAne/`):
- 13 unit tests (vocab, voice pack) — no model deps, run on CI
- 5 E2E tests (synth + ASR roundtrip) — gated by
`FLUIDAUDIO_RUN_KOKOROANE_E2E=1`

**Docs:**
- New `Documentation/TTS/KokoroAne.md` — when-to-pick decision table,
CLI/Swift quick start, per-stage compute targets, voice pack layout,
limits, perf numbers, source links.
- Top-of-file callout on `Documentation/TTS/Kokoro.md` linking to the
ANE-resident variant.
- Updated `Documentation/README.md` index, `Documentation/Models.md` TTS
table, `Documentation/API.md` reference, `Documentation/CLI.md` example.

## Verified end-to-end on M2

Cold model load: 20.6s (`anecompilerservice` first-run ANE compilation).
Warm load: ~300ms.

| Phrase | Synth | Audio | RTFx | ASR roundtrip |
|---|---|---|---|---|
| Hello world | 0.47s | 1.65s | 3.5× | "Hello world." (WER 0%) |
| The quick brown fox… | 0.32s | 3.18s | 9.9× | dropped "The" (WER 11%)
|
| She had been waiting… | 0.25s | 2.80s | 11.4× | "Shay" misheard (WER
12.5%) |

Aggregate macro WER 7.9%, micro WER 10.5% — error is ASR-side; TTS audio
is intelligible.

Steady-state per-stage timings confirm ANE residency (Albert/PostAlbert
~7-10ms each).

## Devin Review fixes addressed in this PR

- 🔴 **Partial model load wedged the store**
(`KokoroAneModelStore.loadIfNeeded`) — fixed via local `pendingModels`
accumulator + atomic commit, matching `PocketTtsModelStore`.
- 🐛 **G2P models not downloaded standalone** — `G2PModel.loadIfNeeded`
only reads from `~/.cache/fluidaudio/Models/kokoro/` and never
downloads. The kokoroAne download set didn't include G2P, so first-time
`--backend kokoro-ane` users (no prior `kokoro` use) hit a cryptic
`vocabLoadFailed`. Fixed by adding a `g2p-only` sentinel variant to
`getRequiredModelNames(.kokoro, …)` and a new
`KokoroAneResourceDownloader.ensureG2PAssets(directory:)` that runs
before `G2PModel.shared.ensureModelsAvailable()` in
`KokoroAneManager.initialize()`.
- 🟡 **Voice pack off-by-one (false positive)** — verified upstream
`convert-coreml.py:552` uses `voice_pack[len(phonemes) - 1]`, exactly
matching the existing Swift `phonemeCount - 1`. No change.

## Refactor pass

Internal cleanup applied across the module after the initial
implementation landed:
- `KokoroAneSynthesizer`: `rebuild16`/`rebuild32` helpers replace 11
inline `outputShape + outputArray + float16Array` patterns; F0/N shapes
cached once (was fetched 4×). Fixed a mislabeled `stage:` argument in
`outputArray` error reporting.
- `KokoroAneSynthesizer+Conversion`: extracted
`convertF32toF16`/`convertF16toF32`/`genericCopy` private helpers
(eliminates 4× duplicated vImage buffer setup).
- `KokoroAneModelStore`: folded `voicePack(_)` +
`loadVoicePackIfNeeded(_)` into one method; dropped unreachable
post-load guard and dead synthesized-URL throw.
- `KokoroAneVocab` / `KokoroAneError`: added `vocabParseFailed(URL,
String)` so a malformed top-level JSON object reports parse-failure
instead of file-not-found; removed dead NSNumber bridging fallback.
- `KokoroAneConstants`: dropped unused `defaultLanguage`,
`voicePackTimbreSlice`, `voicePackStyleSSlice`. Changed `defaultSpeed`
from `Float16` to `Float` (drops 4 `Float(...)` wraps at default-arg
sites).
- `KokoroAneError`: dropped unused `unsupportedPhoneme(Character)` —
`KokoroAneVocab.encode` silently drops unknown chars per the upstream
Python convention.

## Test plan

- [x] `swift build` clean
- [x] `swift test --filter KokoroAne` — 13 unit tests pass, 5 E2E gated
- [x] With models staged at
`~/.cache/fluidaudio/Models/kokoro-82m-coreml/ANE/`:
- [x] `FLUIDAUDIO_RUN_KOKOROANE_E2E=1 swift test --filter KokoroAne` —
all 18 pass
- [x] `swift run fluidaudiocli tts "Hello world" --backend kokoro-ane
--output /tmp/ane.wav --metrics /tmp/m.json` — produces non-silent audio
+ metrics with WER
- [x] `swift run fluidaudiocli tts-asr-verify --texts-file phrases.txt
--output-json /tmp/r.json` — aggregate WER ≤ 0.20

## Models

`FluidInference/kokoro-82m-coreml` on HuggingFace, under the `ANE/`
subdirectory:
```
ANE/KokoroAlbert.mlmodelc       fp16 + int8pal  (CPU+ANE)
ANE/KokoroPostAlbert.mlmodelc   fp16 + int8pal  (CPU+ANE)
ANE/KokoroAlignment.mlmodelc    fp16 + int8pal  (CPU+ANE)
ANE/KokoroProsody.mlmodelc      fp32             (CPU+GPU)
ANE/KokoroNoise.mlmodelc        fp32             (CPU+GPU)
ANE/KokoroVocoder.mlmodelc      fp16 + int8pal   (CPU+ANE)
ANE/KokoroTail.mlmodelc         fp32 + iSTFT     (CPU+GPU)
ANE/vocab.json                  114 IPA tokens
ANE/af_heart.bin                [510, 256] fp32 voice pack
```

G2P assets (`G2PEncoder.mlmodelc`, `G2PDecoder.mlmodelc`,
`g2p_vocab.json`) are pulled from the same repo's root and cached at
`~/.cache/fluidaudio/Models/kokoro/`, shared with the regular
`KokoroTtsManager` backend.

## License

Upstream (laishere) is MIT — carried forward in the mobius PR's LICENSE
file. Used with the upstream author's permission.

2026-04-27 20:08:49 -04:00

21 KiB

Raw Blame History

API Reference

Primary public APIs for FluidAudio components. See inline doc comments for complete details.

Components:

Common Patterns
Diarization
Voice Activity Detection
Automatic Speech Recognition
Text-to-Speech

Common Patterns

Audio Format: All modules expect 16kHz mono Float32 audio samples. Use FluidAudio.AudioConverter to convert AVAudioPCMBuffer or files to 16kHz mono for both CLI and library paths.

Model Registry: Models auto-download from HuggingFace by default. Customize the registry URL using:

ModelRegistry.baseURL (programmatic) - recommended for apps
REGISTRY_URL or MODEL_REGISTRY_URL environment variables - recommended for CLI/testing
Priority order: programmatic override → env vars → default (HuggingFace)

Proxy Configuration: If behind a corporate firewall, set the https_proxy (or http_proxy) environment variable. Both registry URL and proxy configuration are centralized in ModelRegistry.

Error Handling: All async methods throw descriptive errors. Use proper error handling in production code.

Thread Safety: All managers are thread-safe and can be used concurrently across different queues.

Diarization

DiarizerManager

Main class for speaker diarization and "who spoke when" analysis.

Key Methods:

performCompleteDiarization(_:sampleRate:) throws -> DiarizationResult
- Process complete audio file and return speaker segments
- Parameters: RandomAccessCollection<Float> audio samples, sample rate (default: 16000)
- Returns: DiarizerResult with speaker segments and timing
validateAudio(_:) throws -> AudioValidationResult
- Validate audio quality, length, and format requirements

Configuration:

DiarizerConfig: Clustering threshold, minimum durations, activity thresholds
Optimal threshold: 0.7 (17.7% DER on AMI dataset)

OfflineDiarizerManager

Full batch pipeline that mirrors the pyannote/Core ML exporter (powerset segmentation + VBx clustering).

Requires macOS 14 / iOS 17 or later because the manager relies on Swift Concurrency features and C++ clustering shims that are unavailable on older OS releases.

Key Methods:

init(config: OfflineDiarizerConfig = .default)
- Creates manager with configuration
prepareModels(directory:configuration:forceRedownload:) async throws
- Downloads / compiles the Core ML bundles as needed and records timing metadata. Call once before processing when you don't already have OfflineDiarizerModels.
initialize(models: OfflineDiarizerModels)
- Initializes with models containing segmentation, embedding, and PLDA components (useful when you hydrate the bundles yourself).
process(audio: [Float]) async throws -> DiarizationResult
- Runs the full 10 s window pipeline: segmentation → soft mask interpolation → embedding → VBx → timeline reconstruction.
process(audioSource: StreamingAudioSampleSource, audioLoadingSeconds: TimeInterval) async throws -> DiarizationResult
- Streams audio from disk-backed sources without materializing the entire buffer in memory. Pair with StreamingAudioSourceFactory for large meetings.

Supporting Types:

OfflineDiarizerConfig
- Mirrors pyannote config.yaml (clusteringThreshold, Fa, Fb, maxVBxIterations, minDurationOn/off, batch sizes, logging flags).
SegmentationRunner
- Batches 160 k-sample chunks through the segmentation model (589 frames per chunk).
Binarization
- Converts log probabilities to soft VAD weights while retaining binary masks for diagnostics.
WeightInterpolation
- Reimplements scipy.ndimage.zoom (half-pixel offsets) so 589-frame weights align with the embedding model’s pooling stride.
EmbeddingRunner
- Runs the FBANK frontend + embedding backend, resamples masks to 589 frames, and emits 256-d L2-normalized embeddings.
PLDAScoring / VBxClustering
- Apply the exported PLDA transforms and iterative VBx refinement to group embeddings into speakers.
TimelineReconstruction
- Derives timestamps directly from the segmentation frame count and OfflineDiarizerConfig.windowDuration, then enforces minimum gap/duration constraints.
StreamingAudioSourceFactory
- Creates disk-backed or in-memory StreamingAudioSampleSource instances so large meetings never require fully materialized [Float] buffers.

Use OfflineDiarizerManager when you need offline DER parity or want to run the new CLI offline mode (fluidaudio process --mode offline, fluidaudio diarization-benchmark --mode offline).

Diarizer Protocol

SortformerDiarizer and LSEENDDiarizer both conform to the Diarizer protocol, providing a unified streaming and offline API.

Streaming: addAudio(_:sourceSampleRate:) → process() → read timeline. Convenience process(samples:sourceSampleRate:) combines both steps. Returns DiarizerTimelineUpdate? (nil when not enough audio has accumulated).

Offline: processComplete(_:sourceSampleRate:...) or processComplete(audioFileURL:...) to process a full recording in one call.

Speaker Enrollment: enrollSpeaker(withAudio:sourceSampleRate:named:...) feeds known-speaker audio before streaming to label a slot.

Lifecycle: finalizeSession() flushes trailing context so the last true frame becomes finalized. reset() clears streaming state but keeps the model loaded. cleanup() releases everything.

DiarizerTimeline & DiarizerSpeaker

DiarizerTimeline accumulates per-frame speaker probabilities and derives DiarizerSpeaker segments. Each speaker has finalizedSegments (confirmed) and tentativeSegments (may be revised). Segments expose startTime, endTime, duration, and isFinalized.

DiarizerTimelineConfig controls post-processing (onset/offset thresholds default to 0.5, min segment/gap duration, optional rolling window cap). Both diarizers accept this at init.

Speaker Management:

upsertSpeaker(named:atIndex:) -> DiarizerSpeaker?
- Add a speaker to a slot, or update the existing speaker's name if that slot is already occupied
- If atIndex is nil, the first unused diarizer slot is chosen
upsertSpeaker(_:atIndex:transferCurrentSegment:) -> DiarizerSpeaker?
- Insert an existing DiarizerSpeaker into a slot, replacing any speaker already assigned there
- If atIndex is nil, the first unused diarizer slot is chosen
- transferCurrentSegment moves the in-progress segment (if one exists) to the new speaker before continuing
removeSpeaker(atIndex:clearCurrentSegment:) -> DiarizerSpeaker?
- Remove the speaker assigned to a diarizer output slot and return the removed speaker if present
- clearCurrentSegment resets the in-progress speaking state for that slot before continuing
speakers: [Int: DiarizerSpeaker]
- Read or replace the full slot-to-speaker mapping directly when needed

SortformerDiarizer

Streaming diarization using NVIDIA's Sortformer. 4 fixed speaker slots, 16 kHz input, 80 ms frame duration.

let diarizer = SortformerDiarizer(config: .default, timelineConfig: .sortformerDefault)
try await diarizer.initialize(mainModelPath: modelURL)

Config presets: .default / .fastV2_1 (1.04 s latency), .balancedV2_1 (1.04 s, 20.6% DER on AMI SDM), .highContextV2_1 (30.4 s latency). v2 variants also available.

LSEENDDiarizer

Streaming diarization using LS-EEND. Variable speaker slots, 8 kHz input, 100 ms frame duration, 20.7% DER on AMI SDM.

let diarizer = LSEENDDiarizer(computeUnits: .cpuOnly)
try await diarizer.initialize(variant: .dihard3)

Variants: ami, callhome, dihard2, dihard3 (via LSEENDModelDescriptor.loadFromHuggingFace(variant:)).

Call finalizeSession() at end-of-stream to flush pending audio before reading the final timeline.

Voice Activity Detection

VadManager

Voice activity detection using the Silero VAD Core ML model with 256 ms unified inference and ANE optimizations.

Key Methods:

process(_ url: URL) async throws -> [VadResult]
- Process an audio file end-to-end. Automatically converts to 16kHz mono Float32 and processes in 4096-sample frames (256 ms).
process(_ buffer: AVAudioPCMBuffer) async throws -> [VadResult]
- Convert and process an in-memory buffer. Supports any input format; resampled to 16kHz mono internally.
process(_ samples: [Float]) async throws -> [VadResult]
- Process pre-converted 16kHz mono samples.
processChunk(_:inputState:) async throws -> VadResult
- Process a single 4096-sample frame (256 ms at 16 kHz) with optional recurrent state.

Constants:

VadManager.chunkSize = 4096 // samples per frame (256 ms @ 16 kHz, plus 64-sample context managed internally)
VadManager.sampleRate = 16000

Configuration (VadConfig):

defaultThreshold: Float — Baseline decision threshold (0.0–1.0) used when segmentation does not override. Default: 0.85.
debugMode: Bool — Extra logging for benchmarking and troubleshooting. Default: false.
computeUnits: MLComputeUnits — Core ML compute target. Default: .cpuAndNeuralEngine.

Recommended defaultThreshold ranges depend on your acoustic conditions:

Clean speech: 0.7–0.9
Noisy/mixed content: 0.3–0.6 (higher recall, more false positives)

Performance:

Optimized for Apple Neural Engine (ANE) with aligned MLMultiArray buffers, silent-frame short-circuiting, and recurrent state reuse (hidden/cell/context) for sequential inference.
Significantly improved throughput by processing 8×32 ms audio windows in a single Core ML call.

Automatic Speech Recognition

AsrManager

Automatic speech recognition using Parakeet TDT models (v2 English-only, v3 multilingual).

Key Methods:

transcribe(_:source:) async throws -> ASRResult
- Accepts [Float] samples already converted to 16 kHz mono; returns transcription text, confidence, and token timings.
transcribe(_ url: URL, source:) async throws -> ASRResult
- Loads the file directly and performs format conversion internally (AudioConverter).
transcribe(_ buffer: AVAudioPCMBuffer, source:) async throws -> ASRResult
- Convenience overload for capture pipelines that already produce PCM buffers.
initialize(models:) async throws
- Load and initialize ASR models (automatic download if needed)

Model Management:

AsrModels.downloadAndLoad(version: AsrModelVersion = .v3) async throws -> AsrModels
- Download models from HuggingFace and compile for CoreML
- Pass .v2 to load the English-only bundle when you do not need multilingual coverage
- Models cached locally after first download
ASRConfig: Beam size, temperature, language model weights
Audio Processing:
AudioConverter.resampleAudioFile(path:) throws -> [Float]
- Load and convert audio files to 16kHz mono Float32 (WAV, M4A, MP3, FLAC)
AudioConverter.resampleBuffer(_ buffer: AVAudioPCMBuffer) throws -> [Float]
- Convert a buffer to 16kHz mono (stateless conversion)
AudioSource: .microphone or .system for different processing paths

Warning: Avoid hand-decoding audio payloads (e.g., truncating WAV headers or treating bytes as raw Int16 samples). The Core ML models require correctly resampled 16 kHz mono Float32 tensors; manual parsing will silently corrupt input when formats carry metadata chunks, different bit depths, stereo channels, or compression. Always route files and live buffers through AudioConverter before calling AsrManager.transcribe.

Performance:

Real-time factor: ~120x on M4 Pro (processes 1min audio in 0.5s)
Languages: 25 European languages supported

StreamingEouAsrManager

Real-time streaming ASR with End-of-Utterance detection using Parakeet EOU models.

Key Methods:

init(configuration:chunkSize:eouDebounceMs:debugFeatures:)
- Create manager with MLModel configuration and chunk size
- chunkSize: .ms160 (default), .ms320, or .ms1600
- eouDebounceMs: Minimum silence duration before EOU triggers (default: 1280)
loadModels(modelDir:) async throws
- Load CoreML models from directory (encoder, decoder, joint, vocab)
process(audioBuffer:) async throws -> String
- Process audio incrementally, returns empty string (use finish() for transcript)
finish() async throws -> String
- Finalize processing and return accumulated transcript
reset() async
- Reset all state for next utterance
setEouCallback(_:)
- Set callback invoked when End-of-Utterance is detected
appendAudio(_:) throws
- Append audio to buffer without processing (for VAD integration)

Properties:

eouDetected: Bool — Whether EOU was detected in the last chunk
eouDebounceMs: Int — Minimum silence duration before EOU triggers
chunkSize: StreamingChunkSize — Current chunk size configuration

StreamingChunkSize:

.ms160 — 160ms chunks, lowest latency, ~8% WER
.ms320 — 320ms chunks, balanced, ~5% WER
.ms1600 — 1600ms chunks, highest throughput

Usage:

let manager = StreamingEouAsrManager(chunkSize: .ms160, eouDebounceMs: 1280)
try await manager.loadModels(modelDir: modelsURL)

// Process audio incrementally
_ = try await manager.process(audioBuffer: buffer1)
_ = try await manager.process(audioBuffer: buffer2)

// Get final transcript
let transcript = try await manager.finish()

// Reset for next utterance
await manager.reset()

Performance:

Real-time factor: ~5x RTF (160ms), ~12x RTF (320ms) on Apple Silicon
WER: ~8% (160ms), ~5% (320ms) on LibriSpeech test-clean

SlidingWindowAsrManager

Real-time sliding window ASR with overlap and cancellation support.

Key Methods:

init(models:config:) async throws
- Initialize with ASR models and configuration
transcribeChunk(_:isLastChunk:) async throws -> ASRResult
- Process audio chunk with sliding window overlap
- Returns accumulated transcript with proper handling of chunk boundaries
reset()
- Reset internal state for new session

Configuration:

Default chunk size: ~14.96 seconds
Default overlap: 2.0 seconds
Supports cancellation via Task cancellation

Usage:

let manager = try await SlidingWindowAsrManager()

for audioChunk in audioStream {
    let result = try await manager.transcribeChunk(
        audioChunk,
        isLastChunk: false
    )
    print("Partial: \(result.text)")
}

// Process final chunk
let final = try await manager.transcribeChunk(lastChunk, isLastChunk: true)
print("Final: \(final.text)")

StreamingNemotronAsrManager

NVIDIA Nemotron streaming ASR with encoder cache for low-latency processing.

Key Methods:

init(chunkSize:configuration:) async throws
- Initialize with chunk size (160ms, 320ms, or 1600ms)
loadModels(modelDir:) async throws
- Load CoreML models from directory
transcribe(_:) async throws -> String
- Process audio and return transcript
reset() async
- Reset encoder cache and decoder state

Chunk Sizes:

.ms160 — 160ms chunks, lowest latency
.ms320 — 320ms chunks, balanced
.ms1600 — 1600ms chunks, highest throughput

Performance:

Real-time factor: ~0.2x on Apple Silicon
Maintains encoder cache across chunks for efficiency

Qwen3AsrManager

Qwen3-based speech recognition with Whisper mel spectrogram frontend.

Key Methods:

init(modelDir:configuration:) async throws
- Initialize with model directory and CoreML configuration
transcribe(_:) async throws -> String
- Transcribe audio samples (16kHz mono Float32)
transcribe(_:) async throws -> String
- Transcribe from audio file URL

Features:

Whisper-style mel spectrogram processing
Multi-language support
Experimental high-accuracy model

Text-to-Speech (TTS)

KokoroTtsManager

Text-to-speech synthesis using Kokoro CoreML models.

Key Methods:

init(defaultVoice:defaultSpeakerId:directory:computeUnits:customLexicon:)
- Create TTS manager with optional configuration
- computeUnits: Use .cpuAndGPU on iOS 26+ to avoid ANE issues
initialize(preloadVoices:) async throws
- Download and initialize TTS models
- Optionally preload specific voices
synthesize(text:voice:speakerId:speed:pitch:) async throws -> [Float]
- Synthesize speech from text
- Returns audio samples at 24kHz
synthesizeDetailed(text:voice:speakerId:speed:pitch:) async throws -> (audio: [Float], alignments: [AlignmentInfo])
- Synthesize with phoneme-level timing information
synthesizeToFile(text:outputURL:voice:speakerId:speed:pitch:) async throws
- Synthesize directly to WAV file
setDefaultVoice(_:speakerId:) async throws
- Change default voice for subsequent synthesis
setCustomLexicon(_:)
- Set custom pronunciation dictionary
cleanup()
- Release models and free memory

Available Voices:

af — American Female
af_bella, af_nicole, af_sarah — American Female variants
am — American Male
am_adam, am_michael — American Male variants
bf — British Female
bm — British Male

Configuration:

defaultVoice: Voice identifier (default: "af")
defaultSpeakerId: Speaker ID for multi-speaker voices (default: 0)
speed: Speech rate multiplier (0.5–2.0, default: 1.0)
pitch: Pitch shift in semitones (-12 to +12, default: 0)
customLexicon: Custom pronunciation dictionary

Usage:

let manager = KokoroTtsManager(defaultVoice: "af")
try await manager.initialize()

let audio = try await manager.synthesize(
    text: "Hello from FluidAudio!",
    speed: 1.0,
    pitch: 0
)

// Save to file
try await manager.synthesizeToFile(
    text: "Hello world",
    outputURL: URL(fileURLWithPath: "output.wav")
)

Performance:

Real-time factor: ~5-10x on Apple Silicon
Output sample rate: 24kHz
Supports SSML for prosody control

KokoroAneManager

ANE-resident sibling of KokoroTtsManager — splits the Kokoro 82M graph into 7 CoreML stages so the ANE-friendly layers stay resident on the Neural Engine. 3-11× RTFx on Apple Silicon vs. the single-graph default. See KokoroAne for the full pipeline.

Key Methods:

init(defaultVoice:directory:computeUnits:modelStore:)
- Defaults: defaultVoice = "af_heart", computeUnits = .default (per-stage assignment matching the laishere upstream)
initialize(preloadVoices:) async throws
- Download (if missing) and load all 7 .mlmodelc bundles + vocab.json
  - af_heart.bin
synthesize(text:voice:speed:) async throws -> Data
- One-shot text → 24 kHz mono 16-bit PCM WAV
synthesizeDetailed(text:voice:speed:) async throws -> KokoroAneSynthesisResult
- Returns samples + per-stage timings
synthesizeFromPhonemes(_:voice:speed:) async throws -> Data
- Bypass G2P; feed an already-IPA phoneme string directly
synthesizeFromPhonemesDetailed(_:voice:speed:) async throws -> KokoroAneSynthesisResult
setDefaultVoice(_:) — override default voice for subsequent calls
isAvailable() async -> Bool
cleanup() async — drop loaded mlmodelcs + voice packs

Configuration:

defaultVoice: voice id (default "af_heart" — only voice currently shipped)
directory: optional cache directory override
computeUnits: KokoroAneComputeUnits (per-stage MLComputeUnits)
- .default — Albert/PostAlbert/Alignment/Vocoder on cpuAndNeuralEngine, Prosody/Noise/Tail on .all
- .cpuAndGpu — skip ANE entirely (debug baseline)
speed: speech rate multiplier (default 1.0)

Limits:

≤ 510 IPA phonemes per call (no built-in chunker)
Single voice (af_heart)
No SSML / custom lexicon / markdown overrides

Usage:

let manager = KokoroAneManager()
try await manager.initialize()

let wav = try await manager.synthesize(text: "Hello from FluidAudio!")
try wav.write(to: URL(fileURLWithPath: "/tmp/demo.wav"))

// With per-stage timings:
let detail = try await manager.synthesizeDetailed(text: "Hi.")
print("samples: \(detail.samples.count) @ \(detail.sampleRate) Hz")
let t = detail.timings
print("  albert=\(t.albert) postAlbert=\(t.postAlbert) alignment=\(t.alignment)")
print("  prosody=\(t.prosody) noise=\(t.noise) vocoder=\(t.vocoder) tail=\(t.tail)")
print("  total: \(t.totalMs) ms")

Performance:

Real-time factor: 3-11× RTFx on Apple Silicon (vs. 5-10× for KokoroTtsManager)
Cold load (first ever, ANE compile): ~20 s; warm load: ~0.3 s
Output sample rate: 24 kHz

PocketTtsManager

Lightweight streaming TTS with voice cloning support.

Key Methods:

init(directory:computeUnits:) async throws
- Initialize with optional directory and compute units
synthesize(text:voice:speakerId:) async throws -> [Float]
- Synthesize speech from text
- Returns audio samples at 24kHz
synthesizeStreaming(text:voice:speakerId:) -> AsyncThrowingStream<[Float], Error>
- Stream audio chunks as they are generated
cloneVoice(from:name:) async throws -> String
- Clone voice from reference audio
- Returns voice ID for later use
cleanup()
- Release models and free memory

Features:

Streaming synthesis for long text
Voice cloning from short audio samples
Lower memory footprint than Kokoro
Faster synthesis for real-time applications

Usage:

let manager = try await PocketTtsManager()

// Basic synthesis
let audio = try await manager.synthesize(text: "Hello world")

// Streaming synthesis
for try await chunk in manager.synthesizeStreaming(text: longText) {
    // Play audio chunk immediately
    playAudio(chunk)
}

21 KiB Raw Blame History Unescape Escape

API Reference

Common Patterns

Diarization

DiarizerManager

OfflineDiarizerManager

Diarizer Protocol

DiarizerTimeline & DiarizerSpeaker

SortformerDiarizer

LSEENDDiarizer

Voice Activity Detection

VadManager

Automatic Speech Recognition

AsrManager

StreamingEouAsrManager

SlidingWindowAsrManager

StreamingNemotronAsrManager

Qwen3AsrManager

Text-to-Speech (TTS)

KokoroTtsManager

KokoroAneManager

PocketTtsManager

21 KiB

Raw Blame History