mirror of
https://github.com/FluidInference/FluidAudio.git
synced 2026-05-12 20:20:36 +00:00
a0092cf163
1. LS-EEND had a memory leak since the autorelease pool was not releasing the multiarrays properly and was allocating new ones every chunk. Switched to backed output arrays to eliminate new allocations 2. LS-EEND docs were somewhat stale. Updated them to reflect the new API ---------
497 lines
22 KiB
Markdown
497 lines
22 KiB
Markdown
# API Reference
|
||
|
||
Primary public APIs for FluidAudio components. See inline doc comments for complete details.
|
||
|
||
**Components:**
|
||
- [Common Patterns](#common-patterns)
|
||
- [Diarization](#diarization)
|
||
- [Voice Activity Detection](#voice-activity-detection)
|
||
- [Automatic Speech Recognition](#automatic-speech-recognition)
|
||
- [Text-to-Speech](#text-to-speech)
|
||
|
||
## Common Patterns
|
||
|
||
**Audio Format:** All modules expect 16kHz mono Float32 audio samples. Use `FluidAudio.AudioConverter` to convert `AVAudioPCMBuffer` or files to 16kHz mono for both CLI and library paths.
|
||
|
||
**Model Registry:** Models auto-download from HuggingFace by default. Customize the registry URL using:
|
||
- `ModelRegistry.baseURL` (programmatic) - recommended for apps
|
||
- `REGISTRY_URL` or `MODEL_REGISTRY_URL` environment variables - recommended for CLI/testing
|
||
- Priority order: programmatic override → env vars → default (HuggingFace)
|
||
|
||
**Proxy Configuration:** If behind a corporate firewall, set the `https_proxy` (or `http_proxy`) environment variable. Both registry URL and proxy configuration are centralized in `ModelRegistry`.
|
||
|
||
**Error Handling:** All async methods throw descriptive errors. Use proper error handling in production code.
|
||
|
||
**Thread Safety:** All managers are thread-safe and can be used concurrently across different queues.
|
||
|
||
## Diarization
|
||
|
||
### DiarizerManager
|
||
Main class for speaker diarization and "who spoke when" analysis.
|
||
|
||
**Key Methods:**
|
||
- `performCompleteDiarization(_:sampleRate:) throws -> DiarizationResult`
|
||
- Process complete audio file and return speaker segments
|
||
- Parameters: `RandomAccessCollection<Float>` audio samples, sample rate (default: 16000)
|
||
- Returns: `DiarizerResult` with speaker segments and timing
|
||
- `validateAudio(_:) throws -> AudioValidationResult`
|
||
- Validate audio quality, length, and format requirements
|
||
|
||
**Configuration:**
|
||
- `DiarizerConfig`: Clustering threshold, minimum durations, activity thresholds
|
||
- Optimal threshold: 0.7 (17.7% DER on AMI dataset)
|
||
|
||
### OfflineDiarizerManager
|
||
Full batch pipeline that mirrors the pyannote/Core ML exporter (powerset segmentation + VBx clustering).
|
||
|
||
> Requires macOS 14 / iOS 17 or later because the manager relies on Swift Concurrency features and C++ clustering shims that are unavailable on older OS releases.
|
||
|
||
**Key Methods:**
|
||
- `init(config: OfflineDiarizerConfig = .default)`
|
||
- Creates manager with configuration
|
||
- `prepareModels(directory:configuration:forceRedownload:) async throws`
|
||
- Downloads / compiles the Core ML bundles as needed and records timing metadata. Call once before processing when you don't already have `OfflineDiarizerModels`.
|
||
- `initialize(models: OfflineDiarizerModels)`
|
||
- Initializes with models containing segmentation, embedding, and PLDA components (useful when you hydrate the bundles yourself).
|
||
- `process(audio: [Float]) async throws -> DiarizationResult`
|
||
- Runs the full 10 s window pipeline: segmentation → soft mask interpolation → embedding → VBx → timeline reconstruction.
|
||
- `process(audioSource: StreamingAudioSampleSource, audioLoadingSeconds: TimeInterval) async throws -> DiarizationResult`
|
||
- Streams audio from disk-backed sources without materializing the entire buffer in memory. Pair with `StreamingAudioSourceFactory` for large meetings.
|
||
|
||
**Supporting Types:**
|
||
- `OfflineDiarizerConfig`
|
||
- Mirrors pyannote `config.yaml` (`clusteringThreshold`, `Fa`, `Fb`, `maxVBxIterations`, `minDurationOn/off`, batch sizes, logging flags).
|
||
- `SegmentationRunner`
|
||
- Batches 160 k-sample chunks through the segmentation model (589 frames per chunk).
|
||
- `Binarization`
|
||
- Converts log probabilities to soft VAD weights while retaining binary masks for diagnostics.
|
||
- `WeightInterpolation`
|
||
- Reimplements `scipy.ndimage.zoom` (half-pixel offsets) so 589-frame weights align with the embedding model’s pooling stride.
|
||
- `EmbeddingRunner`
|
||
- Runs the FBANK frontend + embedding backend, resamples masks to 589 frames, and emits 256-d L2-normalized embeddings.
|
||
- `PLDAScoring` / `VBxClustering`
|
||
- Apply the exported PLDA transforms and iterative VBx refinement to group embeddings into speakers.
|
||
- `TimelineReconstruction`
|
||
- Derives timestamps directly from the segmentation frame count and `OfflineDiarizerConfig.windowDuration`, then enforces minimum gap/duration constraints.
|
||
- `StreamingAudioSourceFactory`
|
||
- Creates disk-backed or in-memory `StreamingAudioSampleSource` instances so large meetings never require fully materialized `[Float]` buffers.
|
||
|
||
Use `OfflineDiarizerManager` when you need offline DER parity or want to run the new CLI offline mode (`fluidaudio process --mode offline`, `fluidaudio diarization-benchmark --mode offline`).
|
||
|
||
---
|
||
|
||
### Diarizer Protocol
|
||
|
||
`SortformerDiarizer` and `LSEENDDiarizer` both conform to the `Diarizer` protocol, providing a unified streaming and offline API.
|
||
|
||
**Streaming:** `addAudio(_:sourceSampleRate:)` → `process()` → read `timeline`. Convenience `process(samples:sourceSampleRate:)` combines both steps. Returns `DiarizerTimelineUpdate?` (`nil` when not enough audio has accumulated).
|
||
|
||
**Offline:** `processComplete(_:sourceSampleRate:...)` or `processComplete(audioFileURL:...)` to process a full recording in one call.
|
||
|
||
**Speaker Enrollment:** `enrollSpeaker(withAudio:sourceSampleRate:named:...)` feeds known-speaker audio before streaming to label a slot.
|
||
|
||
**Lifecycle:** `finalizeSession()` flushes trailing context so the last true frame becomes finalized. `reset()` clears streaming state but keeps the model loaded. `cleanup()` releases everything.
|
||
|
||
---
|
||
|
||
### DiarizerTimeline & DiarizerSpeaker
|
||
|
||
`DiarizerTimeline` accumulates per-frame speaker probabilities and derives `DiarizerSpeaker` segments. Each speaker has `finalizedSegments` (confirmed) and `tentativeSegments` (may be revised). Segments expose `startTime`, `endTime`, `duration`, and `isFinalized`.
|
||
|
||
**`DiarizerTimelineConfig`** controls post-processing (onset/offset thresholds default to 0.5, min segment/gap duration, optional rolling window cap, and `storeSegments` for emit-only mode that skips creating `DiarizerSpeaker` objects entirely). Both diarizers accept this at init. See [DiarizerTimeline.md](Diarization/DiarizerTimeline.md#emit-only-mode-storesegments--false) for the emit-only mode contract.
|
||
|
||
**Speaker Management:**
|
||
- `upsertSpeaker(named:atIndex:) -> DiarizerSpeaker?`
|
||
- Add a speaker to a slot, or update the existing speaker's name if that slot is already occupied
|
||
- If `atIndex` is `nil`, the first unused diarizer slot is chosen
|
||
- `upsertSpeaker(_:atIndex:transferCurrentSegment:) -> DiarizerSpeaker?`
|
||
- Insert an existing `DiarizerSpeaker` into a slot, replacing any speaker already assigned there
|
||
- If `atIndex` is `nil`, the first unused diarizer slot is chosen
|
||
- `transferCurrentSegment` moves the in-progress segment (if one exists) to the new speaker before continuing
|
||
- `removeSpeaker(atIndex:clearCurrentSegment:) -> DiarizerSpeaker?`
|
||
- Remove the speaker assigned to a diarizer output slot and return the removed speaker if present
|
||
- `clearCurrentSegment` resets the in-progress speaking state for that slot before continuing
|
||
- `speakers: [Int: DiarizerSpeaker]`
|
||
- Read or replace the full slot-to-speaker mapping directly when needed
|
||
|
||
---
|
||
|
||
### SortformerDiarizer
|
||
|
||
Streaming diarization using NVIDIA's Sortformer. 4 fixed speaker slots, 16 kHz input, 80 ms frame duration.
|
||
|
||
```swift
|
||
let diarizer = SortformerDiarizer(config: .default, timelineConfig: .sortformerDefault)
|
||
try await diarizer.initialize(mainModelPath: modelURL)
|
||
```
|
||
|
||
**Config presets:** `.default` / `.fastV2_1` (1.04 s latency), `.balancedV2_1` (1.04 s, 20.6% DER on AMI SDM), `.highContextV2_1` (30.4 s latency). v2 variants also available.
|
||
|
||
---
|
||
|
||
### LSEENDDiarizer
|
||
|
||
Streaming diarization using LS-EEND. Variable speaker slots, 8 kHz input, 100 ms frame duration, 20.7% DER on AMI SDM.
|
||
|
||
```swift
|
||
let diarizer = try await LSEENDDiarizer(variant: .dihard3)
|
||
```
|
||
|
||
**Variants:** ami, callhome, dihard2, dihard3 (via `LSEENDModel.loadFromHuggingFace(variant:stepSize:)`). The optional `LSEENDStepSize` selects how many output frames the model commits per CoreML call (`.step100ms` … `.step500ms`); smaller steps reduce latency, larger steps raise throughput.
|
||
|
||
Call `finalizeSession()` at end-of-stream to flush pending audio before reading the final timeline.
|
||
|
||
## Voice Activity Detection
|
||
|
||
### VadManager
|
||
Voice activity detection using the Silero VAD Core ML model with 256 ms unified inference and ANE optimizations.
|
||
|
||
**Key Methods:**
|
||
- `process(_ url: URL) async throws -> [VadResult]`
|
||
- Process an audio file end-to-end. Automatically converts to 16kHz mono Float32 and processes in 4096-sample frames (256 ms).
|
||
- `process(_ buffer: AVAudioPCMBuffer) async throws -> [VadResult]`
|
||
- Convert and process an in-memory buffer. Supports any input format; resampled to 16kHz mono internally.
|
||
- `process(_ samples: [Float]) async throws -> [VadResult]`
|
||
- Process pre-converted 16kHz mono samples.
|
||
- `processChunk(_:inputState:) async throws -> VadResult`
|
||
- Process a single 4096-sample frame (256 ms at 16 kHz) with optional recurrent state.
|
||
|
||
**Constants:**
|
||
- `VadManager.chunkSize = 4096` // samples per frame (256 ms @ 16 kHz, plus 64-sample context managed internally)
|
||
- `VadManager.sampleRate = 16000`
|
||
|
||
**Configuration (`VadConfig`):**
|
||
- `defaultThreshold: Float` — Baseline decision threshold (0.0–1.0) used when segmentation does not override. Default: `0.85`.
|
||
- `debugMode: Bool` — Extra logging for benchmarking and troubleshooting. Default: `false`.
|
||
- `computeUnits: MLComputeUnits` — Core ML compute target. Default: `.cpuAndNeuralEngine`.
|
||
|
||
Recommended `defaultThreshold` ranges depend on your acoustic conditions:
|
||
- Clean speech: 0.7–0.9
|
||
- Noisy/mixed content: 0.3–0.6 (higher recall, more false positives)
|
||
|
||
**Performance:**
|
||
- Optimized for Apple Neural Engine (ANE) with aligned `MLMultiArray` buffers, silent-frame short-circuiting, and recurrent state reuse (hidden/cell/context) for sequential inference.
|
||
- Significantly improved throughput by processing 8×32 ms audio windows in a single Core ML call.
|
||
|
||
## Automatic Speech Recognition
|
||
|
||
### AsrManager
|
||
Automatic speech recognition using Parakeet TDT models (v2 English-only, v3 multilingual).
|
||
|
||
**Key Methods:**
|
||
- `transcribe(_:source:) async throws -> ASRResult`
|
||
- Accepts `[Float]` samples already converted to 16 kHz mono; returns transcription text, confidence, and token timings.
|
||
- `transcribe(_ url: URL, source:) async throws -> ASRResult`
|
||
- Loads the file directly and performs format conversion internally (`AudioConverter`).
|
||
- `transcribe(_ buffer: AVAudioPCMBuffer, source:) async throws -> ASRResult`
|
||
- Convenience overload for capture pipelines that already produce PCM buffers.
|
||
- `initialize(models:) async throws`
|
||
- Load and initialize ASR models (automatic download if needed)
|
||
|
||
**Model Management:**
|
||
- `AsrModels.downloadAndLoad(version: AsrModelVersion = .v3) async throws -> AsrModels`
|
||
- Download models from HuggingFace and compile for CoreML
|
||
- Pass `.v2` to load the English-only bundle when you do not need multilingual coverage
|
||
- Models cached locally after first download
|
||
- `ASRConfig`: Beam size, temperature, language model weights
|
||
|
||
- **Audio Processing:**
|
||
- `AudioConverter.resampleAudioFile(path:) throws -> [Float]`
|
||
- Load and convert audio files to 16kHz mono Float32 (WAV, M4A, MP3, FLAC)
|
||
- `AudioConverter.resampleBuffer(_ buffer: AVAudioPCMBuffer) throws -> [Float]`
|
||
- Convert a buffer to 16kHz mono (stateless conversion)
|
||
- `AudioSource`: `.microphone` or `.system` for different processing paths
|
||
|
||
> **Warning:** Avoid hand-decoding audio payloads (e.g., truncating WAV headers or treating bytes as raw `Int16` samples).
|
||
> The Core ML models require correctly resampled 16 kHz mono Float32 tensors; manual parsing will silently corrupt input when
|
||
> formats carry metadata chunks, different bit depths, stereo channels, or compression. Always route files and live buffers
|
||
> through `AudioConverter` before calling `AsrManager.transcribe`.
|
||
|
||
**Performance:**
|
||
- Real-time factor: ~120x on M4 Pro (processes 1min audio in 0.5s)
|
||
- Languages: 25 European languages supported
|
||
|
||
### StreamingEouAsrManager
|
||
Real-time streaming ASR with End-of-Utterance detection using Parakeet EOU models.
|
||
|
||
**Key Methods:**
|
||
- `init(configuration:chunkSize:eouDebounceMs:debugFeatures:)`
|
||
- Create manager with MLModel configuration and chunk size
|
||
- `chunkSize`: `.ms160` (default), `.ms320`, or `.ms1600`
|
||
- `eouDebounceMs`: Minimum silence duration before EOU triggers (default: 1280)
|
||
- `loadModels(modelDir:) async throws`
|
||
- Load CoreML models from directory (encoder, decoder, joint, vocab)
|
||
- `process(audioBuffer:) async throws -> String`
|
||
- Process audio incrementally, returns empty string (use `finish()` for transcript)
|
||
- `finish() async throws -> String`
|
||
- Finalize processing and return accumulated transcript
|
||
- `reset() async`
|
||
- Reset all state for next utterance
|
||
- `setEouCallback(_:)`
|
||
- Set callback invoked when End-of-Utterance is detected
|
||
- `appendAudio(_:) throws`
|
||
- Append audio to buffer without processing (for VAD integration)
|
||
|
||
**Properties:**
|
||
- `eouDetected: Bool` — Whether EOU was detected in the last chunk
|
||
- `eouDebounceMs: Int` — Minimum silence duration before EOU triggers
|
||
- `chunkSize: StreamingChunkSize` — Current chunk size configuration
|
||
|
||
**StreamingChunkSize:**
|
||
- `.ms160` — 160ms chunks, lowest latency, ~8% WER
|
||
- `.ms320` — 320ms chunks, balanced, ~5% WER
|
||
- `.ms1600` — 1600ms chunks, highest throughput
|
||
|
||
**Usage:**
|
||
```swift
|
||
let manager = StreamingEouAsrManager(chunkSize: .ms160, eouDebounceMs: 1280)
|
||
try await manager.loadModels(modelDir: modelsURL)
|
||
|
||
// Process audio incrementally
|
||
_ = try await manager.process(audioBuffer: buffer1)
|
||
_ = try await manager.process(audioBuffer: buffer2)
|
||
|
||
// Get final transcript
|
||
let transcript = try await manager.finish()
|
||
|
||
// Reset for next utterance
|
||
await manager.reset()
|
||
```
|
||
|
||
**Performance:**
|
||
- Real-time factor: ~5x RTF (160ms), ~12x RTF (320ms) on Apple Silicon
|
||
- WER: ~8% (160ms), ~5% (320ms) on LibriSpeech test-clean
|
||
|
||
### SlidingWindowAsrManager
|
||
Real-time sliding window ASR with overlap and cancellation support.
|
||
|
||
**Key Methods:**
|
||
- `init(models:config:) async throws`
|
||
- Initialize with ASR models and configuration
|
||
- `transcribeChunk(_:isLastChunk:) async throws -> ASRResult`
|
||
- Process audio chunk with sliding window overlap
|
||
- Returns accumulated transcript with proper handling of chunk boundaries
|
||
- `reset()`
|
||
- Reset internal state for new session
|
||
|
||
**Configuration:**
|
||
- Default chunk size: ~14.96 seconds
|
||
- Default overlap: 2.0 seconds
|
||
- Supports cancellation via Task cancellation
|
||
|
||
**Usage:**
|
||
```swift
|
||
let manager = try await SlidingWindowAsrManager()
|
||
|
||
for audioChunk in audioStream {
|
||
let result = try await manager.transcribeChunk(
|
||
audioChunk,
|
||
isLastChunk: false
|
||
)
|
||
print("Partial: \(result.text)")
|
||
}
|
||
|
||
// Process final chunk
|
||
let final = try await manager.transcribeChunk(lastChunk, isLastChunk: true)
|
||
print("Final: \(final.text)")
|
||
```
|
||
|
||
### StreamingNemotronAsrManager
|
||
NVIDIA Nemotron streaming ASR with encoder cache for low-latency processing.
|
||
|
||
**Key Methods:**
|
||
- `init(chunkSize:configuration:) async throws`
|
||
- Initialize with chunk size (160ms, 320ms, or 1600ms)
|
||
- `loadModels(modelDir:) async throws`
|
||
- Load CoreML models from directory
|
||
- `transcribe(_:) async throws -> String`
|
||
- Process audio and return transcript
|
||
- `reset() async`
|
||
- Reset encoder cache and decoder state
|
||
|
||
**Chunk Sizes:**
|
||
- `.ms160` — 160ms chunks, lowest latency
|
||
- `.ms320` — 320ms chunks, balanced
|
||
- `.ms1600` — 1600ms chunks, highest throughput
|
||
|
||
**Performance:**
|
||
- Real-time factor: ~0.2x on Apple Silicon
|
||
- Maintains encoder cache across chunks for efficiency
|
||
|
||
### Qwen3AsrManager
|
||
Qwen3-based speech recognition with Whisper mel spectrogram frontend.
|
||
|
||
**Key Methods:**
|
||
- `init(modelDir:configuration:) async throws`
|
||
- Initialize with model directory and CoreML configuration
|
||
- `transcribe(_:) async throws -> String`
|
||
- Transcribe audio samples (16kHz mono Float32)
|
||
- `transcribe(_:) async throws -> String`
|
||
- Transcribe from audio file URL
|
||
|
||
**Features:**
|
||
- Whisper-style mel spectrogram processing
|
||
- Multi-language support
|
||
- Experimental high-accuracy model
|
||
|
||
## Text-to-Speech (TTS)
|
||
|
||
### KokoroTtsManager
|
||
Text-to-speech synthesis using Kokoro CoreML models.
|
||
|
||
**Key Methods:**
|
||
- `init(defaultVoice:defaultSpeakerId:directory:computeUnits:customLexicon:)`
|
||
- Create TTS manager with optional configuration
|
||
- `computeUnits`: Use `.cpuAndGPU` on iOS 26+ to avoid ANE issues
|
||
- `initialize(preloadVoices:) async throws`
|
||
- Download and initialize TTS models
|
||
- Optionally preload specific voices
|
||
- `synthesize(text:voice:speakerId:speed:pitch:) async throws -> [Float]`
|
||
- Synthesize speech from text
|
||
- Returns audio samples at 24kHz
|
||
- `synthesizeDetailed(text:voice:speakerId:speed:pitch:) async throws -> (audio: [Float], alignments: [AlignmentInfo])`
|
||
- Synthesize with phoneme-level timing information
|
||
- `synthesizeToFile(text:outputURL:voice:speakerId:speed:pitch:) async throws`
|
||
- Synthesize directly to WAV file
|
||
- `setDefaultVoice(_:speakerId:) async throws`
|
||
- Change default voice for subsequent synthesis
|
||
- `setCustomLexicon(_:)`
|
||
- Set custom pronunciation dictionary
|
||
- `cleanup()`
|
||
- Release models and free memory
|
||
|
||
**Available Voices:**
|
||
- `af` — American Female
|
||
- `af_bella`, `af_nicole`, `af_sarah` — American Female variants
|
||
- `am` — American Male
|
||
- `am_adam`, `am_michael` — American Male variants
|
||
- `bf` — British Female
|
||
- `bm` — British Male
|
||
|
||
**Configuration:**
|
||
- `defaultVoice`: Voice identifier (default: `"af"`)
|
||
- `defaultSpeakerId`: Speaker ID for multi-speaker voices (default: 0)
|
||
- `speed`: Speech rate multiplier (0.5–2.0, default: 1.0)
|
||
- `pitch`: Pitch shift in semitones (-12 to +12, default: 0)
|
||
- `customLexicon`: Custom pronunciation dictionary
|
||
|
||
**Usage:**
|
||
```swift
|
||
let manager = KokoroTtsManager(defaultVoice: "af")
|
||
try await manager.initialize()
|
||
|
||
let audio = try await manager.synthesize(
|
||
text: "Hello from FluidAudio!",
|
||
speed: 1.0,
|
||
pitch: 0
|
||
)
|
||
|
||
// Save to file
|
||
try await manager.synthesizeToFile(
|
||
text: "Hello world",
|
||
outputURL: URL(fileURLWithPath: "output.wav")
|
||
)
|
||
```
|
||
|
||
**Performance:**
|
||
- Real-time factor: ~5-10x on Apple Silicon
|
||
- Output sample rate: 24kHz
|
||
- Supports SSML for prosody control
|
||
|
||
### KokoroAneManager
|
||
ANE-resident sibling of `KokoroTtsManager` — splits the Kokoro 82M graph into
|
||
7 CoreML stages so the ANE-friendly layers stay resident on the Neural
|
||
Engine. **3-11× RTFx** on Apple Silicon vs. the single-graph default. See
|
||
[KokoroAne](TTS/KokoroAne.md) for the full pipeline.
|
||
|
||
**Key Methods:**
|
||
- `init(defaultVoice:directory:computeUnits:modelStore:)`
|
||
- Defaults: `defaultVoice = "af_heart"`, `computeUnits = .default`
|
||
(per-stage assignment matching the laishere upstream)
|
||
- `initialize(preloadVoices:) async throws`
|
||
- Download (if missing) and load all 7 `.mlmodelc` bundles + `vocab.json`
|
||
+ `af_heart.bin`
|
||
- `synthesize(text:voice:speed:) async throws -> Data`
|
||
- One-shot text → 24 kHz mono 16-bit PCM WAV
|
||
- `synthesizeDetailed(text:voice:speed:) async throws -> KokoroAneSynthesisResult`
|
||
- Returns samples + per-stage timings
|
||
- `synthesizeFromPhonemes(_:voice:speed:) async throws -> Data`
|
||
- Bypass G2P; feed an already-IPA phoneme string directly
|
||
- `synthesizeFromPhonemesDetailed(_:voice:speed:) async throws -> KokoroAneSynthesisResult`
|
||
- `setDefaultVoice(_:)` — override default voice for subsequent calls
|
||
- `isAvailable() async -> Bool`
|
||
- `cleanup() async` — drop loaded mlmodelcs + voice packs
|
||
|
||
**Configuration:**
|
||
- `defaultVoice`: voice id (default `"af_heart"` — only voice currently shipped)
|
||
- `directory`: optional cache directory override
|
||
- `computeUnits`: `KokoroAneComputeUnits` (per-stage `MLComputeUnits`)
|
||
- `.default` — Albert/PostAlbert/Alignment/Vocoder on `cpuAndNeuralEngine`,
|
||
Prosody/Noise/Tail on `.all`
|
||
- `.cpuAndGpu` — skip ANE entirely (debug baseline)
|
||
- `speed`: speech rate multiplier (default `1.0`)
|
||
|
||
**Limits:**
|
||
- ≤ 510 IPA phonemes per call (no built-in chunker)
|
||
- Single voice (`af_heart`)
|
||
- No SSML / custom lexicon / markdown overrides
|
||
|
||
**Usage:**
|
||
```swift
|
||
let manager = KokoroAneManager()
|
||
try await manager.initialize()
|
||
|
||
let wav = try await manager.synthesize(text: "Hello from FluidAudio!")
|
||
try wav.write(to: URL(fileURLWithPath: "/tmp/demo.wav"))
|
||
|
||
// With per-stage timings:
|
||
let detail = try await manager.synthesizeDetailed(text: "Hi.")
|
||
print("samples: \(detail.samples.count) @ \(detail.sampleRate) Hz")
|
||
let t = detail.timings
|
||
print(" albert=\(t.albert) postAlbert=\(t.postAlbert) alignment=\(t.alignment)")
|
||
print(" prosody=\(t.prosody) noise=\(t.noise) vocoder=\(t.vocoder) tail=\(t.tail)")
|
||
print(" total: \(t.totalMs) ms")
|
||
```
|
||
|
||
**Performance:**
|
||
- Real-time factor: 3-11× RTFx on Apple Silicon (vs. 5-10× for `KokoroTtsManager`)
|
||
- Cold load (first ever, ANE compile): ~20 s; warm load: ~0.3 s
|
||
- Output sample rate: 24 kHz
|
||
|
||
### PocketTtsManager
|
||
Lightweight streaming TTS with voice cloning support.
|
||
|
||
**Key Methods:**
|
||
- `init(directory:computeUnits:) async throws`
|
||
- Initialize with optional directory and compute units
|
||
- `synthesize(text:voice:speakerId:) async throws -> [Float]`
|
||
- Synthesize speech from text
|
||
- Returns audio samples at 24kHz
|
||
- `synthesizeStreaming(text:voice:speakerId:) -> AsyncThrowingStream<[Float], Error>`
|
||
- Stream audio chunks as they are generated
|
||
- `cloneVoice(from:name:) async throws -> String`
|
||
- Clone voice from reference audio
|
||
- Returns voice ID for later use
|
||
- `cleanup()`
|
||
- Release models and free memory
|
||
|
||
**Features:**
|
||
- Streaming synthesis for long text
|
||
- Voice cloning from short audio samples
|
||
- Lower memory footprint than Kokoro
|
||
- Faster synthesis for real-time applications
|
||
|
||
**Usage:**
|
||
```swift
|
||
let manager = try await PocketTtsManager()
|
||
|
||
// Basic synthesis
|
||
let audio = try await manager.synthesize(text: "Hello world")
|
||
|
||
// Streaming synthesis
|
||
for try await chunk in manager.synthesizeStreaming(text: longText) {
|
||
// Play audio chunk immediately
|
||
playAudio(chunk)
|
||
}
|
||
```
|