FluidAudio/Documentation/API.md

# API Reference

Primary public APIs for FluidAudio components. See inline doc comments for complete details.

**Components:**
- [Common Patterns](#common-patterns)
- [Diarization](#diarization)
- [Voice Activity Detection](#voice-activity-detection)
- [Automatic Speech Recognition](#automatic-speech-recognition)
- [Text-to-Speech](#text-to-speech)

## Common Patterns

**Audio Format:** All modules expect 16kHz mono Float32 audio samples. Use `FluidAudio.AudioConverter` to convert `AVAudioPCMBuffer` or files to 16kHz mono for both CLI and library paths.

**Model Registry:** Models auto-download from HuggingFace by default. Customize the registry URL using:
- `ModelRegistry.baseURL` (programmatic) - recommended for apps
- `REGISTRY_URL` or `MODEL_REGISTRY_URL` environment variables - recommended for CLI/testing
- Priority order: programmatic override → env vars → default (HuggingFace)

**Proxy Configuration:** If behind a corporate firewall, set the `https_proxy` (or `http_proxy`) environment variable. Both registry URL and proxy configuration are centralized in `ModelRegistry`.

**Error Handling:** All async methods throw descriptive errors. Use proper error handling in production code.

**Thread Safety:** All managers are thread-safe and can be used concurrently across different queues.

## Diarization

### DiarizerManager
Main class for speaker diarization and "who spoke when" analysis.

**Key Methods:**
- `performCompleteDiarization(_:sampleRate:) throws -> DiarizationResult`
  - Process complete audio file and return speaker segments
  - Parameters: `RandomAccessCollection<Float>` audio samples, sample rate (default: 16000)
  - Returns: `DiarizerResult` with speaker segments and timing
- `validateAudio(_:) throws -> AudioValidationResult`
  - Validate audio quality, length, and format requirements

**Configuration:**
- `DiarizerConfig`: Clustering threshold, minimum durations, activity thresholds
- Optimal threshold: 0.7 (17.7% DER on AMI dataset)

### OfflineDiarizerManager
Full batch pipeline that mirrors the pyannote/Core ML exporter (powerset segmentation + VBx clustering).

> Requires macOS 14 / iOS 17 or later because the manager relies on Swift Concurrency features and C++ clustering shims that are unavailable on older OS releases.

**Key Methods:**
- `init(config: OfflineDiarizerConfig = .default)`
  - Creates manager with configuration
- `prepareModels(directory:configuration:forceRedownload:) async throws`
  - Downloads / compiles the Core ML bundles as needed and records timing metadata. Call once before processing when you don't already have `OfflineDiarizerModels`.
- `initialize(models: OfflineDiarizerModels)`
  - Initializes with models containing segmentation, embedding, and PLDA components (useful when you hydrate the bundles yourself).
- `process(audio: [Float]) async throws -> DiarizationResult`
  - Runs the full 10 s window pipeline: segmentation → soft mask interpolation → embedding → VBx → timeline reconstruction.
- `process(audioSource: StreamingAudioSampleSource, audioLoadingSeconds: TimeInterval) async throws -> DiarizationResult`
  - Streams audio from disk-backed sources without materializing the entire buffer in memory. Pair with `StreamingAudioSourceFactory` for large meetings.

**Supporting Types:**
- `OfflineDiarizerConfig`
  - Mirrors pyannote `config.yaml` (`clusteringThreshold`, `Fa`, `Fb`, `maxVBxIterations`, `minDurationOn/off`, batch sizes, logging flags).
- `SegmentationRunner`
  - Batches 160 k-sample chunks through the segmentation model (589 frames per chunk).
- `Binarization`
  - Converts log probabilities to soft VAD weights while retaining binary masks for diagnostics.
- `WeightInterpolation`
  - Reimplements `scipy.ndimage.zoom` (half-pixel offsets) so 589-frame weights align with the embedding model’s pooling stride.
- `EmbeddingRunner`
  - Runs the FBANK frontend + embedding backend, resamples masks to 589 frames, and emits 256-d L2-normalized embeddings.
- `PLDAScoring` / `VBxClustering`
  - Apply the exported PLDA transforms and iterative VBx refinement to group embeddings into speakers.
- `TimelineReconstruction`
  - Derives timestamps directly from the segmentation frame count and `OfflineDiarizerConfig.windowDuration`, then enforces minimum gap/duration constraints.
- `StreamingAudioSourceFactory`
  - Creates disk-backed or in-memory `StreamingAudioSampleSource` instances so large meetings never require fully materialized `[Float]` buffers.

Use `OfflineDiarizerManager` when you need offline DER parity or want to run the new CLI offline mode (`fluidaudio process --mode offline`, `fluidaudio diarization-benchmark --mode offline`).

---

### Diarizer Protocol

`SortformerDiarizer` and `LSEENDDiarizer` both conform to the `Diarizer` protocol, providing a unified streaming and offline API.

**Streaming:** `addAudio(_:sourceSampleRate:)` → `process()` → read `timeline`. Convenience `process(samples:sourceSampleRate:)` combines both steps. Returns `DiarizerTimelineUpdate?` (`nil` when not enough audio has accumulated).

**Offline:** `processComplete(_:sourceSampleRate:...)` or `processComplete(audioFileURL:...)` to process a full recording in one call.

**Speaker Enrollment:** `enrollSpeaker(withAudio:sourceSampleRate:named:...)` feeds known-speaker audio before streaming to label a slot.

**Lifecycle:** `finalizeSession()` flushes trailing context so the last true frame becomes finalized. `reset()` clears streaming state but keeps the model loaded. `cleanup()` releases everything.

---

### DiarizerTimeline & DiarizerSpeaker

`DiarizerTimeline` accumulates per-frame speaker probabilities and derives `DiarizerSpeaker` segments. Each speaker has `finalizedSegments` (confirmed) and `tentativeSegments` (may be revised). Segments expose `startTime`, `endTime`, `duration`, and `isFinalized`.

**`DiarizerTimelineConfig`** controls post-processing (onset/offset thresholds default to 0.5, min segment/gap duration, optional rolling window cap, and `storeSegments` for emit-only mode that skips creating `DiarizerSpeaker` objects entirely). Both diarizers accept this at init. See [DiarizerTimeline.md](Diarization/DiarizerTimeline.md#emit-only-mode-storesegments--false) for the emit-only mode contract.

**Speaker Management:**
- `upsertSpeaker(named:atIndex:) -> DiarizerSpeaker?`
  - Add a speaker to a slot, or update the existing speaker's name if that slot is already occupied
  - If `atIndex` is `nil`, the first unused diarizer slot is chosen
- `upsertSpeaker(_:atIndex:transferCurrentSegment:) -> DiarizerSpeaker?`
  - Insert an existing `DiarizerSpeaker` into a slot, replacing any speaker already assigned there
  - If `atIndex` is `nil`, the first unused diarizer slot is chosen
  - `transferCurrentSegment` moves the in-progress segment (if one exists) to the new speaker before continuing
- `removeSpeaker(atIndex:clearCurrentSegment:) -> DiarizerSpeaker?`
  - Remove the speaker assigned to a diarizer output slot and return the removed speaker if present
  - `clearCurrentSegment` resets the in-progress speaking state for that slot before continuing
- `speakers: [Int: DiarizerSpeaker]`
  - Read or replace the full slot-to-speaker mapping directly when needed

---

### SortformerDiarizer

Streaming diarization using NVIDIA's Sortformer. 4 fixed speaker slots, 16 kHz input, 80 ms frame duration.

```swift
let diarizer = SortformerDiarizer(config: .default, timelineConfig: .sortformerDefault)
try await diarizer.initialize(mainModelPath: modelURL)
```

**Config presets:** `.default` / `.fastV2_1` (1.04 s latency), `.balancedV2_1` (1.04 s, 20.6% DER on AMI SDM), `.highContextV2_1` (30.4 s latency). v2 variants also available.

---

### LSEENDDiarizer

Streaming diarization using LS-EEND. Variable speaker slots, 8 kHz input, 100 ms frame duration, 20.7% DER on AMI SDM.

```swift
let diarizer = try await LSEENDDiarizer(variant: .dihard3)
```

**Variants:** ami, callhome, dihard2, dihard3 (via `LSEENDModel.loadFromHuggingFace(variant:stepSize:)`). The optional `LSEENDStepSize` selects how many output frames the model commits per CoreML call (`.step100ms` … `.step500ms`); smaller steps reduce latency, larger steps raise throughput.

Call `finalizeSession()` at end-of-stream to flush pending audio before reading the final timeline.

## Voice Activity Detection

### VadManager
Voice activity detection using the Silero VAD Core ML model with 256 ms unified inference and ANE optimizations.

**Key Methods:**
- `process(_ url: URL) async throws -> [VadResult]`
  - Process an audio file end-to-end. Automatically converts to 16kHz mono Float32 and processes in 4096-sample frames (256 ms).
- `process(_ buffer: AVAudioPCMBuffer) async throws -> [VadResult]`
  - Convert and process an in-memory buffer. Supports any input format; resampled to 16kHz mono internally.
- `process(_ samples: [Float]) async throws -> [VadResult]`
  - Process pre-converted 16kHz mono samples.
- `processChunk(_:inputState:) async throws -> VadResult`
  - Process a single 4096-sample frame (256 ms at 16 kHz) with optional recurrent state.

**Constants:**
- `VadManager.chunkSize = 4096`  // samples per frame (256 ms @ 16 kHz, plus 64-sample context managed internally)
- `VadManager.sampleRate = 16000`

**Configuration (`VadConfig`):**
- `defaultThreshold: Float` — Baseline decision threshold (0.0–1.0) used when segmentation does not override. Default: `0.85`.
- `debugMode: Bool` — Extra logging for benchmarking and troubleshooting. Default: `false`.
- `computeUnits: MLComputeUnits` — Core ML compute target. Default: `.cpuAndNeuralEngine`.

Recommended `defaultThreshold` ranges depend on your acoustic conditions:
- Clean speech: 0.7–0.9
- Noisy/mixed content: 0.3–0.6 (higher recall, more false positives)

**Performance:**
- Optimized for Apple Neural Engine (ANE) with aligned `MLMultiArray` buffers, silent-frame short-circuiting, and recurrent state reuse (hidden/cell/context) for sequential inference.
- Significantly improved throughput by processing 8×32 ms audio windows in a single Core ML call.

## Automatic Speech Recognition

### AsrManager
Automatic speech recognition using Parakeet TDT models (v2 English-only, v3 multilingual).

**Key Methods:**
- `transcribe(_:source:) async throws -> ASRResult`
  - Accepts `[Float]` samples already converted to 16 kHz mono; returns transcription text, confidence, and token timings.
- `transcribe(_ url: URL, source:) async throws -> ASRResult`
  - Loads the file directly and performs format conversion internally (`AudioConverter`).
- `transcribe(_ buffer: AVAudioPCMBuffer, source:) async throws -> ASRResult`
  - Convenience overload for capture pipelines that already produce PCM buffers.
- `initialize(models:) async throws`
  - Load and initialize ASR models (automatic download if needed)

**Model Management:**
- `AsrModels.downloadAndLoad(version: AsrModelVersion = .v3) async throws -> AsrModels`
  - Download models from HuggingFace and compile for CoreML
  - Pass `.v2` to load the English-only bundle when you do not need multilingual coverage
  - Models cached locally after first download
- `ASRConfig`: Beam size, temperature, language model weights

- **Audio Processing:**
- `AudioConverter.resampleAudioFile(path:) throws -> [Float]`
  - Load and convert audio files to 16kHz mono Float32 (WAV, M4A, MP3, FLAC)
- `AudioConverter.resampleBuffer(_ buffer: AVAudioPCMBuffer) throws -> [Float]`
  - Convert a buffer to 16kHz mono (stateless conversion)
- `AudioSource`: `.microphone` or `.system` for different processing paths

> **Warning:** Avoid hand-decoding audio payloads (e.g., truncating WAV headers or treating bytes as raw `Int16` samples).
> The Core ML models require correctly resampled 16 kHz mono Float32 tensors; manual parsing will silently corrupt input when
> formats carry metadata chunks, different bit depths, stereo channels, or compression. Always route files and live buffers
> through `AudioConverter` before calling `AsrManager.transcribe`.

**Performance:**
- Real-time factor: ~120x on M4 Pro (processes 1min audio in 0.5s)
- Languages: 25 European languages supported

### StreamingEouAsrManager
Real-time streaming ASR with End-of-Utterance detection using Parakeet EOU models.

**Key Methods:**
- `init(configuration:chunkSize:eouDebounceMs:debugFeatures:)`
  - Create manager with MLModel configuration and chunk size
  - `chunkSize`: `.ms160` (default), `.ms320`, or `.ms1600`
  - `eouDebounceMs`: Minimum silence duration before EOU triggers (default: 1280)
- `loadModels(modelDir:) async throws`
  - Load CoreML models from directory (encoder, decoder, joint, vocab)
- `process(audioBuffer:) async throws -> String`
  - Process audio incrementally, returns empty string (use `finish()` for transcript)
- `finish() async throws -> String`
  - Finalize processing and return accumulated transcript
- `reset() async`
  - Reset all state for next utterance
- `setEouCallback(_:)`
  - Set callback invoked when End-of-Utterance is detected
- `appendAudio(_:) throws`
  - Append audio to buffer without processing (for VAD integration)

**Properties:**
- `eouDetected: Bool` — Whether EOU was detected in the last chunk
- `eouDebounceMs: Int` — Minimum silence duration before EOU triggers
- `chunkSize: StreamingChunkSize` — Current chunk size configuration

**StreamingChunkSize:**
- `.ms160` — 160ms chunks, lowest latency, ~8% WER
- `.ms320` — 320ms chunks, balanced, ~5% WER
- `.ms1600` — 1600ms chunks, highest throughput

**Usage:**
```swift
let manager = StreamingEouAsrManager(chunkSize: .ms160, eouDebounceMs: 1280)
try await manager.loadModels(modelDir: modelsURL)

// Process audio incrementally
_ = try await manager.process(audioBuffer: buffer1)
_ = try await manager.process(audioBuffer: buffer2)

// Get final transcript
let transcript = try await manager.finish()

// Reset for next utterance
await manager.reset()
```

**Performance:**
- Real-time factor: ~5x RTF (160ms), ~12x RTF (320ms) on Apple Silicon
- WER: ~8% (160ms), ~5% (320ms) on LibriSpeech test-clean

### SlidingWindowAsrManager
Real-time sliding window ASR with overlap and cancellation support.

**Key Methods:**
- `init(models:config:) async throws`
  - Initialize with ASR models and configuration
- `transcribeChunk(_:isLastChunk:) async throws -> ASRResult`
  - Process audio chunk with sliding window overlap
  - Returns accumulated transcript with proper handling of chunk boundaries
- `reset()`
  - Reset internal state for new session

**Configuration:**
- Default chunk size: ~14.96 seconds
- Default overlap: 2.0 seconds
- Supports cancellation via Task cancellation

**Usage:**
```swift
let manager = try await SlidingWindowAsrManager()

for audioChunk in audioStream {
    let result = try await manager.transcribeChunk(
        audioChunk,
        isLastChunk: false
    )
    print("Partial: \(result.text)")
}

// Process final chunk
let final = try await manager.transcribeChunk(lastChunk, isLastChunk: true)
print("Final: \(final.text)")
```

### StreamingNemotronAsrManager
NVIDIA Nemotron streaming ASR with encoder cache for low-latency processing.

**Key Methods:**
- `init(chunkSize:configuration:) async throws`
  - Initialize with chunk size (160ms, 320ms, or 1600ms)
- `loadModels(modelDir:) async throws`
  - Load CoreML models from directory
- `transcribe(_:) async throws -> String`
  - Process audio and return transcript
- `reset() async`
  - Reset encoder cache and decoder state

**Chunk Sizes:**
- `.ms160` — 160ms chunks, lowest latency
- `.ms320` — 320ms chunks, balanced
- `.ms1600` — 1600ms chunks, highest throughput

**Performance:**
- Real-time factor: ~0.2x on Apple Silicon
- Maintains encoder cache across chunks for efficiency

### Qwen3AsrManager
Qwen3-based speech recognition with Whisper mel spectrogram frontend.

**Key Methods:**
- `init(modelDir:configuration:) async throws`
  - Initialize with model directory and CoreML configuration
- `transcribe(_:) async throws -> String`
  - Transcribe audio samples (16kHz mono Float32)
- `transcribe(_:) async throws -> String`
  - Transcribe from audio file URL

**Features:**
- Whisper-style mel spectrogram processing
- Multi-language support
- Experimental high-accuracy model

## Text-to-Speech (TTS)

### KokoroTtsManager
Text-to-speech synthesis using Kokoro CoreML models.

**Key Methods:**
- `init(defaultVoice:defaultSpeakerId:directory:computeUnits:customLexicon:)`
  - Create TTS manager with optional configuration
  - `computeUnits`: Use `.cpuAndGPU` on iOS 26+ to avoid ANE issues
- `initialize(preloadVoices:) async throws`
  - Download and initialize TTS models
  - Optionally preload specific voices
- `synthesize(text:voice:speakerId:speed:pitch:) async throws -> [Float]`
  - Synthesize speech from text
  - Returns audio samples at 24kHz
- `synthesizeDetailed(text:voice:speakerId:speed:pitch:) async throws -> (audio: [Float], alignments: [AlignmentInfo])`
  - Synthesize with phoneme-level timing information
- `synthesizeToFile(text:outputURL:voice:speakerId:speed:pitch:) async throws`
  - Synthesize directly to WAV file
- `setDefaultVoice(_:speakerId:) async throws`
  - Change default voice for subsequent synthesis
- `setCustomLexicon(_:)`
  - Set custom pronunciation dictionary
- `cleanup()`
  - Release models and free memory

**Available Voices:**
- `af` — American Female
- `af_bella`, `af_nicole`, `af_sarah` — American Female variants
- `am` — American Male
- `am_adam`, `am_michael` — American Male variants
- `bf` — British Female
- `bm` — British Male

**Configuration:**
- `defaultVoice`: Voice identifier (default: `"af"`)
- `defaultSpeakerId`: Speaker ID for multi-speaker voices (default: 0)
- `speed`: Speech rate multiplier (0.5–2.0, default: 1.0)
- `pitch`: Pitch shift in semitones (-12 to +12, default: 0)
- `customLexicon`: Custom pronunciation dictionary

**Usage:**
```swift
let manager = KokoroTtsManager(defaultVoice: "af")
try await manager.initialize()

let audio = try await manager.synthesize(
    text: "Hello from FluidAudio!",
    speed: 1.0,
    pitch: 0
)

// Save to file
try await manager.synthesizeToFile(
    text: "Hello world",
    outputURL: URL(fileURLWithPath: "output.wav")
)
```

**Performance:**
- Real-time factor: ~5-10x on Apple Silicon
- Output sample rate: 24kHz
- Supports SSML for prosody control

### KokoroAneManager
ANE-resident sibling of `KokoroTtsManager` — splits the Kokoro 82M graph into
7 CoreML stages so the ANE-friendly layers stay resident on the Neural
Engine. **3-11× RTFx** on Apple Silicon vs. the single-graph default. See
[KokoroAne](TTS/KokoroAne.md) for the full pipeline.

**Key Methods:**
- `init(defaultVoice:directory:computeUnits:modelStore:)`
  - Defaults: `defaultVoice = "af_heart"`, `computeUnits = .default`
    (per-stage assignment matching the laishere upstream)
- `initialize(preloadVoices:) async throws`
  - Download (if missing) and load all 7 `.mlmodelc` bundles + `vocab.json`
    + `af_heart.bin`
- `synthesize(text:voice:speed:) async throws -> Data`
  - One-shot text → 24 kHz mono 16-bit PCM WAV
- `synthesizeDetailed(text:voice:speed:) async throws -> KokoroAneSynthesisResult`
  - Returns samples + per-stage timings
- `synthesizeFromPhonemes(_:voice:speed:) async throws -> Data`
  - Bypass G2P; feed an already-IPA phoneme string directly
- `synthesizeFromPhonemesDetailed(_:voice:speed:) async throws -> KokoroAneSynthesisResult`
- `setDefaultVoice(_:)` — override default voice for subsequent calls
- `isAvailable() async -> Bool`
- `cleanup() async` — drop loaded mlmodelcs + voice packs

**Configuration:**
- `defaultVoice`: voice id (default `"af_heart"` — only voice currently shipped)
- `directory`: optional cache directory override
- `computeUnits`: `KokoroAneComputeUnits` (per-stage `MLComputeUnits`)
  - `.default` — Albert/PostAlbert/Alignment/Vocoder on `cpuAndNeuralEngine`,
    Prosody/Noise/Tail on `.all`
  - `.cpuAndGpu` — skip ANE entirely (debug baseline)
- `speed`: speech rate multiplier (default `1.0`)

**Limits:**
- ≤ 510 IPA phonemes per call (no built-in chunker)
- Single voice (`af_heart`)
- No SSML / custom lexicon / markdown overrides

**Usage:**
```swift
let manager = KokoroAneManager()
try await manager.initialize()

let wav = try await manager.synthesize(text: "Hello from FluidAudio!")
try wav.write(to: URL(fileURLWithPath: "/tmp/demo.wav"))

// With per-stage timings:
let detail = try await manager.synthesizeDetailed(text: "Hi.")
print("samples: \(detail.samples.count) @ \(detail.sampleRate) Hz")
let t = detail.timings
print("  albert=\(t.albert) postAlbert=\(t.postAlbert) alignment=\(t.alignment)")
print("  prosody=\(t.prosody) noise=\(t.noise) vocoder=\(t.vocoder) tail=\(t.tail)")
print("  total: \(t.totalMs) ms")
```

**Performance:**
- Real-time factor: 3-11× RTFx on Apple Silicon (vs. 5-10× for `KokoroTtsManager`)
- Cold load (first ever, ANE compile): ~20 s; warm load: ~0.3 s
- Output sample rate: 24 kHz

### PocketTtsManager
Lightweight streaming TTS with voice cloning support.

**Key Methods:**
- `init(directory:computeUnits:) async throws`
  - Initialize with optional directory and compute units
- `synthesize(text:voice:speakerId:) async throws -> [Float]`
  - Synthesize speech from text
  - Returns audio samples at 24kHz
- `synthesizeStreaming(text:voice:speakerId:) -> AsyncThrowingStream<[Float], Error>`
  - Stream audio chunks as they are generated
- `cloneVoice(from:name:) async throws -> String`
  - Clone voice from reference audio
  - Returns voice ID for later use
- `cleanup()`
  - Release models and free memory

**Features:**
- Streaming synthesis for long text
- Voice cloning from short audio samples
- Lower memory footprint than Kokoro
- Faster synthesis for real-time applications

**Usage:**
```swift
let manager = try await PocketTtsManager()

// Basic synthesis
let audio = try await manager.synthesize(text: "Hello world")

// Streaming synthesis
for try await chunk in manager.synthesizeStreaming(text: longText) {
    // Play audio chunk immediately
    playAudio(chunk)
}
```