## Summary - Extract shared fixture helpers into `DiarizationTestFixtures` enum, removing ~200 lines of duplicate code across `LSEENDIntegrationTests` and `SpeakerEnrollmentTests` - Replace fragile `Mirror`-based private state inspection with `internal` `hasActiveSession` property on `LSEENDDiarizerAPI` - Fix non-deterministic `srand48` seed in `SortformerTests` (use constant `42` instead of time-based seed) - Fix asymmetric skip guards in Sortformer enrollment tests (`XCTSkipIf` instead of `XCTAssertNotNil` for host-dependent segments) ## Test plan - [x] `swift build --build-tests` passes - [ ] `swift test --filter SortformerTests` passes - [ ] `swift test --filter LSEENDIntegrationTests` passes - [ ] `swift test --filter SpeakerEnrollmentTests` passes <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/395" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a> <!-- devin-review-badge-end -->
12 KiB
API Reference
This page summarizes the primary public APIs across modules. See inline doc comments and module-specific documentation for complete details.
Common Patterns
Audio Format: All modules expect 16kHz mono Float32 audio samples. Use FluidAudio.AudioConverter to convert AVAudioPCMBuffer or files to 16kHz mono for both CLI and library paths.
Model Registry: Models auto-download from HuggingFace by default. Customize the registry URL using:
ModelRegistry.baseURL(programmatic) - recommended for appsREGISTRY_URLorMODEL_REGISTRY_URLenvironment variables - recommended for CLI/testing- Priority order: programmatic override → env vars → default (HuggingFace)
Proxy Configuration: If behind a corporate firewall, set the https_proxy (or http_proxy) environment variable. Both registry URL and proxy configuration are centralized in ModelRegistry.
Error Handling: All async methods throw descriptive errors. Use proper error handling in production code.
Thread Safety: All managers are thread-safe and can be used concurrently across different queues.
Diarization
DiarizerManager
Main class for speaker diarization and "who spoke when" analysis.
Key Methods:
performCompleteDiarization(_:sampleRate:) throws -> DiarizationResult- Process complete audio file and return speaker segments
- Parameters:
RandomAccessCollection<Float>audio samples, sample rate (default: 16000) - Returns:
DiarizerResultwith speaker segments and timing
validateAudio(_:) throws -> AudioValidationResult- Validate audio quality, length, and format requirements
Configuration:
DiarizerConfig: Clustering threshold, minimum durations, activity thresholds- Optimal threshold: 0.7 (17.7% DER on AMI dataset)
OfflineDiarizerManager
Full batch pipeline that mirrors the pyannote/Core ML exporter (powerset segmentation + VBx clustering).
Requires macOS 14 / iOS 17 or later because the manager relies on Swift Concurrency features and C++ clustering shims that are unavailable on older OS releases.
Key Methods:
init(config: OfflineDiarizerConfig = .default)- Creates manager with configuration
prepareModels(directory:configuration:forceRedownload:) async throws- Downloads / compiles the Core ML bundles as needed and records timing metadata. Call once before processing when you don't already have
OfflineDiarizerModels.
- Downloads / compiles the Core ML bundles as needed and records timing metadata. Call once before processing when you don't already have
initialize(models: OfflineDiarizerModels)- Initializes with models containing segmentation, embedding, and PLDA components (useful when you hydrate the bundles yourself).
process(audio: [Float]) async throws -> DiarizationResult- Runs the full 10 s window pipeline: segmentation → soft mask interpolation → embedding → VBx → timeline reconstruction.
process(audioSource: StreamingAudioSampleSource, audioLoadingSeconds: TimeInterval) async throws -> DiarizationResult- Streams audio from disk-backed sources without materializing the entire buffer in memory. Pair with
StreamingAudioSourceFactoryfor large meetings.
- Streams audio from disk-backed sources without materializing the entire buffer in memory. Pair with
Supporting Types:
OfflineDiarizerConfig- Mirrors pyannote
config.yaml(clusteringThreshold,Fa,Fb,maxVBxIterations,minDurationOn/off, batch sizes, logging flags).
- Mirrors pyannote
SegmentationRunner- Batches 160 k-sample chunks through the segmentation model (589 frames per chunk).
Binarization- Converts log probabilities to soft VAD weights while retaining binary masks for diagnostics.
WeightInterpolation- Reimplements
scipy.ndimage.zoom(half-pixel offsets) so 589-frame weights align with the embedding model’s pooling stride.
- Reimplements
EmbeddingRunner- Runs the FBANK frontend + embedding backend, resamples masks to 589 frames, and emits 256-d L2-normalized embeddings.
PLDAScoring/VBxClustering- Apply the exported PLDA transforms and iterative VBx refinement to group embeddings into speakers.
TimelineReconstruction- Derives timestamps directly from the segmentation frame count and
OfflineDiarizerConfig.windowDuration, then enforces minimum gap/duration constraints.
- Derives timestamps directly from the segmentation frame count and
StreamingAudioSourceFactory- Creates disk-backed or in-memory
StreamingAudioSampleSourceinstances so large meetings never require fully materialized[Float]buffers.
- Creates disk-backed or in-memory
Use OfflineDiarizerManager when you need offline DER parity or want to run the new CLI offline mode (fluidaudio process --mode offline, fluidaudio diarization-benchmark --mode offline).
Diarizer Protocol
SortformerDiarizer and LSEENDDiarizer both conform to the Diarizer protocol, providing a unified streaming and offline API.
Streaming: addAudio(_:sourceSampleRate:) → process() → read timeline. Convenience process(samples:sourceSampleRate:) combines both steps. Returns DiarizerTimelineUpdate? (nil when not enough audio has accumulated).
Offline: processComplete(_:sourceSampleRate:...) or processComplete(audioFileURL:...) to process a full recording in one call.
Speaker Enrollment: enrollSpeaker(withAudio:sourceSampleRate:named:...) feeds known-speaker audio before streaming to label a slot.
Lifecycle: reset() clears streaming state but keeps the model loaded. cleanup() releases everything.
DiarizerTimeline & DiarizerSpeaker
DiarizerTimeline accumulates per-frame speaker probabilities and derives DiarizerSpeaker segments. Each speaker has finalizedSegments (confirmed) and tentativeSegments (may be revised). Segments expose startTime, endTime, duration, and isFinalized.
DiarizerTimelineConfig controls post-processing (onset/offset thresholds default to 0.5, min segment/gap duration, optional rolling window cap). Both diarizers accept this at init.
SortformerDiarizer
Streaming diarization using NVIDIA's Sortformer. 4 fixed speaker slots, 16 kHz input, 80 ms frame duration.
let diarizer = SortformerDiarizer(config: .default, timelineConfig: .sortformerDefault)
try await diarizer.initialize(mainModelPath: modelURL)
Config presets: .default / .fastV2_1 (1.04 s latency), .balancedV2_1 (1.04 s, 20.6% DER on AMI SDM), .highContextV2_1 (30.4 s latency). v2 variants also available.
LSEENDDiarizer
Streaming diarization using LS-EEND. Variable speaker slots, 8 kHz input, 100 ms frame duration, 20.7% DER on AMI SDM.
let diarizer = LSEENDDiarizer(computeUnits: .cpuOnly)
try await diarizer.initialize(variant: .dihard3)
Variants: ami, callhome, dihard2, dihard3 (via LSEENDModelDescriptor.loadFromHuggingFace(variant:)).
Call finalizeSession() at end-of-stream to flush pending audio before reading the final timeline.
Voice Activity Detection
VadManager
Voice activity detection using the Silero VAD Core ML model with 256 ms unified inference and ANE optimizations.
Key Methods:
process(_ url: URL) async throws -> [VadResult]- Process an audio file end-to-end. Automatically converts to 16kHz mono Float32 and processes in 4096-sample frames (256 ms).
process(_ buffer: AVAudioPCMBuffer) async throws -> [VadResult]- Convert and process an in-memory buffer. Supports any input format; resampled to 16kHz mono internally.
process(_ samples: [Float]) async throws -> [VadResult]- Process pre-converted 16kHz mono samples.
processChunk(_:inputState:) async throws -> VadResult- Process a single 4096-sample frame (256 ms at 16 kHz) with optional recurrent state.
Constants:
VadManager.chunkSize = 4096// samples per frame (256 ms @ 16 kHz, plus 64-sample context managed internally)VadManager.sampleRate = 16000
Configuration (VadConfig):
defaultThreshold: Float— Baseline decision threshold (0.0–1.0) used when segmentation does not override. Default:0.85.debugMode: Bool— Extra logging for benchmarking and troubleshooting. Default:false.computeUnits: MLComputeUnits— Core ML compute target. Default:.cpuAndNeuralEngine.
Recommended defaultThreshold ranges depend on your acoustic conditions:
- Clean speech: 0.7–0.9
- Noisy/mixed content: 0.3–0.6 (higher recall, more false positives)
Performance:
- Optimized for Apple Neural Engine (ANE) with aligned
MLMultiArraybuffers, silent-frame short-circuiting, and recurrent state reuse (hidden/cell/context) for sequential inference. - Significantly improved throughput by processing 8×32 ms audio windows in a single Core ML call.
Automatic Speech Recognition
AsrManager
Automatic speech recognition using Parakeet TDT models (v2 English-only, v3 multilingual).
Key Methods:
transcribe(_:source:) async throws -> ASRResult- Accepts
[Float]samples already converted to 16 kHz mono; returns transcription text, confidence, and token timings.
- Accepts
transcribe(_ url: URL, source:) async throws -> ASRResult- Loads the file directly and performs format conversion internally (
AudioConverter).
- Loads the file directly and performs format conversion internally (
transcribe(_ buffer: AVAudioPCMBuffer, source:) async throws -> ASRResult- Convenience overload for capture pipelines that already produce PCM buffers.
initialize(models:) async throws- Load and initialize ASR models (automatic download if needed)
Model Management:
-
AsrModels.downloadAndLoad(version: AsrModelVersion = .v3) async throws -> AsrModels- Download models from HuggingFace and compile for CoreML
- Pass
.v2to load the English-only bundle when you do not need multilingual coverage - Models cached locally after first download
-
ASRConfig: Beam size, temperature, language model weights -
Audio Processing:
-
AudioConverter.resampleAudioFile(path:) throws -> [Float]- Load and convert audio files to 16kHz mono Float32 (WAV, M4A, MP3, FLAC)
-
AudioConverter.resampleBuffer(_ buffer: AVAudioPCMBuffer) throws -> [Float]- Convert a buffer to 16kHz mono (stateless conversion)
-
AudioSource:.microphoneor.systemfor different processing paths
Warning: Avoid hand-decoding audio payloads (e.g., truncating WAV headers or treating bytes as raw
Int16samples). The Core ML models require correctly resampled 16 kHz mono Float32 tensors; manual parsing will silently corrupt input when formats carry metadata chunks, different bit depths, stereo channels, or compression. Always route files and live buffers throughAudioConverterbefore callingAsrManager.transcribe.
Performance:
- Real-time factor: ~120x on M4 Pro (processes 1min audio in 0.5s)
- Languages: 25 European languages supported
StreamingEouAsrManager
Real-time streaming ASR with End-of-Utterance detection using Parakeet EOU models.
Key Methods:
init(configuration:chunkSize:eouDebounceMs:debugFeatures:)- Create manager with MLModel configuration and chunk size
chunkSize:.ms160(default),.ms320, or.ms1600eouDebounceMs: Minimum silence duration before EOU triggers (default: 1280)
loadModels(modelDir:) async throws- Load CoreML models from directory (encoder, decoder, joint, vocab)
process(audioBuffer:) async throws -> String- Process audio incrementally, returns empty string (use
finish()for transcript)
- Process audio incrementally, returns empty string (use
finish() async throws -> String- Finalize processing and return accumulated transcript
reset() async- Reset all state for next utterance
setEouCallback(_:)- Set callback invoked when End-of-Utterance is detected
appendAudio(_:) throws- Append audio to buffer without processing (for VAD integration)
Properties:
eouDetected: Bool— Whether EOU was detected in the last chunkeouDebounceMs: Int— Minimum silence duration before EOU triggerschunkSize: StreamingChunkSize— Current chunk size configuration
StreamingChunkSize:
.ms160— 160ms chunks, lowest latency, ~8% WER.ms320— 320ms chunks, balanced, ~5% WER.ms1600— 1600ms chunks, highest throughput
Usage:
let manager = StreamingEouAsrManager(chunkSize: .ms160, eouDebounceMs: 1280)
try await manager.loadModels(modelDir: modelsURL)
// Process audio incrementally
_ = try await manager.process(audioBuffer: buffer1)
_ = try await manager.process(audioBuffer: buffer2)
// Get final transcript
let transcript = try await manager.finish()
// Reset for next utterance
await manager.reset()
Performance:
- Real-time factor: ~5x RTF (160ms), ~12x RTF (320ms) on Apple Silicon
- WER: ~8% (160ms), ~5% (320ms) on LibriSpeech test-clean