### Why is this change needed? <!-- Explain the motivation for this change. What problem does it solve? --> Readme has grown way too much, splitting it up to just preserve the essence and move the verbose and details to Documentation
3.3 KiB
API Reference
This page summarizes the primary public APIs across modules. See inline doc comments and module-specific documentation for complete details.
Common Patterns
Audio Format: All modules expect 16kHz mono Float32 audio samples. Use AudioProcessor.loadAudioFile() for conversion.
Model Loading: Models auto-download from HuggingFace on first use. Set https_proxy environment variable if behind corporate firewall.
Error Handling: All async methods throw descriptive errors. Use proper error handling in production code.
Thread Safety: All managers are thread-safe and can be used concurrently across different queues.
Diarization
DiarizerManager
Main class for speaker diarization and "who spoke when" analysis.
Key Methods:
performCompleteDiarization(_:sampleRate:) throws -> DiarizerResult- Process complete audio file and return speaker segments
- Parameters:
RandomAccessCollection<Float>audio samples, sample rate (default: 16000) - Returns:
DiarizerResultwith speaker segments and timing
compareSpeakers(audio1:audio2:) throws -> Float- Compare speaker similarity between two audio samples
- Returns: Similarity score (0.0-1.0, higher = more similar)
validateAudio(_:) throws -> AudioValidationResult- Validate audio quality, length, and format requirements
Configuration:
DiarizerConfig: Clustering threshold, minimum durations, activity thresholds- Optimal threshold: 0.7 (17.7% DER on AMI dataset)
Voice Activity Detection
VadManager
Voice activity detection using Silero VAD models.
Key Methods:
processChunk(_:) throws -> VadResult- Process single 512-sample chunk (32ms at 16kHz)
- Returns: Voice activity probability and boolean decision
processAudioFile(_:) throws -> [VadResult]- Process complete audio file in 512-sample chunks
- Returns: Array of VAD results for each frame
Configuration:
VadConfig: Threshold (0.0-1.0), window size, post-processingVadAudioProcessor: SNR filtering, noise reduction, adaptive thresholding- Recommended threshold: 0.3-0.5 depending on noise conditions
Automatic Speech Recognition
AsrManager
Automatic speech recognition using Parakeet TDT v3 models.
Key Methods:
transcribe(_:source:) throws -> AsrTranscription- Process complete audio and return transcription
- Parameters:
RandomAccessCollection<Float>samples,AudioSource(microphone/system) - Returns:
AsrTranscriptionwith text, confidence, and timing
initialize(models:) async throws- Load and initialize ASR models (automatic download if needed)
Model Management:
AsrModels.downloadAndLoad() async throws -> AsrModels- Download models from HuggingFace and compile for CoreML
- Models cached locally after first download
ASRConfig: Beam size, temperature, language model weights
Audio Processing:
AudioProcessor.loadAudioFile(path:) throws -> [Float]- Load and convert audio to 16kHz mono Float32
- Supports: WAV, M4A, MP3, FLAC, and other common formats
AudioSource:.microphoneor.systemfor different processing paths
Performance:
- Real-time factor: ~120x on M4 Pro (processes 1min audio in 0.5s)
- Languages: 25 European languages supported
- Streaming: Available via
StreamingAsrManager(beta)