### Why is this change needed? <!-- Explain the motivation for this change. What problem does it solve? --> Keeping the streaming one around as the VBx and AHC clustering gets pretty expensive after 30mins of audio and running it constantly gets expensive. Its still possible to support clustering between files but will save that for another PR. Pyannote's Bench mark is around 11% - i increased steps to 0.2s instead of 0.1 to double the speed but also selective fp16 results in more operations to run on ANE but also means that we lose some precision. ``` Average DER: 14.95% | Median DER: 10.89% | Average JER: 39.27% | Median JER: 40.74% (collar=0.25s, ignoreOverlap=True) Average RTFx: 139.63 (from 232 clips) Metrics summary saved to: /Users/brandonweng/FluidAudioDatasets/voxconverse/metrics/test_metrics_release.json Completed. New results: 232, Skipped existing: 0, Total attempted: 232 ``` See benchmark.md for more info but compared to Pytorch model, we are 100x faster than the CPU version and ~6x faster compared to the mps backend on mb pro 4 --------- Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> Co-authored-by: Brandon Weng <BrandonWeng@users.noreply.github.com> Co-authored-by: Alex <36247722+Alex-Wengg@users.noreply.github.com> Co-authored-by: Alex-Wengg <hanweng9@gmail.com>
3.7 KiB
Automatic Speech Recognition (ASR) / Transcription
- Model (multilingual):
FluidInference/parakeet-tdt-0.6b-v3-coreml - Model (English-only):
FluidInference/parakeet-tdt-0.6b-v2-coreml - Languages: v3 spans 25 European languages; v2 focuses on English accuracy
- Processing Mode: Batch transcription for complete audio files
- Real-time Factor: ~120x on M4 Pro (1 minute ≈ 0.5 seconds)
- Streaming Support: Coming soon — batch processing recommended for production use
Choosing a model version
- Prefer v2 when you only need English. It reuses the fused TDT decoder from v3 but ships with a tighter vocabulary, delivering better recall on long-form English audio.
- Use v3 for multilingual coverage (25 languages). English accuracy is still strong, but the broader vocab slightly trails v2 on rare words.
- Both versions share the same API surface—set
AsrModelVersionin code or pass--model-versionin the CLI.
// Download the English-only bundle when you only need English transcripts
let models = try await AsrModels.downloadAndLoad(version: .v2)
Quick Start (Code)
import FluidAudio
// Batch transcription from an audio file
Task {
// 1) Initialize ASR manager and load models
let models = try await AsrModels.downloadAndLoad(version: .v3) // Switch to .v2 for English-only
let asrManager = AsrManager(config: .default)
try await asrManager.initialize(models: models)
// 2) Prepare 16 kHz mono samples (see: Audio Conversion)
let samples = try await loadSamples16kMono(path: "path/to/audio.wav")
// 3) Transcribe the audio
let result = try await asrManager.transcribe(samples, source: .system)
print("Transcription: \(result.text)")
print("Confidence: \(result.confidence)")
}
Important: Do not parse WAV/PCM bytes by hand (e.g., slicing headers or assuming 16-bit samples). Always convert with
AudioConverterso differing bit depths, channel layouts, metadata chunks, or compressed formats (MP3/M4A/FLAC) get normalized to the 16 kHz mono Float32 tensors that Parakeet expects. Manually decoded buffers frequently contain garbage values, which shows up as empty transcripts even though the models load successfully.
Transcribing directly from a file URL
If you already have an audio file on disk you can skip manual sample loading—AsrManager.transcribe(_ url:source:)
handles format conversion internally via AudioConverter.
let models = try await AsrModels.downloadAndLoad(version: .v3)
let asrManager = AsrManager()
try await asrManager.initialize(models: models)
let audioURL = URL(fileURLWithPath: "/path/to/audio.wav")
let result = try await asrManager.transcribe(audioURL, source: .system)
print(result.text)
Manual model loading
Working offline? Follow the Manual Model Loading guide to stage the CoreML bundles and call AsrModels.load without triggering HuggingFace downloads.
CLI
# Transcribe an audio file (batch)
swift run fluidaudio transcribe audio.wav
# English-only run (better recall)
swift run fluidaudio transcribe audio.wav --model-version v2
# Transcribe multiple files in parallel
swift run fluidaudio multi-stream audio1.wav audio2.wav
# Benchmark ASR on LibriSpeech
swift run fluidaudio asr-benchmark --subset test-clean --max-files 50
# Run the English-only benchmark
swift run fluidaudio asr-benchmark --subset test-clean --max-files 50 --model-version v2
# Multilingual ASR (FLEURS) benchmark
swift run fluidaudio fleurs-benchmark --languages en_us,fr_fr --samples 10
# Download LibriSpeech test sets
swift run fluidaudio download --dataset librispeech-test-clean
swift run fluidaudio download --dataset librispeech-test-other