Files
FluidAudio/Documentation/ASR/GettingStarted.md
T
Brandon Weng 7fd5ac5446 pyannote community-1 model for offline speaker diarization pipeline (#150)
### Why is this change needed?
<!-- Explain the motivation for this change. What problem does it solve?
-->

Keeping the streaming one around as the VBx and AHC clustering gets
pretty expensive after 30mins of audio and running it constantly gets
expensive. Its still possible to support clustering between files but
will save that for another PR.

Pyannote's Bench mark is around 11% - i increased steps to 0.2s instead
of 0.1 to double the speed but also selective fp16 results in more
operations to run on ANE but also means that we lose some precision.

```
Average DER: 14.95% | Median DER: 10.89% | Average JER: 39.27% | Median JER: 40.74% (collar=0.25s, ignoreOverlap=True)
Average RTFx: 139.63 (from 232 clips)
Metrics summary saved to: /Users/brandonweng/FluidAudioDatasets/voxconverse/metrics/test_metrics_release.json
Completed. New results: 232, Skipped existing: 0, Total attempted: 232
```

See benchmark.md for more info but compared to Pytorch model, we are
100x faster than the CPU version and ~6x faster compared to the mps
backend on mb pro 4

---------

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
Co-authored-by: Brandon Weng <BrandonWeng@users.noreply.github.com>
Co-authored-by: Alex <36247722+Alex-Wengg@users.noreply.github.com>
Co-authored-by: Alex-Wengg <hanweng9@gmail.com>
2025-10-22 15:11:57 -04:00

3.7 KiB

Automatic Speech Recognition (ASR) / Transcription

  • Model (multilingual): FluidInference/parakeet-tdt-0.6b-v3-coreml
  • Model (English-only): FluidInference/parakeet-tdt-0.6b-v2-coreml
  • Languages: v3 spans 25 European languages; v2 focuses on English accuracy
  • Processing Mode: Batch transcription for complete audio files
  • Real-time Factor: ~120x on M4 Pro (1 minute ≈ 0.5 seconds)
  • Streaming Support: Coming soon — batch processing recommended for production use

Choosing a model version

  • Prefer v2 when you only need English. It reuses the fused TDT decoder from v3 but ships with a tighter vocabulary, delivering better recall on long-form English audio.
  • Use v3 for multilingual coverage (25 languages). English accuracy is still strong, but the broader vocab slightly trails v2 on rare words.
  • Both versions share the same API surface—set AsrModelVersion in code or pass --model-version in the CLI.
// Download the English-only bundle when you only need English transcripts
let models = try await AsrModels.downloadAndLoad(version: .v2)

Quick Start (Code)

import FluidAudio

// Batch transcription from an audio file
Task {
    // 1) Initialize ASR manager and load models
    let models = try await AsrModels.downloadAndLoad(version: .v3)  // Switch to .v2 for English-only
    let asrManager = AsrManager(config: .default)
    try await asrManager.initialize(models: models)

    // 2) Prepare 16 kHz mono samples (see: Audio Conversion)
    let samples = try await loadSamples16kMono(path: "path/to/audio.wav")

    // 3) Transcribe the audio
    let result = try await asrManager.transcribe(samples, source: .system)
    print("Transcription: \(result.text)")
    print("Confidence: \(result.confidence)")
}

Important: Do not parse WAV/PCM bytes by hand (e.g., slicing headers or assuming 16-bit samples). Always convert with AudioConverter so differing bit depths, channel layouts, metadata chunks, or compressed formats (MP3/M4A/FLAC) get normalized to the 16 kHz mono Float32 tensors that Parakeet expects. Manually decoded buffers frequently contain garbage values, which shows up as empty transcripts even though the models load successfully.

Transcribing directly from a file URL

If you already have an audio file on disk you can skip manual sample loading—AsrManager.transcribe(_ url:source:) handles format conversion internally via AudioConverter.

let models = try await AsrModels.downloadAndLoad(version: .v3)
let asrManager = AsrManager()
try await asrManager.initialize(models: models)

let audioURL = URL(fileURLWithPath: "/path/to/audio.wav")
let result = try await asrManager.transcribe(audioURL, source: .system)
print(result.text)

Manual model loading

Working offline? Follow the Manual Model Loading guide to stage the CoreML bundles and call AsrModels.load without triggering HuggingFace downloads.

CLI

# Transcribe an audio file (batch)
swift run fluidaudio transcribe audio.wav

# English-only run (better recall)
swift run fluidaudio transcribe audio.wav --model-version v2

# Transcribe multiple files in parallel
swift run fluidaudio multi-stream audio1.wav audio2.wav

# Benchmark ASR on LibriSpeech
swift run fluidaudio asr-benchmark --subset test-clean --max-files 50

# Run the English-only benchmark
swift run fluidaudio asr-benchmark --subset test-clean --max-files 50 --model-version v2

# Multilingual ASR (FLEURS) benchmark
swift run fluidaudio fleurs-benchmark --languages en_us,fr_fr --samples 10

# Download LibriSpeech test sets
swift run fluidaudio download --dataset librispeech-test-clean
swift run fluidaudio download --dataset librispeech-test-other