Files
FluidAudio/Sources/FluidAudioCLI
Brandon Weng 7fd5ac5446 pyannote community-1 model for offline speaker diarization pipeline (#150)
### Why is this change needed?
<!-- Explain the motivation for this change. What problem does it solve?
-->

Keeping the streaming one around as the VBx and AHC clustering gets
pretty expensive after 30mins of audio and running it constantly gets
expensive. Its still possible to support clustering between files but
will save that for another PR.

Pyannote's Bench mark is around 11% - i increased steps to 0.2s instead
of 0.1 to double the speed but also selective fp16 results in more
operations to run on ANE but also means that we lose some precision.

```
Average DER: 14.95% | Median DER: 10.89% | Average JER: 39.27% | Median JER: 40.74% (collar=0.25s, ignoreOverlap=True)
Average RTFx: 139.63 (from 232 clips)
Metrics summary saved to: /Users/brandonweng/FluidAudioDatasets/voxconverse/metrics/test_metrics_release.json
Completed. New results: 232, Skipped existing: 0, Total attempted: 232
```

See benchmark.md for more info but compared to Pytorch model, we are
100x faster than the CPU version and ~6x faster compared to the mps
backend on mb pro 4

---------

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
Co-authored-by: Brandon Weng <BrandonWeng@users.noreply.github.com>
Co-authored-by: Alex <36247722+Alex-Wengg@users.noreply.github.com>
Co-authored-by: Alex-Wengg <hanweng9@gmail.com>
2025-10-22 15:11:57 -04:00
..

FluidAudio CLI

The FluidAudio CLI provides various commands for audio processing, transcription, and benchmarking.

Installation

Build the CLI tool:

swift build -c release

Commands Overview

1. process - Audio Diarization

Process a single audio file to identify speakers and their segments.

# Basic usage
swift run fluidaudio process audio.wav

# With custom output and threshold
swift run fluidaudio process audio.wav --output results.json --threshold 0.7

# With debug mode
swift run fluidaudio process audio.wav --debug

Options:

  • --output <file>: Output JSON file (default: prints to console)
  • --threshold <value>: Clustering threshold 0.0-1.0 (default: 0.7)
  • --debug: Enable debug output

2. transcribe - Audio Transcription

Transcribe audio files using streaming ASR with real-time updates.

# Basic transcription
swift run fluidaudio transcribe audio.wav

# With low-latency configuration
swift run fluidaudio transcribe audio.wav --config low-latency

# With debug output
swift run fluidaudio transcribe audio.wav --debug

# Compare with direct ASR API
swift run fluidaudio transcribe audio.wav --compare

Options:

  • --config <type>: Configuration type: default, low-latency, high-accuracy
  • --debug: Show debug information
  • --compare: Compare streaming API with direct ASR API
  • --help, -h: Show help message

Configurations:

  • default: 2.5s chunks, 0.85 confirmation threshold
  • low-latency: 2.0s chunks, 0.75 confirmation threshold
  • high-accuracy: 3.0s chunks, 0.90 confirmation threshold

3. multi-stream - Parallel Transcription

Transcribe multiple audio files in parallel using shared ASR models.

# Process two different files
swift run fluidaudio multi-stream mic_audio.wav system_audio.wav

# Process same file on both streams
swift run fluidaudio multi-stream audio.wav

# With debug output
swift run fluidaudio multi-stream audio1.wav audio2.wav --debug

Options:

  • --debug: Show debug information
  • --help, -h: Show help message

4. diarization-benchmark - Speaker Diarization Benchmark

Run comprehensive benchmarks on evaluation datasets.

# Run on AMI dataset with auto-download
swift run fluidaudio diarization-benchmark --auto-download

# Test single file
swift run fluidaudio diarization-benchmark --single-file ES2004a --threshold 0.7

# Run on specific dataset
swift run fluidaudio diarization-benchmark --dataset ami-sdm --max-files 10

# Save results to file
swift run fluidaudio diarization-benchmark --output benchmark_results.json

Options:

  • --dataset <name>: Dataset to use (ami-sdm, ami-mdm, voxconverse)
  • --auto-download: Automatically download required datasets
  • --single-file <id>: Test a single file (e.g., ES2004a)
  • --threshold <value>: Clustering threshold (default: 0.7)
  • --max-files <n>: Maximum files to process
  • --output <file>: Save results to JSON file
  • --verbose: Show detailed progress

5. vad-benchmark - Voice Activity Detection Benchmark

Benchmark VAD performance on test datasets.

# Run VAD benchmark
swift run fluidaudio vad-benchmark --num-files 40

# With custom threshold 
swift run fluidaudio vad-benchmark --threshold 0.8

# Test on specific dataset
swift run fluidaudio vad-benchmark --dataset voices-subset

Options:

  • --num-files <n>: Number of files to test (or --all-files)
  • --threshold <value>: VAD threshold (default: 0.3)
  • --dataset <name>: Dataset to use (e.g., mini50, voices-subset, musan-full)
  • --debug: Verbose logging and per-file RTFx

6. asr-benchmark - ASR Benchmark

Benchmark ASR performance on LibriSpeech or other datasets.

# Run on LibriSpeech test-clean
swift run fluidaudio asr-benchmark --subset test-clean --max-files 100

# Run on test-other subset
swift run fluidaudio asr-benchmark --subset test-other --max-files 50

# With verbose output
swift run fluidaudio asr-benchmark --verbose

Options:

  • --subset <name>: LibriSpeech subset (test-clean, test-other)
  • --max-files <n>: Maximum files to process
  • --verbose: Show detailed progress

7. download - Download Datasets

Download evaluation datasets for benchmarking.

# Download AMI-SDM dataset
swift run fluidaudio download --dataset ami-sdm

# Download multiple datasets
swift run fluidaudio download --dataset ami-sdm --dataset voxconverse

# List available datasets
swift run fluidaudio download --list

Options:

  • --dataset <name>: Dataset to download
  • --list: List available datasets

Output Formats

Diarization Output (JSON)

{
  "audioFile": "audio.wav",
  "durationSeconds": 300.5,
  "speakerCount": 3,
  "segments": [
    {
      "speakerId": "Speaker 1",
      "startTimeSeconds": 0.0,
      "endTimeSeconds": 45.2,
      "qualityScore": 0.85
    }
  ],
  "processingTimeSeconds": 15.2,
  "realTimeFactor": 0.05
}

Transcription Output

  • Real-time updates showing volatile and confirmed text
  • Final transcription with performance metrics
  • RTFx (Real-Time Factor) showing processing speed

Performance Notes

  • All commands process audio as fast as possible (no artificial delays)
  • Multi-stream command demonstrates parallel processing with shared models
  • Benchmarks provide detailed performance metrics including DER, WER, and RTFx

Examples

Complete Workflow Example

# 1. Download dataset
swift run fluidaudio download --dataset ami-sdm

# 2. Run diarization benchmark
swift run fluidaudio diarization-benchmark --dataset ami-sdm --output results.json

# 3. Process individual file
swift run fluidaudio process audio.wav --threshold 0.7

# 4. Transcribe audio
swift run fluidaudio transcribe audio.wav --config low-latency

# 5. Multi-stream transcription
swift run fluidaudio multi-stream mic.wav system.wav

Quick Test

# Test with included sample files
swift run fluidaudio transcribe medical.wav
swift run fluidaudio process IS1001a.Mix-Headset.wav --threshold 0.7

Troubleshooting

  1. Model Download Issues: The CLI will automatically download required models on first use. Ensure you have internet connectivity.

  2. Memory Usage: For long audio files, ensure sufficient memory is available.

  3. Performance: Use release build (swift build -c release) for best performance.

  4. Audio Format: The CLI automatically handles various audio formats and sample rates.