mirror of https://github.com/FluidInference/FluidAudio.git synced 2026-05-12 20:20:36 +00:00

Files

T

Alex 8aa0dfcdac fix: clean up diarization test infrastructure (#395 )

## Summary
- Extract shared fixture helpers into `DiarizationTestFixtures` enum,
removing ~200 lines of duplicate code across `LSEENDIntegrationTests`
and `SpeakerEnrollmentTests`
- Replace fragile `Mirror`-based private state inspection with
`internal` `hasActiveSession` property on `LSEENDDiarizerAPI`
- Fix non-deterministic `srand48` seed in `SortformerTests` (use
constant `42` instead of time-based seed)
- Fix asymmetric skip guards in Sortformer enrollment tests (`XCTSkipIf`
instead of `XCTAssertNotNil` for host-dependent segments)

## Test plan
- [x] `swift build --build-tests` passes
- [ ] `swift test --filter SortformerTests` passes
- [ ] `swift test --filter LSEENDIntegrationTests` passes
- [ ] `swift test --filter SpeakerEnrollmentTests` passes
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/395"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open with Devin">
  </picture>
</a>
<!-- devin-review-badge-end -->

2026-03-18 12:51:34 -04:00

5.9 KiB

Raw Permalink Blame History

NeMo Sortformer AMI Benchmark

This directory contains tools for comparing the Swift/CoreML Sortformer implementation against NVIDIA's original NeMo Sortformer model.

Overview

The nemo_ami_benchmark.py script runs NVIDIA's Sortformer model on the AMI SDM dataset to provide a baseline comparison for the Swift/CoreML implementation.

Requirements

Python Environment

# Create virtual environment with Python 3.10+
python3.10 -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install torch torchaudio torchcodec
pip install nemo_toolkit[asr] pyannote.metrics

HuggingFace Authentication

The NVIDIA Sortformer model is gated and requires HuggingFace authentication:

Create an account at huggingface.co
Accept the model license at nvidia/diar_sortformer_4spk-v2.1
Create an access token at huggingface.co/settings/tokens

AMI Dataset

Download the AMI SDM test set audio files and RTTM ground truth:

# Audio files should be in:
~/FluidAudioDatasets/ami_official/sdm/

# RTTM files should be in:
~/FluidAudioDatasets/ami_official/rttm/

RTTM files can be downloaded from pyannote AMI diarization setup.

Usage

Basic Usage

# Run on single file
HF_TOKEN="your_token" python nemo_ami_benchmark.py --single-file ES2004a --device cpu

# Run on all 16 AMI test meetings
HF_TOKEN="your_token" python nemo_ami_benchmark.py --device cpu

# Save results to JSON
HF_TOKEN="your_token" python nemo_ami_benchmark.py --output results.json

Command Line Options

Option	Description	Default
`--audio-dir`	Path to AMI audio files	`~/FluidAudioDatasets/ami_official/sdm`
`--rttm-dir`	Path to RTTM ground truth files	`~/FluidAudioDatasets/ami_official/rttm`
`--output`, `-o`	Output JSON file path	None
`--single-file`	Run on single meeting (e.g., ES2004a)	All 16 meetings
`--device`	Device to use (cpu, cuda, mps)	mps if available, else cpu
`--batch`	Use batch mode instead of streaming	False
`--model-path`	Path to local .nemo model file	Downloads from HuggingFace

Configuration Settings

Model Configuration

Parameter	Value	Description
Model	`nvidia/diar_sortformer_4spk-v1`	NVIDIA Sortformer 4-speaker model
Sample Rate	16000 Hz	Audio sample rate
Frame Duration	80 ms	Duration per output frame
Num Speakers	4	Maximum number of speakers

High-Latency Streaming Config

These settings match the Swift SortformerConfig.highContextV2_1:

Parameter	Value	Description
Chunk Length	48 frames	Core chunk length in encoder frames
Left Context	56 frames	Left context in encoder frames
Right Context	56 frames	Right context in encoder frames
Subsampling Factor	8	Mel frames per encoder frame
Total Context	30.4 seconds	(48 + 56 + 56) * 8 * 10ms

Post-Processing Config

Parameter	Value	Description
Onset Threshold	0.5	Threshold for speaker activity detection
Offset Threshold	0.5	Threshold for speaker activity end

AMI Test Meetings

The benchmark runs on 16 AMI SDM test meetings:

Series	Meetings
EN2002	EN2002a, EN2002b, EN2002c, EN2002d
ES2004	ES2004a, ES2004b, ES2004c, ES2004d
IS1009	IS1009a, IS1009b, IS1009c, IS1009d
TS3003	TS3003a, TS3003b, TS3003c, TS3003d

Output Metrics

Metric	Description
DER	Diarization Error Rate (Miss + FA + SE)
Miss %	Missed speech (false negatives)
FA %	False alarm (false positives)
SE %	Speaker error (wrong speaker assigned)
Speakers	Detected / Ground truth speaker count
RTFx	Real-time factor (audio duration / processing time)

Example Output

================================================================================
NEMO SORTFORMER AMI BENCHMARK
================================================================================
Device: cpu
Mode: Streaming (30.4s chunks)
Audio dir: /Users/user/FluidAudioDatasets/ami_official/sdm
RTTM dir: /Users/user/FluidAudioDatasets/ami_official/rttm
Meetings: 1

Loading Sortformer model...
Model loaded in 2.35s

----------------------------------------------------------------------
Meeting         DER %   Miss %     FA %     SE %   Speakers     RTFx
----------------------------------------------------------------------
ES2004a         34.0%    30.7%     0.9%     2.3% 4/       4     0.2x
----------------------------------------------------------------------
AVERAGE         34.0%    30.7%     0.9%     2.3%          -     0.2x
======================================================================

Comparison with Swift/CoreML

Metric	NeMo Python (CPU)	Swift/CoreML (ANE)
DER	34.0%	32.3%
Miss Rate	30.7%	~29%
False Alarm	0.9%	~1%
Speaker Error	2.3%	~2%
RTFx	0.2x	~5x

The Swift/CoreML implementation achieves comparable accuracy while being significantly faster due to Apple Neural Engine acceleration.

Notes

CPU inference is slow (~0.2x real-time). Use CUDA for faster inference if available.
MPS (Apple Silicon GPU) may have memory issues with long audio files.
The NeMo model runs in batch mode; Swift implements true streaming chunking on top.

5.9 KiB Raw Permalink Blame History