Files
FluidAudio/CLAUDE.md
T
Brandon Weng 7fd5ac5446 pyannote community-1 model for offline speaker diarization pipeline (#150)
### Why is this change needed?
<!-- Explain the motivation for this change. What problem does it solve?
-->

Keeping the streaming one around as the VBx and AHC clustering gets
pretty expensive after 30mins of audio and running it constantly gets
expensive. Its still possible to support clustering between files but
will save that for another PR.

Pyannote's Bench mark is around 11% - i increased steps to 0.2s instead
of 0.1 to double the speed but also selective fp16 results in more
operations to run on ANE but also means that we lose some precision.

```
Average DER: 14.95% | Median DER: 10.89% | Average JER: 39.27% | Median JER: 40.74% (collar=0.25s, ignoreOverlap=True)
Average RTFx: 139.63 (from 232 clips)
Metrics summary saved to: /Users/brandonweng/FluidAudioDatasets/voxconverse/metrics/test_metrics_release.json
Completed. New results: 232, Skipped existing: 0, Total attempted: 232
```

See benchmark.md for more info but compared to Pytorch model, we are
100x faster than the CPU version and ~6x faster compared to the mps
backend on mb pro 4

---------

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
Co-authored-by: Brandon Weng <BrandonWeng@users.noreply.github.com>
Co-authored-by: Alex <36247722+Alex-Wengg@users.noreply.github.com>
Co-authored-by: Alex-Wengg <hanweng9@gmail.com>
2025-10-22 15:11:57 -04:00

343 lines
14 KiB
Markdown

# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project Overview
FluidAudio is a comprehensive Swift framework for local, low-latency audio processing on Apple platforms. It provides state-of-the-art speaker diarization, automatic speech recognition (ASR), and voice activity detection (VAD) through open-source models converted to Core ML. The system processes audio to identify "who spoke when" by segmenting audio and clustering speaker embeddings, with industry-competitive performance (17.7% DER).
## Critical Development Rules
### ⚠️ NEVER USE "unchecked Sendable"
- **DO NOT** use `@unchecked Sendable` under any circumstances
- Always properly implement thread-safe code with proper synchronization
- Use actors, `@MainActor`, or proper locking mechanisms instead
- If you encounter Sendable conformance issues, fix them properly rather than bypassing with `@unchecked`
### ⚠️ NEVER CREATE DUMMY MODELS OR SYNTHETIC DATA
- **DO NOT** create dummy, mock, or fake models for testing or development
- **DO NOT** generate synthetic audio data for testing
- **DO NOT** use random/fake models as placeholders
- **DO NOT** create "demonstration" or "simulated" models that don't contain real weights
- Always use the actual models required by the code
- If model authentication is required, inform the user rather than creating dummy versions
- Mock models produce meaningless results and waste development time
- Placeholder models with random weights will destroy performance (e.g., 17.8% → 77.1% DER)
### ⚠️ MODEL OPERATIONS - CONSULT BEFORE IMPLEMENTING
- When asked to merge, convert, or modify models:
- If it seems impossible or I have significant objections, CONSULT YOU FIRST
- Explain the concerns and let you decide whether to proceed
- If you say proceed, then DO IT IMMEDIATELY without further objections
- **DO NOT** create demonstration or placeholder models without permission
- **DO NOT** implement alternatives without asking
- Only after your approval: Implementation, then explanation of results
### Code Style and Formatting
- **Swift Format**: This project uses swift-format for consistent code style
- **Configuration**: See `.swift-format` for style rules
- **Auto-formatting**: PRs are automatically checked for formatting compliance
- **Local formatting**: Run `swift format --in-place --recursive --configuration .swift-format Sources/ Tests/`
#### Style Guidelines
- **Line length**: 120 characters
- **Indentation**: 4 spaces
- **Import order**: `import CoreML`, `import Foundation`, `import OSLog` (OrderedImports rule)
- **Naming conventions**:
- lowerCamelCase for variables/functions
- UpperCamelCase for types
- **Error handling**: Use proper Swift error handling, no force unwrapping in production
- **Documentation**: Triple-slash comments (`///`) for public APIs
- **Thread safety**: Use actors, `@MainActor`, or proper locking - never `@unchecked Sendable`
- **Control flow**: Prefer flattened if statements with early returns/continues over nested if statements. Use guard statements and inverted conditions to exit early. Nested if statements should be absolutely avoided to improve readability and reduce cognitive complexity.
## Current Performance Status
- **Achieved**: 17.7% DER
- **Target**: < 30% DER
- **Competitive with**: State-of-the-art research (Powerset BCE: 18.5%)
### Optimal Configuration
```swift
DiarizerConfig(
clusteringThreshold: 0.7, // Optimal value: 17.7% DER
minDurationOn: 1.0, // Minimum speaker segment duration
minDurationOff: 0.5, // Minimum silence between speakers
minActivityThreshold: 10.0, // Minimum activity threshold
debugMode: false
)
```
## Key Features
### 1. Speaker Diarization
- **Status**: Production ready with 17.7% DER
- **Models**: CoreML-based speaker segmentation and embedding
- **Auto-recovery**: Handles corrupted model downloads automatically
### 2. Auto-Recovery Mechanism
- Automatic detection and recovery from CoreML compilation failures
- Re-downloads corrupted models from Hugging Face
- Up to 3 retry attempts with comprehensive logging
## Essential Development Commands
### Core Commands
```bash
# Build
swift build # Debug build
swift build -c release # Release build (recommended for benchmarks)
# Test
swift test # Run all tests
swift test --parallel # Parallel test execution
swift test --filter CITests # Run CI-specific tests only
swift test --filter AsrManagerTests # Run specific test class
# Package management
swift package update # Update dependencies
swift package resolve # Resolve dependencies
swift package clean # Clean build cache
```
### Code Quality
```bash
# Format code (requires Swift 6+ for development)
swift format --in-place --recursive --configuration .swift-format Sources/ Tests/
# Check formatting without modifying
swift format lint --recursive --configuration .swift-format Sources/ Tests/
# Verify formatting compliance (CI-style check)
swift format --configuration .swift-format Sources/ Tests/
```
### CLI Commands
#### Benchmarking
```bash
# Diarization benchmarks
swift run fluidaudio diarization-benchmark --auto-download
swift run fluidaudio diarization-benchmark --single-file ES2004a --threshold 0.7 --output results.json
# ASR benchmarks
swift run fluidaudio asr-benchmark --subset test-clean --max-files 100
swift run fluidaudio asr-benchmark --subset test-other --output asr_results.json
# FLEURS multilingual benchmark
swift run fluidaudio fleurs-benchmark --languages en_us,fr_fr --samples 10
# VAD benchmark
swift run fluidaudio vad-benchmark --num-files 40 --threshold 0.5
```
#### Audio Processing
```bash
# Transcription
swift run fluidaudio transcribe audio.wav
swift run fluidaudio transcribe audio.wav --low-latency
# Multi-stream processing
swift run fluidaudio multi-stream audio1.wav audio2.wav
# Diarization processing
swift run fluidaudio process meeting.wav --output results.json --threshold 0.6
```
#### Dataset Management
```bash
# Download evaluation datasets
swift run fluidaudio download --dataset ami-sdm
swift run fluidaudio download --dataset librispeech-test-clean
swift run fluidaudio download --dataset librispeech-test-other
```
## Parameter Tuning Guide
### DiarizerConfig Parameters
1. **clusteringThreshold** (0.0-1.0)
- Sweet spot: 0.7-0.8
- Impact: Speaker separation accuracy
- 0.7 = 17.7% DER (optimal)
2. **minDurationOn** (seconds)
- Default: 1.0
- Filters short speech segments
3. **minDurationOff** (seconds)
- Default: 0.5
- Minimum gap between speakers
4. **minActivityThreshold** (frames)
- Default: 10.0
- Affects missed speech detection
## High-Level Architecture
### Project Structure
```
FluidAudio/
├── Sources/
│ ├── FluidAudio/ # Main library
│ │ ├── ASR/ # Automatic Speech Recognition
│ │ ├── Diarizer/ # Speaker diarization system
│ │ ├── VAD/ # Voice Activity Detection
│ │ └── Shared/ # Common utilities (audio conversion, memory optimization)
│ └── FluidAudioCLI/ # Command-line interface (macOS only)
├── Tests/ # Comprehensive test suite
└── Datasets/ # Evaluation datasets (AMI corpus)
```
### Core Components
#### 1. ASR (Automatic Speech Recognition)
- **AsrManager**: Main class for speech-to-text processing
- **TDT (Token Duration Transducer)**: Advanced decoding architecture
- **Streaming Support**: Real-time processing with persistent decoder states
- **Models**: Parakeet TDT v3 (0.6b) supporting 25 European languages
- **Performance**: ~110x RTF on M4 Pro (0.5s to process 1 minute of audio)
#### 2. Diarization System
- **DiarizerManager**: Main orchestrator for speaker separation
- **SegmentationProcessor**: Voice activity detection and segmentation
- **EmbeddingExtractor**: Speaker embedding generation for clustering
- **SpeakerManager**: Consistent speaker ID tracking across chunks
- **Performance**: 17.7% DER on AMI dataset (competitive with research)
#### 3. Voice Activity Detection
- **VadManager**: Voice activity detection with CoreML models
- **VadAudioProcessor**: additional filtering process
#### 4. Shared Infrastructure
- **ANEMemoryOptimizer**: Apple Neural Engine memory management
- **AudioConverter**: Universal audio format conversion to 16kHz mono Float32
- **ModelDownloader**: Automatic model retrieval from HuggingFace with recovery
- **Zero-Copy Processing**: Efficient model chaining without data duplication
### Processing Pipeline
1. **Audio Input** → AudioConverter (16kHz mono Float32)
2. **VAD Processing** → Voice activity segments
3. **Diarization** → Speaker embeddings + clustering
4. **ASR Processing** → Speech-to-text transcription
5. **Output** → Timestamped speaker-attributed transcripts
### Threading and Concurrency
- **Actor-based Architecture**: Thread-safe processing without `@unchecked Sendable`
- **Persistent States**: Decoder states maintained across chunks for streaming
- **Memory Management**: Automatic cleanup and ANE optimization
- **Parallel Processing**: Multi-stream support for batch operations
### Model Management
- **Automatic Downloads**: Models fetched from HuggingFace on first use
- **Auto-Recovery**: Corrupt model detection and re-download
- **CoreML Compilation**: Optimized for Apple Neural Engine
- **Caching**: Local model storage with validation
## Architecture Notes
- **Online diarization**: Works well with chunk-based processing
- **10-second chunks**: No significant performance impact
- **Speaker tracking**: Effective across chunks
- **DER calculation**: Fixed with optimal speaker mapping (Hungarian algorithm)
- **Cross-platform**: Supports macOS 14.0+, iOS 17.0+ (library), CLI macOS-only
## Streaming Diarization (Work in Progress)
### Goal
Develop a custom streaming speaker diarization manager that maintains consistent speaker IDs across chunks WITHOUT using the Hungarian algorithm for retroactive remapping (which is "cheating" in real-time scenarios).
## Development Environment
### Requirements
- **Swift**: 5.10+ (Swift 6+ required for contributors using swift-format)
- **Platforms**: macOS 14.0+, iOS 17.0+
- **Xcode**: Latest stable version for iOS development
- **Hardware**: Apple Silicon recommended for optimal performance
### CI/CD Pipeline
The project uses GitHub Actions with the following workflows:
- **swift-format.yml**: Code formatting compliance checks
- **tests.yml**: Cross-platform build and test execution
- **asr-benchmark.yml**: ASR performance validation
- **diarizer-benchmark.yml**: Speaker diarization benchmarks
- **vad-benchmark.yml**: Voice activity detection validation
### Code Style Configuration
- **Swift Format**: Enforced via `.swift-format` config
- **Line Length**: 120 characters
- **Indentation**: 4 spaces
- **Formatting Rules**: Automatic via swift-format, CI enforced
## User Preferences
- Never start responses with positive re-affirming text like "You're absolutely right!", "Good change!", "Excellent progress!", or similar
- Get straight to the point with technical facts
- For debugging, use print statements and delete them at the end when instructed
- Never create fallbacks or simplified solutions that don't actually solve the problem
- Always go for the proper solution over the "simplified" solution
- When asked to implement something specific, DO IT FIRST before explaining why it might not be optimal - implementation first, explanation second
- Just do as instructed - don't try to over-do things that aren't asked
## Development Guidelines
1. **Testing**: Always run benchmarks on multiple files for validation
2. **Logging**: Use comprehensive logging for debugging
3. **Error Handling**: Implement graceful degradation
4. **Performance**: Keep RTFx > 1.0x for real-time capability
5. **Thread Safety**: Never use `@unchecked Sendable` - implement proper synchronization
6. **Follow Instructions**: When the user asks to implement something specific, DO IT FIRST before explaining why it might not be optimal. Implementation first, explanation second.
7. **Avoid Deprecated Code**: Do not add support for deprecated models or features unless explicitly requested. Keep the codebase clean by only supporting current versions.
8. **Testing Policy**: ONLY add or run tests when explicitly requested by the user
9. **Git Operations**: NEVER run `git push` unless explicitly requested by the user. Only commit when asked.
10. **Code Formatting**: All code must pass swift-format checks before merge
## Next Steps
1. **Multi-file validation**: Test optimal config on all AMI files
2. **Real-world testing**: Validate on non-AMI audio
3. **Documentation**: Update API documentation
## Testing Strategy
### Test Categories
- **Unit Tests**: Component-level testing for individual classes
- **Integration Tests**: End-to-end workflow validation
- **Performance Tests**: Benchmarking against standard datasets
- **Memory Tests**: ANE memory optimization validation
- **Edge Case Tests**: Boundary condition handling
- **CI Tests**: Smoke tests for continuous integration
### Key Test Classes
- **CITests**: Lightweight tests for CI pipeline
- **AsrManagerTests**: ASR functionality validation
- **DiarizerMemoryTests**: Memory management validation
- **SendableTests**: Thread safety compliance
- **SegmentationProcessorTests**: Audio segmentation accuracy
### Running Specific Tests
```bash
# CI-specific tests (lightweight)
swift test --filter CITests
# ASR component tests
swift test --filter AsrManagerTests
# Memory optimization tests
swift test --filter ANEMemoryOptimizerTests
# Edge case validation
swift test --filter EdgeCaseTests
```
## Model Sources
- **Diarization**: [pyannote/speaker-diarization-3.1](https://huggingface.co/pyannote/speaker-diarization-3.1)
- **VAD CoreML**: [FluidInference/silero-vad-coreml](https://huggingface.co/FluidInference/silero-vad-coreml)
- **ASR Models**: [FluidInference/parakeet-tdt-0.6b-v3-coreml](https://huggingface.co/FluidInference/parakeet-tdt-0.6b-v3-coreml)
- **Test Data**: [alexwengg/musan_mini*](https://huggingface.co/datasets/alexwengg) variants