1. LS-EEND had a memory leak since the autorelease pool was not
releasing the multiarrays properly and was allocating new ones every
chunk. Switched to backed output arrays to eliminate new allocations
2. LS-EEND docs were somewhat stale. Updated them to reflect the new API
---------
I accidentally deleted the old constructor in my last PR.
---------
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
## Summary
Completes API documentation with missing components and updates ASR and
TTS documentation to match current capabilities.
## Changes
### Documentation/API.md
- Add table of contents with component links
- Add missing ASR managers:
- **SlidingWindowAsrManager**: Sliding window ASR with overlap and
cancellation
- **StreamingNemotronAsrManager**: Nemotron streaming ASR with encoder
cache
- **Qwen3AsrManager**: Qwen3-based ASR with Whisper frontend
- Add complete TTS section:
- **KokoroTtsManager**: TTS with multiple voices (American/British,
male/female)
- **PocketTtsManager**: Lightweight streaming TTS with voice cloning
- Match table of contents order to actual sections
### Documentation/ASR/GettingStarted.md
- Update API from `loadModels()` to `configure(models:)` (current
method)
- Fix all code examples to use correct method signatures
### Documentation/TTS/Kokoro.md
- Remove promotional language ("high-quality")
- Remove "English-only" claim (Kokoro supports other languages, just not
tested yet)
## Result
Complete, accurate API reference with all public managers documented and
current code examples throughout.
## Summary
- add coverage for diarizer timeline synchronization, tentative timeline
compatibility, and Sortformer streaming flush behavior
- move LS-EEND tail-flush finalization into the streaming session so
offline and streaming paths share the same finalize semantics
- update API and diarization docs for explicit `endingOnTime`, timeline
behavior, and finalization details
## Verification
- swift build
- swift test --filter SortformerTimelineTests
- swift test --filter SortformerStreamingIntegrationTests
- swift test --filter
LSEENDIntegrationTests.testDiarizerStreamingFinalizeMatchesProcessComplete
- swift test --filter
LSEENDIntegrationTests.testStreamingSessionMatchesOfflineInferenceOnRealFixtureAudio
- swift test --filter
LSEENDIntegrationTests.testDiarizerProcessEndingOnTimeAlignsVisibleRange
<!-- devin-review-badge-begin -->
---
<a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/421"
target="_blank">
<picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open with Devin">
</picture>
</a>
<!-- devin-review-badge-end -->
## Summary
- expose a public setter for DiarizerTimeline.speakers
- keep the existing queue-synchronized access pattern for reads and
writes
## Testing
- not run
---------
---
## Add LS-EEND speaker diarization
Sortformer handles up to 4 speakers and works best at 16 kHz in noisy
environments. That leaves a gap for phone calls, large meetings, and
recordings with unknown conditions. LS-EEND fills it: up to 10 speakers
(variant-dependent), trained on telephone, meeting, and in-the-wild
corpora, operating at 8 kHz.
This PR adds LS-EEND as a first-class diarizer alongside Sortformer —
same `Diarizer` protocol, same CLI patterns, same post-processing
pipeline.
### Why these changes are needed
**Unified timeline** — `SortformerTimeline` was Sortformer-specific and
couldn't be shared. LS-EEND needs the same post-processing (threshold,
median filter, onset/offset padding, min-duration filtering, finalized
vs tentative segments). `DiarizerTimeline` replaces `SortformerTimeline`
with a shared implementation that both models use, eliminating
duplicated logic.
**LS-EEND diarizer** — The model was partially wired up but missing a
clean public API, proper `Diarizer` protocol conformance, and
integration with `DiarizerTimeline`. This completes the implementation:
offline file processing with automatic resampling, streaming with
committed + speculative preview frames, and session-level control via
`LSEENDStreamingSession`.
**CLI** — Without `lseend` and `lseend-benchmark`, the model can't be
used or evaluated outside of Swift code. The benchmark also validates
that DER matches the paper's reported numbers before shipping to users.
**AMI ground truth fallback** — `lseend-benchmark --variant ami`
silently produced no results because the benchmark looked for RTTM files
that don't exist in the standard dataset layout. Added the same
`AMIParser` XML annotation fallback that the Sortformer benchmark uses.
**Tests** — `LSEENDRuntimeTests` runs the inference engine, streaming
session, and feature extractor against known-good outputs to catch
regressions in the CoreML pipeline.
**Documentation** — LS-EEND has a substantially different API surface
than Sortformer (five source files, streaming session layer, matrix
type, full evaluation namespace, per-variant speaker caps). Documents
the entire public API and provides a variant selection guide.
### Changes
**`DiarizerTimeline.swift`** (new) — Unified post-processing timeline
shared by both Sortformer and LS-EEND. Replaces
`SortformerTimeline.swift` (deleted). `SortformerDiarizerPipeline`
updated to use it.
**`LSEENDDiarizer.swift`** — `Diarizer` protocol conformance; offline
(`processComplete(audioFileURL:)`) and streaming (`addAudio` / `process`
/ `finalizeSession`) APIs; thread-safe via `NSLock`.
**`LSEENDInference.swift`** — `LSEENDInferenceEngine` (offline,
streaming, simulation) and `LSEENDStreamingSession` (stateful,
frame-in-frame-out with committed + preview outputs).
**`LSEENDFeatureExtraction.swift`** — `LSEENDOfflineFeatureExtractor`
and `LSEENDStreamingFeatureExtractor`; log-mel cumulative mean
normalization and splice-and-subsample.
**`LSEENDEvaluation.swift`** — DER computation with collar masking and
optimal speaker assignment (Hungarian); RTTM parsing and writing.
**`LSEENDCommand.swift`**, **`LSEENDBenchmark.swift`** — CLI commands
`lseend` and `lseend-benchmark`, with the same post-processing flags as
the Sortformer equivalents.
**`LSEENDRuntimeTests.swift`** — Integration tests for offline
inference, streaming, session behavior, and feature extraction.
**`Documentation/Diarization/LSEEND.md`** — Full public API reference
and variant selection guide (`.ami` → 4 speakers, `.callhome` → 7,
`.dihard2`/`.dihard3` → 10; DER numbers from the paper).
All tasks from the previous session are complete:
1. **Merge conflict** in `LSEENDRuntimeProbeSupport.swift` — resolved
using the async approach, merged `claude/nice-brattain` into `ls-eend`
2. **RTTM not found bug** in `lseend-benchmark` — fixed with AMI XML
annotation fallback, `public init` on `LSEENDRTTMEntry`, async
`processMeeting`
3. **Documentation** — `Documentation/Diarization/LSEEND.md` with full
public API reference, correct speaker counts (AMI→4, CALLHOME→7,
DIHARD2/3→10)
4. **PR description** — written in chat covering the full `ls-eend`
branch scope
Everything is committed to the `ls-eend` branch at
`/Users/benjaminlee/Documents/FluidAudio`. Let me know what you'd like
to work on next.
<!-- devin-review-badge-begin -->
---
<a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/376"
target="_blank">
<picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open with Devin">
</picture>
</a>
<!-- devin-review-badge-end -->
---------
### Why is this change needed?
<!-- Explain the motivation for this change. What problem does it solve?
-->
The priority order for ModelRegistry.baseURL is:
1. Programmatic override (highest priority)
ModelRegistry.baseURL = "https://custom.com"
2. REGISTRY_URL environment variable
export REGISTRY_URL=https://custom.com
3. MODEL_REGISTRY_URL environment variable
export MODEL_REGISTRY_URL=https://custom.com
4. Default (lowest priority)
https://huggingface.co
The https_proxy is lefy around to not break existing users
Updated the caching key for the github workflows to trigger redownload
### Why is this change needed?
<!-- Explain the motivation for this change. What problem does it solve?
-->
Unit test to reproduce what the user was seeing, but essentially has to
do with shared ML arrays in the ANE optimizer when we have multiple
models running at once, or when there's too many instances running
(meeting note taker for instance)
Previously we had applied fixes on the manager level but the culpruit is
the underlying ane optimizer so it manifested in different forms
### Why is this change needed?
<!-- Explain the motivation for this change. What problem does it solve?
-->
Keeping the streaming one around as the VBx and AHC clustering gets
pretty expensive after 30mins of audio and running it constantly gets
expensive. Its still possible to support clustering between files but
will save that for another PR.
Pyannote's Bench mark is around 11% - i increased steps to 0.2s instead
of 0.1 to double the speed but also selective fp16 results in more
operations to run on ANE but also means that we lose some precision.
```
Average DER: 14.95% | Median DER: 10.89% | Average JER: 39.27% | Median JER: 40.74% (collar=0.25s, ignoreOverlap=True)
Average RTFx: 139.63 (from 232 clips)
Metrics summary saved to: /Users/brandonweng/FluidAudioDatasets/voxconverse/metrics/test_metrics_release.json
Completed. New results: 232, Skipped existing: 0, Total attempted: 232
```
See benchmark.md for more info but compared to Pytorch model, we are
100x faster than the CPU version and ~6x faster compared to the mps
backend on mb pro 4
---------
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
Co-authored-by: Brandon Weng <BrandonWeng@users.noreply.github.com>
Co-authored-by: Alex <36247722+Alex-Wengg@users.noreply.github.com>
Co-authored-by: Alex-Wengg <hanweng9@gmail.com>
### Why is this change needed?
<!-- Explain the motivation for this change. What problem does it solve?
-->
https://github.com/FluidInference/FluidAudio/pull/153 <-- from this PR,
but thought it would be easier for me to just clean it up entirely.
Rename the actor level threshold to "defaultThreshold" and actually
allow overriding per segment.
previously the end speech (negative threshold) wasn't being used either
### Why is this change needed?
<!-- Explain the motivation for this change. What problem does it solve?
-->
We had ~3 develoeprs seperately ask to bring support back, so by popular
demand it is back. is better for strictly english use cases
For English only transcription, it is still much better than v3 based on
what I've seen, even though avg WER iso nly 0.4% better.
v3 average WER is 2.6%
```
[01:35:16.894] [INFO] [Benchmark] 2620 files per dataset • Test runtime: 3m 25s • 09/26/2025, 1:35 AM EDT
[01:35:16.894] [INFO] [Benchmark] --- Benchmark Results ---
[01:35:16.894] [INFO] [Benchmark] Dataset: librispeech test-clean
[01:35:16.894] [INFO] [Benchmark] Files processed: 2620
[01:35:16.894] [INFO] [Benchmark] Average WER: 2.2%
[01:35:16.894] [INFO] [Benchmark] Median WER: 0.0%
[01:35:16.894] [INFO] [Benchmark] Average CER: 0.7%
[01:35:16.894] [INFO] [Benchmark] Median RTFx: 125.6x
[01:35:16.894] [INFO] [Benchmark] Overall RTFx: 141.2x (19452.5s / 137.7s)
[01:35:16.894] [INFO] [Benchmark] Results saved to: asr_benchmark_results.json
[01:35:16.894] [INFO] [Benchmark] ASR benchmark completed successfully
```
### Why is this change needed?
<!-- Explain the motivation for this change. What problem does it solve?
-->
Previous model was basedon v5 and had too much custom tuning. This new
converted model nearly perfectly matches the output of the Python
Pytorch JIT model, with < 0.05% of deviation. No real improvements for
quantized models so no need, just use the base models is more than
enough
<img width="3533" height="1774" alt="image"
src="https://github.com/user-attachments/assets/16da2b62-2f6c-4dfa-9534-9688377e521e"
/>
<img width="4170" height="2365" alt="image"
src="https://github.com/user-attachments/assets/fb90d0a1-4297-4a47-9d6b-5fb87b8ceacd"
/>
<img width="1674" height="1170" alt="bweng-Ghostty-2025-09-15-at-21 33
30"
src="https://github.com/user-attachments/assets/c6c77fc8-1218-495d-967d-f5320d2e4eda"
/>
```text
[21:34:10.201] [INFO] [VAD] RTFx: 1517.8x faster than real-time
[21:34:10.444] [INFO] [VAD] VAD Benchmark Results:
[21:34:10.444] [INFO] [VAD] Accuracy: 90.9%
[21:34:10.444] [INFO] [VAD] Precision: 70.0%
[21:34:10.444] [INFO] [VAD] Recall: 100.0%
[21:34:10.444] [INFO] [VAD] F1-Score: 82.3%
[21:34:10.444] [INFO] [VAD] Total Time: 259.22s
[21:34:10.444] [INFO] [VAD] RTFx: 1517.8x faster than real-time
[21:34:10.444] [INFO] [VAD] Files Processed: 2016
[21:34:10.444] [INFO] [VAD] Avg Time per File: 0.129s
[21:34:10.484] [INFO] [VAD] Results saved to: vad_benchmark_results.json
[21:34:10.484] [INFO] [VAD] EXCELLENT: F1-Score above 70%
```
### Why is this change needed?
The post processing logic we had previously was convoluted and hacked
together.
- Remove the normalization
- Remove the post processing of the output from the model
- Refactor the model name and centralize in ModelNames
- Batch processing of the chunks, RTFx, 100 RTx --> 1200 RTFx
- Accelerate for array operations helped a ton here too, increased by
100 RTFx -> 220 RTFx
### Why is this change needed?
<!-- Explain the motivation for this change. What problem does it solve?
-->
Readme has grown way too much, splitting it up to just preserve the
essence and move the verbose and details to Documentation