19 Commits

Author SHA1 Message Date
Benjamin Lee a0092cf163 Fixed LS-EEND Memory Leak + Updated Docs (#605)
1. LS-EEND had a memory leak since the autorelease pool was not
releasing the multiarrays properly and was allocating new ones every
chunk. Switched to backed output arrays to eliminate new allocations
2. LS-EEND docs were somewhat stale. Updated them to reflect the new API

---------
2026-05-12 08:53:59 -04:00
Benjamin Lee 35f6ba697f Added Back the Old LS-EEND Constructors (#563)
I accidentally deleted the old constructor in my last PR.

---------

Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-04-30 17:24:18 -07:00
Alex 7c115f6b4e feat(tts/kokoro-ane): add laishere 7-stage CoreML chain (ANE-optimized) (#547)
## Summary

Adds a second Kokoro TTS backend (`KokoroAne`) wrapping the
[laishere/kokoro-coreml](https://github.com/laishere/kokoro-coreml)
7-stage chain (Albert → PostAlbert → Alignment → Prosody → Noise →
Vocoder → Tail) behind an actor-based facade, used with the upstream
author's permission. Per-stage `MLComputeUnits` assignment routes
Albert/PostAlbert/Alignment/Vocoder to **ANE**; Prosody/Noise/Tail stay
on CPU+GPU for fp32/iSTFT-heavy ops.

The companion mobius PR for the conversion side:
https://github.com/FluidInference/mobius/pull/45

Existing `KokoroTtsManager` (single fp32 model) is untouched. Both
backends ship from the same `FluidInference/kokoro-82m-coreml` HF repo —
KokoroAne lives under the `ANE/` subdirectory.

## What's added

**Module: `Sources/FluidAudio/TTS/KokoroAne/`**
- `KokoroAneManager` — actor facade: `initialize`,
`synthesize(text|phonemes)`, `synthesizeDetailed`
- `KokoroAneSynthesizer` — 7-stage orchestration with fp16↔fp32 vImage
boundaries (Prosody→Noise→Vocoder→Tail). Uses `rebuild16`/`rebuild32`
helpers so each output is fetched once.
- `KokoroAneModelStore` — per-stage MLModel handles + vocab + voice pack
cache. Atomic-commit load (matches `PocketTtsModelStore` pattern) so
partial-load failures stay retryable.
- `KokoroAneVoicePack` — `[510, 256]` flat fp32 row indexing (timbre
cols `[0:128]`, style_s cols `[128:256]`)
- `KokoroAneVocab` — IPA → token IDs with BOS/EOS wrap, max 512
- `KokoroAneResourceDownloader` — HF cache management via existing
`DownloadUtils`; also downloads the shared kokoro G2P assets on first
init (see fix below)
- G2P reuses existing `G2PModel.shared`

**CLI:**
```bash
fluidaudiocli tts "Hello world" --backend kokoro-ane [--metrics m.json]
fluidaudiocli tts-asr-verify --texts-file phrases.txt --output-json results.json
```
The `tts-asr-verify` batch command synthesizes each phrase, transcribes
with Parakeet, and emits per-phrase + macro/micro WER with stage
timings.

**Tests** (`Tests/FluidAudioTests/TTS/KokoroAne/`):
- 13 unit tests (vocab, voice pack) — no model deps, run on CI
- 5 E2E tests (synth + ASR roundtrip) — gated by
`FLUIDAUDIO_RUN_KOKOROANE_E2E=1`

**Docs:**
- New `Documentation/TTS/KokoroAne.md` — when-to-pick decision table,
CLI/Swift quick start, per-stage compute targets, voice pack layout,
limits, perf numbers, source links.
- Top-of-file callout on `Documentation/TTS/Kokoro.md` linking to the
ANE-resident variant.
- Updated `Documentation/README.md` index, `Documentation/Models.md` TTS
table, `Documentation/API.md` reference, `Documentation/CLI.md` example.

## Verified end-to-end on M2

Cold model load: 20.6s (`anecompilerservice` first-run ANE compilation).
Warm load: ~300ms.

| Phrase | Synth | Audio | RTFx | ASR roundtrip |
|---|---|---|---|---|
| Hello world | 0.47s | 1.65s | 3.5× | "Hello world." (WER 0%) |
| The quick brown fox… | 0.32s | 3.18s | 9.9× | dropped "The" (WER 11%)
|
| She had been waiting… | 0.25s | 2.80s | 11.4× | "Shay" misheard (WER
12.5%) |

Aggregate macro WER 7.9%, micro WER 10.5% — error is ASR-side; TTS audio
is intelligible.

Steady-state per-stage timings confirm ANE residency (Albert/PostAlbert
~7-10ms each).

## Devin Review fixes addressed in this PR

- 🔴 **Partial model load wedged the store**
(`KokoroAneModelStore.loadIfNeeded`) — fixed via local `pendingModels`
accumulator + atomic commit, matching `PocketTtsModelStore`.
- 🐛 **G2P models not downloaded standalone** — `G2PModel.loadIfNeeded`
only reads from `~/.cache/fluidaudio/Models/kokoro/` and never
downloads. The kokoroAne download set didn't include G2P, so first-time
`--backend kokoro-ane` users (no prior `kokoro` use) hit a cryptic
`vocabLoadFailed`. Fixed by adding a `g2p-only` sentinel variant to
`getRequiredModelNames(.kokoro, …)` and a new
`KokoroAneResourceDownloader.ensureG2PAssets(directory:)` that runs
before `G2PModel.shared.ensureModelsAvailable()` in
`KokoroAneManager.initialize()`.
- 🟡 **Voice pack off-by-one (false positive)** — verified upstream
`convert-coreml.py:552` uses `voice_pack[len(phonemes) - 1]`, exactly
matching the existing Swift `phonemeCount - 1`. No change.

## Refactor pass

Internal cleanup applied across the module after the initial
implementation landed:
- `KokoroAneSynthesizer`: `rebuild16`/`rebuild32` helpers replace 11
inline `outputShape + outputArray + float16Array` patterns; F0/N shapes
cached once (was fetched 4×). Fixed a mislabeled `stage:` argument in
`outputArray` error reporting.
- `KokoroAneSynthesizer+Conversion`: extracted
`convertF32toF16`/`convertF16toF32`/`genericCopy` private helpers
(eliminates 4× duplicated vImage buffer setup).
- `KokoroAneModelStore`: folded `voicePack(_)` +
`loadVoicePackIfNeeded(_)` into one method; dropped unreachable
post-load guard and dead synthesized-URL throw.
- `KokoroAneVocab` / `KokoroAneError`: added `vocabParseFailed(URL,
String)` so a malformed top-level JSON object reports parse-failure
instead of file-not-found; removed dead NSNumber bridging fallback.
- `KokoroAneConstants`: dropped unused `defaultLanguage`,
`voicePackTimbreSlice`, `voicePackStyleSSlice`. Changed `defaultSpeed`
from `Float16` to `Float` (drops 4 `Float(...)` wraps at default-arg
sites).
- `KokoroAneError`: dropped unused `unsupportedPhoneme(Character)` —
`KokoroAneVocab.encode` silently drops unknown chars per the upstream
Python convention.

## Test plan

- [x] `swift build` clean
- [x] `swift test --filter KokoroAne` — 13 unit tests pass, 5 E2E gated
- [x] With models staged at
`~/.cache/fluidaudio/Models/kokoro-82m-coreml/ANE/`:
- [x] `FLUIDAUDIO_RUN_KOKOROANE_E2E=1 swift test --filter KokoroAne` —
all 18 pass
- [x] `swift run fluidaudiocli tts "Hello world" --backend kokoro-ane
--output /tmp/ane.wav --metrics /tmp/m.json` — produces non-silent audio
+ metrics with WER
- [x] `swift run fluidaudiocli tts-asr-verify --texts-file phrases.txt
--output-json /tmp/r.json` — aggregate WER ≤ 0.20

## Models

`FluidInference/kokoro-82m-coreml` on HuggingFace, under the `ANE/`
subdirectory:
```
ANE/KokoroAlbert.mlmodelc       fp16 + int8pal  (CPU+ANE)
ANE/KokoroPostAlbert.mlmodelc   fp16 + int8pal  (CPU+ANE)
ANE/KokoroAlignment.mlmodelc    fp16 + int8pal  (CPU+ANE)
ANE/KokoroProsody.mlmodelc      fp32             (CPU+GPU)
ANE/KokoroNoise.mlmodelc        fp32             (CPU+GPU)
ANE/KokoroVocoder.mlmodelc      fp16 + int8pal   (CPU+ANE)
ANE/KokoroTail.mlmodelc         fp32 + iSTFT     (CPU+GPU)
ANE/vocab.json                  114 IPA tokens
ANE/af_heart.bin                [510, 256] fp32 voice pack
```

G2P assets (`G2PEncoder.mlmodelc`, `G2PDecoder.mlmodelc`,
`g2p_vocab.json`) are pulled from the same repo's root and cached at
`~/.cache/fluidaudio/Models/kokoro/`, shared with the regular
`KokoroTtsManager` backend.

## License

Upstream (laishere) is MIT — carried forward in the mobius PR's LICENSE
file. Used with the upstream author's permission.
2026-04-27 20:08:49 -04:00
Alex 0143bf8dce docs: Complete API reference and update ASR documentation (#498)
## Summary
Completes API documentation with missing components and updates ASR and
TTS documentation to match current capabilities.

## Changes

### Documentation/API.md
- Add table of contents with component links
- Add missing ASR managers:
- **SlidingWindowAsrManager**: Sliding window ASR with overlap and
cancellation
- **StreamingNemotronAsrManager**: Nemotron streaming ASR with encoder
cache
  - **Qwen3AsrManager**: Qwen3-based ASR with Whisper frontend
- Add complete TTS section:
- **KokoroTtsManager**: TTS with multiple voices (American/British,
male/female)
  - **PocketTtsManager**: Lightweight streaming TTS with voice cloning
- Match table of contents order to actual sections

### Documentation/ASR/GettingStarted.md
- Update API from `loadModels()` to `configure(models:)` (current
method)
- Fix all code examples to use correct method signatures

### Documentation/TTS/Kokoro.md
- Remove promotional language ("high-quality")
- Remove "English-only" claim (Kokoro supports other languages, just not
tested yet)

## Result
Complete, accurate API reference with all public managers documented and
current code examples throughout.
2026-04-07 20:34:26 -04:00
Benjamin Lee d68352510c Update diarizer timeline sync and LS-EEND finalization (#421)
## Summary
- add coverage for diarizer timeline synchronization, tentative timeline
compatibility, and Sortformer streaming flush behavior
- move LS-EEND tail-flush finalization into the streaming session so
offline and streaming paths share the same finalize semantics
- update API and diarization docs for explicit `endingOnTime`, timeline
behavior, and finalization details

## Verification
- swift build
- swift test --filter SortformerTimelineTests
- swift test --filter SortformerStreamingIntegrationTests
- swift test --filter
LSEENDIntegrationTests.testDiarizerStreamingFinalizeMatchesProcessComplete
- swift test --filter
LSEENDIntegrationTests.testStreamingSessionMatchesOfflineInferenceOnRealFixtureAudio
- swift test --filter
LSEENDIntegrationTests.testDiarizerProcessEndingOnTimeAlignsVisibleRange
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/421"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open with Devin">
  </picture>
</a>
<!-- devin-review-badge-end -->
2026-03-25 19:12:06 -04:00
Benjamin Lee 401324de1f Make speakers publically mutable in DiarizerTimeline (#402)
## Summary
- expose a public setter for DiarizerTimeline.speakers
- keep the existing queue-synchronized access pattern for reads and
writes

## Testing
- not run

---------
2026-03-20 00:41:26 +00:00
Alex 8aa0dfcdac fix: clean up diarization test infrastructure (#395)
## Summary
- Extract shared fixture helpers into `DiarizationTestFixtures` enum,
removing ~200 lines of duplicate code across `LSEENDIntegrationTests`
and `SpeakerEnrollmentTests`
- Replace fragile `Mirror`-based private state inspection with
`internal` `hasActiveSession` property on `LSEENDDiarizerAPI`
- Fix non-deterministic `srand48` seed in `SortformerTests` (use
constant `42` instead of time-based seed)
- Fix asymmetric skip guards in Sortformer enrollment tests (`XCTSkipIf`
instead of `XCTAssertNotNil` for host-dependent segments)

## Test plan
- [x] `swift build --build-tests` passes
- [ ] `swift test --filter SortformerTests` passes
- [ ] `swift test --filter LSEENDIntegrationTests` passes
- [ ] `swift test --filter SpeakerEnrollmentTests` passes
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/395"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open with Devin">
  </picture>
</a>
<!-- devin-review-badge-end -->
2026-03-18 12:51:34 -04:00
Benjamin Lee ba17ebc600 LS-EEND Diarizer (#376)
---

## Add LS-EEND speaker diarization

Sortformer handles up to 4 speakers and works best at 16 kHz in noisy
environments. That leaves a gap for phone calls, large meetings, and
recordings with unknown conditions. LS-EEND fills it: up to 10 speakers
(variant-dependent), trained on telephone, meeting, and in-the-wild
corpora, operating at 8 kHz.

This PR adds LS-EEND as a first-class diarizer alongside Sortformer —
same `Diarizer` protocol, same CLI patterns, same post-processing
pipeline.

### Why these changes are needed

**Unified timeline** — `SortformerTimeline` was Sortformer-specific and
couldn't be shared. LS-EEND needs the same post-processing (threshold,
median filter, onset/offset padding, min-duration filtering, finalized
vs tentative segments). `DiarizerTimeline` replaces `SortformerTimeline`
with a shared implementation that both models use, eliminating
duplicated logic.

**LS-EEND diarizer** — The model was partially wired up but missing a
clean public API, proper `Diarizer` protocol conformance, and
integration with `DiarizerTimeline`. This completes the implementation:
offline file processing with automatic resampling, streaming with
committed + speculative preview frames, and session-level control via
`LSEENDStreamingSession`.

**CLI** — Without `lseend` and `lseend-benchmark`, the model can't be
used or evaluated outside of Swift code. The benchmark also validates
that DER matches the paper's reported numbers before shipping to users.

**AMI ground truth fallback** — `lseend-benchmark --variant ami`
silently produced no results because the benchmark looked for RTTM files
that don't exist in the standard dataset layout. Added the same
`AMIParser` XML annotation fallback that the Sortformer benchmark uses.

**Tests** — `LSEENDRuntimeTests` runs the inference engine, streaming
session, and feature extractor against known-good outputs to catch
regressions in the CoreML pipeline.

**Documentation** — LS-EEND has a substantially different API surface
than Sortformer (five source files, streaming session layer, matrix
type, full evaluation namespace, per-variant speaker caps). Documents
the entire public API and provides a variant selection guide.

### Changes

**`DiarizerTimeline.swift`** (new) — Unified post-processing timeline
shared by both Sortformer and LS-EEND. Replaces
`SortformerTimeline.swift` (deleted). `SortformerDiarizerPipeline`
updated to use it.

**`LSEENDDiarizer.swift`** — `Diarizer` protocol conformance; offline
(`processComplete(audioFileURL:)`) and streaming (`addAudio` / `process`
/ `finalizeSession`) APIs; thread-safe via `NSLock`.

**`LSEENDInference.swift`** — `LSEENDInferenceEngine` (offline,
streaming, simulation) and `LSEENDStreamingSession` (stateful,
frame-in-frame-out with committed + preview outputs).

**`LSEENDFeatureExtraction.swift`** — `LSEENDOfflineFeatureExtractor`
and `LSEENDStreamingFeatureExtractor`; log-mel cumulative mean
normalization and splice-and-subsample.

**`LSEENDEvaluation.swift`** — DER computation with collar masking and
optimal speaker assignment (Hungarian); RTTM parsing and writing.

**`LSEENDCommand.swift`**, **`LSEENDBenchmark.swift`** — CLI commands
`lseend` and `lseend-benchmark`, with the same post-processing flags as
the Sortformer equivalents.

**`LSEENDRuntimeTests.swift`** — Integration tests for offline
inference, streaming, session behavior, and feature extraction.

**`Documentation/Diarization/LSEEND.md`** — Full public API reference
and variant selection guide (`.ami` → 4 speakers, `.callhome` → 7,
`.dihard2`/`.dihard3` → 10; DER numbers from the paper).

All tasks from the previous session are complete:

1. **Merge conflict** in `LSEENDRuntimeProbeSupport.swift` — resolved
using the async approach, merged `claude/nice-brattain` into `ls-eend`
2. **RTTM not found bug** in `lseend-benchmark` — fixed with AMI XML
annotation fallback, `public init` on `LSEENDRTTMEntry`, async
`processMeeting`
3. **Documentation** — `Documentation/Diarization/LSEEND.md` with full
public API reference, correct speaker counts (AMI→4, CALLHOME→7,
DIHARD2/3→10)
4. **PR description** — written in chat covering the full `ls-eend`
branch scope

Everything is committed to the `ls-eend` branch at
`/Users/benjaminlee/Documents/FluidAudio`. Let me know what you'd like
to work on next.
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/376"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open with Devin">
  </picture>
</a>
<!-- devin-review-badge-end -->

---------
2026-03-17 18:03:54 -04:00
Alex 5d9176eb35 docs: organize Documentation folder structure and fix stale content (#280)
## Summary

Closes #274

- **Reorganized docs into subdirectories**: moved diarization docs →
`Diarization/`, custom vocabulary docs → `ASR/`, eSpeak docs → `TTS/`
- **Added `Models.md`**: comprehensive guide to all CoreML model
pipelines (ASR, VAD, Diarization, TTS) with architecture, performance,
and source links
- **Fixed stale content across 5 files**: removed nonexistent
`compareSpeakers` from API.md, fixed year in Benchmarks.md, fixed
`TtSManager` casing in SSML.md, updated cross-references in
SpeakerManager.md, removed dead MCP link from README
- **Rewrote README.md index** with correct paths for all reorganized
docs
- **Added `.gitignore` patterns** for benchmark artifact files
(`*benchmark*.json`, `.sortformer_progress*.json`)

## Files changed

| Change | File |
|--------|------|
| Moved | `SpeakerDiarization.md` → `Diarization/GettingStarted.md` |
| Moved | `SpeakerManager.md` → `Diarization/SpeakerManager.md` |
| Moved | `Sortformer.md` → `Diarization/Sortformer.md` |
| Moved | `DIARIZATION_INVESTIGATION_REPORT.md` →
`Diarization/InvestigationReport.md` |
| Moved | `CtcCustomVocabulary.md` → `ASR/CustomVocabulary.md` |
| Moved | `CustomPronunciationDictionary.md` →
`ASR/CustomPronunciation.md` |
| Moved | `EspeakFramework.md` → `TTS/EspeakFramework.md` |
| New | `Models.md` — all CoreML model pipelines |
| Updated | `README.md` — new index with correct paths |
| Fixed | `API.md` — removed nonexistent `compareSpeakers` |
| Fixed | `Benchmarks.md` — year 2024 → 2025 |
| Fixed | `TTS/SSML.md` — `TtsManager` → `TtSManager` |
| Fixed | `Diarization/SpeakerManager.md` — cross-references |
| Updated | `.gitignore` — benchmark artifact patterns |

## Test plan

- [x] No code changes — documentation only
- [ ] Verify all internal doc links resolve correctly
- [ ] Review Models.md content for accuracy
2026-01-30 13:42:33 -05:00
Alex 892da4f9a9 Feat: Parakeet EOU streaming ASR with 160ms/320ms chunk support (#216)
- Add Parakeet EOU 120M streaming ASR with End-of-Utterance detection
- Support 160ms and 320ms chunk sizes with automatic HuggingFace model
downloads
- benchmarks.md 
- Add GitHub Actions CI benchmark workflow for Parakeet EOU



Changes
- StreamingEouAsrManager - streaming pipeline with configurable chunk
sizes
- NeMoMelSpectrogram - native Swift mel spectrogram with vDSP
vectorization
- RnntDecoder - RNN-T greedy decoder with EOU detection
- Configurable EOU debounce (default 1280ms)

---------
2025-12-17 17:18:01 -05:00
Brandon Weng 549f8d1262 Standardize registry override (#175)
### Why is this change needed?
<!-- Explain the motivation for this change. What problem does it solve?
-->

The priority order for ModelRegistry.baseURL is:
  1. Programmatic override (highest priority)
  ModelRegistry.baseURL = "https://custom.com"
  2. REGISTRY_URL environment variable
  export REGISTRY_URL=https://custom.com
  3. MODEL_REGISTRY_URL environment variable
  export MODEL_REGISTRY_URL=https://custom.com
  4. Default (lowest priority)
  https://huggingface.co

The https_proxy is lefy around to not break existing users

Updated the caching key for the github workflows to trigger redownload
2025-11-02 11:46:55 -05:00
Brandon Weng a5cecf8278 Make ANE Utils concurrency safe (#172)
### Why is this change needed?
<!-- Explain the motivation for this change. What problem does it solve?
-->

Unit test to reproduce what the user was seeing, but essentially has to
do with shared ML arrays in the ANE optimizer when we have multiple
models running at once, or when there's too many instances running
(meeting note taker for instance)

Previously we had applied fixes on the manager level but the culpruit is
the underlying ane optimizer so it manifested in different forms
2025-10-31 19:27:10 -04:00
Brandon Weng 7fd5ac5446 pyannote community-1 model for offline speaker diarization pipeline (#150)
### Why is this change needed?
<!-- Explain the motivation for this change. What problem does it solve?
-->

Keeping the streaming one around as the VBx and AHC clustering gets
pretty expensive after 30mins of audio and running it constantly gets
expensive. Its still possible to support clustering between files but
will save that for another PR.

Pyannote's Bench mark is around 11% - i increased steps to 0.2s instead
of 0.1 to double the speed but also selective fp16 results in more
operations to run on ANE but also means that we lose some precision.

```
Average DER: 14.95% | Median DER: 10.89% | Average JER: 39.27% | Median JER: 40.74% (collar=0.25s, ignoreOverlap=True)
Average RTFx: 139.63 (from 232 clips)
Metrics summary saved to: /Users/brandonweng/FluidAudioDatasets/voxconverse/metrics/test_metrics_release.json
Completed. New results: 232, Skipped existing: 0, Total attempted: 232
```

See benchmark.md for more info but compared to Pytorch model, we are
100x faster than the CPU version and ~6x faster compared to the mps
backend on mb pro 4

---------

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
Co-authored-by: Brandon Weng <BrandonWeng@users.noreply.github.com>
Co-authored-by: Alex <36247722+Alex-Wengg@users.noreply.github.com>
Co-authored-by: Alex-Wengg <hanweng9@gmail.com>
2025-10-22 15:11:57 -04:00
Brandon Weng 0935593bef Fix VAD threshold overriding per segment (#155)
### Why is this change needed?
<!-- Explain the motivation for this change. What problem does it solve?
-->

https://github.com/FluidInference/FluidAudio/pull/153 <-- from this PR,
but thought it would be easier for me to just clean it up entirely.

Rename the actor level threshold to "defaultThreshold" and actually
allow overriding per segment.

previously the end speech (negative threshold) wasn't being used either
2025-10-21 19:45:08 -04:00
Brandon Weng 21a88fdb45 Bring nvidia/parakeet-tdt-0.6b-v2 back (#125)
### Why is this change needed?
<!-- Explain the motivation for this change. What problem does it solve?
-->

We had ~3 develoeprs seperately ask to bring support back, so by popular
demand it is back. is better for strictly english use cases

For English only transcription, it is still much better than v3 based on
what I've seen, even though avg WER iso nly 0.4% better.

v3 average WER is 2.6%
```
[01:35:16.894] [INFO] [Benchmark] 2620 files per dataset • Test runtime: 3m 25s • 09/26/2025, 1:35 AM EDT
[01:35:16.894] [INFO] [Benchmark] --- Benchmark Results ---
[01:35:16.894] [INFO] [Benchmark]    Dataset: librispeech test-clean
[01:35:16.894] [INFO] [Benchmark]    Files processed: 2620
[01:35:16.894] [INFO] [Benchmark]    Average WER: 2.2%
[01:35:16.894] [INFO] [Benchmark]    Median WER: 0.0%
[01:35:16.894] [INFO] [Benchmark]    Average CER: 0.7%
[01:35:16.894] [INFO] [Benchmark]    Median RTFx: 125.6x
[01:35:16.894] [INFO] [Benchmark]    Overall RTFx: 141.2x (19452.5s / 137.7s)
[01:35:16.894] [INFO] [Benchmark] Results saved to: asr_benchmark_results.json
[01:35:16.894] [INFO] [Benchmark] ASR benchmark completed successfully
```
2025-09-26 16:00:47 +00:00
Brandon Weng 5ce8b41ae9 silero-vad-v6.0.0 (#108)
### Why is this change needed?
<!-- Explain the motivation for this change. What problem does it solve?
-->

Previous model was basedon v5 and had too much custom tuning. This new
converted model nearly perfectly matches the output of the Python
Pytorch JIT model, with < 0.05% of deviation. No real improvements for
quantized models so no need, just use the base models is more than
enough

<img width="3533" height="1774" alt="image"
src="https://github.com/user-attachments/assets/16da2b62-2f6c-4dfa-9534-9688377e521e"
/>

<img width="4170" height="2365" alt="image"
src="https://github.com/user-attachments/assets/fb90d0a1-4297-4a47-9d6b-5fb87b8ceacd"
/>


<img width="1674" height="1170" alt="bweng-Ghostty-2025-09-15-at-21 33
30"
src="https://github.com/user-attachments/assets/c6c77fc8-1218-495d-967d-f5320d2e4eda"
/>


```text
[21:34:10.201] [INFO] [VAD] RTFx: 1517.8x faster than real-time
[21:34:10.444] [INFO] [VAD] VAD Benchmark Results:
[21:34:10.444] [INFO] [VAD] Accuracy: 90.9%
[21:34:10.444] [INFO] [VAD] Precision: 70.0%
[21:34:10.444] [INFO] [VAD] Recall: 100.0%
[21:34:10.444] [INFO] [VAD] F1-Score: 82.3%
[21:34:10.444] [INFO] [VAD] Total Time: 259.22s
[21:34:10.444] [INFO] [VAD] RTFx: 1517.8x faster than real-time
[21:34:10.444] [INFO] [VAD] Files Processed: 2016
[21:34:10.444] [INFO] [VAD] Avg Time per File: 0.129s
[21:34:10.484] [INFO] [VAD] Results saved to: vad_benchmark_results.json
[21:34:10.484] [INFO] [VAD] EXCELLENT: F1-Score above 70%
```
2025-09-15 22:20:15 -04:00
Brandon Weng a9726e5c2a 10x faster VAD + removing post process logic (#106)
### Why is this change needed?
The post processing logic we had previously was convoluted and hacked
together.

- Remove the normalization
- Remove the post processing of the output from the model
- Refactor the model name and centralize in ModelNames
- Batch processing of the chunks,  RTFx, 100 RTx --> 1200 RTFx
- Accelerate for array operations helped a ton here too, increased by
100 RTFx -> 220 RTFx
2025-09-14 11:00:08 -04:00
Brandon Weng 245880345a Cleanup AudioConverter (#103) 2025-09-13 12:33:30 -04:00
Brandon Weng 4e8b54ed78 Clean up README (#98)
### Why is this change needed?
<!-- Explain the motivation for this change. What problem does it solve?
-->

Readme has grown way too much, splitting it up to just preserve the
essence and move the verbose and details to Documentation
2025-09-11 10:11:25 -04:00