FluidAudio

mirror of https://github.com/FluidInference/FluidAudio.git synced 2026-05-12 20:20:36 +00:00

Author	SHA1	Message	Date
Benjamin Lee	a0092cf163	Fixed LS-EEND Memory Leak + Updated Docs (#605 ) 1. LS-EEND had a memory leak since the autorelease pool was not releasing the multiarrays properly and was allocating new ones every chunk. Switched to backed output arrays to eliminate new allocations 2. LS-EEND docs were somewhat stale. Updated them to reflect the new API ---------	2026-05-12 08:53:59 -04:00
Benjamin Lee	35f6ba697f	Added Back the Old LS-EEND Constructors (#563 ) I accidentally deleted the old constructor in my last PR. --------- Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-04-30 17:24:18 -07:00
Alex	7c115f6b4e	feat(tts/kokoro-ane): add laishere 7-stage CoreML chain (ANE-optimized) (#547 ) ## Summary Adds a second Kokoro TTS backend (`KokoroAne`) wrapping the [laishere/kokoro-coreml](https://github.com/laishere/kokoro-coreml) 7-stage chain (Albert → PostAlbert → Alignment → Prosody → Noise → Vocoder → Tail) behind an actor-based facade, used with the upstream author's permission. Per-stage `MLComputeUnits` assignment routes Albert/PostAlbert/Alignment/Vocoder to ANE; Prosody/Noise/Tail stay on CPU+GPU for fp32/iSTFT-heavy ops. The companion mobius PR for the conversion side: https://github.com/FluidInference/mobius/pull/45 Existing `KokoroTtsManager` (single fp32 model) is untouched. Both backends ship from the same `FluidInference/kokoro-82m-coreml` HF repo — KokoroAne lives under the `ANE/` subdirectory. ## What's added Module: `Sources/FluidAudio/TTS/KokoroAne/` - `KokoroAneManager` — actor facade: `initialize`, `synthesize(text\|phonemes)`, `synthesizeDetailed` - `KokoroAneSynthesizer` — 7-stage orchestration with fp16↔fp32 vImage boundaries (Prosody→Noise→Vocoder→Tail). Uses `rebuild16`/`rebuild32` helpers so each output is fetched once. - `KokoroAneModelStore` — per-stage MLModel handles + vocab + voice pack cache. Atomic-commit load (matches `PocketTtsModelStore` pattern) so partial-load failures stay retryable. - `KokoroAneVoicePack` — `[510, 256]` flat fp32 row indexing (timbre cols `[0:128]`, style_s cols `[128:256]`) - `KokoroAneVocab` — IPA → token IDs with BOS/EOS wrap, max 512 - `KokoroAneResourceDownloader` — HF cache management via existing `DownloadUtils`; also downloads the shared kokoro G2P assets on first init (see fix below) - G2P reuses existing `G2PModel.shared` CLI: ```bash fluidaudiocli tts "Hello world" --backend kokoro-ane [--metrics m.json] fluidaudiocli tts-asr-verify --texts-file phrases.txt --output-json results.json ``` The `tts-asr-verify` batch command synthesizes each phrase, transcribes with Parakeet, and emits per-phrase + macro/micro WER with stage timings. Tests (`Tests/FluidAudioTests/TTS/KokoroAne/`): - 13 unit tests (vocab, voice pack) — no model deps, run on CI - 5 E2E tests (synth + ASR roundtrip) — gated by `FLUIDAUDIO_RUN_KOKOROANE_E2E=1` Docs: - New `Documentation/TTS/KokoroAne.md` — when-to-pick decision table, CLI/Swift quick start, per-stage compute targets, voice pack layout, limits, perf numbers, source links. - Top-of-file callout on `Documentation/TTS/Kokoro.md` linking to the ANE-resident variant. - Updated `Documentation/README.md` index, `Documentation/Models.md` TTS table, `Documentation/API.md` reference, `Documentation/CLI.md` example. ## Verified end-to-end on M2 Cold model load: 20.6s (`anecompilerservice` first-run ANE compilation). Warm load: ~300ms. \| Phrase \| Synth \| Audio \| RTFx \| ASR roundtrip \| \|---\|---\|---\|---\|---\| \| Hello world \| 0.47s \| 1.65s \| 3.5× \| "Hello world." (WER 0%) \| \| The quick brown fox… \| 0.32s \| 3.18s \| 9.9× \| dropped "The" (WER 11%) \| \| She had been waiting… \| 0.25s \| 2.80s \| 11.4× \| "Shay" misheard (WER 12.5%) \| Aggregate macro WER 7.9%, micro WER 10.5% — error is ASR-side; TTS audio is intelligible. Steady-state per-stage timings confirm ANE residency (Albert/PostAlbert ~7-10ms each). ## Devin Review fixes addressed in this PR - 🔴 Partial model load wedged the store (`KokoroAneModelStore.loadIfNeeded`) — fixed via local `pendingModels` accumulator + atomic commit, matching `PocketTtsModelStore`. - 🐛 G2P models not downloaded standalone — `G2PModel.loadIfNeeded` only reads from `~/.cache/fluidaudio/Models/kokoro/` and never downloads. The kokoroAne download set didn't include G2P, so first-time `--backend kokoro-ane` users (no prior `kokoro` use) hit a cryptic `vocabLoadFailed`. Fixed by adding a `g2p-only` sentinel variant to `getRequiredModelNames(.kokoro, …)` and a new `KokoroAneResourceDownloader.ensureG2PAssets(directory:)` that runs before `G2PModel.shared.ensureModelsAvailable()` in `KokoroAneManager.initialize()`. - 🟡 Voice pack off-by-one (false positive) — verified upstream `convert-coreml.py:552` uses `voice_pack[len(phonemes) - 1]`, exactly matching the existing Swift `phonemeCount - 1`. No change. ## Refactor pass Internal cleanup applied across the module after the initial implementation landed: - `KokoroAneSynthesizer`: `rebuild16`/`rebuild32` helpers replace 11 inline `outputShape + outputArray + float16Array` patterns; F0/N shapes cached once (was fetched 4×). Fixed a mislabeled `stage:` argument in `outputArray` error reporting. - `KokoroAneSynthesizer+Conversion`: extracted `convertF32toF16`/`convertF16toF32`/`genericCopy` private helpers (eliminates 4× duplicated vImage buffer setup). - `KokoroAneModelStore`: folded `voicePack(_)` + `loadVoicePackIfNeeded(_)` into one method; dropped unreachable post-load guard and dead synthesized-URL throw. - `KokoroAneVocab` / `KokoroAneError`: added `vocabParseFailed(URL, String)` so a malformed top-level JSON object reports parse-failure instead of file-not-found; removed dead NSNumber bridging fallback. - `KokoroAneConstants`: dropped unused `defaultLanguage`, `voicePackTimbreSlice`, `voicePackStyleSSlice`. Changed `defaultSpeed` from `Float16` to `Float` (drops 4 `Float(...)` wraps at default-arg sites). - `KokoroAneError`: dropped unused `unsupportedPhoneme(Character)` — `KokoroAneVocab.encode` silently drops unknown chars per the upstream Python convention. ## Test plan - [x] `swift build` clean - [x] `swift test --filter KokoroAne` — 13 unit tests pass, 5 E2E gated - [x] With models staged at `~/.cache/fluidaudio/Models/kokoro-82m-coreml/ANE/`: - [x] `FLUIDAUDIO_RUN_KOKOROANE_E2E=1 swift test --filter KokoroAne` — all 18 pass - [x] `swift run fluidaudiocli tts "Hello world" --backend kokoro-ane --output /tmp/ane.wav --metrics /tmp/m.json` — produces non-silent audio + metrics with WER - [x] `swift run fluidaudiocli tts-asr-verify --texts-file phrases.txt --output-json /tmp/r.json` — aggregate WER ≤ 0.20 ## Models `FluidInference/kokoro-82m-coreml` on HuggingFace, under the `ANE/` subdirectory: ``` ANE/KokoroAlbert.mlmodelc fp16 + int8pal (CPU+ANE) ANE/KokoroPostAlbert.mlmodelc fp16 + int8pal (CPU+ANE) ANE/KokoroAlignment.mlmodelc fp16 + int8pal (CPU+ANE) ANE/KokoroProsody.mlmodelc fp32 (CPU+GPU) ANE/KokoroNoise.mlmodelc fp32 (CPU+GPU) ANE/KokoroVocoder.mlmodelc fp16 + int8pal (CPU+ANE) ANE/KokoroTail.mlmodelc fp32 + iSTFT (CPU+GPU) ANE/vocab.json 114 IPA tokens ANE/af_heart.bin [510, 256] fp32 voice pack ``` G2P assets (`G2PEncoder.mlmodelc`, `G2PDecoder.mlmodelc`, `g2p_vocab.json`) are pulled from the same repo's root and cached at `~/.cache/fluidaudio/Models/kokoro/`, shared with the regular `KokoroTtsManager` backend. ## License Upstream (laishere) is MIT — carried forward in the mobius PR's LICENSE file. Used with the upstream author's permission.	2026-04-27 20:08:49 -04:00
Alex	0143bf8dce	docs: Complete API reference and update ASR documentation (#498 ) ## Summary Completes API documentation with missing components and updates ASR and TTS documentation to match current capabilities. ## Changes ### Documentation/API.md - Add table of contents with component links - Add missing ASR managers: - SlidingWindowAsrManager: Sliding window ASR with overlap and cancellation - StreamingNemotronAsrManager: Nemotron streaming ASR with encoder cache - Qwen3AsrManager: Qwen3-based ASR with Whisper frontend - Add complete TTS section: - KokoroTtsManager: TTS with multiple voices (American/British, male/female) - PocketTtsManager: Lightweight streaming TTS with voice cloning - Match table of contents order to actual sections ### Documentation/ASR/GettingStarted.md - Update API from `loadModels()` to `configure(models:)` (current method) - Fix all code examples to use correct method signatures ### Documentation/TTS/Kokoro.md - Remove promotional language ("high-quality") - Remove "English-only" claim (Kokoro supports other languages, just not tested yet) ## Result Complete, accurate API reference with all public managers documented and current code examples throughout.	2026-04-07 20:34:26 -04:00
Benjamin Lee	d68352510c	Update diarizer timeline sync and LS-EEND finalization (#421 ) ## Summary - add coverage for diarizer timeline synchronization, tentative timeline compatibility, and Sortformer streaming flush behavior - move LS-EEND tail-flush finalization into the streaming session so offline and streaming paths share the same finalize semantics - update API and diarization docs for explicit `endingOnTime`, timeline behavior, and finalization details ## Verification - swift build - swift test --filter SortformerTimelineTests - swift test --filter SortformerStreamingIntegrationTests - swift test --filter LSEENDIntegrationTests.testDiarizerStreamingFinalizeMatchesProcessComplete - swift test --filter LSEENDIntegrationTests.testStreamingSessionMatchesOfflineInferenceOnRealFixtureAudio - swift test --filter LSEENDIntegrationTests.testDiarizerProcessEndingOnTimeAlignsVisibleRange <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/421" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a> <!-- devin-review-badge-end -->	2026-03-25 19:12:06 -04:00
Benjamin Lee	401324de1f	Make speakers publically mutable in DiarizerTimeline (#402 ) ## Summary - expose a public setter for DiarizerTimeline.speakers - keep the existing queue-synchronized access pattern for reads and writes ## Testing - not run ---------	2026-03-20 00:41:26 +00:00
Alex	8aa0dfcdac	fix: clean up diarization test infrastructure (#395 ) ## Summary - Extract shared fixture helpers into `DiarizationTestFixtures` enum, removing ~200 lines of duplicate code across `LSEENDIntegrationTests` and `SpeakerEnrollmentTests` - Replace fragile `Mirror`-based private state inspection with `internal` `hasActiveSession` property on `LSEENDDiarizerAPI` - Fix non-deterministic `srand48` seed in `SortformerTests` (use constant `42` instead of time-based seed) - Fix asymmetric skip guards in Sortformer enrollment tests (`XCTSkipIf` instead of `XCTAssertNotNil` for host-dependent segments) ## Test plan - [x] `swift build --build-tests` passes - [ ] `swift test --filter SortformerTests` passes - [ ] `swift test --filter LSEENDIntegrationTests` passes - [ ] `swift test --filter SpeakerEnrollmentTests` passes <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/395" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a> <!-- devin-review-badge-end -->	2026-03-18 12:51:34 -04:00
Benjamin Lee	ba17ebc600	LS-EEND Diarizer (#376 ) --- ## Add LS-EEND speaker diarization Sortformer handles up to 4 speakers and works best at 16 kHz in noisy environments. That leaves a gap for phone calls, large meetings, and recordings with unknown conditions. LS-EEND fills it: up to 10 speakers (variant-dependent), trained on telephone, meeting, and in-the-wild corpora, operating at 8 kHz. This PR adds LS-EEND as a first-class diarizer alongside Sortformer — same `Diarizer` protocol, same CLI patterns, same post-processing pipeline. ### Why these changes are needed Unified timeline — `SortformerTimeline` was Sortformer-specific and couldn't be shared. LS-EEND needs the same post-processing (threshold, median filter, onset/offset padding, min-duration filtering, finalized vs tentative segments). `DiarizerTimeline` replaces `SortformerTimeline` with a shared implementation that both models use, eliminating duplicated logic. LS-EEND diarizer — The model was partially wired up but missing a clean public API, proper `Diarizer` protocol conformance, and integration with `DiarizerTimeline`. This completes the implementation: offline file processing with automatic resampling, streaming with committed + speculative preview frames, and session-level control via `LSEENDStreamingSession`. CLI — Without `lseend` and `lseend-benchmark`, the model can't be used or evaluated outside of Swift code. The benchmark also validates that DER matches the paper's reported numbers before shipping to users. AMI ground truth fallback — `lseend-benchmark --variant ami` silently produced no results because the benchmark looked for RTTM files that don't exist in the standard dataset layout. Added the same `AMIParser` XML annotation fallback that the Sortformer benchmark uses. Tests — `LSEENDRuntimeTests` runs the inference engine, streaming session, and feature extractor against known-good outputs to catch regressions in the CoreML pipeline. Documentation — LS-EEND has a substantially different API surface than Sortformer (five source files, streaming session layer, matrix type, full evaluation namespace, per-variant speaker caps). Documents the entire public API and provides a variant selection guide. ### Changes `DiarizerTimeline.swift` (new) — Unified post-processing timeline shared by both Sortformer and LS-EEND. Replaces `SortformerTimeline.swift` (deleted). `SortformerDiarizerPipeline` updated to use it. `LSEENDDiarizer.swift` — `Diarizer` protocol conformance; offline (`processComplete(audioFileURL:)`) and streaming (`addAudio` / `process` / `finalizeSession`) APIs; thread-safe via `NSLock`. `LSEENDInference.swift` — `LSEENDInferenceEngine` (offline, streaming, simulation) and `LSEENDStreamingSession` (stateful, frame-in-frame-out with committed + preview outputs). `LSEENDFeatureExtraction.swift` — `LSEENDOfflineFeatureExtractor` and `LSEENDStreamingFeatureExtractor`; log-mel cumulative mean normalization and splice-and-subsample. `LSEENDEvaluation.swift` — DER computation with collar masking and optimal speaker assignment (Hungarian); RTTM parsing and writing. `LSEENDCommand.swift`, `LSEENDBenchmark.swift` — CLI commands `lseend` and `lseend-benchmark`, with the same post-processing flags as the Sortformer equivalents. `LSEENDRuntimeTests.swift` — Integration tests for offline inference, streaming, session behavior, and feature extraction. `Documentation/Diarization/LSEEND.md` — Full public API reference and variant selection guide (`.ami` → 4 speakers, `.callhome` → 7, `.dihard2`/`.dihard3` → 10; DER numbers from the paper). All tasks from the previous session are complete: 1. Merge conflict in `LSEENDRuntimeProbeSupport.swift` — resolved using the async approach, merged `claude/nice-brattain` into `ls-eend` 2. RTTM not found bug in `lseend-benchmark` — fixed with AMI XML annotation fallback, `public init` on `LSEENDRTTMEntry`, async `processMeeting` 3. Documentation — `Documentation/Diarization/LSEEND.md` with full public API reference, correct speaker counts (AMI→4, CALLHOME→7, DIHARD2/3→10) 4. PR description — written in chat covering the full `ls-eend` branch scope Everything is committed to the `ls-eend` branch at `/Users/benjaminlee/Documents/FluidAudio`. Let me know what you'd like to work on next. <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/376" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a> <!-- devin-review-badge-end --> ---------	2026-03-17 18:03:54 -04:00
Alex	5d9176eb35	docs: organize Documentation folder structure and fix stale content (#280 ) ## Summary Closes #274 - Reorganized docs into subdirectories: moved diarization docs → `Diarization/`, custom vocabulary docs → `ASR/`, eSpeak docs → `TTS/` - Added `Models.md`: comprehensive guide to all CoreML model pipelines (ASR, VAD, Diarization, TTS) with architecture, performance, and source links - Fixed stale content across 5 files: removed nonexistent `compareSpeakers` from API.md, fixed year in Benchmarks.md, fixed `TtSManager` casing in SSML.md, updated cross-references in SpeakerManager.md, removed dead MCP link from README - Rewrote README.md index with correct paths for all reorganized docs - Added `.gitignore` patterns for benchmark artifact files (`benchmark.json`, `.sortformer_progress*.json`) ## Files changed \| Change \| File \| \|--------\|------\| \| Moved \| `SpeakerDiarization.md` → `Diarization/GettingStarted.md` \| \| Moved \| `SpeakerManager.md` → `Diarization/SpeakerManager.md` \| \| Moved \| `Sortformer.md` → `Diarization/Sortformer.md` \| \| Moved \| `DIARIZATION_INVESTIGATION_REPORT.md` → `Diarization/InvestigationReport.md` \| \| Moved \| `CtcCustomVocabulary.md` → `ASR/CustomVocabulary.md` \| \| Moved \| `CustomPronunciationDictionary.md` → `ASR/CustomPronunciation.md` \| \| Moved \| `EspeakFramework.md` → `TTS/EspeakFramework.md` \| \| New \| `Models.md` — all CoreML model pipelines \| \| Updated \| `README.md` — new index with correct paths \| \| Fixed \| `API.md` — removed nonexistent `compareSpeakers` \| \| Fixed \| `Benchmarks.md` — year 2024 → 2025 \| \| Fixed \| `TTS/SSML.md` — `TtsManager` → `TtSManager` \| \| Fixed \| `Diarization/SpeakerManager.md` — cross-references \| \| Updated \| `.gitignore` — benchmark artifact patterns \| ## Test plan - [x] No code changes — documentation only - [ ] Verify all internal doc links resolve correctly - [ ] Review Models.md content for accuracy	2026-01-30 13:42:33 -05:00
Alex	892da4f9a9	Feat: Parakeet EOU streaming ASR with 160ms/320ms chunk support (#216 ) - Add Parakeet EOU 120M streaming ASR with End-of-Utterance detection - Support 160ms and 320ms chunk sizes with automatic HuggingFace model downloads - benchmarks.md - Add GitHub Actions CI benchmark workflow for Parakeet EOU Changes - StreamingEouAsrManager - streaming pipeline with configurable chunk sizes - NeMoMelSpectrogram - native Swift mel spectrogram with vDSP vectorization - RnntDecoder - RNN-T greedy decoder with EOU detection - Configurable EOU debounce (default 1280ms) ---------	2025-12-17 17:18:01 -05:00
Brandon Weng	549f8d1262	Standardize registry override (#175 ) ### Why is this change needed? <!-- Explain the motivation for this change. What problem does it solve? --> The priority order for ModelRegistry.baseURL is: 1. Programmatic override (highest priority) ModelRegistry.baseURL = "https://custom.com" 2. REGISTRY_URL environment variable export REGISTRY_URL=https://custom.com 3. MODEL_REGISTRY_URL environment variable export MODEL_REGISTRY_URL=https://custom.com 4. Default (lowest priority) https://huggingface.co The https_proxy is lefy around to not break existing users Updated the caching key for the github workflows to trigger redownload	2025-11-02 11:46:55 -05:00
Brandon Weng	a5cecf8278	Make ANE Utils concurrency safe (#172 ) ### Why is this change needed? <!-- Explain the motivation for this change. What problem does it solve? --> Unit test to reproduce what the user was seeing, but essentially has to do with shared ML arrays in the ANE optimizer when we have multiple models running at once, or when there's too many instances running (meeting note taker for instance) Previously we had applied fixes on the manager level but the culpruit is the underlying ane optimizer so it manifested in different forms	2025-10-31 19:27:10 -04:00
Brandon Weng	7fd5ac5446	pyannote community-1 model for offline speaker diarization pipeline (#150 ) ### Why is this change needed? <!-- Explain the motivation for this change. What problem does it solve? --> Keeping the streaming one around as the VBx and AHC clustering gets pretty expensive after 30mins of audio and running it constantly gets expensive. Its still possible to support clustering between files but will save that for another PR. Pyannote's Bench mark is around 11% - i increased steps to 0.2s instead of 0.1 to double the speed but also selective fp16 results in more operations to run on ANE but also means that we lose some precision. ``` Average DER: 14.95% \| Median DER: 10.89% \| Average JER: 39.27% \| Median JER: 40.74% (collar=0.25s, ignoreOverlap=True) Average RTFx: 139.63 (from 232 clips) Metrics summary saved to: /Users/brandonweng/FluidAudioDatasets/voxconverse/metrics/test_metrics_release.json Completed. New results: 232, Skipped existing: 0, Total attempted: 232 ``` See benchmark.md for more info but compared to Pytorch model, we are 100x faster than the CPU version and ~6x faster compared to the mps backend on mb pro 4 --------- Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> Co-authored-by: Brandon Weng <BrandonWeng@users.noreply.github.com> Co-authored-by: Alex <36247722+Alex-Wengg@users.noreply.github.com> Co-authored-by: Alex-Wengg <hanweng9@gmail.com>	2025-10-22 15:11:57 -04:00
Brandon Weng	0935593bef	Fix VAD threshold overriding per segment (#155 ) ### Why is this change needed? <!-- Explain the motivation for this change. What problem does it solve? --> https://github.com/FluidInference/FluidAudio/pull/153 <-- from this PR, but thought it would be easier for me to just clean it up entirely. Rename the actor level threshold to "defaultThreshold" and actually allow overriding per segment. previously the end speech (negative threshold) wasn't being used either	2025-10-21 19:45:08 -04:00
Brandon Weng	21a88fdb45	Bring nvidia/parakeet-tdt-0.6b-v2 back (#125 ) ### Why is this change needed? <!-- Explain the motivation for this change. What problem does it solve? --> We had ~3 develoeprs seperately ask to bring support back, so by popular demand it is back. is better for strictly english use cases For English only transcription, it is still much better than v3 based on what I've seen, even though avg WER iso nly 0.4% better. v3 average WER is 2.6% ``` [01:35:16.894] [INFO] [Benchmark] 2620 files per dataset • Test runtime: 3m 25s • 09/26/2025, 1:35 AM EDT [01:35:16.894] [INFO] [Benchmark] --- Benchmark Results --- [01:35:16.894] [INFO] [Benchmark] Dataset: librispeech test-clean [01:35:16.894] [INFO] [Benchmark] Files processed: 2620 [01:35:16.894] [INFO] [Benchmark] Average WER: 2.2% [01:35:16.894] [INFO] [Benchmark] Median WER: 0.0% [01:35:16.894] [INFO] [Benchmark] Average CER: 0.7% [01:35:16.894] [INFO] [Benchmark] Median RTFx: 125.6x [01:35:16.894] [INFO] [Benchmark] Overall RTFx: 141.2x (19452.5s / 137.7s) [01:35:16.894] [INFO] [Benchmark] Results saved to: asr_benchmark_results.json [01:35:16.894] [INFO] [Benchmark] ASR benchmark completed successfully ```	2025-09-26 16:00:47 +00:00
Brandon Weng	5ce8b41ae9	silero-vad-v6.0.0 (#108 ) ### Why is this change needed? <!-- Explain the motivation for this change. What problem does it solve? --> Previous model was basedon v5 and had too much custom tuning. This new converted model nearly perfectly matches the output of the Python Pytorch JIT model, with < 0.05% of deviation. No real improvements for quantized models so no need, just use the base models is more than enough <img width="3533" height="1774" alt="image" src="https://github.com/user-attachments/assets/16da2b62-2f6c-4dfa-9534-9688377e521e" /> <img width="4170" height="2365" alt="image" src="https://github.com/user-attachments/assets/fb90d0a1-4297-4a47-9d6b-5fb87b8ceacd" /> <img width="1674" height="1170" alt="bweng-Ghostty-2025-09-15-at-21 33 30" src="https://github.com/user-attachments/assets/c6c77fc8-1218-495d-967d-f5320d2e4eda" /> ```text [21:34:10.201] [INFO] [VAD] RTFx: 1517.8x faster than real-time [21:34:10.444] [INFO] [VAD] VAD Benchmark Results: [21:34:10.444] [INFO] [VAD] Accuracy: 90.9% [21:34:10.444] [INFO] [VAD] Precision: 70.0% [21:34:10.444] [INFO] [VAD] Recall: 100.0% [21:34:10.444] [INFO] [VAD] F1-Score: 82.3% [21:34:10.444] [INFO] [VAD] Total Time: 259.22s [21:34:10.444] [INFO] [VAD] RTFx: 1517.8x faster than real-time [21:34:10.444] [INFO] [VAD] Files Processed: 2016 [21:34:10.444] [INFO] [VAD] Avg Time per File: 0.129s [21:34:10.484] [INFO] [VAD] Results saved to: vad_benchmark_results.json [21:34:10.484] [INFO] [VAD] EXCELLENT: F1-Score above 70% ```	2025-09-15 22:20:15 -04:00
Brandon Weng	a9726e5c2a	10x faster VAD + removing post process logic (#106 ) ### Why is this change needed? The post processing logic we had previously was convoluted and hacked together. - Remove the normalization - Remove the post processing of the output from the model - Refactor the model name and centralize in ModelNames - Batch processing of the chunks, RTFx, 100 RTx --> 1200 RTFx - Accelerate for array operations helped a ton here too, increased by 100 RTFx -> 220 RTFx	2025-09-14 11:00:08 -04:00
Brandon Weng	245880345a	Cleanup AudioConverter (#103 )	2025-09-13 12:33:30 -04:00
Brandon Weng	4e8b54ed78	Clean up README (#98 ) ### Why is this change needed? <!-- Explain the motivation for this change. What problem does it solve? --> Readme has grown way too much, splitting it up to just preserve the essence and move the verbose and details to Documentation	2025-09-11 10:11:25 -04:00

19 Commits