FluidAudio

mirror of https://github.com/FluidInference/FluidAudio.git synced 2026-05-12 20:20:36 +00:00

Author	SHA1	Message	Date
Alex	2ea0727541	ASR: fix Parakeet TDT v3 emitting Cyrillic for short Latin-script utterances (#512 ) (#515 ) Fixes #512. ## TL;DR Parakeet TDT v3 transcribed short Polish utterances like "Wpisz Google kropka com" as Cyrillic (`Впиш Гугл к ком.`) because the joint decoder's top-1 pick drifts to Cyrillic tokens under low acoustic confidence. This PR adds an opt-in script filter: when a caller passes `language: .polish` (or any other language with a declared script), the decoder rejects top-1 if it's the wrong script and walks top-K to the highest-probability candidate matching the expected script. - Opt-in: `language:` defaults to `nil` — zero behavior change for existing callers. - No acoustic-model changes — this is purely a decoder-side post-processing step over the joint logits. - Requires `JointDecisionv3.mlmodelc` (exposes top-K outputs). Auto-downloaded from HuggingFace alongside the other v3 files; falls back to standard argmax when absent. ## Empirical validation — reporter's own audio Samples pulled via `gdown --folder <link-from-issue-#512-comment>` from @tajchert's Drive folder. `JointDecisionv3.mlmodelc` is loaded in both columns — this isolates the Swift filter as the mechanism, not a model swap. \| sample \| ground truth \| `language: nil` (current) \| `language: .polish` (this PR) \| \|---\|---\|---\|---\| \| pl \| Wpisz Google kropka com \| Впиш Гугл к ком. \| Wpis Google.com. \| \| pl2 \| Wpisz Google kropka com \| Впиш Гугл крокаком. \| Wpish Google, Com. \| \| pl3 \| Wpisz Google kropka com \| Впишь куглькрабком. \| VP Kugl.com. \| \| pl4 \| Wpisz Google kropka com \| Впиш гугл к ком. \| Wpish gugl c. \| \| pl5 \| Wpisz Google kropka com \| Впиш гугл кракаком. \| Wpish Google Croca kom. \| \| pl6 \| Wpisz Google kropka com \| Впиш, гугл крокаком. \| Wpish, Google, Com. \| \| pl_complex \| Cały spichlarz jest ze spiżu \| Cały spichlarz jest ze spiżu. \| Cały spichlarz jest ze spiżu. \| 6/6 short samples flip Cyrillic → Latin. `pl_complex` was never broken (long context → high joint confidence → no drift) and is unchanged. ## Scope & limitations (important — please don't overclaim) *This PR fixes the script* the tokens are drawn from. It does NOT fix per-word acoustic accuracy. \| \| `language: nil` \| `language: .polish` \| \|---\|---\|---\| \| Script correct (Latin, not Cyrillic) \| ✗ \| ✓ (6/6) \| \| Word spelling matches ground truth \| ✗ \| ✗ (still 6/7 wrong on short) \| The residual errors — `Wpisz` → `Wpish`/`Wpis`, `kropka` → `Croca` / dropped — are Parakeet TDT v3 acoustic weaknesses on short Polish commands. No amount of output post-processing can turn `Wpish` into `Wpisz`; that needs better acoustic modeling, a Polish LM rescorer, or more training data. Out of scope here. What users actually get by merging: - Output is visually Polish (Latin script), not pseudo-Russian — works with locale-aware post-processing, spell-check, and UI rendering - Locale-strict WER evaluators no longer penalize Cyrillic-vs-Latin substitution - Opt-in; zero risk for callers who don't pass `language:` What users do not get: - Higher word accuracy on short Polish/Slavic Latin utterances - Support for languages outside the `Language` enum (Greek, Maltese, Hungarian, Turkish, Baltic — their characters fit the Latin Unicode ranges but aren't exposed; easy follow-up) - A meaningful FLEURS WER delta — see [Documentation/fleurs-script-filtering-comparison.md](./Documentation/fleurs-script-filtering-comparison.md); full sentences aren't in the failure regime ## Implementation ### New - `Sources/FluidAudio/Shared/ScriptDetection.swift` (new, +112) - `public enum Language` — 13 Latin (en, es, fr, de, it, pt, ro, pl, cs, sk, sl, hr, bs) + 5 Cyrillic (ru, uk, be, bg, sr) - `public enum Script { case latin, cyrillic }` - `matches(_:script:)` over Unicode ranges: ASCII (0x20–0x7F), Latin-1 (0xA0–0xFF), Latin Extended-A (0x100–0x17F), Latin Extended-B (0x180–0x24F — Romanian ș/ț), Latin Extended Additional (0x1E00–0x1EFF — Vietnamese), Cyrillic (0x400–0x4FF). Strips SentencePiece boundary marker U+2581 before checking. - `filterTopK(topKIds:topKLogits:vocabulary:preferredScript:) -> (tokenId, probability)?` — returns the highest-probability top-K candidate matching the target script; probability via softmax over the top-K subset with the max-logit stability trick; guarded against top-K array length mismatch. ### Changed - `TdtJointDecision` — optional `topKIds` / `topKLogits` fields (populated by JointDecisionv3 only) - `TdtDecoderV3` — script filter runs only when top-1 is already wrong script; both decode sites feed `filtered.probability` (a real [0,1]) into `TdtDurationMapping.clampProbability`, not raw logits - `AsrManager.transcribe(...)` — `language: Language? = nil` plumbed through all three overloads: `[Float]`, `URL`, `AVAudioPCMBuffer` - `AsrModels` + `ModelNames` — `requiredModelsV3` set includes `JointDecisionv3.mlmodelc` so the download utility fetches it on fresh installs and also backfills it for existing users on next `.v3` load - CLI — `fluidaudiocli transcribe <file> --language {en\|pl\|cs\|sk\|sl\|hr\|bs\|ro\|es\|fr\|de\|it\|pt\|ru\|uk\|be\|bg\|sr}` ### How to try it ```bash swift run -c release fluidaudiocli transcribe sample.wav --language pl ``` ## Model dependency `JointDecisionv3.mlmodelc` must be present in `FluidInference/parakeet-tdt-0.6b-v3-coreml` on HuggingFace. It exposes `top_k_ids` / `top_k_logits` outputs (K=64 in our export) alongside the standard argmax. When absent, `AsrModels` falls back to `JointDecision.mlmodelc` and the script filter becomes a no-op — backward compatible. Cache-upgrade verified: removed `JointDecisionv3.mlmodelc` from a populated cache, re-ran `--language pl`; the file was auto-fetched and Polish output was Latin. Existing users pick up the fix on next `.v3` load without manual intervention. ## Review notes / risky bits - Softmax over top-K subset, not the full vocab — probabilities won't exactly match a true full-softmax, but K=64 captures ~all the mass when the model is anywhere near confident. If you prefer, we can expose the raw top-K logits to callers and let them compute confidence however they want. - Top-1 escape hatch: filter is only triggered when top-1 fails `matches(_, script:)`. When top-1 is already correct, nothing is changed — so we can't regress the common case. - Length-mismatch guard in `filterTopK` uses `min(topKIds.count, topKLogits.count)`. If CoreML output arrays ever diverge, we iterate the common prefix instead of crashing. - Latin Extended-B (0x0180–0x024F) was added specifically so Romanian ș/ț aren't rejected as non-Latin. Latin Extended Additional (0x1E00–0x1EFF) was added for free — helps Vietnamese should anyone want it later. ## Tests - `ScriptDetectionTests` — 37 tests**: Unicode range coverage (Latin-1 / Extended-A / Extended-B / Extended Additional / Cyrillic), SentencePiece boundary-marker stripping, `filterTopK` happy path, length-mismatch guard, probability-range invariant, Czech/Slovak/Slovenian/Croatian/Romanian token coverage, cross-script rejection - Build clean; `swift format lint` clean on all touched files - A/B end-to-end run against reporter's actual Polish audio (table above) ## Checklist - [x] Builds clean (`swift build`, `swift build -c release`) - [x] `swift format lint` clean on touched files - [x] `ScriptDetectionTests` 37/37 pass - [x] A/B reproduction on #512 reporter's audio - [x] Cache-upgrade path verified (JointDecisionv3 auto-fetched on existing caches) - [x] CLI accepts all 18 language codes end-to-end - [ ] CI green ## Follow-ups (not blocking) - Expose more Latin languages in the enum (Hungarian, Turkish, Baltic, Maltese) — all character ranges already supported, just need enum cases - Add `Script.greek` for `el_gr` (separate Unicode range) - Short-utterance benchmark dataset (FLEURS is the wrong tool — it's all long sentences where drift doesn't happen) - Optional: publish a Polish LM rescorer to address the underlying acoustic-accuracy issue the script filter cannot fix ---------	2026-04-23 17:43:09 -04:00
Alex	7c9be31c05	fix(benchmark): repair 3 pre-existing script/download bugs (#534 ) ## Summary Three unrelated pre-existing bugs surfaced while validating PR #515. All of them block `Scripts/parakeet_subset_benchmark.sh --download` from succeeding, but none are related to the v3 script-filtering work. Consolidating into one PR since each fix is ~1–3 lines. ### 1. Japanese TDT folder-name mismatch `Scripts/parakeet_subset_benchmark.sh` verifies the Japanese TDT model at `$MODELS_DIR/parakeet-tdt-ja/`, but the folder was renamed to `parakeet-ja` in `4ef33f0b6` (`Repo.parakeetJa.folderName = "parakeet-ja"`). Result: `verify_assets()` always reported missing assets even on a fully provisioned machine. One-line rename to match. ### 2. EOU streaming CLI writes to wrong path `ParakeetEouCommand` had a default / `--use-cache` split where the default branch produced `$CWD/Models/<chunk>/<chunk>/` (double-nested, relative to CWD) as the load path, while `downloadModels()` called `deletingLastPathComponent().deletingLastPathComponent()` then `DownloadUtils.downloadRepo(repo, to:)` which appended `folderName = "parakeet-eou-streaming/<chunk>"`. Net effect: files landed at `$CWD/Models/parakeet-eou-streaming/<chunk>/` while `loadModels()` looked at `$CWD/Models/<chunk>/<chunk>/` — model load failed silently. Unified on Application Support (matches every other CoreML model in FluidAudio). `--use-cache` retained as a no-op flag for backward compatibility. ### 3. earnings22-kws dataset 404 HuggingFace consolidated `argmaxinc/earnings22-kws-golden` into `argmaxinc/contextual-earnings22`. The old id now returns 404 from the Datasets-Server REST API (no redirect follow). The new dataset has the same feature schema (`audio`, `file_id`, `text`, `dictionary`, ...), so swapping the id is sufficient — no downstream consumer changes needed. ## Test plan Ran `Scripts/parakeet_subset_benchmark.sh --download` end-to-end: - [x] `verify_assets` correctly resolves `parakeet-ja/` (all 5 expected files present) - [x] EOU warmup: `Models downloaded to ~/Library/Application Support/FluidAudio/Models/parakeet-eou-streaming/320ms`, 0.00% WER on warmup file - [x] earnings22-kws: 1140+ files downloaded (was 0 before), no 404 - [x] `swift build` passes Out of scope but observed (pre-existing, unrelated): - `ctc-earnings-benchmark --auto-download` does not actually auto-download CTC-110m model - THCHS-30 dataset hit HF IP rate limit (429) — transient <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/534" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review"> </picture> </a> <!-- devin-review-badge-end -->	2026-04-21 04:22:18 -04:00
Alex	b789a56609	Fix Japanese TDT model download filename mismatch (#522 ) Fixes the infinite re-download loop for Japanese TDT models reported in #521. ## Problem The `download()` function was using hardcoded `Names.decoderFile` and `Names.jointFile` for all model versions. For `.tdtJa`, this downloaded: - `Decoder.mlmodelc` - `JointDecision.mlmodelc` But `modelsExist()` checks for version-specific filenames: - `Decoderv2.mlmodelc` - `Jointerv2.mlmodelc` This mismatch caused the existence check to fail, triggering cache purge and re-download in an infinite loop. ## Solution Use `getModelFileNames(version)` in the download function to get the correct filenames for each version, matching what `modelsExist()` expects. ## Testing - [x] Build passes - [x] Filenames now match between download and existence check <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/522" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a> <!-- devin-review-badge-end --> ---------	2026-04-20 17:56:10 -04:00
Alex	2593f55415	Add Japanese ASR support with JSUT and Common Voice datasets (#478 ) ## Summary Adds comprehensive Japanese ASR support to FluidAudio with benchmark datasets and CLI commands. ## Changes ### Core Japanese ASR Support - CtcJaManager.swift - Japanese CTC transcription manager (actor-based) - CtcJaModels.swift - Japanese model loading and management - ModelNames.swift - Added Japanese model registry (`parakeetCtcJa`, `CTCJa` enum) - AsrModels.swift - Added `.ctcJa` model version (3,072 vocab, 1,024 hidden, blank_id=3072) - AsrManager.swift - Added `.ctcJa` case with error directing to `CtcJaManager` ### CLI Commands - JapaneseAsrBenchmark.swift (459 lines) - New `ja-benchmark` command - JSUT basic5000 dataset support - Mozilla Common Voice (MCV) test set support - Auto-download capability - CER (Character Error Rate) evaluation - DownloadCommand.swift - Added JSUT and MCV Japanese dataset downloads - TranscribeCommand.swift - Added `.ctcJa` model version support - AsrBenchmark.swift - Added `.ctcJa` switch case ### Dataset Support - JapaneseDatasetDownloader.swift (387 lines) - Dataset download and parsing - JSUT basic5000 (5,000 sentences, clean studio recordings) - Mozilla Common Voice Japanese test split - Efficient streaming downloads - Metadata extraction and validation ## Usage ### CLI Commands ```bash # Benchmark on JSUT basic5000 (100 samples) swift run fluidaudiocli ja-benchmark --dataset jsut --samples 100 # Benchmark on Common Voice test (500 samples, auto-download) swift run fluidaudiocli ja-benchmark --dataset cv-test --samples 500 --auto-download # Download datasets swift run fluidaudiocli download --dataset jsut swift run fluidaudiocli download --dataset cv-ja-test ``` ### Swift API ```swift // Load and use Japanese CTC transcription let manager = try await CtcJaManager.load() let text = try manager.transcribe(audioURL: japaneseAudioFile) ``` ## Model Info - Repo: `FluidInference/parakeet-ctc-0.6b-ja-coreml` - Architecture: 600M parameter CTC-only - Vocabulary: 3,072 Japanese SentencePiece tokens + 1 blank (id: 3072) - Encoder: 1,024 hidden size - Expected CER: 6.5% on JSUT basic5000, 13.3% on MCV 16.1 test ## Testing - ✅ Builds successfully (`swift build`) - ✅ Model loading integration tested - ✅ CLI commands compile and link correctly - ⏳ Runtime benchmark testing pending (requires model download) ## Related - Mobius PR #39: Japanese CTC CoreML conversion (https://github.com/FluidInference/mobius/pull/39) 🤖 Generated with Claude Code <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/478" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a> <!-- devin-review-badge-end --> ---------	2026-04-04 12:57:32 -04:00
Alex	d9eef864d2	ASR tech debt cleanup: remove dead code, fix bugs, add benchmark script 28/03/2026 (#460 ) ## Summary Systematic cleanup of the ASR module addressing tech debt items from #457. Net reduction of ~430 lines while fixing real bugs and improving maintainability. ### Bug fixes - `enableFP16` silently ignored — `optimizedConfiguration(enableFP16:)` delegated to a shared factory that hardcoded `allowLowPrecisionAccumulationOnGPU = true`, ignoring the caller's parameter - `MLArrayCache.returnArray` only reset float32 data — cached arrays of other types (float16, int32) retained stale data from previous use - CTC model auto-detection broken — `Repo.parakeetCtc110m.folderName` returned `"parakeet-ctc-110m"` instead of `"parakeet-ctc-110m-coreml"` because the `folderName` switch fell through to a `default` case that stripped the `-coreml` suffix. Same for `parakeetCtc06b`. - Duplicate tokens at chunk merge boundary — `mergeByMidpoint` used `<=`/`>=` so tokens exactly at the cutoff appeared in both left and right chunks ### Dead code removal - Deleted `ANEOptimizer` indirection layer (166 lines) — was a pass-through wrapping `MLModel` with no optimization - Deleted `PerformanceMonitor` actor and `AggregatedMetrics` — never instantiated, component times hardcoded to 0 - Deleted `getFloat16Array` from MLArrayCache — never called - Deleted `sliceEncoderOutput` from AsrTranscription — never called (30 lines) - Deleted `loadWithANEOptimization` from AsrModels — never called - Removed unused `tokenTimings` parameter chain through `processTranscriptionResult` - Removed unused `import OSLog` / `import CoreML` across 5 files - Removed `nonisolated(unsafe)` from SlidingWindowAsrManager (types already Sendable) ### Duplication elimination - Extracted `clearCachedCtcData()` helper (replaced 3× triple-nil assignments) - Extracted `decoderState(for:)` / `setDecoderState(_:for:)` (replaced 4× switch blocks) - Extracted `frameAlignedAudio()` (replaced 2× duplicated frame-alignment blocks) - Added `ASRConstants.secondsPerEncoderFrame` (replaced 5× magic `0.08`) - Replaced hardcoded `16_000` with `config.sampleRate` / `ASRConstants.sampleRate` - Extracted `MLModelConfigurationUtils.defaultConfiguration()` (replaced 5× copy-pasted config methods) - Extracted `MLModelConfigurationUtils.defaultModelsDirectory()` (replaced 3× copy-pasted directory methods) - Consolidated duplicate `vocabularyFile` / `vocabularyFileArray` constants ### File organization - Moved `PerformanceMetrics.swift`, `ProgressEmitter.swift`, `MLArrayCache.swift` from `ASR/Parakeet/` to `Shared/` (used by multiple modules) - Renamed `StreamingAudioSourceFactory` → `AudioSourceFactory`, `StreamingAudioSampleSource` → `AudioSampleSource` (types used by both ASR and Diarizer) - Renamed files to match type names: `SortformerDiarizerPipeline.swift` → `SortformerDiarizer.swift`, `LSEENDDiarizerAPI.swift` → `LSEENDDiarizer.swift`, `NemotronPipeline.swift` → `NemotronStreamingAsrManager+Pipeline.swift` - Replaced force unwraps in `RnntDecoder.swift` with `guard let` + descriptive errors - Removed stale TODO about decoder state in AsrManager ### Benchmark script - Added `Scripts/run_parakeet_benchmarks.sh` — runs all 6 benchmarks (v3, v2, TDT-CTC-110M, CTC earnings, EOU 320ms, Nemotron 1120ms) with WER comparison against `benchmarks100.md` baselines and regression detection - Referenced from `Documentation/ASR/benchmarks100.md` ## Verified — no regressions ``` Model Baseline Current Delta Parakeet TDT v3 (0.6B) 2.6% 2.64% +0.04% Parakeet TDT v2 (0.6B) 3.8% 3.79% -0.01% CTC-TDT 110M 3.6% 3.56% -0.04% CTC Earnings 16.54% 16.51% -0.03% EOU 320ms (120M) 7.11% 7.11% +0.00% Nemotron 1120ms (0.6B) 1.99% 1.99% +0.00% ``` ## Test plan - [x] `swift build` passes - [x] `swift test` passes (all existing tests, updated for removed dead code) - [x] All 6 ASR benchmarks match baselines (100 files each) - [ ] `swift format lint` passes	2026-03-28 23:44:10 -04:00
Alex	8aa0dfcdac	fix: clean up diarization test infrastructure (#395 ) ## Summary - Extract shared fixture helpers into `DiarizationTestFixtures` enum, removing ~200 lines of duplicate code across `LSEENDIntegrationTests` and `SpeakerEnrollmentTests` - Replace fragile `Mirror`-based private state inspection with `internal` `hasActiveSession` property on `LSEENDDiarizerAPI` - Fix non-deterministic `srand48` seed in `SortformerTests` (use constant `42` instead of time-based seed) - Fix asymmetric skip guards in Sortformer enrollment tests (`XCTSkipIf` instead of `XCTAssertNotNil` for host-dependent segments) ## Test plan - [x] `swift build --build-tests` passes - [ ] `swift test --filter SortformerTests` passes - [ ] `swift test --filter LSEENDIntegrationTests` passes - [ ] `swift test --filter SpeakerEnrollmentTests` passes <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/395" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a> <!-- devin-review-badge-end -->	2026-03-18 12:51:34 -04:00
Alex	7d074e1ee6	chore: consolidate Python scripts into Scripts/ (#344 ) ## Summary - Move `Benchmarks/nemo` to `Scripts/nemo_ami_benchmark` - Move `Tools/voice_cloning` to `Scripts/voice_cloning` - Remove now-empty `Benchmarks/` and `Tools/` top-level directories Consolidates standalone Python utilities into a single `Scripts/` directory to reduce top-level clutter. ## Test plan - [x] Verify files moved correctly (no content changes) <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/344" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a> <!-- devin-review-badge-end -->	2026-03-04 12:46:03 -05:00

7 Commits