7 Commits

Author SHA1 Message Date
Alex 2ea0727541 ASR: fix Parakeet TDT v3 emitting Cyrillic for short Latin-script utterances (#512) (#515)
Fixes #512.

## TL;DR

Parakeet TDT v3 transcribed short Polish utterances like "Wpisz Google
kropka com" as Cyrillic (`Впиш Гугл к ком.`) because the joint decoder's
top-1 pick drifts to Cyrillic tokens under low acoustic confidence. This
PR adds an **opt-in** script filter: when a caller passes `language:
.polish` (or any other language with a declared script), the decoder
rejects top-1 if it's the wrong script and walks top-K to the
highest-probability candidate matching the expected script.

- **Opt-in**: `language:` defaults to `nil` — zero behavior change for
existing callers.
- **No acoustic-model changes** — this is purely a decoder-side
post-processing step over the joint logits.
- **Requires `JointDecisionv3.mlmodelc`** (exposes top-K outputs).
Auto-downloaded from HuggingFace alongside the other v3 files; falls
back to standard argmax when absent.

## Empirical validation — reporter's own audio

Samples pulled via `gdown --folder <link-from-issue-#512-comment>` from
@tajchert's Drive folder. **`JointDecisionv3.mlmodelc` is loaded in both
columns** — this isolates the Swift filter as the mechanism, not a model
swap.

| sample | ground truth | `language: nil` (current) | `language:
.polish` (this PR) |
|---|---|---|---|
| pl | Wpisz Google kropka com | **Впиш Гугл к ком.** | Wpis Google.com.
|
| pl2 | Wpisz Google kropka com | **Впиш Гугл крокаком.** | Wpish
Google, Com. |
| pl3 | Wpisz Google kropka com | **Впишь куглькрабком.** | VP Kugl.com.
|
| pl4 | Wpisz Google kropka com | **Впиш гугл к ком.** | Wpish gugl c. |
| pl5 | Wpisz Google kropka com | **Впиш гугл кракаком.** | Wpish Google
Croca kom. |
| pl6 | Wpisz Google kropka com | **Впиш, гугл крокаком.** | Wpish,
Google, Com. |
| pl_complex | Cały spichlarz jest ze spiżu | Cały spichlarz jest ze
spiżu. | Cały spichlarz jest ze spiżu. |

**6/6 short samples flip Cyrillic → Latin.** `pl_complex` was never
broken (long context → high joint confidence → no drift) and is
unchanged.

## Scope & limitations (important — please don't overclaim)

**This PR fixes the *script* the tokens are drawn from. It does NOT fix
per-word acoustic accuracy.**

| | `language: nil` | `language: .polish` |
|---|---|---|
| Script correct (Latin, not Cyrillic) | ✗ | ✓ (6/6) |
| Word spelling matches ground truth | ✗ | ✗ (still 6/7 wrong on short)
|

The residual errors — `Wpisz` → `Wpish`/`Wpis`, `kropka` → `Croca` /
dropped — are **Parakeet TDT v3 acoustic weaknesses on short Polish
commands**. No amount of output post-processing can turn `Wpish` into
`Wpisz`; that needs better acoustic modeling, a Polish LM rescorer, or
more training data. Out of scope here.

What users actually get by merging:

- Output is visually Polish (Latin script), not pseudo-Russian — works
with locale-aware post-processing, spell-check, and UI rendering
- Locale-strict WER evaluators no longer penalize Cyrillic-vs-Latin
substitution
- Opt-in; zero risk for callers who don't pass `language:`

What users do **not** get:

- Higher word accuracy on short Polish/Slavic Latin utterances
- Support for languages outside the `Language` enum (Greek, Maltese,
Hungarian, Turkish, Baltic — their characters fit the Latin Unicode
ranges but aren't exposed; easy follow-up)
- A meaningful FLEURS WER delta — see
[Documentation/fleurs-script-filtering-comparison.md](./Documentation/fleurs-script-filtering-comparison.md);
full sentences aren't in the failure regime

## Implementation

### New
- `Sources/FluidAudio/Shared/ScriptDetection.swift` (new, +112)
- `public enum Language` — 13 Latin (en, es, fr, de, it, pt, ro, pl, cs,
sk, sl, hr, bs) + 5 Cyrillic (ru, uk, be, bg, sr)
  - `public enum Script { case latin, cyrillic }`
- `matches(_:script:)` over Unicode ranges: ASCII (0x20–0x7F), Latin-1
(0xA0–0xFF), Latin Extended-A (0x100–0x17F), **Latin Extended-B
(0x180–0x24F — Romanian ș/ț)**, **Latin Extended Additional
(0x1E00–0x1EFF — Vietnamese)**, Cyrillic (0x400–0x4FF). Strips
SentencePiece boundary marker U+2581 before checking.
- `filterTopK(topKIds:topKLogits:vocabulary:preferredScript:) ->
(tokenId, probability)?` — returns the highest-probability top-K
candidate matching the target script; probability via **softmax over the
top-K subset** with the max-logit stability trick; guarded against top-K
array length mismatch.

### Changed
- `TdtJointDecision` — optional `topKIds` / `topKLogits` fields
(populated by JointDecisionv3 only)
- `TdtDecoderV3` — script filter runs **only when top-1 is already wrong
script**; both decode sites feed `filtered.probability` (a real [0,1])
into `TdtDurationMapping.clampProbability`, not raw logits
- `AsrManager.transcribe(...)` — `language: Language? = nil` plumbed
through all three overloads: `[Float]`, `URL`, `AVAudioPCMBuffer`
- `AsrModels` + `ModelNames` — `requiredModelsV3` set includes
`JointDecisionv3.mlmodelc` so the download utility fetches it on fresh
installs and also backfills it for existing users on next `.v3` load
- CLI — `fluidaudiocli transcribe <file> --language
{en|pl|cs|sk|sl|hr|bs|ro|es|fr|de|it|pt|ru|uk|be|bg|sr}`

### How to try it

```bash
swift run -c release fluidaudiocli transcribe sample.wav --language pl
```

## Model dependency

`JointDecisionv3.mlmodelc` must be present in
`FluidInference/parakeet-tdt-0.6b-v3-coreml` on HuggingFace. It exposes
`top_k_ids` / `top_k_logits` outputs (K=64 in our export) alongside the
standard argmax. When absent, `AsrModels` falls back to
`JointDecision.mlmodelc` and the script filter becomes a no-op —
backward compatible.

**Cache-upgrade verified**: removed `JointDecisionv3.mlmodelc` from a
populated cache, re-ran `--language pl`; the file was auto-fetched and
Polish output was Latin. Existing users pick up the fix on next `.v3`
load without manual intervention.

## Review notes / risky bits

- **Softmax over top-K subset, not the full vocab** — probabilities
won't exactly match a true full-softmax, but K=64 captures ~all the mass
when the model is anywhere near confident. If you prefer, we can expose
the raw top-K logits to callers and let them compute confidence however
they want.
- **Top-1 escape hatch**: filter is only triggered when top-1 fails
`matches(_, script:)`. When top-1 is already correct, nothing is changed
— so we can't regress the common case.
- **Length-mismatch guard** in `filterTopK` uses `min(topKIds.count,
topKLogits.count)`. If CoreML output arrays ever diverge, we iterate the
common prefix instead of crashing.
- **Latin Extended-B (0x0180–0x024F)** was added specifically so
Romanian ș/ț aren't rejected as non-Latin. Latin Extended Additional
(0x1E00–0x1EFF) was added for free — helps Vietnamese should anyone want
it later.

## Tests

- `ScriptDetectionTests` — **37 tests**: Unicode range coverage (Latin-1
/ Extended-A / Extended-B / Extended Additional / Cyrillic),
SentencePiece boundary-marker stripping, `filterTopK` happy path,
length-mismatch guard, probability-range invariant,
Czech/Slovak/Slovenian/Croatian/Romanian token coverage, cross-script
rejection
- Build clean; `swift format lint` clean on all touched files
- A/B end-to-end run against reporter's actual Polish audio (table
above)

## Checklist

- [x] Builds clean (`swift build`, `swift build -c release`)
- [x] `swift format lint` clean on touched files
- [x] `ScriptDetectionTests` 37/37 pass
- [x] A/B reproduction on #512 reporter's audio
- [x] Cache-upgrade path verified (JointDecisionv3 auto-fetched on
existing caches)
- [x] CLI accepts all 18 language codes end-to-end
- [ ] CI green

## Follow-ups (not blocking)

- Expose more Latin languages in the enum (Hungarian, Turkish, Baltic,
Maltese) — all character ranges already supported, just need enum cases
- Add `Script.greek` for `el_gr` (separate Unicode range)
- Short-utterance benchmark dataset (FLEURS is the wrong tool — it's all
long sentences where drift doesn't happen)
- Optional: publish a Polish LM rescorer to address the underlying
acoustic-accuracy issue the script filter cannot fix

---------
2026-04-23 17:43:09 -04:00
Alex 7c9be31c05 fix(benchmark): repair 3 pre-existing script/download bugs (#534)
## Summary

Three unrelated pre-existing bugs surfaced while validating PR #515. All
of them block `Scripts/parakeet_subset_benchmark.sh --download` from
succeeding, but none are related to the v3 script-filtering work.
Consolidating into one PR since each fix is ~1–3 lines.

### 1. Japanese TDT folder-name mismatch

`Scripts/parakeet_subset_benchmark.sh` verifies the Japanese TDT model
at `$MODELS_DIR/parakeet-tdt-ja/`, but the folder was renamed to
`parakeet-ja` in 4ef33f0b6 (`Repo.parakeetJa.folderName =
"parakeet-ja"`). Result: `verify_assets()` always reported missing
assets even on a fully provisioned machine. One-line rename to match.

### 2. EOU streaming CLI writes to wrong path

`ParakeetEouCommand` had a default / `--use-cache` split where the
default branch produced `$CWD/Models/<chunk>/<chunk>/` (double-nested,
relative to CWD) as the load path, while `downloadModels()` called
`deletingLastPathComponent().deletingLastPathComponent()` then
`DownloadUtils.downloadRepo(repo, to:)` which appended `folderName =
"parakeet-eou-streaming/<chunk>"`. Net effect: files landed at
`$CWD/Models/parakeet-eou-streaming/<chunk>/` while `loadModels()`
looked at `$CWD/Models/<chunk>/<chunk>/` — model load failed silently.

Unified on Application Support (matches every other CoreML model in
FluidAudio). `--use-cache` retained as a no-op flag for backward
compatibility.

### 3. earnings22-kws dataset 404

HuggingFace consolidated `argmaxinc/earnings22-kws-golden` into
`argmaxinc/contextual-earnings22`. The old id now returns 404 from the
Datasets-Server REST API (no redirect follow). The new dataset has the
same feature schema (`audio`, `file_id`, `text`, `dictionary`, ...), so
swapping the id is sufficient — no downstream consumer changes needed.

## Test plan

Ran `Scripts/parakeet_subset_benchmark.sh --download` end-to-end:

- [x] `verify_assets` correctly resolves `parakeet-ja/` (all 5 expected
files present)
- [x] EOU warmup: `Models downloaded to ~/Library/Application
Support/FluidAudio/Models/parakeet-eou-streaming/320ms`, 0.00% WER on
warmup file
- [x] earnings22-kws: 1140+ files downloaded (was 0 before), no 404
- [x] `swift build` passes

Out of scope but observed (pre-existing, unrelated):
- `ctc-earnings-benchmark --auto-download` does not actually
auto-download CTC-110m model
- THCHS-30 dataset hit HF IP rate limit (429) — transient
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/534"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open in Devin Review">
  </picture>
</a>
<!-- devin-review-badge-end -->
2026-04-21 04:22:18 -04:00
Alex b789a56609 Fix Japanese TDT model download filename mismatch (#522)
Fixes the infinite re-download loop for Japanese TDT models reported in
#521.

## Problem
The `download()` function was using hardcoded `Names.decoderFile` and
`Names.jointFile` for all model versions. For `.tdtJa`, this downloaded:
- `Decoder.mlmodelc` 
- `JointDecision.mlmodelc`

But `modelsExist()` checks for version-specific filenames:
- `Decoderv2.mlmodelc`
- `Jointerv2.mlmodelc`

This mismatch caused the existence check to fail, triggering cache purge
and re-download in an infinite loop.

## Solution
Use `getModelFileNames(version)` in the download function to get the
correct filenames for each version, matching what `modelsExist()`
expects.

## Testing
- [x] Build passes
- [x] Filenames now match between download and existence check
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/522"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open with Devin">
  </picture>
</a>
<!-- devin-review-badge-end -->

---------
2026-04-20 17:56:10 -04:00
Alex 2593f55415 Add Japanese ASR support with JSUT and Common Voice datasets (#478)
## Summary

Adds comprehensive Japanese ASR support to FluidAudio with benchmark
datasets and CLI commands.

## Changes

### Core Japanese ASR Support
- **CtcJaManager.swift** - Japanese CTC transcription manager
(actor-based)
- **CtcJaModels.swift** - Japanese model loading and management
- **ModelNames.swift** - Added Japanese model registry (`parakeetCtcJa`,
`CTCJa` enum)
- **AsrModels.swift** - Added `.ctcJa` model version (3,072 vocab, 1,024
hidden, blank_id=3072)
- **AsrManager.swift** - Added `.ctcJa` case with error directing to
`CtcJaManager`

### CLI Commands
- **JapaneseAsrBenchmark.swift** (459 lines) - New `ja-benchmark`
command
  - JSUT basic5000 dataset support
  - Mozilla Common Voice (MCV) test set support
  - Auto-download capability
  - CER (Character Error Rate) evaluation
- **DownloadCommand.swift** - Added JSUT and MCV Japanese dataset
downloads
- **TranscribeCommand.swift** - Added `.ctcJa` model version support
- **AsrBenchmark.swift** - Added `.ctcJa` switch case

### Dataset Support
- **JapaneseDatasetDownloader.swift** (387 lines) - Dataset download and
parsing
  - JSUT basic5000 (5,000 sentences, clean studio recordings)
  - Mozilla Common Voice Japanese test split
  - Efficient streaming downloads
  - Metadata extraction and validation

## Usage

### CLI Commands
```bash
# Benchmark on JSUT basic5000 (100 samples)
swift run fluidaudiocli ja-benchmark --dataset jsut --samples 100

# Benchmark on Common Voice test (500 samples, auto-download)
swift run fluidaudiocli ja-benchmark --dataset cv-test --samples 500 --auto-download

# Download datasets
swift run fluidaudiocli download --dataset jsut
swift run fluidaudiocli download --dataset cv-ja-test
```

### Swift API
```swift
// Load and use Japanese CTC transcription
let manager = try await CtcJaManager.load()
let text = try manager.transcribe(audioURL: japaneseAudioFile)
```

## Model Info
- **Repo**: `FluidInference/parakeet-ctc-0.6b-ja-coreml`
- **Architecture**: 600M parameter CTC-only
- **Vocabulary**: 3,072 Japanese SentencePiece tokens + 1 blank (id:
3072)
- **Encoder**: 1,024 hidden size
- **Expected CER**: 6.5% on JSUT basic5000, 13.3% on MCV 16.1 test

## Testing
-  Builds successfully (`swift build`)
-  Model loading integration tested
-  CLI commands compile and link correctly
-  Runtime benchmark testing pending (requires model download)

## Related
- Mobius PR #39: Japanese CTC CoreML conversion
(https://github.com/FluidInference/mobius/pull/39)

🤖 Generated with Claude Code
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/478"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open with Devin">
  </picture>
</a>
<!-- devin-review-badge-end -->

---------
2026-04-04 12:57:32 -04:00
Alex d9eef864d2 ASR tech debt cleanup: remove dead code, fix bugs, add benchmark script 28/03/2026 (#460)
## Summary

Systematic cleanup of the ASR module addressing tech debt items from
#457. Net reduction of ~430 lines while fixing real bugs and improving
maintainability.

### Bug fixes
- **`enableFP16` silently ignored** —
`optimizedConfiguration(enableFP16:)` delegated to a shared factory that
hardcoded `allowLowPrecisionAccumulationOnGPU = true`, ignoring the
caller's parameter
- **`MLArrayCache.returnArray` only reset float32 data** — cached arrays
of other types (float16, int32) retained stale data from previous use
- **CTC model auto-detection broken** —
`Repo.parakeetCtc110m.folderName` returned `"parakeet-ctc-110m"` instead
of `"parakeet-ctc-110m-coreml"` because the `folderName` switch fell
through to a `default` case that stripped the `-coreml` suffix. Same for
`parakeetCtc06b`.
- **Duplicate tokens at chunk merge boundary** — `mergeByMidpoint` used
`<=`/`>=` so tokens exactly at the cutoff appeared in both left and
right chunks

### Dead code removal
- Deleted `ANEOptimizer` indirection layer (166 lines) — was a
pass-through wrapping `MLModel` with no optimization
- Deleted `PerformanceMonitor` actor and `AggregatedMetrics` — never
instantiated, component times hardcoded to 0
- Deleted `getFloat16Array` from MLArrayCache — never called
- Deleted `sliceEncoderOutput` from AsrTranscription — never called (30
lines)
- Deleted `loadWithANEOptimization` from AsrModels — never called
- Removed unused `tokenTimings` parameter chain through
`processTranscriptionResult`
- Removed unused `import OSLog` / `import CoreML` across 5 files
- Removed `nonisolated(unsafe)` from SlidingWindowAsrManager (types
already Sendable)

### Duplication elimination
- Extracted `clearCachedCtcData()` helper (replaced 3× triple-nil
assignments)
- Extracted `decoderState(for:)` / `setDecoderState(_:for:)` (replaced
4× switch blocks)
- Extracted `frameAlignedAudio()` (replaced 2× duplicated
frame-alignment blocks)
- Added `ASRConstants.secondsPerEncoderFrame` (replaced 5× magic `0.08`)
- Replaced hardcoded `16_000` with `config.sampleRate` /
`ASRConstants.sampleRate`
- Extracted `MLModelConfigurationUtils.defaultConfiguration()` (replaced
5× copy-pasted config methods)
- Extracted `MLModelConfigurationUtils.defaultModelsDirectory()`
(replaced 3× copy-pasted directory methods)
- Consolidated duplicate `vocabularyFile` / `vocabularyFileArray`
constants

### File organization
- Moved `PerformanceMetrics.swift`, `ProgressEmitter.swift`,
`MLArrayCache.swift` from `ASR/Parakeet/` to `Shared/` (used by multiple
modules)
- Renamed `StreamingAudioSourceFactory` → `AudioSourceFactory`,
`StreamingAudioSampleSource` → `AudioSampleSource` (types used by both
ASR and Diarizer)
- Renamed files to match type names: `SortformerDiarizerPipeline.swift`
→ `SortformerDiarizer.swift`, `LSEENDDiarizerAPI.swift` →
`LSEENDDiarizer.swift`, `NemotronPipeline.swift` →
`NemotronStreamingAsrManager+Pipeline.swift`
- Replaced force unwraps in `RnntDecoder.swift` with `guard let` +
descriptive errors
- Removed stale TODO about decoder state in AsrManager

### Benchmark script
- Added `Scripts/run_parakeet_benchmarks.sh` — runs all 6 benchmarks
(v3, v2, TDT-CTC-110M, CTC earnings, EOU 320ms, Nemotron 1120ms) with
WER comparison against `benchmarks100.md` baselines and regression
detection
- Referenced from `Documentation/ASR/benchmarks100.md`

## Verified — no regressions

```
Model                       Baseline    Current      Delta
Parakeet TDT v3 (0.6B)          2.6%      2.64%     +0.04%
Parakeet TDT v2 (0.6B)          3.8%      3.79%     -0.01%
CTC-TDT 110M                    3.6%      3.56%     -0.04%
CTC Earnings                  16.54%     16.51%     -0.03%
EOU 320ms (120M)               7.11%      7.11%     +0.00%
Nemotron 1120ms (0.6B)         1.99%      1.99%     +0.00%
```

## Test plan
- [x] `swift build` passes
- [x] `swift test` passes (all existing tests, updated for removed dead
code)
- [x] All 6 ASR benchmarks match baselines (100 files each)
- [ ] `swift format lint` passes
2026-03-28 23:44:10 -04:00
Alex 8aa0dfcdac fix: clean up diarization test infrastructure (#395)
## Summary
- Extract shared fixture helpers into `DiarizationTestFixtures` enum,
removing ~200 lines of duplicate code across `LSEENDIntegrationTests`
and `SpeakerEnrollmentTests`
- Replace fragile `Mirror`-based private state inspection with
`internal` `hasActiveSession` property on `LSEENDDiarizerAPI`
- Fix non-deterministic `srand48` seed in `SortformerTests` (use
constant `42` instead of time-based seed)
- Fix asymmetric skip guards in Sortformer enrollment tests (`XCTSkipIf`
instead of `XCTAssertNotNil` for host-dependent segments)

## Test plan
- [x] `swift build --build-tests` passes
- [ ] `swift test --filter SortformerTests` passes
- [ ] `swift test --filter LSEENDIntegrationTests` passes
- [ ] `swift test --filter SpeakerEnrollmentTests` passes
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/395"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open with Devin">
  </picture>
</a>
<!-- devin-review-badge-end -->
2026-03-18 12:51:34 -04:00
Alex 7d074e1ee6 chore: consolidate Python scripts into Scripts/ (#344)
## Summary
- Move `Benchmarks/nemo` to `Scripts/nemo_ami_benchmark`
- Move `Tools/voice_cloning` to `Scripts/voice_cloning`
- Remove now-empty `Benchmarks/` and `Tools/` top-level directories

Consolidates standalone Python utilities into a single `Scripts/`
directory to reduce top-level clutter.

## Test plan
- [x] Verify files moved correctly (no content changes)
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/344"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open with Devin">
  </picture>
</a>
<!-- devin-review-badge-end -->
2026-03-04 12:46:03 -05:00