mirror of
https://github.com/FluidInference/FluidAudio.git
synced 2026-05-12 20:20:36 +00:00
2ea0727541
Fixes #512. ## TL;DR Parakeet TDT v3 transcribed short Polish utterances like "Wpisz Google kropka com" as Cyrillic (`Впиш Гугл к ком.`) because the joint decoder's top-1 pick drifts to Cyrillic tokens under low acoustic confidence. This PR adds an **opt-in** script filter: when a caller passes `language: .polish` (or any other language with a declared script), the decoder rejects top-1 if it's the wrong script and walks top-K to the highest-probability candidate matching the expected script. - **Opt-in**: `language:` defaults to `nil` — zero behavior change for existing callers. - **No acoustic-model changes** — this is purely a decoder-side post-processing step over the joint logits. - **Requires `JointDecisionv3.mlmodelc`** (exposes top-K outputs). Auto-downloaded from HuggingFace alongside the other v3 files; falls back to standard argmax when absent. ## Empirical validation — reporter's own audio Samples pulled via `gdown --folder <link-from-issue-#512-comment>` from @tajchert's Drive folder. **`JointDecisionv3.mlmodelc` is loaded in both columns** — this isolates the Swift filter as the mechanism, not a model swap. | sample | ground truth | `language: nil` (current) | `language: .polish` (this PR) | |---|---|---|---| | pl | Wpisz Google kropka com | **Впиш Гугл к ком.** | Wpis Google.com. | | pl2 | Wpisz Google kropka com | **Впиш Гугл крокаком.** | Wpish Google, Com. | | pl3 | Wpisz Google kropka com | **Впишь куглькрабком.** | VP Kugl.com. | | pl4 | Wpisz Google kropka com | **Впиш гугл к ком.** | Wpish gugl c. | | pl5 | Wpisz Google kropka com | **Впиш гугл кракаком.** | Wpish Google Croca kom. | | pl6 | Wpisz Google kropka com | **Впиш, гугл крокаком.** | Wpish, Google, Com. | | pl_complex | Cały spichlarz jest ze spiżu | Cały spichlarz jest ze spiżu. | Cały spichlarz jest ze spiżu. | **6/6 short samples flip Cyrillic → Latin.** `pl_complex` was never broken (long context → high joint confidence → no drift) and is unchanged. ## Scope & limitations (important — please don't overclaim) **This PR fixes the *script* the tokens are drawn from. It does NOT fix per-word acoustic accuracy.** | | `language: nil` | `language: .polish` | |---|---|---| | Script correct (Latin, not Cyrillic) | ✗ | ✓ (6/6) | | Word spelling matches ground truth | ✗ | ✗ (still 6/7 wrong on short) | The residual errors — `Wpisz` → `Wpish`/`Wpis`, `kropka` → `Croca` / dropped — are **Parakeet TDT v3 acoustic weaknesses on short Polish commands**. No amount of output post-processing can turn `Wpish` into `Wpisz`; that needs better acoustic modeling, a Polish LM rescorer, or more training data. Out of scope here. What users actually get by merging: - Output is visually Polish (Latin script), not pseudo-Russian — works with locale-aware post-processing, spell-check, and UI rendering - Locale-strict WER evaluators no longer penalize Cyrillic-vs-Latin substitution - Opt-in; zero risk for callers who don't pass `language:` What users do **not** get: - Higher word accuracy on short Polish/Slavic Latin utterances - Support for languages outside the `Language` enum (Greek, Maltese, Hungarian, Turkish, Baltic — their characters fit the Latin Unicode ranges but aren't exposed; easy follow-up) - A meaningful FLEURS WER delta — see [Documentation/fleurs-script-filtering-comparison.md](./Documentation/fleurs-script-filtering-comparison.md); full sentences aren't in the failure regime ## Implementation ### New - `Sources/FluidAudio/Shared/ScriptDetection.swift` (new, +112) - `public enum Language` — 13 Latin (en, es, fr, de, it, pt, ro, pl, cs, sk, sl, hr, bs) + 5 Cyrillic (ru, uk, be, bg, sr) - `public enum Script { case latin, cyrillic }` - `matches(_:script:)` over Unicode ranges: ASCII (0x20–0x7F), Latin-1 (0xA0–0xFF), Latin Extended-A (0x100–0x17F), **Latin Extended-B (0x180–0x24F — Romanian ș/ț)**, **Latin Extended Additional (0x1E00–0x1EFF — Vietnamese)**, Cyrillic (0x400–0x4FF). Strips SentencePiece boundary marker U+2581 before checking. - `filterTopK(topKIds:topKLogits:vocabulary:preferredScript:) -> (tokenId, probability)?` — returns the highest-probability top-K candidate matching the target script; probability via **softmax over the top-K subset** with the max-logit stability trick; guarded against top-K array length mismatch. ### Changed - `TdtJointDecision` — optional `topKIds` / `topKLogits` fields (populated by JointDecisionv3 only) - `TdtDecoderV3` — script filter runs **only when top-1 is already wrong script**; both decode sites feed `filtered.probability` (a real [0,1]) into `TdtDurationMapping.clampProbability`, not raw logits - `AsrManager.transcribe(...)` — `language: Language? = nil` plumbed through all three overloads: `[Float]`, `URL`, `AVAudioPCMBuffer` - `AsrModels` + `ModelNames` — `requiredModelsV3` set includes `JointDecisionv3.mlmodelc` so the download utility fetches it on fresh installs and also backfills it for existing users on next `.v3` load - CLI — `fluidaudiocli transcribe <file> --language {en|pl|cs|sk|sl|hr|bs|ro|es|fr|de|it|pt|ru|uk|be|bg|sr}` ### How to try it ```bash swift run -c release fluidaudiocli transcribe sample.wav --language pl ``` ## Model dependency `JointDecisionv3.mlmodelc` must be present in `FluidInference/parakeet-tdt-0.6b-v3-coreml` on HuggingFace. It exposes `top_k_ids` / `top_k_logits` outputs (K=64 in our export) alongside the standard argmax. When absent, `AsrModels` falls back to `JointDecision.mlmodelc` and the script filter becomes a no-op — backward compatible. **Cache-upgrade verified**: removed `JointDecisionv3.mlmodelc` from a populated cache, re-ran `--language pl`; the file was auto-fetched and Polish output was Latin. Existing users pick up the fix on next `.v3` load without manual intervention. ## Review notes / risky bits - **Softmax over top-K subset, not the full vocab** — probabilities won't exactly match a true full-softmax, but K=64 captures ~all the mass when the model is anywhere near confident. If you prefer, we can expose the raw top-K logits to callers and let them compute confidence however they want. - **Top-1 escape hatch**: filter is only triggered when top-1 fails `matches(_, script:)`. When top-1 is already correct, nothing is changed — so we can't regress the common case. - **Length-mismatch guard** in `filterTopK` uses `min(topKIds.count, topKLogits.count)`. If CoreML output arrays ever diverge, we iterate the common prefix instead of crashing. - **Latin Extended-B (0x0180–0x024F)** was added specifically so Romanian ș/ț aren't rejected as non-Latin. Latin Extended Additional (0x1E00–0x1EFF) was added for free — helps Vietnamese should anyone want it later. ## Tests - `ScriptDetectionTests` — **37 tests**: Unicode range coverage (Latin-1 / Extended-A / Extended-B / Extended Additional / Cyrillic), SentencePiece boundary-marker stripping, `filterTopK` happy path, length-mismatch guard, probability-range invariant, Czech/Slovak/Slovenian/Croatian/Romanian token coverage, cross-script rejection - Build clean; `swift format lint` clean on all touched files - A/B end-to-end run against reporter's actual Polish audio (table above) ## Checklist - [x] Builds clean (`swift build`, `swift build -c release`) - [x] `swift format lint` clean on touched files - [x] `ScriptDetectionTests` 37/37 pass - [x] A/B reproduction on #512 reporter's audio - [x] Cache-upgrade path verified (JointDecisionv3 auto-fetched on existing caches) - [x] CLI accepts all 18 language codes end-to-end - [ ] CI green ## Follow-ups (not blocking) - Expose more Latin languages in the enum (Hungarian, Turkish, Baltic, Maltese) — all character ranges already supported, just need enum cases - Add `Script.greek` for `el_gr` (separate Unicode range) - Short-utterance benchmark dataset (FLEURS is the wrong tool — it's all long sentences where drift doesn't happen) - Optional: publish a Polish LM rescorer to address the underlying acoustic-accuracy issue the script filter cannot fix ---------
99 lines
3.0 KiB
Bash
Executable File
99 lines
3.0 KiB
Bash
Executable File
#!/bin/bash
|
|
# Run the FLEURS benchmark across all 24 Parakeet TDT v3 languages for
|
|
# local testing / regression checks. 100 samples/language by default.
|
|
#
|
|
# Usage:
|
|
# ./Scripts/fleurs_parakeet_sub_benchmark.sh # 100 samples/lang
|
|
# SAMPLES=10 ./Scripts/fleurs_parakeet_sub_benchmark.sh # quick smoke test
|
|
#
|
|
# FLEURS data and models download automatically if missing. Results land in
|
|
# benchmark_results/ as per-language JSON plus a combined summary CSV.
|
|
|
|
set -euo pipefail
|
|
|
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
|
PROJECT_DIR="$(cd "$SCRIPT_DIR/.." && pwd)"
|
|
RESULTS_DIR="$PROJECT_DIR/benchmark_results"
|
|
TIMESTAMP=$(date +"%Y%m%d_%H%M%S")
|
|
LOG_FILE="$RESULTS_DIR/fleurs_${TIMESTAMP}.log"
|
|
SUMMARY_CSV="$RESULTS_DIR/fleurs_${TIMESTAMP}_summary.csv"
|
|
SAMPLES="${SAMPLES:-100}"
|
|
|
|
LANGUAGES=(
|
|
en_us es_419 it_it fr_fr de_de
|
|
ru_ru nl_nl pl_pl uk_ua sk_sk
|
|
cs_cz bg_bg hr_hr ro_ro fi_fi
|
|
hu_hu sv_se et_ee da_dk lt_lt
|
|
el_gr mt_mt lv_lv sl_si
|
|
)
|
|
|
|
mkdir -p "$RESULTS_DIR"
|
|
|
|
log() {
|
|
printf '[%s] %s\n' "$(date '+%H:%M:%S')" "$*" | tee -a "$LOG_FILE"
|
|
}
|
|
|
|
command -v python3 >/dev/null \
|
|
|| { log "ERROR: python3 required for summary extraction"; exit 1; }
|
|
|
|
cd "$PROJECT_DIR"
|
|
|
|
log "Building release binary..."
|
|
if ! swift build -c release 2>&1 | tee -a "$LOG_FILE"; then
|
|
log "ERROR: swift build failed"
|
|
exit 1
|
|
fi
|
|
CLI="$PROJECT_DIR/.build/release/fluidaudiocli"
|
|
|
|
log "=== FLEURS: $SAMPLES samples x ${#LANGUAGES[@]} languages = $(( SAMPLES * ${#LANGUAGES[@]} )) total ==="
|
|
|
|
SUITE_START=$(date +%s)
|
|
|
|
for i in "${!LANGUAGES[@]}"; do
|
|
lang="${LANGUAGES[$i]}"
|
|
output_file="$RESULTS_DIR/fleurs_${lang}_${TIMESTAMP}.json"
|
|
|
|
log "[$((i+1))/${#LANGUAGES[@]}] $lang: starting ($SAMPLES samples)"
|
|
start_time=$(date +%s)
|
|
|
|
if ! "$CLI" fleurs-benchmark \
|
|
--languages "$lang" \
|
|
--samples "$SAMPLES" \
|
|
--output "$output_file" \
|
|
2>&1 | tee -a "$LOG_FILE"; then
|
|
log "WARN: $lang failed — continuing"
|
|
fi
|
|
|
|
log "[$((i+1))/${#LANGUAGES[@]}] $lang: done in $(( $(date +%s) - start_time ))s"
|
|
done
|
|
|
|
SUITE_ELAPSED=$(( $(date +%s) - SUITE_START ))
|
|
log "=== Suite complete in $((SUITE_ELAPSED / 60))m $((SUITE_ELAPSED % 60))s ==="
|
|
|
|
log ""
|
|
log "=== Summary ($SAMPLES samples per language) ==="
|
|
printf 'lang,wer_pct,cer_pct,rtfx\n' > "$SUMMARY_CSV"
|
|
printf '%-10s %10s %10s %10s\n' "Language" "WER%" "CER%" "RTFx" | tee -a "$LOG_FILE"
|
|
|
|
for lang in "${LANGUAGES[@]}"; do
|
|
json_file="$RESULTS_DIR/fleurs_${lang}_${TIMESTAMP}.json"
|
|
row=$(python3 - "$json_file" <<'PY' 2>/dev/null || printf 'N/A,N/A,N/A'
|
|
import json, sys
|
|
try:
|
|
d = json.load(open(sys.argv[1]))
|
|
s = d["summary"]
|
|
print(f"{s['averageWER']*100:.2f},{s['averageCER']*100:.2f},{s['averageRTFx']:.1f}")
|
|
except Exception:
|
|
print("N/A,N/A,N/A")
|
|
PY
|
|
)
|
|
printf '%s,%s\n' "$lang" "$row" >> "$SUMMARY_CSV"
|
|
|
|
IFS=',' read -r wer cer rtfx <<< "$row"
|
|
printf '%-10s %9s%% %9s%% %9sx\n' "$lang" "$wer" "$cer" "$rtfx" | tee -a "$LOG_FILE"
|
|
done
|
|
|
|
log ""
|
|
log "Results: $RESULTS_DIR"
|
|
log "Summary CSV: $SUMMARY_CSV"
|