mirror of
https://github.com/FluidInference/FluidAudio.git
synced 2026-05-12 20:20:36 +00:00
docs/update-documentation
4 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
2593f55415 |
Add Japanese ASR support with JSUT and Common Voice datasets (#478)
## Summary Adds comprehensive Japanese ASR support to FluidAudio with benchmark datasets and CLI commands. ## Changes ### Core Japanese ASR Support - **CtcJaManager.swift** - Japanese CTC transcription manager (actor-based) - **CtcJaModels.swift** - Japanese model loading and management - **ModelNames.swift** - Added Japanese model registry (`parakeetCtcJa`, `CTCJa` enum) - **AsrModels.swift** - Added `.ctcJa` model version (3,072 vocab, 1,024 hidden, blank_id=3072) - **AsrManager.swift** - Added `.ctcJa` case with error directing to `CtcJaManager` ### CLI Commands - **JapaneseAsrBenchmark.swift** (459 lines) - New `ja-benchmark` command - JSUT basic5000 dataset support - Mozilla Common Voice (MCV) test set support - Auto-download capability - CER (Character Error Rate) evaluation - **DownloadCommand.swift** - Added JSUT and MCV Japanese dataset downloads - **TranscribeCommand.swift** - Added `.ctcJa` model version support - **AsrBenchmark.swift** - Added `.ctcJa` switch case ### Dataset Support - **JapaneseDatasetDownloader.swift** (387 lines) - Dataset download and parsing - JSUT basic5000 (5,000 sentences, clean studio recordings) - Mozilla Common Voice Japanese test split - Efficient streaming downloads - Metadata extraction and validation ## Usage ### CLI Commands ```bash # Benchmark on JSUT basic5000 (100 samples) swift run fluidaudiocli ja-benchmark --dataset jsut --samples 100 # Benchmark on Common Voice test (500 samples, auto-download) swift run fluidaudiocli ja-benchmark --dataset cv-test --samples 500 --auto-download # Download datasets swift run fluidaudiocli download --dataset jsut swift run fluidaudiocli download --dataset cv-ja-test ``` ### Swift API ```swift // Load and use Japanese CTC transcription let manager = try await CtcJaManager.load() let text = try manager.transcribe(audioURL: japaneseAudioFile) ``` ## Model Info - **Repo**: `FluidInference/parakeet-ctc-0.6b-ja-coreml` - **Architecture**: 600M parameter CTC-only - **Vocabulary**: 3,072 Japanese SentencePiece tokens + 1 blank (id: 3072) - **Encoder**: 1,024 hidden size - **Expected CER**: 6.5% on JSUT basic5000, 13.3% on MCV 16.1 test ## Testing - ✅ Builds successfully (`swift build`) - ✅ Model loading integration tested - ✅ CLI commands compile and link correctly - ⏳ Runtime benchmark testing pending (requires model download) ## Related - Mobius PR #39: Japanese CTC CoreML conversion (https://github.com/FluidInference/mobius/pull/39) 🤖 Generated with Claude Code <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/478" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a> <!-- devin-review-badge-end --> --------- |
||
|
|
d9eef864d2 |
ASR tech debt cleanup: remove dead code, fix bugs, add benchmark script 28/03/2026 (#460)
## Summary Systematic cleanup of the ASR module addressing tech debt items from #457. Net reduction of ~430 lines while fixing real bugs and improving maintainability. ### Bug fixes - **`enableFP16` silently ignored** — `optimizedConfiguration(enableFP16:)` delegated to a shared factory that hardcoded `allowLowPrecisionAccumulationOnGPU = true`, ignoring the caller's parameter - **`MLArrayCache.returnArray` only reset float32 data** — cached arrays of other types (float16, int32) retained stale data from previous use - **CTC model auto-detection broken** — `Repo.parakeetCtc110m.folderName` returned `"parakeet-ctc-110m"` instead of `"parakeet-ctc-110m-coreml"` because the `folderName` switch fell through to a `default` case that stripped the `-coreml` suffix. Same for `parakeetCtc06b`. - **Duplicate tokens at chunk merge boundary** — `mergeByMidpoint` used `<=`/`>=` so tokens exactly at the cutoff appeared in both left and right chunks ### Dead code removal - Deleted `ANEOptimizer` indirection layer (166 lines) — was a pass-through wrapping `MLModel` with no optimization - Deleted `PerformanceMonitor` actor and `AggregatedMetrics` — never instantiated, component times hardcoded to 0 - Deleted `getFloat16Array` from MLArrayCache — never called - Deleted `sliceEncoderOutput` from AsrTranscription — never called (30 lines) - Deleted `loadWithANEOptimization` from AsrModels — never called - Removed unused `tokenTimings` parameter chain through `processTranscriptionResult` - Removed unused `import OSLog` / `import CoreML` across 5 files - Removed `nonisolated(unsafe)` from SlidingWindowAsrManager (types already Sendable) ### Duplication elimination - Extracted `clearCachedCtcData()` helper (replaced 3× triple-nil assignments) - Extracted `decoderState(for:)` / `setDecoderState(_:for:)` (replaced 4× switch blocks) - Extracted `frameAlignedAudio()` (replaced 2× duplicated frame-alignment blocks) - Added `ASRConstants.secondsPerEncoderFrame` (replaced 5× magic `0.08`) - Replaced hardcoded `16_000` with `config.sampleRate` / `ASRConstants.sampleRate` - Extracted `MLModelConfigurationUtils.defaultConfiguration()` (replaced 5× copy-pasted config methods) - Extracted `MLModelConfigurationUtils.defaultModelsDirectory()` (replaced 3× copy-pasted directory methods) - Consolidated duplicate `vocabularyFile` / `vocabularyFileArray` constants ### File organization - Moved `PerformanceMetrics.swift`, `ProgressEmitter.swift`, `MLArrayCache.swift` from `ASR/Parakeet/` to `Shared/` (used by multiple modules) - Renamed `StreamingAudioSourceFactory` → `AudioSourceFactory`, `StreamingAudioSampleSource` → `AudioSampleSource` (types used by both ASR and Diarizer) - Renamed files to match type names: `SortformerDiarizerPipeline.swift` → `SortformerDiarizer.swift`, `LSEENDDiarizerAPI.swift` → `LSEENDDiarizer.swift`, `NemotronPipeline.swift` → `NemotronStreamingAsrManager+Pipeline.swift` - Replaced force unwraps in `RnntDecoder.swift` with `guard let` + descriptive errors - Removed stale TODO about decoder state in AsrManager ### Benchmark script - Added `Scripts/run_parakeet_benchmarks.sh` — runs all 6 benchmarks (v3, v2, TDT-CTC-110M, CTC earnings, EOU 320ms, Nemotron 1120ms) with WER comparison against `benchmarks100.md` baselines and regression detection - Referenced from `Documentation/ASR/benchmarks100.md` ## Verified — no regressions ``` Model Baseline Current Delta Parakeet TDT v3 (0.6B) 2.6% 2.64% +0.04% Parakeet TDT v2 (0.6B) 3.8% 3.79% -0.01% CTC-TDT 110M 3.6% 3.56% -0.04% CTC Earnings 16.54% 16.51% -0.03% EOU 320ms (120M) 7.11% 7.11% +0.00% Nemotron 1120ms (0.6B) 1.99% 1.99% +0.00% ``` ## Test plan - [x] `swift build` passes - [x] `swift test` passes (all existing tests, updated for removed dead code) - [x] All 6 ASR benchmarks match baselines (100 files each) - [ ] `swift format lint` passes |
||
|
|
8aa0dfcdac |
fix: clean up diarization test infrastructure (#395)
## Summary - Extract shared fixture helpers into `DiarizationTestFixtures` enum, removing ~200 lines of duplicate code across `LSEENDIntegrationTests` and `SpeakerEnrollmentTests` - Replace fragile `Mirror`-based private state inspection with `internal` `hasActiveSession` property on `LSEENDDiarizerAPI` - Fix non-deterministic `srand48` seed in `SortformerTests` (use constant `42` instead of time-based seed) - Fix asymmetric skip guards in Sortformer enrollment tests (`XCTSkipIf` instead of `XCTAssertNotNil` for host-dependent segments) ## Test plan - [x] `swift build --build-tests` passes - [ ] `swift test --filter SortformerTests` passes - [ ] `swift test --filter LSEENDIntegrationTests` passes - [ ] `swift test --filter SpeakerEnrollmentTests` passes <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/395" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a> <!-- devin-review-badge-end --> |
||
|
|
7d074e1ee6 |
chore: consolidate Python scripts into Scripts/ (#344)
## Summary - Move `Benchmarks/nemo` to `Scripts/nemo_ami_benchmark` - Move `Tools/voice_cloning` to `Scripts/voice_cloning` - Remove now-empty `Benchmarks/` and `Tools/` top-level directories Consolidates standalone Python utilities into a single `Scripts/` directory to reduce top-level clutter. ## Test plan - [x] Verify files moved correctly (no content changes) <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/344" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a> <!-- devin-review-badge-end --> |