FluidAudio

mirror of https://github.com/FluidInference/FluidAudio.git synced 2026-05-12 20:20:36 +00:00

Author	SHA1	Message	Date
Alex	847a985ae4	fix(tts/pocket-tts): repair v1 voice cloning for pocket-tts 2.0.0 (#592 ) (#601 ) ## Summary Fixes #592 — PocketTTS voice cloning produced garbled audio on macOS after the `pocket-tts==2.0.0` upgrade. v2 (pre-baked KV snapshot) voices were unaffected — only the v1 path (user audio → `mimi_encoder` → `cond_step` prefill) was broken. Two compounding bugs: ### RCA 1 — stale `mimi_encoder` The `mimi_encoder.mlpackage` originally published on HF was traced against pre-2.0.0 `pocket-tts` (torch 2.9.1, Float32, scalar output) and no longer matched the runtime cond_step contract. Re-traced as `mimi_encoderv2` from `pocket-tts==2.0.0` (torch 2.11.0, Float16, fixed `[1, 1, 240000]` → `[1, 125, 1024]`). Both files now live at the HF repo root (legacy file kept for backwards compat); `ModelNames.mimiEncoder` points at the new one. ### RCA 2 — missing `bos_before_voice` prepend `pocket-tts` 2.0.0 added a learned 1024-d `flow_lm.bos_before_voice` buffer that has to be prepended to the audio_prompt during cond_step prefill. Without it the FlowLM sees a different token distribution than training. Extracted per-language as `constants_bin/bos_before_voice.bin` (4096 bytes each, 10 packs × distinct SHA-256s, all verified byte-for-byte against the HF upload). ### Swift-side changes - `PocketTtsVoiceCloner` pads/truncates input to the encoder's fixed 240 000 samples (10 s @ 24 kHz, non-flexible shape) and trims output frames to real-audio duration so zero-padded frames don't bleed into the prompt. - `PocketTtsSynthesizer+KVCache.prefillKVCache` prepends `bos_before_voice` ahead of the audio_prompt on the v1 path. v2 snapshots skip this — their pre-baked KV cache already encodes the prefix. - `PocketTtsResourceDownloader.ensureModels` backfills `bos_before_voice.bin` for caches that predate this fix (per-file fetch) instead of forcing a full language-pack re-download. Conversion artifacts and per-language SHA-256s documented in `mobius/models/tts/pocket_tts/coreml/TRIALS.md` (Phase 7). ## Test plan - [x] `swift build` clean - [x] `swift test --filter PocketTtsConstantsLoaderTests` — 3 new tests pass - [x] `swift format` applied - [x] E2E v1 cloning: `am_michael.wav` (7.5 s) → 3.92 s @ 24 kHz Int16, intelligible voice match. KV cache prefill lands at position 113 = 1 BOS + 95 voice + 17 text tokens (matches pocket-tts 2.0.0 layout). - [x] v2 snapshot regression check: default `alba.safetensors` voice still synthesizes correctly (prefill position 140, no `bos_before_voice` involvement) - [x] Backfill path: deleted `bos_before_voice.bin` from cache, re-ran cloning — file auto-fetched from HF (4096 bytes) before synthesis - [x] All 10 language packs verified on HF: SHA-256 match between local extraction and uploaded `v2/<lang>/constants_bin/bos_before_voice.bin`	2026-05-12 08:55:44 -04:00
Benjamin Lee	a0092cf163	Fixed LS-EEND Memory Leak + Updated Docs (#605 ) 1. LS-EEND had a memory leak since the autorelease pool was not releasing the multiarrays properly and was allocating new ones every chunk. Switched to backed output arrays to eliminate new allocations 2. LS-EEND docs were somewhat stale. Updated them to reflect the new API ---------	2026-05-12 08:53:59 -04:00
Alex	fb8b779380	feat(tts/magpie): warmup API for cold-start mitigation (#60 Track 2) (#595 )	2026-05-10 16:51:09 -04:00
Alex	2c45df3035	docs(tts): refresh Benchmarks.md per #590 ; wire styletts2 + --variant into tts-benchmark (#593 ) ## Summary Closes the work tracked in #590: bring `Documentation/TTS/Benchmarks.md` into agreement with what's actually shipped on `main` for CoreML TTS backends, and add the two CLI affordances needed to benchmark the in-scope backend × language matrix. ### Doc changes (`Documentation/TTS/Benchmarks.md`) - Single consolidated per-backend table that merges basic info (license, language+voice, footprint in GB, sample rate, max chunk per pass, streaming flag) with performance metrics (TTFT p50/p95, synth p50/p95, agg RTFx, peak RSS, WER %, CER %). Five rows: Kokoro ANE en (`af_heart`), Kokoro ANE zh (`zf_001`), PocketTTS en (`alba` 6L), Magpie en (`John`, batch-only on `main`), StyleTTS2 en (LibriTTS iteration_3, zero-shot). - Dropped from the top-line per scope decision: non-ANE Kokoro, CosyVoice3 zh, PocketTTS 24L variants, Hindi/Cantonese rows. CosyVoice3 narrative sections (decode budget cap + auto-chunker validation) stay verbatim. - Refreshed Kokoro ANE per-stage breakdown (post-laishere 7-graph chain). - Replaced the old Magpie per-stage table with a pointer paragraph (`MagpieSynthesisResult.timings` is still populated for callers; sub-1.5 s TTFA work referenced in #590 lives on `feat/magpie-lt-fusion`, not `main`). - Corrected PocketTTS footprint to `fp16 ~0.77 / int8 ~0.55 GB` (was `~140 / ~520 MB`); enumerated all 10 packs in the corpus matrix; added zh to the Kokoro ANE corpus row; added a StyleTTS2 row. ### CLI changes (`Sources/FluidAudioCLI/Commands/TtsBenchmarkCommand.swift`) - New `styletts2` / `style-tts2` backend wired to `StyleTTS2Manager.synthesize(text:referenceAudioURL:)`. Requires `--reference <wav>`; the shipped iteration_3 `ref_encoder` is fixed at `[1, 1, 80, 231]`, so the reference must be exactly 2.875 s @ 24 kHz mono — the harness errors out at predict time on mismatched durations. - New `--variant {english\|mandarin}` flag for `kokoro-ane` so the `zf_001` Mandarin voice pack can be benchmarked alongside `af_heart`. Falls back to `english` when unset; the manager constructor now receives the parsed `KokoroAneVariant` and the default voice is variant-aware. ### Methodology 100-phrase MiniMax-Multilingual on MacBook Air M2 (16 GB, macOS 26, on AC), `--compute-units default`. English WER/CER via Parakeet TDT roundtrip; Mandarin CER via `whisper-large-v3` (Python CPU FP32, `Scripts/whisper_zh_cer.py`) — macro 4.01% / micro 4.14% across all 100 zh phrases. WER omitted for Mandarin because `WERCalculator` splits on whitespace. ## Test plan - [x] `swift build` clean on `main`-based branch. - [x] `swift format lint --recursive --configuration .swift-format Sources/FluidAudioCLI/Commands/TtsBenchmarkCommand.swift` clean. - [x] Smoke test: `swift run fluidaudio tts-benchmark --backend styletts2 --reference ref.wav --corpus minimax-english --output-json /tmp/styletts2-smoke.json` — produces a valid JSON report. - [x] Smoke test: `swift run fluidaudio tts-benchmark --backend kokoro-ane --variant mandarin --voice zf_001 --corpus minimax-chinese --skip-asr --output-json /tmp/kokoro-zh-smoke.json` — pulls the Mandarin voice pack and produces audio. - [x] Full 100-phrase runs for all five table rows produced under `Benchmarks/tts/runs/590/` (gitignored); table numbers come straight from those JSON reports. - [ ] Reviewer cross-check: footnote markers (`*`, `‡`, `∥`, `¶`) in the consolidated table all have matching paragraphs below.	2026-05-09 21:47:45 -04:00
panv-kw	a400080380	Make SpeakerManager a struct and de-async DiarizerManager (#591 ) ### Why is this change needed? `DiarizerManger.performCompleteDiarization` is `async`, even though no asynchronous operations occur when running the models and processing the results - this is just plain, synchronous computation. It doesn't wait on the network or things like that. It is important to be able to integrate it in to other synchronous compute workflows. The reason it had to be `async` until now is that the `SpeakerManager` type containing the speaker database was a `class`, meaning that it was shared mutable state. It was made an `actor` because this shared mutable state could be mutated concurrently. But really, there should not be concurrent mutations to the `SpeakerManager` in the first place. The user of this type, `DiarizationManager`, is not actually prepared for other code to be modifying this database while it is using it, and anybody who is trying is almost certainly writing a bug because their code would be logically racing with `DiarizationManager` and the results would be unpredictable. This change makes `SpeakerManager` a struct. It has copy-on-write value semantics because it wraps a `Dictionary` for its storage, and mutations are marked by the `mutating` keyword and require exclusive ownership of the variable -- again, just like `Dictionary`. The compiler statically diagnoses attempts to concurrently mutate the `DiarizationManager`'s speaker database, so the test for this can be removed (it no longer compiles). <img width="1108" height="386" alt="Screenshot 2026-05-09 at 19 10 26" src="https://github.com/user-attachments/assets/04fc3395-7d46-42a8-b035-4d0b559cc8aa" /> In summary, this change significantly reduces the cognitive load of using and maintaining this code, promotes correct usage through static diagnostics rather than allowing unpredictable results through concurrent mutation of the speaker database, and enables diarization to be used in more contexts in more programs. (BTW, `SpeakerManager` doesn't strictly _need_ to be `Sendable`, but the previous one was by virtue of being an `actor`, so I marked this one as being `Sendable` too in case anybody was relying on it. I don't think the implementation of this type is going to change radically in the future to the point where that might be a problem)	2026-05-09 17:14:07 -04:00
Alex	3ff5ae2d0c	refactor(tts): async StyleTTS2 predict + drop non-native Magpie synthesizeStream (#589 )	2026-05-09 12:54:07 -04:00
Alex	ce59fb14b8	feat(tts): StyleTTS2 LibriTTS (iteration_3) CoreML backend (#588 ) ## Summary Swift port of `mobius/models/tts/styletts2/coreml/inference.py` against the `FluidInference/StyleTTS-2-coreml/iteration_3/compiled` mlmodelc assets. New `StyleTTS2Manager` actor exposes the same public shape as `MagpieTtsManager` / `KokoroSynthesizer`, plus a `--backend styletts2` route in the CLI. ## Architecture `StyleTTS2Manager` orchestrates four pieces: 1. `StyleTTS2ModelStore` — actor-managed lazy load of the 8 default `.mlmodelc` stages plus 6 token-axis bucket variants (T = 64 / 128 / 256 fp16). 2. `StyleTTS2Phonemizer` — wraps shared `MultilingualG2PModel` (CharsiuG2P) with an espeak-fallback note in the docs; `synthesize(ipa:)` escape hatch preserves parity for callers that already have espeak output. 3. `StyleTTS2MelExtractor` — vDSP FFT + 80-bin HTK mel filterbank with the training-time `sample_rate=16000` quirk for the speaker-reference path. 4. `StyleTTS2Synthesizer` — drives the 8-stage CoreML graph (`text_encoder`, `bert`, `ref_encoder`, `fused_diffusion_sampler`, `duration_predictor`, `fused_f0n_har_source`, `decoder_pre`, `decoder_upsample`) and returns 24 kHz mono Float32 PCM. Eager-glue ops (`StyleTTS2GlueOps`) bridge the stages on the CPU side: sigmoid+round of duration logits, one-hot alignment matrix, BLAS `cblas_sgemm` matmul, vDSP transpose, HiFi-GAN causal asr-shift, and the alpha/beta style blend (`s_pred[:, 128:]` / `s_pred[:, :128]`). The fused diffusion sampler consumes pre-materialized noise — `StyleTTS2DiffusionSchedule` provides the Karras sigma formula plus a SplitMix64 + Box-Muller source so a fixed `noiseSeed` reproduces the same audio. ## CLI ``` swift run fluidaudiocli tts "Hello from StyleTTS2." \ --backend styletts2 \ --reference path/to/speaker.wav \ --output out.wav \ --alpha 0.3 --beta 0.7 --seed 0 ``` `--ipa` overrides the text path with a verbatim IPA string for espeak parity. ## Test plan - [x] `swift build` clean - [x] `swift format lint` clean on touched files - [x] `swift test --filter StyleTTS2` — 32 / 32 passing - `StyleTTS2TextCleanerTests` — symbol vocab + encode round-trip + drop-unknown - `StyleTTS2GlueOpsTests` — duration rounding, alignment matrix, BLAS matmul, transpose, HiFi-GAN shift, alpha/beta blend - `StyleTTS2DiffusionScheduleTests` — Karras boundary conditions + monotonicity, RNG determinism, Gaussian stats - `StyleTTS2MultiArrayTests` — Float32 / Int32 round-trip, `extractFloats` for double / int32 backings - [ ] End-to-end smoke run via `swift run fluidaudiocli tts ... --backend styletts2 --reference ...` against a downloaded `iteration_3/compiled` asset bundle	2026-05-09 00:25:54 -04:00
Greg Young	b3a725db3e	Fix: Prevent Metal crash when targetTokens is 0 in Kokoro TTS (#586 ) Adds a defensive guard against targetTokens == 0 reaching CoreML in the Kokoro TTS pipeline. A zero-length int put_ids tensor causes the Metal backend to dispatch compute shaders with threadgroupsPerGrid.width(0), which is an uncatchable assertion failure: -[MTLDebugComputeCommandEncoder dispatchThreadgroups:threadsPerThreadgroup:]:1377: failed assertion `(threadgroupsPerGrid.width(0) * ...) must not be 0.' Changes 1. KokoroSynthesizer.swift — synthesizeChunk() now throws a descriptive TTSError.processingFailed when targetTokens == 0, before any MLMultiArray allocation or model prediction. This converts an uncatchable Metal assertion into a recoverable Swift error. 2. KokoroModelCache.swift — Cached token lengths are clamped with max(1, inferTokenLength(...)) at all 3 caching sites (loadModelsIfNeeded, tokenLength(for:), registerPreloadedModels). Defense-in-depth: although inferTokenLength() already returns a positive value or falls back to 124, this guarantees the cache invariant is locally enforced regardless of future changes to the inference helper. Testing - Manual: confirmed synthesizeChunk now throws TTSError.processingFailed instead of trapping when a 0 token length is forced.	2026-05-08 17:56:13 -04:00
local	024bd8e454	chore(tts): remove StyleTTS2 backend, models, and references	2026-05-07 13:32:16 -04:00
Prakash Joshi Pax	a53aff438b	fix(tts): guard direct Float16 reads with #if arch(arm64) (CosyVoice3, StyleTTS2) (#582 ) ## Summary `Float16` is an arm64-only Swift built-in, so any direct `Float16` typing fails to compile in the x86_64 slice of a Universal build. Four sites in CosyVoice3 and StyleTTS2 do raw `Float16` pointer binds with no arch guard, which currently breaks Universal archive builds with errors like: ``` 'Float16' is unavailable in macOS No exact matches in call to initializer Failed to produce diagnostic for expression; please submit a bug report ``` Affected sites: - `Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Synthesize/CosyVoice3SpeechEmbeddings.swift:65` — `assumingMemoryBound(to: Float16.self)` for the fp16 safetensors lookup table. - `Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Synthesize/CosyVoice3Synthesizer.swift:554` — fp16 branch of the Flow→HiFT mel copy in `runHiFT`. - `Sources/FluidAudio/TTS/StyleTTS2/Pipeline/StyleTTS2Synthesizer.swift:288` — fp16 case in `sliceFirstAxis2D`. - `Sources/FluidAudio/TTS/StyleTTS2/Pipeline/StyleTTS2Synthesizer.swift:541` — fp16 case in `readMLMultiArrayPrefix`. This PR mirrors the package's existing pattern for fp16 reads on non-arm64 (`ASR/Qwen3/Qwen3AsrModels.swift:310`, `Diarizer/Sortformer/SortformerModelInference.swift:278`): wrap each `Float16`-touching arm in `#if arch(arm64) ... #endif`. ## Behavior on x86_64 - CosyVoice3SpeechEmbeddings — throws `CosyVoice3Error.predictionFailed("requires Apple Silicon (arm64); fp16 lookup table cannot be read on x86_64")`. The `speech_embedding` safetensors table is fp16-only on disk with no fp32 alternative, so this matches the Qwen3-ASR posture (its embedding table is also fp16-only on disk; it `fatalError`s on x86_64). - CosyVoice3Synthesizer.runHiFT — the `case .float16:` arm is omitted on x86_64. The `case .float32:` path is unchanged. If a Flow variant emits fp16 at runtime on Intel, control falls into the existing `default:` arm, which already throws `"runHiFT: unexpected Flow mel dtype …"`. - StyleTTS2Synthesizer (`sliceFirstAxis2D`, `readMLMultiArrayPrefix`) — the `case .float16:` arm is omitted on x86_64. fp16 arrays fall through to the existing NSNumber-bridged `default:` arm (`arr[i].floatValue` / `fill { arr[$0].floatValue }`), which already converts fp16 correctly. Slightly slower on Intel; no behavior regression. No new error types or dependencies. Diff is +13 lines across 3 files. ## Why this approach (vs. vImage byte-level conversion) FluidAudio already uses two patterns for cross-arch fp16 handling: - `#if arch(arm64)` guard — used for fp16 reads in `Qwen3AsrModels.swift` and `SortformerModelInference.swift`. - vImage `Planar16FtoPlanarF` / `PlanarFtoPlanar16F` — used for fp16 writes in `KokoroAneSynthesizer+Conversion.swift`, `TtsModels.swift`, `KokoroSynthesizer.swift`, with the explicit comment in `TtsModels.swift:182`: "This avoids direct Float16 usage which isn't available in all build configurations". This PR matches the existing fp16-read precedent (Pattern A). A follow-up could port these read paths to vImage for full Intel runtime support (Pattern B), but that's a larger change and would need testing on Intel hardware. The minimal goal here is unblocking the Universal compile. ## Test plan - [x] Universal archive build of a downstream macOS app that links FluidAudio as a local SPM package now succeeds (failed prior to this patch with the errors above). - [ ] CI lint / build on the package itself. - [ ] No CosyVoice3 / StyleTTS2 runtime regression on Apple Silicon (the arm64 path is byte-identical to before).	2026-05-05 09:27:27 -04:00
Alex	284ce520f9	feat(tts/magpie): nanocodec v4 (fp32 + int8 palettize) precision (#581 ) ## Summary Add `MagpieNanocodecPrecision.fp32Pal` selecting `nanocodec_decoder_v4`: v3's fp32 architecture with 8-bit kmeans-palettized weights. Acoustically transparent vs v3 at ~4× smaller on disk and ~11% lower peak RSS. Same recipe Kokoro Noise uses for `fp32 + int8pal`. Compute units track precision: `.fp32Pal` pins `.cpuOnly` (palettized weights dequantize to fp32 at runtime; ANE refuses fp32 / GPU is 50%+ slower than CPU on fp32 codec). ## Bench (M2, 16 GB, .cpuOnly, T_in=24, 5 warmup + 50 timed iters) \| metric \| v3 fp32 \| v4 fp32+int8pal \| delta \| \|---------------------\|---------\|-----------------\|--------\| \| mlpackage on disk \| 121.0MB \| 30.9MB \| -74% \| \| post-load RSS delta \| +59.9MB \| +61.7MB \| eq. \| \| peak RSS \| 700.8MB \| 621.8MB \| -11% \| \| latency median \| 117.6ms \| 117.1ms \| eq. \| \| latency p95 \| 145.9ms \| 123.6ms \| -15% \| \| RTFx (codec) \| 9.48× \| 9.52× \| eq. \| \| SNR vs v3 (AR codes)\| inf. \| 33.6dB \| clean* \| *User-confirmed acoustically transparent on AR-emitted speech. ## Fallback chain Each candidate carries its own config so the fallback doesn't inherit the primary's compute-unit selection. fp16 (v2) is only reached when explicitly requested or when no other candidate is present, since it's audibly noisy on voiced speech: \| Requested \| Order \| \|----------------\|--------------------------\| \| `.fp32Pal` \| v4 → v3 → v2 \| \| `.fp32` \| v3 → v4 → v2 \| \| `.fp16` \| v2 → v4 → v3 \| If every chunked artifact is missing the loader falls through to legacy monolithic v1 with `.cpuOnly` (audibly noisy). ## HF artifacts Already uploaded to `FluidInference/magpie-tts-multilingual-357m-coreml`: - `nanocodec_decoder_v4.mlmodelc/` - `nanocodec_decoder_v4.mlpackage/` ## Companion PR mobius converter: https://github.com/FluidInference/mobius/pull/54 ## Test plan - [x] `swift build` green - [x] `swift test --filter MagpieConstantsTests` 5/5 pass - [x] `swift format lint` clean for changed files - [ ] End-to-end `MagpieTtsManager` synth with `.fp32Pal` once HF artifacts propagate to user caches	2026-05-04 23:22:34 -04:00
Alex	8389c1b714	feat(tts/magpie): nanocodec v1/v2/v3 + decoder_step ANE pin + dual-precision API (#580 ) ## Summary Companion to mobius PR FluidInference/mobius#53. Wires the new nanocodec v2/v3 builds into the FluidAudio Magpie runtime, plus pins `decoder_step` to ANE for ~2× wall speedup. ## Commits - `5879a32b3` — `fix(tts/magpie): pin decoder_step to ANE for ~2x speedup + correct EOS` - `decoder_step.mlmodelc` was running CPU+GPU. Pinning to `.cpuAndNeuralEngine` halves wall on M2. - EOS handling: don't emit the post-EOS frame. - `ec7051504` — `feat(tts/magpie): chunked T=24 fp32 nanocodec + edge-pad (Phase C v2)` - Slide a 24-frame window with stride 8, overlap 16 (= dilated-conv input receptive field). - Edge-replicate context at sequence boundaries instead of zero-padding (zero-pad produces a sharp pop in the first ~30 ms). - `2f0aab7a7` — `feat(tts/magpie): dual fp16/fp32 nanocodec t24 builds via MagpieNanocodecPrecision` - New `MagpieNanocodecPrecision` enum (`.fp16` / `.fp32`). - Compute-unit dispatch: fp32 → `.cpuOnly` (ANE is fp16-only); fp16 → `.cpuAndNeuralEngine` unless caller pinned CPU. - Plumbed through `MagpieModelStore.init` and `MagpieTtsManager.init` / `downloadAndCreate`. - `4bd31469f` — `refactor(tts/magpie): nanocodec v1/v2/v3 versioning (drop t24 prefix)` - Final naming: v1 = legacy mono, v2 = chunked fp16, v3 = chunked fp32. - `requiredModels` now lists `nanocodecDecoderV3File` so legacy v1-only users auto-upgrade on next bulk fetch. - Load chain: primary (precision-matched) → secondary (cross-precision warning) → legacy v1 fallback. ## Production state \| Build \| File \| Precision \| Shape \| Selector \| Audio \| \|---\|---\|---\|---\|---\|---\| \| v1 \| `nanocodec_decoder.mlmodelc` \| fp16 \| T=256 monolithic \| legacy fallback \| noisy + slow \| \| v2 \| `nanocodec_decoder_v2.mlmodelc` \| fp16 \| T_in=24 chunked \| `MagpieNanocodecPrecision.fp16` \| noisy / fast \| \| v3 \| `nanocodec_decoder_v3.mlmodelc` \| fp32 \| T_in=24 chunked \| `MagpieNanocodecPrecision.fp32` (default) \| clean \| All three live on `FluidInference/magpie-tts-multilingual-357m-coreml`. ## Background Phase F mixed-precision sweep (mobius#53) confirmed no fp16 op/location combination recovers cleanliness — production stays on v3 (fp32) with v2 as opt-in for throughput-bound callers willing to accept the 27 dB SNR floor. ## Test plan - [x] `swift format` clean - [x] `swift build` clean - [ ] Sanity-check `swift test --filter MagpieTtsTests` (if present) - [ ] Spot-check synthesis via CLI on default speaker	2026-05-04 22:35:22 -04:00
Alex	bdbff4d88a	feat(tts/kokoro-ane/zh): consolidated Mandarin G2P (erhua + jieba HMM + g2pW) (#572 items 1, 3, 4) (#579 ) ## Summary Consolidates PRs #574, #575, and #576 into a single landing for Mandarin G2P enhancements per [issue #572](https://github.com/FluidInference/FluidAudio/issues/572). All three features are non-overlapping and stack cleanly inside `MandarinG2P.phonemize`: - Item 3 — Erhua merging (was #574): folds trailing `儿` into the previous syllable so `小孩儿` emits a single r-coloured token instead of a stray `er` tail. - Item 4 — Jieba HMM tail (was #575): re-segments OOV runs of single-char fallbacks via a 4-state B/M/E/S Viterbi to recover proper-noun boundaries (`特朗普`, `比特币`); recovered words are then retried against the phrase dict before per-char fallback. - Item 1 — g2pW polyphone disambiguation (was #576): int8 BERT-base classifier (152 MB CoreML) picks the right reading for polyphonic Hanzi (`行`/`长`/`重`/`朝`/…) using the full sentence as context. Best-effort: falls back to dict-only when assets are missing. Item 2 (number normalization) already merged via #573. Items 5 (POS sandhi, #577) and 6 (custom lexicon, #578) remain as separate PRs. ## Pipeline order ``` text → MandarinNumberNormalizer.normalize (already on main) → normalizeText (punctuation) → segment(): FMM phrases + jieba HMM tail (NEW: item 4) → polyphone disambiguation via g2pW (NEW: item 1) → diacritic → digit (MandarinPinyinNormalizer) → MandarinErhua.merge (NEW: item 3) → MandarinToneSandhi.apply → MandarinBopomofoMap.encode ``` ## API changes - `MandarinG2P.phonemize` is now `async throws` (g2pW disambiguation requires async). Backwards-compatible callers must add `try await`. - `MandarinG2P.init(dict:, jiebaHmm:, g2pw:)` — both new parameters are optional, default `nil` keeps baseline behaviour. - New `Segment.bopomofoOverride(String)` case carries g2pW's pre-encoded bopomofo + tone digit; bypasses sandhi. ## Asset requirements Pulled from `huggingface.co/FluidInference/kokoro-82m-coreml`: - `ANE-zh/g2pw/g2pw.mlmodelc/` — bulk `ensureModels` (added to `requiredModelsZh`) - `ANE-zh/g2pw/vocab.txt` + `POLYPHONIC_CHARS.txt` — `ensureMandarinG2pw` lazy fetch - `ANE-zh/assets/jieba_hmm_{start,trans,emit}.bin` — `ensureMandarinJiebaHmm` lazy fetch All three optional asset groups degrade gracefully: missing g2pW falls back to dict-first reading, missing jieba HMM falls back to per-char singles. ## Test plan - [x] `swift build` clean - [x] 102 tests pass across `MandarinG2PTests`, `MandarinErhuaTests`, `MandarinJiebaHmmTests`, `MandarinPolyphoneCatalogTests`, `MandarinBertTokenizerTests`, `MandarinNumberNormalizerTests` - [x] Polyphone target tracking through jieba HMM resegmentation: `flushHanziRun` carries absolute char positions so g2pW sees the right context window - [x] Backward-compat: `MandarinG2P(dict:)` (no jieba, no g2pW) still passes baseline tests ## Closes - #574 (erhua) - #575 (jieba HMM) - #576 (g2pW) Refs #572.	2026-05-04 01:01:39 -04:00
Alex	684ceaf42b	feat(tts/kokoro-ane/zh): POS-aware tone sandhi (#572 item 5) (#577 ) ## Summary Issue #572 item 5. The baseline \`MandarinToneSandhi\` rules are POS-independent and audibly misfire on three contexts: - 一 ordinals (\`第一 dì-yī\`, \`一月 yī-yuè\`, \`一号\`) keep tone 1; baseline promotes them to 2/4 unconditionally - 不 reduplication (\`要不要\`, \`好不好\`, \`行不行\`) keeps \`不\` at tone 4 inside \`[X, 不, X]\`; baseline misfires with bu4+tone4 → bu2 - 3+3 chains apply within prosodic words; cross-word 3+3 only promotes the word-final syllable. Baseline's pure-run rule cascades too far left (\`我也想去\` → wrong \`2 2 2 3\` instead of correct \`2 2 3 4\`) ## Design \`MandarinToneSandhiPOS.apply(_:words:tags:)\` — pure function, takes the syllable buffer plus pre-computed word ranges + jieba POS tags. Backward-compat path stays on \`MandarinToneSandhi.apply\` for callers without a POS tagger (existing behavior preserved). ## Test plan - [x] 一 ordinal carve-outs (\`第一\`, \`一月\`) - [x] 一 contextual sandhi still fires in non-numeral words (\`一定\`, \`一起\`) - [x] 不 reduplication keeps tone 4 (\`要不要\`) - [x] 不 promotion still fires for non-reduplication (\`不要\`) - [x] In-word 3+3 run promotes all but last - [x] Cross-word 3+3 only promotes the boundary - [x] Cross-word chain stops at non-3 (\`我是你的\`) - [x] Backward-compat for single-word ranges - [x] \`swift build\` + \`swift format lint\` clean - 14 unit tests, all passing ## Out of scope (follow-up) - MandarinG2P routing to \`MandarinToneSandhiPOS\` lands once PR #575 (jieba HMM + POS tagger tables) merges and the POS tagger is loaded by \`KokoroAneModelStore.mandarinG2PPipeline\`. Until then this module is testable in isolation via synthetic POS input. ## Depends on - #575 — for the POS tagger Viterbi + tables that produce the \`words\`/\`tags\` arrays at runtime	2026-05-04 00:39:50 -04:00
Alex	f202200d1f	feat(tts/kokoro-ane): user-supplied Mandarin custom lexicon (#572 item 6) (#578 ) ## Summary Issue #572 item 6. Lets app developers ship a project-specific Mandarin lexicon that overrides both the bundled phrase dict and g2pW. Useful for proper nouns the bundled dict doesn't cover (brand names, technical jargon, regionalisms) and for cases where the user knows the correct reading and wants to bypass any heuristic. ## Test plan - [x] Custom lexicon entry overrides phrase dict - [x] Custom lexicon entry overrides single-char dict - [x] Empty lexicon = no-op (baseline preserved) - [x] \`swift build\` + \`swift format lint\` clean ## Independent This PR is independent of #573–#577. Land in any order.	2026-05-04 00:39:37 -04:00
Alex	0ea7c900b0	feat(tts/kokoro-ane/zh): number/date/currency verbalization (#572 item 2) (#573 ) ## Summary Issue #572 item 2. Pre-pass that verbalizes numerics, dates, times, percentages, fractions, and currencies into Hanzi before `MandarinG2P` segments the text. Without it, conversational input like \`¥120\`, \`2025年5月3日\`, \`8:30\`, \`99%\` either fragments into per-digit literals or gets dropped entirely by the segmenter. - Port misaki/zh/num.py rules: cardinals up to 兆 (10¹²), decimal point form, percentages with 百分之, fractions (二分之一), money (¥/$/€/￥), dates (YYYY年MM月DD日, YYYY/MM/DD), times (HH:MM[:SS]) - Hook in `MandarinG2P.phonemize` before punctuation normalization - Pure function, no new public API surface on `KokoroAneManager` ## Test plan - [x] `MandarinNumberNormalizerTests` covers cardinals, decimals, percentages, fractions, money, dates, times - [x] `MandarinG2PTests` baseline regression (no behavior change on pure-Hanzi input) - [x] `swift build` + `swift format lint` clean	2026-05-04 00:36:31 -04:00
Benjamin Lee	e4ce919762	Finalized DiarzerTimeline segment updates no longer commit tentative segments (#568 ) There was a bug that would cause the trailing diarizer segment to disappear if minFramesOff was nonzero once the person stopped talking. ---------	2026-05-04 00:18:18 -04:00
Alex	98acce358a	feat(tts/kokoro-ane): add Mandarin (v1.1-zh) variant (#570 ) ## Summary Phase 1 — variant plumbing + phonemes-bypass synthesis for Kokoro-82M-v1.1-zh on the existing 7-stage CoreML chain. Callers that supply pre-computed Bopomofo (e.g. via misaki[zh] in Python or a future Swift G2P) can now synthesize Mandarin audio. Mandarin text-to-Bopomofo G2P is deferred to a separate Phase 2 PR. The 7-stage chain is language-agnostic by construction — input ids, voice slices, and per-stage I/O contracts are identical across v1.0 (English) and v1.1-zh (Mandarin). Only the embedding vocab (177 → 171), the HF subdir (`ANE/` → `ANE-zh/`), the voice-file layout (flat → `voices/<voice>.bin`), and the default voice (`af_heart` → `zf_001`) differ. ## Changes - New `Repo.kokoroAneZh` → `FluidInference/kokoro-82m-coreml/ANE-zh` with `subPath = ANE-zh`, `folderName = kokoro-82m-coreml/ANE-zh`. - `ModelNames.KokoroAne.requiredModelsZh` references `voices/zf_001.bin` so the downloader's all-files-present check resolves correctly when the file lands at `<repoDir>/voices/zf_001.bin`. - New `KokoroAneVariant` enum (`.english` / `.mandarin`) with `defaultVoice`, `useVoicesSubdir`, and `repo` accessors. - `KokoroAneResourceDownloader.ensureModels` and `ensureVoicePack` accept a `variant` param (default `.english` keeps existing callers source-compatible). Mandarin voice fetch creates the `voices/` parent directory on demand. - `KokoroAneModelStore` and `KokoroAneManager` thread the variant through to download + load. - `KokoroAneManager.synthesize(text:)` and `synthesizeDetailed(text:)` reject Mandarin with a clear error directing callers to `synthesizeFromPhonemes()`. The phonemes-bypass entry point already works for any vocab via `vocab.encode → 7-stage chain`. - CLI `--variant` flag accepts `en` / `english` / `zh` / `mandarin` for the `kokoro-ane` backend. Mandarin runs treat the input text as pre-computed Bopomofo and call `synthesizeFromPhonemesDetailed`. - 12 new unit tests (`KokoroAneVariantTests`): variant defaults, repo wiring, required-files set routing, manager init signatures, and Mandarin text-path rejection on both `synthesize` and `synthesizeDetailed`. End-to-end Mandarin synthesis verified against PyTorch ground truth on `zf_001` and `zm_009`. Background-noise investigation tracked separately in #569 (atan2 phase correction in upstream `CoreMLForwardSTFT`). ## Test plan - [x] `swift build` clean - [x] `swift test --filter KokoroAneVariantTests` — 12/12 pass - [x] `swift format lint` clean (only pre-existing warnings on `fastV2_1`/`balancedV2_1`/`highContextV2_1` enum cases unrelated to this PR) - [ ] After HF upload of `ANE-zh/` bundle, end-to-end smoke test: `swift run fluidaudiocli tts "ㄋㄧˇㄏㄠˇㄕˋㄐㄧㄝˋ。" --backend kokoro-ane --variant zh --voice zf_001 --output /tmp/zh.wav` - [ ] No regressions on existing English path (default-arg behavior preserved) ## Out of scope - Mandarin text-to-Bopomofo G2P — Phase 2 (separate PR). - HF upload of `ANE-zh/` bundle — handled outside this repo. - Updating `Documentation/` with Mandarin voice list — defer to Phase 2 when the path is fully usable end-to-end.	2026-05-03 22:03:27 -04:00
Benjamin Lee	821e0f97bc	Fixed an LS-EEND constructor (#567 ) The asynchronous constructor for `LSEENDDiarizer` that simultaneously loads the model did not update the timeline config's speaker count or frame duration, as it would've if using ```swift diarizer = LSEENDDiarizer() await diarizer.initialize(variant: .dihard3, stepSize: .step500ms) ``` ---------	2026-05-02 19:24:54 -04:00
Benjamin Lee	0a9aace382	Fixed short segment filter for trailing tentative segments in DiarizerTimeline (#566 ) Apparently i did it incorrectly last time. --------- Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-05-02 10:11:32 -07:00
Benjamin Lee	5bb84bc0b0	Fix DiarizerTimeline Short Segment Filter (#565 ) The `DiarizerTimeline` was incorrectly closing short gaps as soon as another speech frame appeared, instead of waiting for a sufficiently long speech segment to merge the old one with. This bug fix ensures that gaps are only closed between two segments of sufficient length (at least `config.minFramesOn` frames long). Also removed an unnecessary `throws` from a non-throwing `LSENDDiarizer` constructor. --------- Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-05-01 23:25:58 -04:00
Alex	cad8a2b563	feat(asr/cohere): long-form transcribeLong + cold/warm docs (#564 ) ## Summary Two related changes that grew out of the cohere isolated-bench investigation: ### 1. Long-form Cohere ASR — `CoherePipeline.transcribeLong` The encoder is fixed at a 35 s window, so the prior `transcribe()` silently truncated longer audio via `padOrTruncate(... fixedFrames: 3_500)` in `Sources/FluidAudio/ASR/Cohere/CoherePipeline.swift:250`. `transcribeLong` slices audio into 35 s chunks with 5 s overlap (matches upstream `cohere-pytorch/config.json` `overlap_chunk_second: 5`) and stitches adjacent chunks via token-level longest-common-substring merge. No model changes — encoder shape stays `[1, 128, 3500]`, decoder cache shape unchanged. - Audio ≤ 35 s short-circuits to the existing single-chunk `transcribe()` path → byte-identical short-form behavior, zero perf delta on FLEURS / LibriSpeech (which are all ≤ 35 s) - Audio > 35 s: hop = 30 s, decode each chunk independently, merge token streams (drop the suffix's matched head, keep prefix as-is) - LCS window bounded to 32 tokens per seam → O(K²) merge is negligible vs. decode - Per-chunk encoder/decoder/total seconds are summed into one `TranscriptionResult` CLI rewiring: - `cohere-transcribe`, `cohere-benchmark`, `tts-benchmark` now route through `transcribeLong` - `cohere-benchmark` no longer skips files exceeding 35 s Smoke-tested on a Mandarin 81 s WAV: full 80 s now transcribes (previously cut at 35 s). 10 unit tests cover `mergeTokenStreams` correctness (empty-input, no-overlap, threshold fallback, boundary overlap, offset overlap, longest-run preference, window bounds) and chunk-config constants. ### 2. Cold-start vs warm inference docs Adds a section to `Documentation/ASR/Cohere.md` capturing the isolated single-process bench (cold ANE compile ~186 s on M2 Tahoe; warm calls 3.4–4.6 s, RTFx 1.96×–8.73×). Clarifies process reuse is what unlocks the headline FLEURS/LibriSpeech RTFx. ## Test plan - [x] `swift test --filter CohereLongFormTests` (10 / 10 pass) - [x] `swift test --filter CohereAsrConfigTests` and `CoherePipelineMaskTests` (no regressions) - [x] `swift build` (debug) clean; `swift build -c release` clean - [x] Smoke: `cohere-transcribe /tmp/cohere_test_80s.wav --language zh` transcribes full 80 s of audio (3 chunks merged, no duplicated overlap content) - [x] `swift format lint` — no new warnings in changed files	2026-05-01 10:26:27 -04:00
Alex	7603ac6733	feat(tts/benchmark): tts-benchmark CLI covering all TTS backends (#557 ) ## Summary Adds `fluidaudio tts-benchmark`, a unified harness for measuring latency × efficiency × quality across every shipping TTS backend in FluidAudio, plus the model + runtime fixes needed to actually clear all six backends end-to-end on the [MiniMax Multilingual TTS Test Set](https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set). Also tags Magpie / StyleTTS2 / CosyVoice3 as beta at the API + docs level so users get a runtime warning on `initialize()` reflecting their actual perf / quality posture. ### Backends — all green on M2 / macOS 26 \| Backend \| Corpus \| Status \| Audio out (min / p50 / max) \| RTFx \| WER \| Notes \| \|---\|---\|---\|---\|---\|---\|---\| \| Kokoro ANE \| minimax-en (100/100) \| ✅ \| 3.5 s / 8.0 s / 11.4 s \| 5.19× \| 10.8% \| one-shot @ 24 kHz, 7-graph pipeline; per-stage CU sweep \| \| Kokoro \| minimax-en (100/100) \| ✅ \| 3.5 s / 6.8 s / 9.3 s \| 2.02× \| 1.3% \| one-shot @ 24 kHz; multi-chunk w/ 8 ms crossfade; cleanest English ASR roundtrip \| \| PocketTTS \| minimax-en (100/100) \| ✅ \| 2.8 s / 6.3 s / 9.4 s \| 0.61× \| 1.4% \| streaming @ 24 kHz, 80 ms frames; TTFT 1244 ms — RTFx looks slow but is honest per-frame cost (see "RTFx caveat" below) \| \| Magpie \| minimax-en (100/100) \| ⚠️ BETA \| 4.7 s / 10.0 s / 20.6 s \| 0.64× \| 5.6% \| streaming TTFT @ 22.05 kHz: first chunk at 9.6 s p50 vs full synth 15.1 s; split-K/V decoder + `outputBackings` fast path; below real-time, runtime warning on init \| \| StyleTTS2 \| minimax-en (100/100) \| ⚠️ BETA \| 9.6 s / 22.6 s / 32.6 s \| 2.72× \| 44.0% \| one-shot @ 24 kHz; flex-shape fix + misaki→espeak post-pass remap (WER 58.1% → 44.0%); WER ~30× Kokoro's, runtime warning on init \| \| CosyVoice3 \| minimax-zh (100/100) \| ⚠️ BETA \| 2.2 s / 6.5 s / 16.0 s \| 0.357×† \| n/a‡ \| post auto-chunker @ 24 kHz; long phrases now split + crossfaded (8 ms cosine) — longest output 16.0 s (was capped at ~6.5 s); HiFT `.cpuAndGPU` + LLM-Decode `outputBackings` (+33% RTFx); whisper-large-v3 CER 1.68% (macro) / 1.84% (micro) across 100/100 phrases‡; RTFx < 1, runtime warning on init \| \| CosyVoice3 \| minimax-yue (100/100) \| ⚠️ BETA \| 3.3 s / 8.0 s / 16.1 s \| 0.249× \| n/a \| post auto-chunker; truncation 80/100 → 5/100 phrases (`finished_on_eos=false` field), longest output 6.5 s → 16.1 s. TTFT-p50 climbs (24 s → 36 s) as the cost of multi-chunk synth \| ⚠️ BETA = `${Backend}TtsManager.initialize()` emits a `logger.warning` flagging the perf / quality posture; safe to ship in non-latency-sensitive paths but read the per-backend doc first. ‡ CosyVoice3 zh WER stays `n/a` because `WERCalculator` whitespace-tokenizes and Mandarin has no word boundaries (word-level WER reads ~100% and is meaningless). CER is `whisper-large-v3` against the rendered WAVs from the full 100-phrase `minimax-chinese` run via `Scripts/whisper_zh_cer.py`. Cohere Transcribe q8 is also wired in this PR via `--asr-backend cohere` (see [Cohere ASR backend in the harness](#cohere-asr-backend-in-the-harness) below) and agrees with whisper at the 3–5% CER range on a 10-phrase sub-sample, but hits a `MILCompilerForANE` cache failure on this M2 host that drops it to RTFx ~0.13×, so whisper is the practical source-of-truth for the full 100-phrase run. Full numbers (cold start, p50/p95 synth, peak RSS, WER/CER per category) live in `Documentation/TTS/Benchmarks.md`. Corpus attribution + reproduction notes live in `Documentation/TTS/MinimaxCorpus.md`. ### RTFx caveat — phrase length and streaming granularity both matter Aggregate RTFx (audio_duration / wall_clock) is only directly comparable between backends when both produce similar phrase lengths and yield audio at the same granularity. Two things skew the headline number on this corpus: 1. Phrase-length spread. StyleTTS2 emits ~22 s p50 of audio per `minimax-english` phrase while Kokoro emits ~7 s — same input text, ~3× more audio out. That's mostly long inter-word pauses + slow speaking rate baked into the LibriTTS multi-speaker checkpoint, not a measurement artifact. A 2.72× RTFx on 22 s audio = ~8 s wall — which matches the TTFT p50 column. Kokoro's 2.02× on 7 s audio = ~3.5 s wall. Same-corpus RTFx ratios alone hide this. 2. Streaming granularity. PocketTTS posts 0.61× agg-RTFx vs. Kokoro's 2.02× but it's not slower from a user perspective: PocketTTS yields its first 80 ms audio frame at TTFT 1244 ms, Kokoro's first frame at TTFT 3113 ms (full one-shot chunk). The 0.61× is the per-frame cost averaged across the streaming run; what users feel is TTFT. \| Backend \| TTFT p50 \| First yield \| Implication \| \|-------------\|----------\|------------------\|--------------------------------------------\| \| PocketTTS \| 1244 ms \| 80 ms frame \| true streaming; conversational-ready \| \| Kokoro ANE \| 1586 ms \| full ~8 s chunk \| ~1.6 s to any audio; ANE-tuned \| \| Kokoro \| 3113 ms \| full ~7 s chunk \| clean quality, slower first-byte \| \| StyleTTS2 \| 6671 ms \| full ~22 s chunk \| one-shot only; long phrase output amortizes the wall \| \| Magpie \| 9580 ms \| first chunk @ 22.05 kHz \| streaming via `synthesizeStream`; TTFT-p50 9.6 s vs full synth 15.1 s — 36% earlier playback start \| \| CosyVoice3 \| 14091 / 35681 ms (zh / yue) \| full chunk @ 24 kHz \| one-shot per chunk; multi-chunk phrases pay TTFT for the first chunk only \| For conversational use cases, TTFT > RTFx. PocketTTS (true streaming), Magpie (streaming via `synthesizeStream`), and Kokoro ANE (small one-shot chunks) are the three backends that meaningfully clear the "user feels it's responsive" bar today. ### Beta callouts (StyleTTS2, Magpie, CosyVoice3) Three of the six shipping backends post numbers that callers should weigh against an explicit caveat: - StyleTTS2 — WER 44% on `minimax-english` is ~30× Kokoro's 1.3%. The misaki→espeak post-pass remap closed half the gap; the remainder is BART G2P misses + diffusion-sampler formant breaks on long phrases. - Magpie — agg-RTFx 0.64× on M2 — below real-time but streaming via `synthesizeStream` so TTFT (9.6 s p50) is significantly better than full-synth wall (15.1 s p50). Long-tail phrases still pull p95 wall to ~30 s. - CosyVoice3 — agg-RTFx 0.357× on `minimax-chinese` (0.249× on the longer-phrase `minimax-cantonese` after the auto-chunker). The 250-token Flow input cap is now worked around at the call site by the auto-chunker (long phrases split + crossfaded), dropping cantonese truncation from 80/100 → 5/100 and lifting longest output from 6.5 s → 16.1 s. The 5/100 residual is the long-tail token-rate worst case; the structural fix is re-exporting Flow with a larger fixed input shape (tracked in `mobius-cosyvoice3`). `CosyVoice3SynthesisResult.finishedOnEos: Bool` + a `.warning`-level `LLM-Decode budget exhausted` log still surface any truncation, and the harness writes `finished_on_eos` into each phrase in the JSON report. Each manager now logs a `.warning`-level beta notice on `initialize()` (mirroring the existing CosyVoice3 pattern) so anyone wiring these into a product gets a console signal, not a silent surprise. Docs (`Documentation/TTS/Magpie.md`, `Documentation/TTS/Benchmarks.md` StyleTTS2 footnote, existing `CosyVoice3.md` callout) carry the same caveat at the top. ### Model + runtime fixes landed in this PR #### CosyVoice3 stateless port (`71130c9fb`) Switches LLM-Decode from the macOS 15+ stateful `MLState` path to the non-stateful `LLM-Decode-M768-fp16` graph that's actually shipped on HuggingFace. Drops ~95 LOC of state plumbing for ~30 LOC of plain `MLDictionaryFeatureProvider` prediction with explicit kv carry-forward; lowers the availability gate from macOS 15 / iOS 18 back to the package baseline (macOS 14 / iOS 17). `CosyVoice3ModelNameTests` guard the rename. #### CosyVoice3 HiFT timeout fix (`267766b62`) `minimax-chinese` runs were aborting mid-corpus with `E5RT: Submit Async failed for [3:29] ... HiFT-T500-fp16_main__Op104_BnnsCpuInference has timed out`. Root cause: HiFT was loaded with `.cpuAndNeuralEngine`, which let the planner place most of the graph on ANE but kept at least one op on the BNNS CPU async-dispatch path; long phrases tripped the BNNS async watchdog. Fix pins HiFT to `.cpuAndGPU` regardless of user-supplied compute-units, removing the BNNS path entirely. Verified on 100/100 zh + 100/100 yue. #### CosyVoice3 LLM-Decode `outputBackings` double-buffer (`248c638c6`) The autoregressive decode loop runs ~163 steps per phrase to fill the 250-token cap. Each step takes the previous step's KV cache as `kv_k` / `kv_v` (fp32 `[24, 1, 2, 768, 64]` = 9 MB each) and produces fresh `kv_k_out` / `kv_v_out` plus logits — i.e. ~36 MB of host-side `MLMultiArray` allocation per step. Fix pre-allocates 4 KV back-buffers + a logits backing, rotates front/back/spare across steps via `MLPredictionOptions.outputBackings`, and falls back to fresh-alloc on first rejection (one-shot `logger.warning`). Mirrors the Magpie pattern. Result on full `minimax-chinese`: agg-RTFx 0.269 → 0.357 (+33%), TTFT-p50 14091 ms → 9619 ms (-31%), peak RSS 3302 MB → 2470 MB. #### CosyVoice3 auto-chunker (`f80e0b92e` + `fd22624b5` + `f60cccd0d`) The 250-token Flow input cap means a single synth pass produces at most ~6.5 s of audio regardless of input length. Re-exporting Flow with a larger fixed input shape is gated on upstream conversion work, so this PR works around it at the call site: long inputs are split at sentence/clause boundaries by `CosyVoice3TextChunker`, synthesized independently, and merged with an 8 ms equal-power cosine crossfade. Splitter policy: hard enders (`. ! ? 。！？ \n`) commit always; soft enders (`，、；： ; ,` + ASCII space) commit only at-or-past budget; force-split at +30 token overshoot if no natural boundary exists. `defaultMaxSpeechTokens` = 110 (leaves margin under the 250-token cap minus a typical 60–90-token speech-prompt context). Token-rate heuristic is calibrated against minimax-zh + minimax-yue runs: \| Char class \| Tokens / char \| Rationale \| \|------------\|---------------\|--------------------------------------------------------------\| \| CJK \| 7.5 \| worst-case observed in real generation; varies 5.5–9 per char \| \| ASCII \| 1.5 \| matches BPE rate on English text \| \| Other \| 2.5 \| conservative for accented Latin / non-CJK Unicode \| Validation on full `minimax-cantonese` (100 phrases, M2): \| Metric \| Pre-chunker \| Post-chunker \| Δ \| \|-------------------------------------------\|-------------\|--------------\|------------\| \| `finished_on_eos=false` (truncated) \| 80 / 100 \| 5 / 100 \| −94% \| \| Longest audio output \| 6.5 s \| 16.1 s \| +148% \| \| agg-RTFx \| 0.245× \| 0.249× \| +1.6% \| \| TTFT p50 \| 23.9 s \| 35.7 s \| +49% \| The TTFT regression is the cost of running multiple synth passes per long phrase — splitting unblocks long-form output at the price of wall-clock latency. The 5/100 residual truncation is the long-tail token-rate worst case (some chars hit ~9 tokens/char); raising the per-CJK heuristic further would over-fragment short phrases. Cleaner fix is the Flow re-export. 16-test suite covers tokenization estimates, hard/soft/force-split policy, and the crossfade arithmetic. Lives in `Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Preprocess/CosyVoice3TextChunker.swift` + `CosyVoice3TtsManager.concatWithCrossfade`. #### Magpie streaming TTFT wire-up (`ace0bf485`) `TtsBenchmarkCommand.swift` now drives Magpie through `MagpieTtsManager.synthesizeStream`, recording `ttft_ms` at first `MagpieAudioChunk` emit instead of conflating it with full-synth wall time. Result on full `minimax-english` (100 phrases, M2): TTFT-p50 9.6 s vs full synth-p50 15.1 s — agents start playback ~36% earlier than waiting for full synth. agg-RTFx 0.41× → 0.64× (warm-cache re-run benefit; fundamentals unchanged). #### StyleTTS2 `FlexibleShapeInfo` fix (`c24900731` + `8f9e42fd9`) `text_predictor.mlmodelc` aborted on long MiniMax phrases with `E5RT: tensor_buffer has known strides while the model has FlexibleShapeInfo`. The CoreML runtime rejects two access patterns on outputs from a flex-shape model: `arr.strides` reads, and `arr[idx].floatValue` element subscripts — and the original `sliceFirstAxis2D` helper used both. Fix rewrites it to read via `arr.dataPointer.bindMemory(...)` (handling `.float32`, `.float16`, `.double`) and computes the flat index from the known `(1, leading, trailing)` row-major layout. Verified on full 100/100 minimax-en with a `ref_s.bin` dumped from the upstream LibriTTS demo voice. #### StyleTTS2 misaki → espeak post-pass remap (`ded0b9467`) After `sliceFirstAxis2D` unblocked the full corpus, StyleTTS2 still landed at WER 0.581 / CER 0.476 — an order of magnitude worse than Kokoro (0.013). Instrumented the encoder via a new `--tokenize-only --corpus` mode and disproved the silent-vocab-drop hypothesis: only 0.09% of scalars dropped on the full 100-phrase corpus (11 ASCII hyphens / 12247 scalars). Real root cause: G2P convention mismatch. Both Kokoro and StyleTTS2 share the in-tree misaki BART G2P (`G2PModel`), but the StyleTTS2 LibriTTS checkpoint was trained by yl4579 on espeak-ng-phonemized LibriTTS — predating misaki by years. The 178-vocab accepts both forms (e.g. both `ʧ` U+02A7 and `tʃ` decomposed encode), but acoustic embeddings for the misaki ligature glyphs are essentially untrained noise. Side-by-side comparison against locally-installed `espeak-ng -v en-us --ipa -q` flagged four systematic divergences: \| misaki \| espeak-ng \| example \| \|--------\|-----------\|--------------------------\| \| `ʧ` \| `tʃ` \| choice → `tʃˈɔɪs` \| \| `ʤ` \| `dʒ` \| jump → `dʒˈʌmps` \| \| `ɜɹ` \| `ɝ` \| girl → `ɡˈɝl` \| \| `əɹ` \| `ɚ` \| over → `ˈoʊvɚ` \| Fix: a 4-rule post-pass remap in `StyleTTS2Phonemizer.phonemize`, gated on `.americanEnglish` and applied to the assembled phoneme string after every word has been emitted by the BART G2P. Lives alongside the existing per-piece misaki diphthong remap. Result on the same 100-phrase MiniMax-English run with the same `libritts_696` voice and same Parakeet TDT roundtrip: \| Metric \| Pre \| Post \| Δ \| \|-----------------\|-------\|-------\|--------\| \| Macro WER \| 0.581 \| 0.440 \| −24.2% \| \| Macro CER \| 0.476 \| 0.241 \| −49.5% \| \| TTFT p50 (ms) \| 8937 \| 6671 \| −25.4% \| \| Agg RTFx \| 2.36× \| 2.72× \| +15.3% \| \| Peak RSS (MB) \| 1428 \| 963 \| −32.6% \| Phrase 1 (`"…simple choice. Get busy living…"`) went from `simple voice. Busy dying.` (0.40 WER) to a perfect roundtrip. Remaining errors cluster on word-level G2P misses from the BART itself (`practical → practicckles`, `separation → expiration`) and diffusion-sampler formant breaks; closing the rest of the gap to Kokoro likely needs richer espeak coverage or libespeak-ng vendor — tracked separately. #### Beta callouts on StyleTTS2 + Magpie managers (`25e2b492a`) `StyleTTS2Manager.initialize` and `MagpieTtsManager.initialize` now emit `logger.warning` beta notices mirroring the existing `CosyVoice3TtsManager.initialize` pattern. Backends docs (`Magpie.md` Status section, `Benchmarks.md` StyleTTS2 footnote) gain matching `⚠️ Beta / experimental` callouts so the perf / quality posture is visible at every entry point — runtime, manager docstring, doc top, PR body. #### Magpie `outputBackings` rejection fallback (`72dae8400` + `9767e1ef9`) The shipped `decoder_step.mlmodelc` reaches the user before the rebuild lands, so CoreML can reject our `outputBackings` dictionary on a name-mismatch. Latched fallback path falls back to a fresh-alloc decode so the model still runs; first rejection latches the flag for the rest of the run. ### Cohere ASR backend in the harness (`8e741e659`) Lets non-English TTS runs (CosyVoice3, Magpie zh, etc.) score WER / CER through the harness against [Cohere Transcribe](Sources/FluidAudio/ASR/Cohere/) instead of being forced into `--skip-asr`. Four new flags on `tts-benchmark`: - `--asr-backend parakeet\|cohere\|none` — selects the ASR roundtrip engine. Default is `parakeet` for English-only runs and skipped for CosyVoice3. - `--cohere-model-dir <path>` — path to a directory containing `cohere_encoder.mlmodelc`, `cohere_decoder_cache_external_v2.mlmodelc`, and `vocab.json`. - `--asr-language <code>` — overrides the inferred language code (covers all 14 Cohere languages: en, fr, de, es, it, pt, nl, pl, el, ar, ja, zh, ko, vi). - `--cohere-compute-units all\|cpu-and-gpu\|cpu-only\|all-ane` — pins `MLComputeUnits` at `CoherePipeline.loadModels` time. Use `cpu-and-gpu` when the q8 encoder fails ANE compilation (`MILCompilerForANE error: failed to compile ANE model using ANEF`) to skip the multi-minute fallback compile on the first call. The harness logs a WER caveat for zh/ja runs flagging that whitespace-tokenized WER is meaningless and the CER column is the real signal. Example end-to-end: ```bash fluidaudio tts-benchmark \ --backend cosyvoice3 \ --corpus minimax-chinese \ --asr-backend cohere \ --cohere-model-dir /path/to/cohere/q8 \ --asr-language zh \ --output-json benchmark_results/cv3-zh-cohere.json \ --audio-dir benchmark_results/cv3-zh-cohere/audio ``` On this M2 host the q8 encoder hits a CoreML ANE-cache failure (`MILCompilerForANE error: ANECCompile() FAILED`) and CoreML silently falls back to CPU+GPU, dropping Cohere from its documented RTFx ~2× (per `Documentation/ASR/Cohere.md`) to RTFx ~0.13× — correctness is unaffected (same graph, same output), only latency. The full 100-phrase CosyVoice3 zh CER number reported above (1.68% macro / 1.84% micro) was therefore produced via `whisper-large-v3` (Python CPU FP32, `Scripts/whisper_zh_cer.py`) rather than by running Cohere over all 100 phrases. A 10-phrase Cohere sub-sample agrees with whisper at the 3–5% CER range. ### Corpus migration (`4cc7d3111`) + on-demand fetch CLI (`8022e8384`) Replaces the original `prose-en` / `numbers-en` / `names-en` / `prose-zh` shipped with the first cut of this PR with the [MiniMax Multilingual TTS Test Set](https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set) (CC-BY-SA-4.0; 100 phrases × 25 languages). Same public corpus used by [MiniMax-Speech](https://arxiv.org/abs/2505.07916), seed-tts-eval, and Gradium — numbers in this PR are paper-comparable. The 24 per-language `.txt` files used to be vendored in `Benchmarks/tts/corpus/minimax/`. Removed in this PR in favor of an on-demand `fluidaudio minimax-corpus` CLI subcommand that fetches them from the upstream HF dataset at the pinned revision and writes them to the same path. Reuses `DownloadUtils.fetchHuggingFaceFile` for HF auth (HF_TOKEN env) + retry/backoff — no `swift-transformers` dep added, no hardcoded asset URLs. The `.txt` files now live in `.gitignore` since they're CC-BY-SA-4.0 derivative content; only `Documentation/TTS/MinimaxCorpus.md` (attribution + revision pin + WER caveats — moved from `Benchmarks/tts/corpus/minimax/README.md` in `ac21d60bf`) and the CLI subcommand are tracked. Replaces the prior `python Scripts/fetch_minimax_tts_corpus.py` (also deleted). Per-backend language scope: \| Backend \| Languages benchmarked \| \|---\|---\| \| Kokoro / Kokoro ANE \| en (af_heart) \| \| PocketTTS \| en + de + it + pt + es + fr \| \| Magpie \| en + es + de + fr + it + vi + zh + hi \| \| StyleTTS2 \| en (LibriTTS multi-spk) \| \| CosyVoice3 \| zh + yue \| ### PocketTTS streaming TTFT (`c26f1e163`) PocketTTS now drives the harness through its `synthesizeStreaming` API so TTFT measures time-to-first-80ms-frame instead of full one-shot synth. TTFT 1244 ms vs. full synth 8757 ms — a 7× streaming advantage that one-shot benchmarking previously hid. ### Reference voice dumper helper (mobius-styletts2) `mobius-styletts2/scripts/06_dump_ref_s.py` (added in the sibling repo) wraps `style_encoder` + `predictor_encoder` from `99_parity_check.py` to dump a 256-fp32 LE `ref_s.bin` that `StyleTTS2Manager.synthesize` consumes via `--voice`. Required because the shipped CoreML bundle doesn't include those upstream-only PyTorch encoders. ## Test plan - [x] `swift build -c release` clean - [x] `swift format lint` clean for new files - [x] `fluidaudio tts-benchmark --help` lists all 6 backends - [x] `fluidaudio minimax-corpus --languages english --out-dir /tmp/x` produces byte-identical output to the deleted Python script - [x] Kokoro / Kokoro ANE / PocketTTS / Magpie — full 100/100 minimax-en - [x] StyleTTS2 — full 100/100 minimax-en (verified after `sliceFirstAxis2D` fix + post-pass remap) - [x] CosyVoice3 — full 100/100 minimax-zh + 100/100 minimax-yue (verified after HiFT + LLM-Decode `outputBackings` fixes) - [x] `CosyVoice3ModelNameTests` + `TtsComputeUnitPresetTests` green - [x] No `@unchecked Sendable`; per-backend error enums use `Error, LocalizedError` - [x] StyleTTS2 + Magpie + CosyVoice3 emit beta `logger.warning` on `initialize()` - [x] Corpus README moved to `Documentation/TTS/MinimaxCorpus.md`; cross-refs in `Benchmarks.md`, `MinimaxCorpusCommand.swift`, `TtsBenchmarkCommand.swift` updated - [x] CosyVoice3 6.5 s output cap investigated — confirmed structural (250-token Flow input shape, 40 ms / token); surfaced via `finishedOnEos` + warning log + JSON `finished_on_eos` field. See [Decode budget cap](Documentation/TTS/Benchmarks.md#cosyvoice3-decode-budget-cap) - [x] CosyVoice3 auto-chunker lands in this PR as a call-site workaround. Validated on full minimax-cantonese: truncation 80/100 → 5/100, longest output 6.5 s → 16.1 s, agg-RTFx 0.245× → 0.249×. 16-test suite (`CosyVoice3TextChunkerTests`) green. See [CosyVoice3 auto-chunker](Documentation/TTS/Benchmarks.md#cosyvoice3-auto-chunker) - [x] Magpie streaming TTFT wired through `synthesizeStream` in `TtsBenchmarkCommand.swift`. Validated on full minimax-english: TTFT-p50 9.6 s (first chunk) vs full-synth-p50 15.1 s — 36% earlier playback start. agg-RTFx 0.41× → 0.64× (warm-cache re-run) - [x] Cohere ASR harness wiring (`--asr-backend cohere` + `--cohere-model-dir` + `--asr-language` + `--cohere-compute-units`). Smoke-tested on a 10-phrase `minimax-chinese` sub-sample (Cohere q8 macro CER 4.88%, hit `MILCompilerForANE` fallback, RTFx ~0.13× on this M2 host). Whisper-large-v3 cross-check on the same WAVs: macro CER 3.04% — both backends agree - [x] CosyVoice3 zh CER on full corpus measured via `whisper-large-v3` (Python CPU FP32, `Scripts/whisper_zh_cer.py`) over all 100 minimax-chinese WAVs: macro CER 1.68%, micro CER 1.84%. Recorded in `Documentation/TTS/Benchmarks.md` (CosyVoice3 row + footnote ‡)	2026-05-01 09:09:42 -04:00
Alex	b5d8017d1f	feat(asr/parakeet-v3): default to int4-per-channel encoder (#560 ) ## Summary Switch the Parakeet TDT v3 default encoder from the 6-bit palettized `Encoder.mlmodelc` to a new int4-per-channel `EncoderInt4.mlmodelc`. v2 and TDTJa keep the legacy 6-bit encoder; v3 is the only path that changes. ## WER / size / speed (LibriSpeech test-clean, 100 files, M2) \| variant \| WER \| disk \| RTFx \| ANE residency \| \|---\|---\|---\|---\|---\| \| baseline (6-bit palettized, current default) \| 2.64% \| 426 MB \| 36.8x \| 99.4% \| \| int4-per-channel (new default) \| 5.24% \| 285 MB \| 49.2x \| 82.0% \| \| enc-prune+int8 \| 2.57% \| 568 MB \| 19.8x \| 82.0% \| \| enc-int4-linear-per-block-32 \| 3.95% \| 319 MB \| 15.6x \| 33.3% \| \| enc-prune+int4-block \| 3.95% \| 319 MB \| 15.9x \| 33.3% \| The chosen variant trades roughly 2× LibriSpeech WER (still in the same single-digit-percent regime) for 33% less disk and the fastest RTFx of any variant tested. Per-block quants drop off ANE entirely (33%) while per-channel stays compatible (82%). ## Implementation - `ModelNames.ASR` - Add `encoderInt4 = \"EncoderInt4\"` and `encoderInt4File = \"EncoderInt4.mlmodelc\"`. - Swap `encoderFile` for `encoderInt4File` in `requiredModelsV3`. `encoderFile` stays defined and is still used by v2 / TDTJa / 110m. - `AsrModels.swift` - Extend `getModelFileNames(version:)` return tuple from `(decoder, joint, vocabulary)` to `(encoder, decoder, joint, vocabulary)`. - Thread `fileNames.encoder` through `createModelSpecs`, the v3 `load` flow, the `download` spec list, and `isModelValid`. v3 returns `Names.encoderInt4File`; v2/tdtJa return their existing `Encoder.mlmodelc`; fused (110m) is unaffected. - Tests: add `testV3UsesInt4EncoderAsDefault` and `testV2KeepsLegacyEncoder` in `ModelNamesTests`. ## Distribution The new `EncoderInt4.mlpackage` / `EncoderInt4.mlmodelc` will be uploaded to the existing `FluidInference/parakeet-tdt-0.6b-v3-coreml` HF repo alongside the current `Encoder.mlmodelc`. Older library versions that still ask for `Encoder.mlmodelc` continue to work unchanged. ## Test plan - [x] `swift build` clean - [x] `swift test --filter ModelNamesTests` — 20/20 (2 new) - [x] `swift test --filter AsrModelsTests` — 30/30 - [x] End-to-end transcription smoke test on LibriSpeech 61-70970-0001.flac via `EncoderInt4.mlmodelc`: correct text, RTFx 29.12x. ANE cold compile 21.3s (one-time). - [x] swift-format lint clean on the modified files (only pre-existing Sortformer warnings remain in `ModelNames.swift`). - [ ] CI: tests + asr-benchmark - [ ] Verify HF download path on a clean cache once `EncoderInt4.mlmodelc` is uploaded to the v3 repo. ## Companion The mobius PR adds the conversion scripts that produced these variants (`extra_encoder_variants.py`, `analyze_fallback.py`, `compute_unit_sweep.py`). <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/560" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review"> </picture> </a> <!-- devin-review-badge-end -->	2026-04-30 23:00:43 -04:00
Benjamin Lee	35f6ba697f	Added Back the Old LS-EEND Constructors (#563 ) I accidentally deleted the old constructor in my last PR. --------- Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-04-30 17:24:18 -07:00
Benjamin Lee	4065a9917e	Optimized LS-EEND API (#526 )	2026-04-30 17:49:32 -04:00
Zhongpai Gao	c4d56a5cb5	Feat/pocket tts int8 precision swap (#558 ) ### Why is this change needed? <!-- Explain the motivation for this change. What problem does it solve? --> Wires up the published `flowlm_stepv2.mlmodelc` int8 variant via a `PocketTtsPrecision { .fp16, .int8 }` parameter on `PocketTtsManager`, threaded through to `PocketTtsModelStore` and `PocketTtsResourceDownloader`. Closes the loop on the `flowlm_stepv2.mlmodelc` artifact that's been published under `v2/<lang>/` for a while but didn't have a Swift loader hook. Default stays `.fp16`, no behavior change for existing callers. ## What's in this PR Code (5 files, +171/-12): - `PocketTtsPrecision.swift` — new enum `{ .fp16, .int8 }`, with docstring documenting the kyutai-labs/pocket-tts#147 recipe and preserving the per-submodel A/B data from `experiment/pocket-tts-int8` (cond_step / flowlm_step / flow_decoder / mimi_decoder safety summary) - `ModelNames.swift` — `flowlmStepV2` constant + `flowlmStepFile(precision:)` and `requiredModels(precision:)` helpers - `PocketTtsResourceDownloader.swift` — `precision:` param, precision-aware cache check, and `removeUnusedFlowlmVariant()` post- download cleanup so callers' disk usage matches the loaded models - `PocketTtsModelStore.swift` — `precision:` init param plumbed to the precision-aware filename helper - `PocketTtsManager.swift` — `precision:` init param threaded to the store Docs (1 file, +47): - `Documentation/TTS/PocketTTS.md` — new "Model Files & Precision" section: per-submodel precision/size/HF-path table, fp16-vs-int8 totals, rationale for why only `flowlm_step` is quantized ## Why default is `.fp16` I asked about the on-disk weight format before committing the rename and verified by inspecting `model.mlmodel` for both flowlm variants: the int8 variant has explicit `cast_fp16_to_fp32` op scaffolding throughout, while the default has none — indicating uniform fp16 weights. Combined with the 304→77 MB size ratio (~4×, consistent with fp16→int8 plus quantization scale tensors) the default file's weights are fp16 on disk. The existing `PocketTtsModelStore.swift:65-67` comment about "CPU/GPU compute in float32 matches the Python reference" is correct about runtime compute precision (CoreML upcasts fp16 weights to fp32 on `.cpuAndGPU`); it just doesn't describe disk format and reads as accurate as-is. ## Why per-submodel quantization isn't exposed The `experiment/pocket-tts-int8` branch's `PocketTtsQuantization` struct (per-submodel `PocketTtsModelPrecision`) is a richer API, but the per-submodel int8 artifacts (`cond_step_int8.mlmodelc`, etc.) aren't published on HuggingFace today. Adding the API would let callers request configurations that 404 at download time. Only `flowlm_stepv2.mlmodelc` is published, and that's what this PR wires up. The `PocketTtsPrecision` enum can grow into the experiment branch's `PocketTtsQuantization` shape mechanically if/when the per-submodel artifacts ship. ## Disk footprint (English language pack) \| \| fp16 (default) \| int8 \| \|---\|---\|---\| \| Total active files on disk \| 766.3 MB \| 549.3 MB \| \| int8 savings vs fp16 \| — \| −217 MB (28%) \| The `v2/<lang>/` HF directory ships both flowlm variants, so first download briefly holds ~857 MB before the cleanup pass deletes the unused `.mlmodelc` and `.mlpackage`. ## Backward compatibility - `PocketTtsManager()` / `PocketTtsModelStore()` / `ensureModels()` defaults all stay `.fp16`, which loads `flowlm_step.mlmodelc` exactly as before - Existing `requiredModels` constant retained alongside new `requiredModels(precision:)` so non-precision-aware callers keep compiling ## Verification done - All 11 published language packs have both `flowlm_step.mlmodelc` and `flowlm_stepv2.mlmodelc` under `v2/<lang>/` — verified via HF tree API - Branch is exactly +2 commits on top of `main` (`00ea906 fix: remove module_map from MachTaskSelfWrapper subspec`) - Diff content is identical to `Gaozhongpai/FluidAudio:main`, just squashed from 5 iterative commits into 2 clean ones (one feat, one docs) I haven't run `swift test` locally — Bash on Windows here, no Swift toolchain. Happy to fix anything CI flags. <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/558" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review"> </picture> </a> <!-- devin-review-badge-end -->	2026-04-29 13:57:15 -04:00
dianshu	00ea906c20	fix: remove module_map from MachTaskSelfWrapper subspec (#546 ) ## Summary - Remove `mach.module_map` from the `MachTaskSelfWrapper` subspec — CocoaPods does not allow `module_map` on subspecs - Guard `import MachTaskSelfWrapper` with `#if canImport(MachTaskSelfWrapper)`, matching the existing `FastClusterWrapper` pattern - In CocoaPods builds, the C headers are already exposed via the umbrella header, so the explicit module import is only needed under SwiftPM ## Verification - `pod lib lint FluidAudio.podspec --allow-warnings` — passed - `swift build` — passed Fixes #545 <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/546" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review"> </picture> </a> <!-- devin-review-badge-end --> Co-authored-by: dianshu <dianshu@123.com>	2026-04-29 09:25:25 -04:00
Alex	248b76b8b6	feat(tts/styletts2): scaffold StyleTTS2 4-stage pipeline integration (#554 ) ## Summary Adds the FluidAudio host surface for the StyleTTS2 LibriTTS multi-speaker checkpoint published at `FluidInference/StyleTTS-2-coreml`, end-to-end. Covers asset download, lazy bucketed model loading, text frontend (G2P + 178-token vocab), bundle config validation, the ADPM2/Karras sampler, hard-alignment, decoder driver, and a CLI driver. `fluidaudio styletts2 "Hello world." --voice ref_s.bin --output out.wav` produces an audible 24 kHz mono WAV. ## Pipeline Per utterance (~5 ADPM2 steps default): \| Stage \| Bucket axis \| Buckets \| Precision \| Compute \| \|---\|---\|---\|---\|---\| \| `text_predictor` \| input tokens \| 32, 64, 128, 256, 512 \| fp16 \| ANE \| \| `diffusion_step` \| bert_dur frames \| 512 only (5× per utt) \| fp16 \| CPU+GPU \| \| `f0n_energy` \| dynamic (en frames) \| enumerated 256/512/1024/2048/4096 \| fp16 \| CPU \| \| `decoder` \| mel frames \| 256, 512, 1024, 2048, 4096 \| fp32 \| CPU+GPU \| The decoder is fp32 because SineGen phase saturation in fp16 produces robotic audio. The HF repo ships precompiled `compiled/.mlmodelc` bundles (skipping the cold-start `anecompilerservice` hit) plus `.mlpackage` doubles for portability — only the `.mlmodelc` bundles are fetched. `f0n_energy` is pinned to CPU and always called at the largest enumerated shape (1, 640, 4096) with zero-padding — the E5RT runtime emits a stderr "tensor_buffer has known strides while the model has FlexibleShapeInfo" warning when it sees enumerated shapes on GPU/ANE, which is non-fatal but the CPU/largest-shape path sidesteps it cleanly. ## What's in this PR Sources:* - `StyleTTS2Constants` — audio/tokenizer/model dims + sampler defaults (Karras `rho=9` to match upstream) - `StyleTTS2Error` — module-local `LocalizedError` enum - `Assets/StyleTTS2ResourceDownloader` — `DownloadUtils.downloadRepo` wrapper - `Assets/StyleTTS2Vocab` — 178-token espeak-ng IPA vocab loader; iterates Unicode scalars (not graphemes) so combining marks like U+0329 syllabic / U+0361 tie-bar look up against their own vocab entries - `Assets/StyleTTS2BundleConfig` — `config.json` Codable + `validate()` against `StyleTTS2Constants` - `Assets/StyleTTS2VoiceStyle` — parser for precomputed `ref_s.bin` (256 fp32 LE) speaker-prosody blobs (dump script lives in `mobius-styletts2/scripts/06_dump_ref_s.py`) - `Pipeline/StyleTTS2ModelStore` — actor with lazy per-bucket `MLModel` cache + lazy vocab/config caches; `f0nEnergy()` pinned `.cpuOnly` - `Pipeline/StyleTTS2Phonemizer` — `TtsTextPreprocessor` → in-tree `G2PModel` (BART, misaki IPA) for English with a small misaki→espeak-ng remap (`A→eɪ`, `I→aɪ`, `O→oʊ`, `W→aʊ`, `Y→ɔɪ`, schwa-offglide → `ə`); other languages fall back to `MultilingualG2PModel` - `Pipeline/StyleTTS2Sampler` — ADPM2 / Karras-rho noise schedule + CFG-aware sampling closure; deterministic via SplitMix64 + Box-Muller - `Pipeline/StyleTTS2Synthesizer` — full 4-stage driver. Float16-aware `MLMultiArray` reads (`denoised`, `F0`, `N` all ship as fp16 per schema), cumsum-of-durations → one-hot → matmul hard-alignment, decoder fan-out - `StyleTTS2Manager` — public actor; `initialize()` validates bundle config; `tokenize()` exposes the text frontend; `synthesize(text:voiceStyleURL:steps:alpha:beta:randomSeed:)` returns 24 kHz mono WAV `Data` - `Sources/FluidAudioCLI/Commands/StyleTTS2Command` — `fluidaudio styletts2 "<text>" --voice <ref_s.bin> [--output --steps --alpha --beta --seed]` - `ModelNames.StyleTTS2` + `Repo.styleTts2` wired into the central registries - `TtsBackend.styleTts2` case Tests (37/37 pass, no network or CoreML deps): - `StyleTTS2VocabTests` — load happy path, combining-grapheme handling, missing/malformed JSON, encode known/unknown/empty - `StyleTTS2BundleConfigTests` — load + validate against every constant mismatch - `StyleTTS2VoiceStyleTests` — `ref_s.bin` parsing (size, fp32 round-trip, wrong-size rejection) - `StyleTTS2SamplerTests` — Karras schedule, RNG determinism ## Verification - `fluidaudio styletts2 "Hello world. The quick brown fox jumps over the lazy dog." --voice /tmp/styletts2-ref_s.bin --output /tmp/out.wav --seed 42` → 4.80s @ 24 kHz, RMS 7158, 0.0009% clipping - `fluidaudio transcribe /tmp/out.wav` → `Hello world quick brown fax nomps over lazy` (most words recovered; residual gaps are BART G2P emitting reduced `ð` for "the" with no schwa, and lacking length marks `ː` on stressed long vowels) ## Test plan - [x] `swift build -c release` clean - [x] `swift test --filter StyleTTS2` → 37/37 pass - [x] `swift format lint` clean on new files - [x] End-to-end CLI synth produces audible WAV - [x] ASR roundtrip recovers most content words ## Known follow-up - Tune misaki→espeak remap for length marks `ː` and reduced function-words (would push ASR WER lower) - Voice-bank packaging story (currently the user must precompute `ref_s.bin` via `mobius-styletts2/scripts/06_dump_ref_s.py`) - StyleTTS2 benchmark suite <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/554" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review"> </picture> </a> <!-- devin-review-badge-end -->	2026-04-29 09:24:44 -04:00
Alex	3d9d422202	feat(tts/magpie): add NVIDIA Magpie TTS Multilingual 357M Swift port (#541 ) ## Summary Ports the NVIDIA Magpie TTS Multilingual 357M autoregressive TTS from Python (mobius [#24](https://github.com/FluidInference/mobius/pull/24)) to Swift. Closes FluidInference/FluidAudio#49. > ⚠️ Experimental — quite slow on Apple Silicon, needs significant perf work. First synth on a fresh process is dominated by CoreML model load + first-call ANE compile (~30 s). Warm synths run at ~96 s wall for an 8-word English sentence on M-series — RTFx ≈ 0.04 (~25× slower than realtime). Whether the throughput ceiling is a model characteristic, a CoreML conversion limitation, or both is still being investigated and is expected to improve in subsequent iterations. Do not use in latency-sensitive paths. For real-time use prefer Kokoro (~20× RTFx, parallel) or PocketTTS (~1.5–2× RTFx, streaming Mimi). Magpie's value prop is multilingual coverage + 5 built-in speaker contexts, not throughput. ## Status Functional. Audio quality is perceptually clean across all 5 speakers; first synth on a fresh process is dominated by CoreML model load + first-call ANE compile (~30 s), warm synths run at ~96 s wall for an 8-word English sentence on M-series (RTFx ≈ 0.04). Quality is ASR-clean on 4/5 speakers; speaker 0 has a single trailing-word artifact ("…and") attributable to fp16 sampler-trajectory drift, not a structural bug. Not yet covered: Japanese (deferred — needs OpenJTalk XCFramework + MeCab dict), CFG performance optimization, MLX-backed LocalTransformer. - Languages (8/9): English, Spanish, German, French, Italian, Vietnamese, Mandarin, Hindi. Japanese deferred pending OpenJTalk XCFramework integration. - 5 built-in speakers (`.john`, `.sofia`, `.aria`, `.jason`, `.leo`) with 110-token (768d fp16) context embeddings. - Inline IPA override (`"Hello \| ˈ n ɛ m o ʊ \| world"`) routes `\|…\|` segments directly to the tokenizer for pronunciation control — first-class feature. - Streaming: `synthesizeStream(...)` yields `MagpieAudioChunk` per chunk as soon as its NanoCodec decode finishes (first chunk is a small clause-sized head ≈ 50 frames / 2.3 s for low TTFA). Each non-final chunk includes punctuation-aware trailing silence for gapless playback. - ANE warmup at init: `MagpieTtsManager.initialize()` runs an unmeasured 16-step synthesis to force `MILCompilerForANE` to compile the decoder graphs once. Without this the first user-facing `synthesize()` can fall back to GPU/CPU and run multiple× slower. - Output: 22.05 kHz mono WAV via 8-codebook NanoCodec decoder, max 11.89 s per synthesis (256 nanocodec frames). ## HF assets — live [`FluidInference/magpie-tts-multilingual-357m-coreml`](https://huggingface.co/FluidInference/magpie-tts-multilingual-357m-coreml) is uploaded and ready (1.4 GB). Ships: - `text_encoder.{mlmodelc,mlpackage}` — both compiled and portable - `decoder_step.{mlmodelc,mlpackage}` — rank-4 split-K/V cache, 97.3% ANE residency - `decoder_prefill.{mlmodelc,mlpackage}` — fast prefill path (110-token batched) - `nanocodec_decoder.{mlmodelc,mlpackage}` — 8-codebook → 22 kHz PCM (CPU-only by export) - `constants/` — `constants.json`, `speaker_info.json`, 8 audio-codebook embeddings, 5 speaker contexts, local-transformer weights - `tokenizer/` — per-language phoneme/jieba/pypinyin lookups (lazy-downloaded) - `manifest.json` — machine-readable index (sha256, file sizes, npy shapes, model IO specs) consumed by `MagpieResourceDownloader` ## Architecture \| Stage \| Implementation \| \|---\|---\| \| Text encoder \| `text_encoder.mlmodelc` (CoreML, cpuAndNeuralEngine) \| \| Prefill \| `decoder_prefill.mlmodelc` fast path (single batched call, 110 tokens), or fallback loop \| \| AR loop \| `decoder_step.mlmodelc` with rank-4 split-K/V cache (`cache_k{i}` / `cache_v{i}`, shape `[1, 512, 12, 64]` × 12 layers; logits `var_2129`); `outputBackings` + double-buffered KV cache to keep allocations off the hot path \| \| Local transformer \| Pure Swift, 1-layer (256d), Accelerate (`cblas_sgemm`) + BNNS (GELU); fp32 only (fp64 path removed); vDSP-fused embed; min-heap top-K \| \| Sampling \| top-k (80) + temperature (0.6), audio-EOS mask during `minFrames`, forbidden-token mask `[2016, 2018-2023]`; `torch.topk`-faithful tie semantics (counts above-threshold + earliest-index ties up to K) \| \| Vocoder \| `nanocodec_decoder.mlmodelc` pinned to `cpuOnly` (ANE rejects the graph) — 8×N codes → float PCM → peak-normalize \| CFG is off by default (`cfgScale = 1.0`); enabling it doubles per-step decoder cost. Assets fetched lazily via `DownloadUtils`; only the languages requested in `downloadAndCreate(languages:)` are materialized. ## Public API ```swift let manager = try await MagpieTtsManager.downloadAndCreate( languages: [.english, .spanish] ) // One-shot let result = try await manager.synthesize( text: "Hello \| ˈ n ɛ m o ʊ \| from FluidAudio.", speaker: .john, language: .english ) let wav = AudioWAV.data(from: result.samples, sampleRate: result.sampleRate) // Streaming (chunk-level, per-chunk NanoCodec decode) for try await chunk in try await manager.synthesizeStream(text: longText) { audioPlayer.append(chunk.samples) } ``` ## CLI ``` fluidaudiocli magpie download --languages en,es fluidaudiocli magpie text --text "Bonjour." --speaker 0 --language fr --output out.wav fluidaudiocli magpie text --text "Long passage..." --stream --output stream.wav fluidaudiocli magpie bench --runs 5 --warmup 1 # in-process median RTFx ``` (Parity tooling moved to mobius — see [FluidInference/mobius#44](https://github.com/FluidInference/mobius/pull/44) for the fixture emitter / Python ground-truth path.) ## Inline IPA — verified working The `\|…\|` passthrough is native NeMo `IpaG2p` behavior (not added by us): segments inside pipes are looked up directly in `token2id.json` as whitespace-separated phonemes, bypassing G2P. ``` input: "Hello \| n ɛ m o ʊ \| from FluidAudio." G2P: həˈloʊ nɛmoʊ frʌm fluɪdaːdɪoʊ. ← injected IPA visible mid-stream ``` Validated end-to-end with the live HF assets (Python reference): 30 tokens → 43 frames → 2.00 s @ 3.97x RTF. ## Guardrails followed - No `@unchecked Sendable`; `MagpieTtsManager`, `MagpieModelStore`, `MagpieTokenizer`, `MagpieSynthesizer` are all `actor`s. - No dummy models / synthetic data. - `AppLogger(category: "Magpie")` throughout, no `print()` (including `MagpieCommand.printUsage`). - `MagpieError: Error, LocalizedError` for all error paths. ## Test plan - [x] `swift build` — clean on macOS 14 / Swift 6 (only pre-existing `cblas_sgemm` deprecation warnings from Accelerate); iOS build also clean (Swift 6 isolation-checker workaround landed). - [x] `swift test --filter "Magpie\|NpyReader"` — 17 / 17 pass: - `MagpieConstantsTests` (4) — forbidden-token mask, shape relations, NeMo tokenizer-name parity, per-language file coverage - `MagpieIpaOverrideTests` (7) — `\|…\|` segmentation edge cases - `MagpieKvCacheTests` (3) — cache shape, `addInputs` key count, static output keys - `NpyReaderTests` (3) — fp32 parse, fp16→fp32 upcast, bad-magic rejection - [x] HF assets uploaded; Python inference parity confirmed (4.60 s plain English, 2.00 s + 11.05 s with inline IPA). - [x] End-to-end Swift validation: `magpie download` → `magpie text` produces audible 22 kHz WAV; `magpie bench` reports stable RTFx medians on M-series. - [x] Audio quality validated: ASR-clean on 4/5 speakers; speaker 0 trailing-word artifact diagnosed as fp16 sampler-trajectory drift, not structural. - [x] Streaming validated: chunk-level decode yields correct gapless playback when concatenated; first chunk arrives in ~half the wall-time of the full synthesis. - [x] Devin review feedback addressed: `--text` flag handler, `torch.topk`-faithful tie semantics, `AppLogger.info()` in `printUsage()`, stale `MagpieComputePlanCommand` removed. ## Companion PR Conversion pipeline + parity-fixture emitter + manifest generator: [FluidInference/mobius#44](https://github.com/FluidInference/mobius/pull/44). ## Out of scope (follow-ups — perf is the headline item) - Throughput investigation* — current ~0.04 RTFx is the dominant gap. Suspect surfaces: rank-4 split-K/V scatter ANE residency vs. apparent GPU fallback, NanoCodec CPU-only export, LocalTransformer per-step Accelerate path. - MLX-backed LocalTransformer — drop-in replacement for the Accelerate/BNNS forward pass to put the per-step hot loop on the GPU. - CFG perf optimization — currently doubles per-step decoder cost. - Speaker 0 fp16 sampler drift — investigate whether higher-precision logits or a small temperature schedule eliminates the trailing-word artifact. - Japanese support (OpenJTalk + MeCab dict). - Streaming NanoCodec via MLState conv-cache (current export is fixed-window batch; chunked-overlap fallback yields <15 dB SNR — unviable without proper state caching). - CI workflow `magpie-benchmark.yml`.	2026-04-28 10:54:00 -04:00
Alex	b82d4f2fc8	feat(tts): CosyVoice3 Mandarin zero-shot TTS port (#536 ) ## Summary Swift port of CosyVoice3 (Mandarin zero-shot TTS) wired through the four validated CoreML mlpackages hosted at [`FluidInference/CosyVoice3-0.5B-coreml`](https://huggingface.co/FluidInference/CosyVoice3-0.5B-coreml). Delivered in two layered phases matching the existing Kokoro manager shape: - Phase 1 (parity harness): full Swift pipeline that ingests a Python frontend fixture (`.safetensors`) and produces WAV within parity of the Python reference — validates all four CoreML bindings, 24-layer Qwen2 KV-cache slicing, RAS sampler, and Flow / HiFT wiring. - Phase 2 (native frontend): pure-Swift Qwen2 BPE tokenizer + Qwen2 text embeddings + minimal Mandarin text normalizer + 24 kHz log-mel DSP so callers can synthesize directly from `String` input without a Python dependency. Conversion pipeline that produced the mlpackages lives at [FluidInference/mobius#42](https://github.com/FluidInference/mobius/pull/42). Backend documentation: [`Documentation/TTS/CosyVoice3.md`](./Documentation/TTS/CosyVoice3.md). > ⚠️ Backend ships as beta / experimental. End-to-end synthesis is > currently slow on Apple Silicon — RTFx < 1.0 typical, several seconds > of latency for short Mandarin utterances. Cause is partly the Flow CFM > stage (fp32 / CPU-or-GPU only because fp16 + ANE produces NaNs through > the fused `layer_norm`) and partly HiFT sinegen / windowing ops that > fall back to CPU. Treat as preliminary; may be a model issue, may be > recoverable via better conversion. Warnings surfaced via doc comments, > runtime `logger.warning` in `initialize()`, and CLI help text. ## What's shipped ### Public API (`Sources/FluidAudio/TTS/CosyVoice3/`) ```swift public actor CosyVoice3TtsManager { public init(directory: URL? = nil, computeUnits: MLComputeUnits = .cpuAndNeuralEngine) public static func downloadAndCreate(from repo: Repo = .cosyvoice3, computeUnits: MLComputeUnits = .cpuAndNeuralEngine) async throws -> CosyVoice3TtsManager public func initialize() async throws public func synthesize(text: String, promptAssets: CosyVoice3PromptAssets, options: CosyVoice3SynthesisOptions = .init(), prenormalized: Bool = false) async throws -> CosyVoice3SynthesisResult } ``` `TtsBackend` gains `case cosyvoice3`; `ModelNames` gets the `CosyVoice3` enum plus `Repo.cosyvoice3` pointing at the HF repo. ### Pipeline components \| Layer \| File \| Notes \| \|---\|---\|---\| \| Model loader \| `Assets/CosyVoice3ModelStore.swift` \| Flat + nested layout probing, `.mlmodelc` compile cache \| \| Downloader \| `Assets/CosyVoice3ResourceDownloader.swift` \| `DownloadUtils` wrapper for the 4 mlpackages + embeddings \| \| Safetensors \| `Shared/SafetensorsReader.swift` \| ~170 LoC pure-Swift mmap + fp16/fp32/i32 accessors \| \| Prefill/decode \| `Pipeline/Synthesize/CosyVoice3Synthesizer.swift` \| Actor; in-place `[24,1,2,768,64]` fp16 KV-cache passthrough \| \| Sampler \| `Pipeline/Synthesize/CosyVoice3RasSampler.swift` \| top-p / top-k / repetition mask, seed-tokens bypass \| \| Speech embed \| `Pipeline/Synthesize/CosyVoice3SpeechEmbeddings.swift` \| Lazy mmap of 6761×896 fp16 table (12 MB) \| \| Frontend \| `Pipeline/Preprocess/CosyVoice3TextFrontend.swift` \| Special-token splitting + lm_input assembly \| \| Tokenizer \| `Pipeline/Preprocess/Qwen2BpeTokenizer.swift` \| tiktoken-compatible byte-level BPE, 151 936 vocab \| \| Text embed \| `Pipeline/Preprocess/CosyVoice3TextEmbeddings.swift` \| 151 936×896 fp16 mmap → row copy \| \| TN \| `Pipeline/Preprocess/CosyVoice3ChineseNormalizer.swift` \| Minimal regex-free port of `frontend_utils.py` \| \| Prompt mel \| `Pipeline/Preprocess/CosyVoice3PromptMel.swift` \| 24 kHz log-mel matching `matcha audio.py` \| ### CLI (`Sources/FluidAudioCLI/Commands/`) ``` fluidaudio tts --backend cosyvoice3-parity --fixture … --models-dir … --output … fluidaudio tts --backend cosyvoice3 --text "希望你以后能够做的比我还好用" \ --prompt-assets … --models-dir … --output … fluidaudio tts --backend cosyvoice3-tokenizer --fixture … # BPE parity fluidaudio tts --backend cosyvoice3-frontend --text … # lm_input dump ``` `--backend` help text marks `cosyvoice3` as `[BETA — slow, RTFx < 1.0]` and the dispatcher emits a runtime `logger.warning` so users see the status without reading docs. ### Tests - `CosyVoice3ChineseNormalizerTests` — 8 cases covering `contains_chinese`, `replace_blank`, corner marks, brackets, digit spellout, trailing comma collapse, end-to-end, `is_only_punctuation`. - `CosyVoice3PromptMelTests` — 8 cases covering the matcha frame-count formula, zero-audio log floor clamp, 200 Hz sine peak in low mel bins, exact reflect-pad semantics, periodic Hann endpoints, mel-basis shape / non-zero integrals, token-ratio trimming (and the throws-if-too-short path). ### Integration - `ModelNames.swift` — `CosyVoice3` enum + `Repo.cosyvoice3` - `TtsBackend.swift` — `case cosyvoice3` - `TTSCommand.swift` — subcommand wiring - `Documentation/TTS/CosyVoice3.md` — file roster, call flow, public API, CoreML caveats, indexed from `Documentation/README.md` ## Test plan - [x] `swift build` (release) - [x] Full `swift test` on this branch: 1 435 tests, 24 skipped, 0 failures (~13 min) - [x] `--filter CosyVoice3ChineseNormalizer` — 8/8 pass - [x] `--filter CosyVoice3PromptMel` — 8/8 pass - [x] Phase 1 end-to-end parity vs `build/wavs/e2e_shipping.wav` (max\|Δ\| < 1e-3, SNR > 40 dB, CPU-only fp32 Flow) - [x] Phase 2 end-to-end round-trip: Swift output → whisper.base → expected transcript ## Non-goals / follow-ups - SpeechTokenizer and CAMPPlus remain Python-side for prompt asset preparation; both have CoreML mlpackages but the required DSPs aren't yet ported. Users pass pre-computed `promptSpeechIds` / `spkEmbedding` in `CosyVoice3PromptAssets` for now. - Full `wetext.ZhNormalizer` (year / currency / decimals / units) is not ported. Callers that need production-grade TN run wetext server-side and pass `prenormalized: true`. - Flow stays fp32 (1.2 GB) until CoreMLTools pins `layer_norm` fused fp16. ## Updates — Devin review + main merge Picked up `origin/main` (resolved trivial enum-case merge in `ModelNames.swift` / `TtsBackend.swift` / `TTSCommand.swift`; both branches added new cases) and addressed the 12 Devin inline findings: - Sendable hygiene — dropped `@unchecked Sendable` from 9 types. `CosyVoice3Synthesizer` is now a proper `actor` (it crosses actor boundaries from the manager); `CosyVoice3Models` is plain `: Sendable` via `@preconcurrency import CoreML` (matches the existing `TtsModels` pattern; the initial drop-to-no-Sendable broke the benchmark CI build with `non-sendable result type CosyVoice3Models cannot be sent from actor-isolated context`, since it's returned by `store.models()`). The remaining types had Sendable conformance dropped entirely since they don't escape the owning actor. - Prefill stop-token bug — if the LLM emits an EOS token at step 0 the synthesizer now throws `predictionFailed` instead of falling through into the decode loop and accumulating semantically meaningless tokens. - HiFT mel slice OOB — added bounds check on `newMelStart` against the actual mel length and clamped `validFrames` to the available window; previously a `newMelStart > totalMelFrames` would `MLMultiArray` out of range during the chunk-packed call path. - Production logging — replaced `print()` stage timings with `AppLogger.info`; added `logger.warning` calls in `initialize()` and the CLI dispatcher for the beta-status banner. - Beta marker — doc comments on `CosyVoice3TtsManager` and `TtsBackend.cosyvoice3` flag the backend as experimental; CLI help text annotates the backend label. - Documentation — added `Documentation/TTS/CosyVoice3.md` mirroring the Kokoro / PocketTTS doc layout (files, call flow, public API, CLI, CoreML caveats, known limits) and indexed it from `Documentation/README.md`. 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/536" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review"> </picture> </a> <!-- devin-review-badge-end --> ---------	2026-04-28 09:57:13 -04:00
Alex	eff1752ebf	feat(tts/pocket): multi-language support (EN + 9 new packs) (#549 ) ## Summary Adds first-class support for PocketTTS language packs upstream `kyutai/pocket-tts` just published, tracking issue #49. Users pick a language at manager construction; all packs (including English) are downloaded from `v2/<lang>/` on `FluidInference/pocket-tts-coreml`. This PR replaces #540 (rebased onto current `main` from a fresh branch). ### Supported languages \| ID \| Layers \| HF subtree \| \|-----------------\|--------\|----------------------\| \| `english` \| 6 \| `v2/english` \| \| `french_24l` \| 24 \| `v2/french_24l` \| \| `german` \| 6 \| `v2/german` \| \| `german_24l` \| 24 \| `v2/german_24l` \| \| `italian` \| 6 \| `v2/italian` \| \| `italian_24l` \| 24 \| `v2/italian_24l` \| \| `portuguese` \| 6 \| `v2/portuguese` \| \| `portuguese_24l`\| 24 \| `v2/portuguese_24l` \| \| `spanish` \| 6 \| `v2/spanish` \| \| `spanish_24l` \| 24 \| `v2/spanish_24l` \| French ships 24-layer only upstream; no 6-layer French pack exists. ### Per-language artifacts shipped on HF Each `v2/<lang>/` subtree contains 5 `.mlmodelc` directories + `constants_bin/`: \| Artifact \| Precision \| Notes \| \|---------------------------\|---------------------\|-------\| \| `cond_step.mlmodelc` \| fp16 \| conditioning prefill (voice/text → KV cache) \| \| `flow_decoder.mlmodelc` \| fp16 \| flow-matching audio decoder \| \| `flowlm_step.mlmodelc` \| fp16 \| per-token transformer step (default) \| \| `flowlm_stepv2.mlmodelc` \| selective int8 \| weight-only PTQ on attn + FFN body linears (per kyutai-labs/pocket-tts#147 recipe); EOS head + input embedding stay fp32. Optional smaller variant; not currently loaded by Swift but available for client-side swap-in. \| \| `mimi_decoder.mlmodelc` \| fp16 \| Mimi neural codec decoder \| `mimi_encoder.mlmodelc` (voice cloning, language-agnostic) is fetched lazily, separately from any language pack. The selective int8 in `flowlm_stepv2` quantizes 4 linears per transformer layer (`attn_in_proj`, `attn_out_proj`, FFN expand, FFN contract) via `coremltools.optimize.torch.quantization.PostTrainingQuantizer` (per-channel, symmetric, weight-only). Sizes: 6L 145 MB → 74 MB; 24L 1.1 GB → 291 MB. ## Changes - `PocketTtsLanguage`: new enum (10 cases) with `repoSubdirectory` (always `"v2/<rawValue>"`) and `transformerLayers` (6 or 24). - `ModelNames.PocketTTS`: single `mimiDecoderFile = "mimi_decoder.mlmodelc"` and single `requiredModels` set covering all language packs uniformly. - `PocketTtsLayerKeys`: discovers KV-cache I/O names at runtime so 6L and 24L packs share the same inference path. `discover(...)` requires `expectedLayers: Int` (6 or 24) for early sanity-check. - `PocketTtsMimiKeys`: discovers the Mimi decoder's audio output + per-state input→output pairing dynamically (pass-through inputs first, then shape-bucket pairing in canonical order). - Voice safetensors prebakes: every language pack ships `<voice>.safetensors` containing pre-computed LM transformer KV cache snapshots (per-layer `[2, 1, seqLen, 16, 64]` F32 + I64 offset). `PocketTtsConstantsLoader.loadVoiceSnapshot` parses the safetensors header (8-byte LE u64 + JSON) and extracts per-layer cache + offset tensors. `PocketTtsSynthesizer.kvCacheStateFromSnapshot` copies K/V blocks into the runtime `[2, 1, kvCacheMaxLen, 16, 64]` state independently. Skips the per-token `cond_step` voice prefill. - `PocketTtsResourceDownloader`: `ensureModels(language:)` always fetches the requested `v2/<lang>/` subtree via `DownloadUtils.downloadSubdirectory`. `ensureVoice` downloads `<voice>.safetensors`. `ensureMimiEncoder()` lazily fetches the language-agnostic encoder for voice cloning without pulling a full language pack. - `PocketTtsModelStore` / `PocketTtsManager` / `PocketTtsSession` / `PocketTtsSynthesizer`: language threaded through load + constants + KV-cache sizing. Voice data is cached per `(language, voice)`. Mimi keys discovered + cached per language. - Voice cloning across languages: Mimi encoder is shared; cloned `PocketTtsVoiceData` from one language's manager can be fed to another. - CLI: `fluidaudiocli tts --backend pocket --language <id>` (default `english`). Unknown values log the supported list and fall back to English. - Docs: `Documentation/TTS/PocketTTS.md` gains a Languages section + cross-language cloning example. ## Tests - `PocketTtsLanguageTests` — pure-logic cases covering `repoSubdirectory`, `transformerLayers`, and `requiredModels`. No model download / no network. - Full PocketTTS test suite: 16/16 passing (`swift test --filter PocketTts`). ## Test plan - [x] `swift build` — clean Release build (rebased onto current `main`) - [x] `swift format lint --recursive --configuration .swift-format` — clean - [x] `swift test --filter PocketTts` — 16/16 pass - [x] Manual end-to-end via FluidAudio Swift CLI for all 10 language packs (fresh HF download → fp16 baseline → swap `flowlm_stepv2.mlmodelc` → re-synthesize → Parakeet TDT v3 ASR check on both outputs): \| Language \| fp16 ASR \| flowlm_stepv2 (int8) ASR \| \|-----------------\|---\|---\| \| english \| ✓ \| ✓ \| \| spanish \| ✓ \| ✓ \| \| spanish_24l \| ✓ \| ✓ \| \| french_24l \| ✓ \| ✓ \| \| german \| ✓ \| ✓ \| \| german_24l \| ✓ \| ✓ \| \| italian \| ✓ \| ✓ \| \| italian_24l \| ✓ \| ✓ \| \| portuguese \| ✓ \| ✓ \| \| portuguese_24l \| ✓ \| ✓ \| Selective int8 vs fp16 for `flowlm_step`: 6L 145 MB → 74 MB; 24L 1.1 GB → 291 MB. ## Non-goals - Runtime language switching on a live `PocketTtsManager` (create a new manager instead). - Auto-inferring language from text. - French 6-layer (upstream did not ship it). - Auto-loading `flowlm_stepv2` (Swift continues to load `flowlm_step.mlmodelc`/fp16 by default; the int8 variant ships in the pack so clients can opt in via cache swap, and a future PR can add a `precision: .fp16 \| .int8` selector). Closes #49	2026-04-27 22:21:43 -04:00
Alexandre Mendonça Alvaro	982f117eb4	fix: avoid misleading confidence warning in SlidingWindowAsrManager.finish() (#548 ) ### Why is this change needed? `SlidingWindowAsrManager.finish()` reconstructs final text by calling `processTranscriptionResult(...)` with empty `timestamps` and `confidences`. That path only needs token-to-text reconstruction, but it also runs confidence calculation, which logs: `Expected token confidences but got none - this should not happen` In practice this shows up during normal finalization even though nothing is actually wrong. ### What changed? Use `convertTokensToText(accumulatedTokens)` directly in `finish()` when only the merged final text is needed. This keeps behavior the same for the returned transcription while avoiding a misleading warning during normal shutdown. ### Validation - `swift test --filter SlidingWindowAsrManagerTests` - Reproduced locally from an app integration path before the patch; warning no longer appears after the change. <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/548" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review"> </picture> </a> <!-- devin-review-badge-end -->	2026-04-27 21:50:42 -04:00
Alex	7c115f6b4e	feat(tts/kokoro-ane): add laishere 7-stage CoreML chain (ANE-optimized) (#547 ) ## Summary Adds a second Kokoro TTS backend (`KokoroAne`) wrapping the [laishere/kokoro-coreml](https://github.com/laishere/kokoro-coreml) 7-stage chain (Albert → PostAlbert → Alignment → Prosody → Noise → Vocoder → Tail) behind an actor-based facade, used with the upstream author's permission. Per-stage `MLComputeUnits` assignment routes Albert/PostAlbert/Alignment/Vocoder to ANE; Prosody/Noise/Tail stay on CPU+GPU for fp32/iSTFT-heavy ops. The companion mobius PR for the conversion side: https://github.com/FluidInference/mobius/pull/45 Existing `KokoroTtsManager` (single fp32 model) is untouched. Both backends ship from the same `FluidInference/kokoro-82m-coreml` HF repo — KokoroAne lives under the `ANE/` subdirectory. ## What's added Module: `Sources/FluidAudio/TTS/KokoroAne/` - `KokoroAneManager` — actor facade: `initialize`, `synthesize(text\|phonemes)`, `synthesizeDetailed` - `KokoroAneSynthesizer` — 7-stage orchestration with fp16↔fp32 vImage boundaries (Prosody→Noise→Vocoder→Tail). Uses `rebuild16`/`rebuild32` helpers so each output is fetched once. - `KokoroAneModelStore` — per-stage MLModel handles + vocab + voice pack cache. Atomic-commit load (matches `PocketTtsModelStore` pattern) so partial-load failures stay retryable. - `KokoroAneVoicePack` — `[510, 256]` flat fp32 row indexing (timbre cols `[0:128]`, style_s cols `[128:256]`) - `KokoroAneVocab` — IPA → token IDs with BOS/EOS wrap, max 512 - `KokoroAneResourceDownloader` — HF cache management via existing `DownloadUtils`; also downloads the shared kokoro G2P assets on first init (see fix below) - G2P reuses existing `G2PModel.shared` CLI: ```bash fluidaudiocli tts "Hello world" --backend kokoro-ane [--metrics m.json] fluidaudiocli tts-asr-verify --texts-file phrases.txt --output-json results.json ``` The `tts-asr-verify` batch command synthesizes each phrase, transcribes with Parakeet, and emits per-phrase + macro/micro WER with stage timings. Tests (`Tests/FluidAudioTests/TTS/KokoroAne/`): - 13 unit tests (vocab, voice pack) — no model deps, run on CI - 5 E2E tests (synth + ASR roundtrip) — gated by `FLUIDAUDIO_RUN_KOKOROANE_E2E=1` Docs: - New `Documentation/TTS/KokoroAne.md` — when-to-pick decision table, CLI/Swift quick start, per-stage compute targets, voice pack layout, limits, perf numbers, source links. - Top-of-file callout on `Documentation/TTS/Kokoro.md` linking to the ANE-resident variant. - Updated `Documentation/README.md` index, `Documentation/Models.md` TTS table, `Documentation/API.md` reference, `Documentation/CLI.md` example. ## Verified end-to-end on M2 Cold model load: 20.6s (`anecompilerservice` first-run ANE compilation). Warm load: ~300ms. \| Phrase \| Synth \| Audio \| RTFx \| ASR roundtrip \| \|---\|---\|---\|---\|---\| \| Hello world \| 0.47s \| 1.65s \| 3.5× \| "Hello world." (WER 0%) \| \| The quick brown fox… \| 0.32s \| 3.18s \| 9.9× \| dropped "The" (WER 11%) \| \| She had been waiting… \| 0.25s \| 2.80s \| 11.4× \| "Shay" misheard (WER 12.5%) \| Aggregate macro WER 7.9%, micro WER 10.5% — error is ASR-side; TTS audio is intelligible. Steady-state per-stage timings confirm ANE residency (Albert/PostAlbert ~7-10ms each). ## Devin Review fixes addressed in this PR - 🔴 Partial model load wedged the store (`KokoroAneModelStore.loadIfNeeded`) — fixed via local `pendingModels` accumulator + atomic commit, matching `PocketTtsModelStore`. - 🐛 G2P models not downloaded standalone — `G2PModel.loadIfNeeded` only reads from `~/.cache/fluidaudio/Models/kokoro/` and never downloads. The kokoroAne download set didn't include G2P, so first-time `--backend kokoro-ane` users (no prior `kokoro` use) hit a cryptic `vocabLoadFailed`. Fixed by adding a `g2p-only` sentinel variant to `getRequiredModelNames(.kokoro, …)` and a new `KokoroAneResourceDownloader.ensureG2PAssets(directory:)` that runs before `G2PModel.shared.ensureModelsAvailable()` in `KokoroAneManager.initialize()`. - 🟡 Voice pack off-by-one (false positive) — verified upstream `convert-coreml.py:552` uses `voice_pack[len(phonemes) - 1]`, exactly matching the existing Swift `phonemeCount - 1`. No change. ## Refactor pass Internal cleanup applied across the module after the initial implementation landed: - `KokoroAneSynthesizer`: `rebuild16`/`rebuild32` helpers replace 11 inline `outputShape + outputArray + float16Array` patterns; F0/N shapes cached once (was fetched 4×). Fixed a mislabeled `stage:` argument in `outputArray` error reporting. - `KokoroAneSynthesizer+Conversion`: extracted `convertF32toF16`/`convertF16toF32`/`genericCopy` private helpers (eliminates 4× duplicated vImage buffer setup). - `KokoroAneModelStore`: folded `voicePack(_)` + `loadVoicePackIfNeeded(_)` into one method; dropped unreachable post-load guard and dead synthesized-URL throw. - `KokoroAneVocab` / `KokoroAneError`: added `vocabParseFailed(URL, String)` so a malformed top-level JSON object reports parse-failure instead of file-not-found; removed dead NSNumber bridging fallback. - `KokoroAneConstants`: dropped unused `defaultLanguage`, `voicePackTimbreSlice`, `voicePackStyleSSlice`. Changed `defaultSpeed` from `Float16` to `Float` (drops 4 `Float(...)` wraps at default-arg sites). - `KokoroAneError`: dropped unused `unsupportedPhoneme(Character)` — `KokoroAneVocab.encode` silently drops unknown chars per the upstream Python convention. ## Test plan - [x] `swift build` clean - [x] `swift test --filter KokoroAne` — 13 unit tests pass, 5 E2E gated - [x] With models staged at `~/.cache/fluidaudio/Models/kokoro-82m-coreml/ANE/`: - [x] `FLUIDAUDIO_RUN_KOKOROANE_E2E=1 swift test --filter KokoroAne` — all 18 pass - [x] `swift run fluidaudiocli tts "Hello world" --backend kokoro-ane --output /tmp/ane.wav --metrics /tmp/m.json` — produces non-silent audio + metrics with WER - [x] `swift run fluidaudiocli tts-asr-verify --texts-file phrases.txt --output-json /tmp/r.json` — aggregate WER ≤ 0.20 ## Models `FluidInference/kokoro-82m-coreml` on HuggingFace, under the `ANE/` subdirectory: ``` ANE/KokoroAlbert.mlmodelc fp16 + int8pal (CPU+ANE) ANE/KokoroPostAlbert.mlmodelc fp16 + int8pal (CPU+ANE) ANE/KokoroAlignment.mlmodelc fp16 + int8pal (CPU+ANE) ANE/KokoroProsody.mlmodelc fp32 (CPU+GPU) ANE/KokoroNoise.mlmodelc fp32 (CPU+GPU) ANE/KokoroVocoder.mlmodelc fp16 + int8pal (CPU+ANE) ANE/KokoroTail.mlmodelc fp32 + iSTFT (CPU+GPU) ANE/vocab.json 114 IPA tokens ANE/af_heart.bin [510, 256] fp32 voice pack ``` G2P assets (`G2PEncoder.mlmodelc`, `G2PDecoder.mlmodelc`, `g2p_vocab.json`) are pulled from the same repo's root and cached at `~/.cache/fluidaudio/Models/kokoro/`, shared with the regular `KokoroTtsManager` backend. ## License Upstream (laishere) is MIT — carried forward in the mobius PR's LICENSE file. Used with the upstream author's permission.	2026-04-27 20:08:49 -04:00
Alex	d302273d49	fix(diarizer): convert SpeakerManager to actor, Speaker to struct (#528 ) (#539 ) ## Summary Fixes [#528](https://github.com/FluidInference/FluidAudio/issues/528): heap corruption (`BUG IN CLIENT OF LIBMALLOC: memory corruption of free block`) and `Potential Structural Swift Concurrency Issue: unsafeForcedSync called from Swift Concurrent context` warnings in the diarizer on iOS 26.4 when `DiarizerModels.download()` + `SpeakerManager.extractSpeakerEmbedding` are called from an async context under Swift 6 strict concurrency. Root cause - `SpeakerManager` used `DispatchQueue.sync(flags: .barrier)` → `unsafeForcedSync` warning when called from a Swift concurrent context. - `Speaker` was a reference type with mutable `[Float]` embeddings → concurrent COW mutations on the embedding buffers corrupted the heap. Fix — apply the same actor-conversion pattern used for `AsrManager` in #419: - `Speaker`: `final class` → `struct` (Sendable value type) - `SpeakerManager`: class + `DispatchQueue` → `actor` - `SpeakerOperations` extension: dropped `queue.sync` - `DiarizerManager`: async-ified methods - `SpeakerManager.upsertSpeaker(_:)` + `upsertSpeaker(id:...)`: thread the speaker's `name` through persistence (previously implicit via class-reference mutation; now required with struct value semantics). - CLI (`ProcessCommand`, `DiarizationBenchmark`) and all speaker/diarizer tests updated to `await` the actor-isolated API. - `testConcurrentAccess` rewritten from `DispatchQueue.async`/`DispatchGroup` to `withTaskGroup` for structured concurrency. ## Test plan - [x] `swift build` — clean on macOS - [x] `swift test` — 1435 tests, 0 failures (24 skipped) - [x] swift-format — no new warnings in touched files (pre-existing warnings only, unrelated to this change) - [ ] CI: build + tests + swift-format checks - [ ] Verify on reporter's iOS 26.4 repro from #528 <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/539" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review"> </picture> </a> <!-- devin-review-badge-end -->	2026-04-23 22:13:47 -04:00
Alex	2ea0727541	ASR: fix Parakeet TDT v3 emitting Cyrillic for short Latin-script utterances (#512 ) (#515 ) Fixes #512. ## TL;DR Parakeet TDT v3 transcribed short Polish utterances like "Wpisz Google kropka com" as Cyrillic (`Впиш Гугл к ком.`) because the joint decoder's top-1 pick drifts to Cyrillic tokens under low acoustic confidence. This PR adds an opt-in script filter: when a caller passes `language: .polish` (or any other language with a declared script), the decoder rejects top-1 if it's the wrong script and walks top-K to the highest-probability candidate matching the expected script. - Opt-in: `language:` defaults to `nil` — zero behavior change for existing callers. - No acoustic-model changes — this is purely a decoder-side post-processing step over the joint logits. - Requires `JointDecisionv3.mlmodelc` (exposes top-K outputs). Auto-downloaded from HuggingFace alongside the other v3 files; falls back to standard argmax when absent. ## Empirical validation — reporter's own audio Samples pulled via `gdown --folder <link-from-issue-#512-comment>` from @tajchert's Drive folder. `JointDecisionv3.mlmodelc` is loaded in both columns — this isolates the Swift filter as the mechanism, not a model swap. \| sample \| ground truth \| `language: nil` (current) \| `language: .polish` (this PR) \| \|---\|---\|---\|---\| \| pl \| Wpisz Google kropka com \| Впиш Гугл к ком. \| Wpis Google.com. \| \| pl2 \| Wpisz Google kropka com \| Впиш Гугл крокаком. \| Wpish Google, Com. \| \| pl3 \| Wpisz Google kropka com \| Впишь куглькрабком. \| VP Kugl.com. \| \| pl4 \| Wpisz Google kropka com \| Впиш гугл к ком. \| Wpish gugl c. \| \| pl5 \| Wpisz Google kropka com \| Впиш гугл кракаком. \| Wpish Google Croca kom. \| \| pl6 \| Wpisz Google kropka com \| Впиш, гугл крокаком. \| Wpish, Google, Com. \| \| pl_complex \| Cały spichlarz jest ze spiżu \| Cały spichlarz jest ze spiżu. \| Cały spichlarz jest ze spiżu. \| 6/6 short samples flip Cyrillic → Latin. `pl_complex` was never broken (long context → high joint confidence → no drift) and is unchanged. ## Scope & limitations (important — please don't overclaim) *This PR fixes the script* the tokens are drawn from. It does NOT fix per-word acoustic accuracy. \| \| `language: nil` \| `language: .polish` \| \|---\|---\|---\| \| Script correct (Latin, not Cyrillic) \| ✗ \| ✓ (6/6) \| \| Word spelling matches ground truth \| ✗ \| ✗ (still 6/7 wrong on short) \| The residual errors — `Wpisz` → `Wpish`/`Wpis`, `kropka` → `Croca` / dropped — are Parakeet TDT v3 acoustic weaknesses on short Polish commands. No amount of output post-processing can turn `Wpish` into `Wpisz`; that needs better acoustic modeling, a Polish LM rescorer, or more training data. Out of scope here. What users actually get by merging: - Output is visually Polish (Latin script), not pseudo-Russian — works with locale-aware post-processing, spell-check, and UI rendering - Locale-strict WER evaluators no longer penalize Cyrillic-vs-Latin substitution - Opt-in; zero risk for callers who don't pass `language:` What users do not get: - Higher word accuracy on short Polish/Slavic Latin utterances - Support for languages outside the `Language` enum (Greek, Maltese, Hungarian, Turkish, Baltic — their characters fit the Latin Unicode ranges but aren't exposed; easy follow-up) - A meaningful FLEURS WER delta — see [Documentation/fleurs-script-filtering-comparison.md](./Documentation/fleurs-script-filtering-comparison.md); full sentences aren't in the failure regime ## Implementation ### New - `Sources/FluidAudio/Shared/ScriptDetection.swift` (new, +112) - `public enum Language` — 13 Latin (en, es, fr, de, it, pt, ro, pl, cs, sk, sl, hr, bs) + 5 Cyrillic (ru, uk, be, bg, sr) - `public enum Script { case latin, cyrillic }` - `matches(_:script:)` over Unicode ranges: ASCII (0x20–0x7F), Latin-1 (0xA0–0xFF), Latin Extended-A (0x100–0x17F), Latin Extended-B (0x180–0x24F — Romanian ș/ț), Latin Extended Additional (0x1E00–0x1EFF — Vietnamese), Cyrillic (0x400–0x4FF). Strips SentencePiece boundary marker U+2581 before checking. - `filterTopK(topKIds:topKLogits:vocabulary:preferredScript:) -> (tokenId, probability)?` — returns the highest-probability top-K candidate matching the target script; probability via softmax over the top-K subset with the max-logit stability trick; guarded against top-K array length mismatch. ### Changed - `TdtJointDecision` — optional `topKIds` / `topKLogits` fields (populated by JointDecisionv3 only) - `TdtDecoderV3` — script filter runs only when top-1 is already wrong script; both decode sites feed `filtered.probability` (a real [0,1]) into `TdtDurationMapping.clampProbability`, not raw logits - `AsrManager.transcribe(...)` — `language: Language? = nil` plumbed through all three overloads: `[Float]`, `URL`, `AVAudioPCMBuffer` - `AsrModels` + `ModelNames` — `requiredModelsV3` set includes `JointDecisionv3.mlmodelc` so the download utility fetches it on fresh installs and also backfills it for existing users on next `.v3` load - CLI — `fluidaudiocli transcribe <file> --language {en\|pl\|cs\|sk\|sl\|hr\|bs\|ro\|es\|fr\|de\|it\|pt\|ru\|uk\|be\|bg\|sr}` ### How to try it ```bash swift run -c release fluidaudiocli transcribe sample.wav --language pl ``` ## Model dependency `JointDecisionv3.mlmodelc` must be present in `FluidInference/parakeet-tdt-0.6b-v3-coreml` on HuggingFace. It exposes `top_k_ids` / `top_k_logits` outputs (K=64 in our export) alongside the standard argmax. When absent, `AsrModels` falls back to `JointDecision.mlmodelc` and the script filter becomes a no-op — backward compatible. Cache-upgrade verified: removed `JointDecisionv3.mlmodelc` from a populated cache, re-ran `--language pl`; the file was auto-fetched and Polish output was Latin. Existing users pick up the fix on next `.v3` load without manual intervention. ## Review notes / risky bits - Softmax over top-K subset, not the full vocab — probabilities won't exactly match a true full-softmax, but K=64 captures ~all the mass when the model is anywhere near confident. If you prefer, we can expose the raw top-K logits to callers and let them compute confidence however they want. - Top-1 escape hatch: filter is only triggered when top-1 fails `matches(_, script:)`. When top-1 is already correct, nothing is changed — so we can't regress the common case. - Length-mismatch guard in `filterTopK` uses `min(topKIds.count, topKLogits.count)`. If CoreML output arrays ever diverge, we iterate the common prefix instead of crashing. - Latin Extended-B (0x0180–0x024F) was added specifically so Romanian ș/ț aren't rejected as non-Latin. Latin Extended Additional (0x1E00–0x1EFF) was added for free — helps Vietnamese should anyone want it later. ## Tests - `ScriptDetectionTests` — 37 tests**: Unicode range coverage (Latin-1 / Extended-A / Extended-B / Extended Additional / Cyrillic), SentencePiece boundary-marker stripping, `filterTopK` happy path, length-mismatch guard, probability-range invariant, Czech/Slovak/Slovenian/Croatian/Romanian token coverage, cross-script rejection - Build clean; `swift format lint` clean on all touched files - A/B end-to-end run against reporter's actual Polish audio (table above) ## Checklist - [x] Builds clean (`swift build`, `swift build -c release`) - [x] `swift format lint` clean on touched files - [x] `ScriptDetectionTests` 37/37 pass - [x] A/B reproduction on #512 reporter's audio - [x] Cache-upgrade path verified (JointDecisionv3 auto-fetched on existing caches) - [x] CLI accepts all 18 language codes end-to-end - [ ] CI green ## Follow-ups (not blocking) - Expose more Latin languages in the enum (Hungarian, Turkish, Baltic, Maltese) — all character ranges already supported, just need enum cases - Add `Script.greek` for `el_gr` (separate Unicode range) - Short-utterance benchmark dataset (FLEURS is the wrong tool — it's all long sentences where drift doesn't happen) - Optional: publish a Polish LM rescorer to address the underlying acoustic-accuracy issue the script filter cannot fix ---------	2026-04-23 17:43:09 -04:00
Alex	cc4e712643	feat(asr/cohere): ANE-friendly static-shape decoder (v2) (#537 ) ## Summary Adds support for a new Cohere decoder variant — `cohere_decoder_cache_external_v2` — with fully static shapes so CoreML can dispatch the decoder to the Apple Neural Engine. - `ModelNames.CohereTranscribe`: adds v2 constants, flips default `requiredModels` to v2, keeps legacy set as `requiredModelsLegacy`. - `CoherePipeline.loadModels`: prefers v2 in `decoderDir`, falls back to v1, clear error if neither present. - Decode loop already auto-detects the variant from `attention_mask` shape (shipped in #487 area) — nothing to change runtime-side. - CLI help lists both decoder filenames. v2 artifacts are published at [`FluidInference/cohere-transcribe-03-2026-coreml/q8`](https://huggingface.co/FluidInference/cohere-transcribe-03-2026-coreml) (`cohere_decoder_cache_external_v2.{mlmodelc,mlpackage}`). The existing v1 decoder remains supported as a fallback. ## Why The v1 (`RangeDim(1, 108)`) decoder has a dynamic `attention_mask` length, which blocks ANE dispatch — `computeUnits = .all` silently falls back to CPU/GPU. v2 fixes the mask at `[1, 1, 1, 108]` and sources the decode position from `position_id`, letting the full decoder land on ANE. Measured with `fluidaudiocli cohere-transcribe` on the same audio (15 tokens, same q8 encoder, 3 warm runs each): \| Decoder \| Config \| Median decoder time \| \|---\|---\|---:\| \| Static (v2) \| `.all` (ANE) \| 2.58 s \| \| Dynamic (v1) \| `.all` \| 4.13 s \| \| Static (v2) \| `--cpu-gpu` \| 10.02 s \| \| Dynamic (v1) \| `--cpu-gpu` \| 4.32 s \| ~1.6× faster decoder end-to-end. The v1 `.all` ≈ v1 `--cpu-gpu` rows confirm RangeDim blocks ANE. v2 attends over the full 108 slots every step, so on pure CPU/GPU it's slower — the win is entirely from ANE residency. Transcripts are byte-identical across configs. ## Test plan - [x] Smoke test v2-preferred: directory containing only `cohere_decoder_cache_external_v2.mlmodelc` transcribes `english_original.wav` correctly. - [x] Smoke test v1 fallback: directory containing only `cohere_decoder_cache_external.mlmodelc` transcribes correctly. - [x] `swift build -c release --product fluidaudiocli` clean. - [x] `swift format` clean on changed files. - [ ] Reviewer: run `fluidaudiocli cohere-transcribe <audio> --model-dir <q8 dir with v2>` to reproduce the ANE speedup. ## Related - v2 export script (mobius): `export-decoder-cache-external-static.py` (uncommitted, to land in a follow-up mobius PR). - HF repo: `FluidInference/cohere-transcribe-03-2026-coreml` now ships both decoders under `q8/`. <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/537" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review"> </picture> </a> <!-- devin-review-badge-end -->	2026-04-23 17:42:34 -04:00
Sachin Desai	bd5ba7e1b7	fix abbreviation handling for kokoro (#538 ) ### Why is this change needed? This change fixes the following issues: - Sort the common abbreviations on the longest keys so that, e.g. "etc." is matched before "etc" to prevent a stray "." if the shorter match is performed first - The trailing "\b" fails when the abbreviation ends in a non-word char, e.g. "Dr." followed by a space is non-word→non-word, so no boundary. <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/538" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review"> </picture> </a> <!-- devin-review-badge-end --> Co-authored-by: Sachin Desai <sdesai@salesforce.com>	2026-04-23 17:40:26 -04:00
Alex	b10bdcb51d	feat(asr): add Cohere Transcribe (INT8 encoder + FP16 cache-external decoder) (#487 ) ## Summary Adds Cohere Transcribe ASR for 14 languages, shipped as an INT8 encoder + FP16 cache-external decoder hybrid (`CoherePipeline`). One CLI for single-file transcription, one CLI for dataset benchmarking (FLEURS and LibriSpeech). ## Languages English, French, German, Spanish, Italian, Portuguese, Dutch, Polish, Greek, Arabic, Japanese, Chinese (Simplified), Korean, Vietnamese. ## What's added ### Library (`Sources/FluidAudio/ASR/Cohere/`) - `CoherePipeline` — encoder + cache-external decoder runner. Allocates the K/V cache host-side (no CoreML State API; iOS 17+), applies the additive cross-attention mask, and detokenizes via SentencePiece byte fallback so CJK comes out as real characters. Accepts separate `encoderDir` / `decoderDir` to support the q8/f16 split. - `CohereAsrConfig` — per-language prompt sequences and token IDs; shared 35 s / 3500-frame audio window and 108-token decoder cache window constants. The 35 s cap traces directly to upstream `max_audio_clip_s: 35`. - `CohereMelSpectrogram` — 128-mel front-end matching the reference model (preemph, Slaney mel, CMVN). ### CLI (`Sources/FluidAudioCLI/Commands/ASR/Cohere/`) - `fluidaudiocli cohere-transcribe <audio> --language <lang>` — single-file transcription. Accepts either `--model-dir` (single dir with both encoder and decoder) or `--encoder-dir` + `--decoder-dir` for the q8/f16 split. - `fluidaudiocli cohere-benchmark` — dataset benchmark with `--dataset fleurs\|librispeech`, `--subset` for LibriSpeech splits, `--languages` for FLEURS codes, `--auto-download`, and `--checkpoint-every N` (default 100) so long runs persist partial results and survive mid-run crashes. ### `ModelNames.swift` - New `Repo.cohereTranscribeCoreml` → `FluidInference/cohere-transcribe-03-2026-coreml/q8`. - New `ModelNames.CohereTranscribe` enum with `encoder`, `decoderCacheExternal`, `vocab` and the corresponding `.mlmodelc` paths. ### Documentation - `Documentation/ASR/Cohere.md` — architecture, API, CLI, LibriSpeech + FLEURS results, upstream config provenance (`max_audio_clip_s`, `overlap_chunk_second`), comparison vs Cohere's Figure 4 reference numbers, caveats. ### FLEURS coverage - Extends `FleursBenchmark.supportedLanguages` with the 6 non-European Cohere languages (`pt_br`, `ar_eg`, `ja_jp`, `cmn_hans_cn`, `ko_kr`, `vi_vn`). ## LibriSpeech test-clean (Apple M2 2022, Tahoe 26.0) Full split, all 2,620 utterances, single-chunk. \| Subset \| Samples \| WER \| CER \| RTFx (per-file mean) \| RTFx (total audio/compute) \| \|---\|---:\|---:\|---:\|---:\|---:\| \| test-clean \| 2,620 \| 1.77% \| 0.60% \| 2.04× \| 1.72× \| 5h 24m audio processed in 3h 09m compute (3h 12m wall time including one-time ~6 min ANE cold-start compile). Competitive with Parakeet TDT 0.6B v3 (~1.7%) and Whisper large-v3 (~1.8%). ## FLEURS results (full splits, single-chunk) M4 Pro / Tahoe 26.0, 9,911 samples total. \| FLEURS code \| Language \| Samples \| WER \| CER \| RTFx \| \|---\|---\|---:\|---:\|---:\|---:\| \| en_us \| English \| 647 \| 5.63% \| 3.19% \| 2.49× \| \| fr_fr \| French \| 676 \| 6.22% \| 3.11% \| 2.21× \| \| de_de \| German \| 862 \| 5.84% \| 2.83% \| 1.98× \| \| es_419 \| Spanish (LATAM) \| 908 \| 4.53% \| 2.40% \| 1.34× \| \| it_it \| Italian \| 865 \| 4.03% \| 2.04% \| 3.15× \| \| pt_br \| Portuguese (BR) \| 919 \| 6.44% \| 3.38% \| 2.79× \| \| nl_nl \| Dutch \| 364 \| 8.07% \| 4.14% \| 2.04× \| \| pl_pl \| Polish \| 758 \| 7.49% \| 3.23% \| 1.98× \| \| el_gr \| Greek \| 650 \| 11.50% \| 5.45% \| 2.00× \| \| ar_eg \| Arabic (EG) \| 428 \| 18.46% \| 6.71% \| 2.06× \| \| ja_jp \| Japanese \| 650 \| 60.13%† \| 6.25% \| 2.23× \| \| cmn_hans_cn \| Mandarin \| 945 \| 98.52%† \| 12.01% \| 1.85× \| \| ko_kr \| Korean \| 382 \| 16.39% \| 6.67% \| 1.84× \| \| vi_vn \| Vietnamese \| 857 \| 9.55% \| 6.87% \| 1.55× \| †Japanese and Mandarin are written without word boundaries, so WER on the raw hypothesis is a tokenization artifact — CER is the real accuracy metric. Cohere's own Figure 4 uses CER for zh/ja/ko for the same reason. ## Usage ```swift let models = try await CoherePipeline.loadModels( encoderDir: q8Dir, decoderDir: q8Dir, vocabDir: q8Dir ) let pipeline = CoherePipeline() let result = try await pipeline.transcribe( audio: samples, // 16 kHz mono Float32, up to 35 s models: models, language: .english ) ``` ```bash # Single file swift run -c release fluidaudiocli cohere-transcribe audio.wav --language en # LibriSpeech swift run -c release fluidaudiocli cohere-benchmark \ --dataset librispeech --subset test-clean \ --model-dir /path/to/q8 --auto-download # FLEURS swift run -c release fluidaudiocli cohere-benchmark \ --dataset fleurs --languages en_us,fr_fr --auto-download ``` ## HuggingFace - INT8 hybrid (shipped): https://huggingface.co/FluidInference/cohere-transcribe-03-2026-coreml (subdir `q8/`) - Upstream model: https://huggingface.co/CohereLabs/cohere-transcribe-03-2026 ## Notes - 35 s single-chunk limit is baked into the upstream model (`max_audio_clip_s: 35` in `cohere-pytorch/config.json`). Upstream Python also supports >35 s via 5 s-overlap chunking (`overlap_chunk_second: 5`); this port does not implement that wrapper yet and skips longer utterances with a warning. - Cache-external decoder stays FP16: INT8 decoder quantization regresses quality significantly in testing and is not shipped. ## Test plan - [x] Library + CLI release build clean - [x] Single-file transcription via \`cohere-transcribe\` - [x] FLEURS en_us sanity (5.63% WER) - [x] Full 14-language FLEURS benchmark (9,911 samples) - [x] Full LibriSpeech test-clean benchmark (2,620 samples, WER 1.77%) - [x] CJK CER validated (word-boundary-agnostic metric for ja/zh) - [x] Checkpoint-every survives kill mid-run - [x] \`printFinalSummary\` no longer aborts on macOS 26	2026-04-23 10:59:07 -04:00
Alex	7c9be31c05	fix(benchmark): repair 3 pre-existing script/download bugs (#534 ) ## Summary Three unrelated pre-existing bugs surfaced while validating PR #515. All of them block `Scripts/parakeet_subset_benchmark.sh --download` from succeeding, but none are related to the v3 script-filtering work. Consolidating into one PR since each fix is ~1–3 lines. ### 1. Japanese TDT folder-name mismatch `Scripts/parakeet_subset_benchmark.sh` verifies the Japanese TDT model at `$MODELS_DIR/parakeet-tdt-ja/`, but the folder was renamed to `parakeet-ja` in `4ef33f0b6` (`Repo.parakeetJa.folderName = "parakeet-ja"`). Result: `verify_assets()` always reported missing assets even on a fully provisioned machine. One-line rename to match. ### 2. EOU streaming CLI writes to wrong path `ParakeetEouCommand` had a default / `--use-cache` split where the default branch produced `$CWD/Models/<chunk>/<chunk>/` (double-nested, relative to CWD) as the load path, while `downloadModels()` called `deletingLastPathComponent().deletingLastPathComponent()` then `DownloadUtils.downloadRepo(repo, to:)` which appended `folderName = "parakeet-eou-streaming/<chunk>"`. Net effect: files landed at `$CWD/Models/parakeet-eou-streaming/<chunk>/` while `loadModels()` looked at `$CWD/Models/<chunk>/<chunk>/` — model load failed silently. Unified on Application Support (matches every other CoreML model in FluidAudio). `--use-cache` retained as a no-op flag for backward compatibility. ### 3. earnings22-kws dataset 404 HuggingFace consolidated `argmaxinc/earnings22-kws-golden` into `argmaxinc/contextual-earnings22`. The old id now returns 404 from the Datasets-Server REST API (no redirect follow). The new dataset has the same feature schema (`audio`, `file_id`, `text`, `dictionary`, ...), so swapping the id is sufficient — no downstream consumer changes needed. ## Test plan Ran `Scripts/parakeet_subset_benchmark.sh --download` end-to-end: - [x] `verify_assets` correctly resolves `parakeet-ja/` (all 5 expected files present) - [x] EOU warmup: `Models downloaded to ~/Library/Application Support/FluidAudio/Models/parakeet-eou-streaming/320ms`, 0.00% WER on warmup file - [x] earnings22-kws: 1140+ files downloaded (was 0 before), no 404 - [x] `swift build` passes Out of scope but observed (pre-existing, unrelated): - `ctc-earnings-benchmark --auto-download` does not actually auto-download CTC-110m model - THCHS-30 dataset hit HF IP rate limit (429) — transient <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/534" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review"> </picture> </a> <!-- devin-review-badge-end -->	2026-04-21 04:22:18 -04:00
Alex	f8badf7899	docs(diarization): update AMI offline benchmarks after #523 fix (#533 ) ## Summary Follow-up to #523. The merged fix dramatically improved offline diarization accuracy on the AMI SDM test set, but the documented benchmark numbers still reflected the pre-fix buggy pipeline. ### Benchmark impact (full 16-meeting AMI SDM test set, post-#523 merged main) \| Metric \| Before #523 (buggy) \| After #523 (fixed) \| Change \| \|---\|---:\|---:\|---:\| \| Average DER \| 20.74% \| 10.62% \| −48.8% \| \| Average JER \| 47.20% \| 17.37% \| −63.2% \| \| Average Speaker Error \| 13.80% \| 3.26% \| −76.4% \| \| Correct speaker count \| 3 / 16 \| 12 / 16 \| +9 meetings \| \| Catastrophic failures (DER > 40%) \| 2 \| 0 \| −2 \| ES2004d in particular went from 69.4% DER → 11.4% DER. ## Changes 1. `Documentation/Diarization/BenchmarkAMISubset.md` - Offline VBx 4-meeting subset table updated with post-fix numbers (12.0% avg DER, down from 21.8%) - Added full 16-meeting AMI SDM reference (10.62% DER) - Summary table Offline VBx row updated 2. `Documentation/Benchmarks.md` - Added full AMI SDM 16-meeting offline results block in the "Offline diarization pipeline" section (keeps existing VoxConverse numbers) 3. `Sources/FluidAudioCLI/Commands/DiarizationBenchmark.swift` - Added EN2002 and TS3003 series to `allMeetings` so `--dataset ami-sdm --auto-download` actually enumerates all 16 official test meetings. Previously, `DatasetDownloader.swift` downloaded 16 WAVs but `DiarizationBenchmark.swift`'s separate `allMeetings` list only included 8 of them (ES2004 + IS1009), silently skipping EN2002 and TS3003. ## Test plan - [x] `swift build -c release` passes on branch - [x] `swift run fluidaudiocli diarization-benchmark --mode offline --dataset ami-sdm --auto-download` now enumerates all 16 meetings - [x] Verification run reproduces documented numbers: 10.62% avg DER across 16 meetings, 12/16 with correct speaker count - [ ] CI benchmark workflow picks up new numbers on merge ## Related - Implements follow-up docs for #523 - Closes #513 (original bug report) — already implicitly closed by #523 merge, this PR makes the docs reflect reality <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/533" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review"> </picture> </a> <!-- devin-review-badge-end -->	2026-04-20 20:19:46 -04:00
thechatk	30599f9c89	Fix offline diarization pipeline producing single-speaker output (#523 ) ## Summary Fixes three bugs in the offline diarization pipeline that caused it to attribute nearly all segments to a single speaker: - `vDSP_mtrans` dimension swap — `frameCount` and `speakerCount` arguments were reversed, corrupting transposed speaker masks to be nearly identical across all speakers - Missing activity ratio filter — the reference pyannote implementation filters speakers with <20% clean activation, but the current code only filters completely silent speakers, allowing junk embeddings through to clustering - Soft masks vs binary masks — the reference derives per-frame speaker masks from argmax on powerset logits (binary 0/1), but the current code uses soft probabilistic masks via matrix-vector multiplication, producing blurred activations ## Test results After all three fixes, diarization works correctly across multiple test scenarios. Testing against a 467-second 3-speaker audio file achieved 97% F1 score vs PyTorch pyannote while maintaining real-time performance (120x faster than real-time). ## Files changed - `Sources/FluidAudio/Diarizer/Offline/Extraction/OfflineEmbeddingExtractor.swift` - `Sources/FluidAudio/Diarizer/Offline/Segmentation/OfflineSegmentationProcessor.swift` Fixes #513 <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/523" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a> <!-- devin-review-badge-end --> --------- Co-authored-by: thechatk <275578031+thechatk@users.noreply.github.com>	2026-04-20 19:35:56 -04:00
Alex	b789a56609	Fix Japanese TDT model download filename mismatch (#522 ) Fixes the infinite re-download loop for Japanese TDT models reported in #521. ## Problem The `download()` function was using hardcoded `Names.decoderFile` and `Names.jointFile` for all model versions. For `.tdtJa`, this downloaded: - `Decoder.mlmodelc` - `JointDecision.mlmodelc` But `modelsExist()` checks for version-specific filenames: - `Decoderv2.mlmodelc` - `Jointerv2.mlmodelc` This mismatch caused the existence check to fail, triggering cache purge and re-download in an infinite loop. ## Solution Use `getModelFileNames(version)` in the download function to get the correct filenames for each version, matching what `modelsExist()` expects. ## Testing - [x] Build passes - [x] Filenames now match between download and existence check <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/522" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a> <!-- devin-review-badge-end --> ---------	2026-04-20 17:56:10 -04:00
Phoenix	3dc57c83b3	Fix: Lower ASR minimum audio guard from 1s to 300ms (#531 ) Short single-word utterances (e.g. "yes", "no", "stop") are typically 500-700ms and were silently rejected by AsrManager with ASRError.invalidAudioData before transcription ran. Lower the guard to 300ms so these reach the model. Adds ASRConstants.minimumAudioDurationSeconds and a companion minimumRequiredSamples(forSampleRate:) helper, mirroring the existing calculateEncoderFrames(from:) pattern so the arithmetic lives next to the other ASR audio math rather than being inlined at each guard. ## Summary - Lower the minimum audio length accepted by `AsrManager` from 1s (16,000 samples) to 300ms (4,800 samples) so short single-word utterances ("yes", "no", "stop") reach the Parakeet TDT model instead of being silently rejected with `ASRError.invalidAudioData`. - Add `ASRConstants.minimumAudioDurationSeconds` (= 0.3) and a companion `minimumRequiredSamples(forSampleRate:)` helper, mirroring the existing `calculateEncoderFrames(from:)` pattern so the arithmetic is defined once next to the other ASR audio math. - Update the in-memory guard in `AsrManager+Transcription.swift` and the disk-backed guard in `AsrManager.swift` to call the helper via a locally-named `minimumRequiredSamples`. - Refresh the `ASRError.invalidAudioData` message from "at least 1 second" to "at least 300ms" so it stays truthful. ## Files changed - `Sources/FluidAudio/Shared/ASRConstants.swift` — add `minimumAudioDurationSeconds` constant and `minimumRequiredSamples(forSampleRate:)` helper. - `Sources/FluidAudio/ASR/Parakeet/SlidingWindow/TDT/AsrManager+Transcription.swift` — in-memory guard now uses the helper. - `Sources/FluidAudio/ASR/Parakeet/SlidingWindow/TDT/AsrManager.swift` — disk-backed guard (`transcribeDiskBacked`) now uses the helper. - `Sources/FluidAudio/ASR/Parakeet/AsrTypes.swift` — update `ASRError.invalidAudioData` message to reflect the new threshold. - `Tests/FluidAudioTests/ASR/Parakeet/ASRConstantsTests.swift` — add `testMinimumAudioDurationSeconds` covering both the constant and the helper. ## Why is this change needed? <img width="849" height="630" alt="image" src="https://github.com/user-attachments/assets/a4adba61-4f37-4bd6-b341-554362cb2e32" /> https://github.com/altic-dev/FluidVoice/issues/276 https://github.com/altic-dev/FluidAudio/blob/f3dba78a23cb706d01c889bd54f7efd26871e82e/Sources/FluidAudio/ASR/Parakeet/AsrTranscription.swift#L11 <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/531" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review"> </picture> </a> <!-- devin-review-badge-end -->	2026-04-18 21:43:57 -04:00
Alex	4ef33f0b64	Fix Japanese TDT models and consolidate to unified AsrModels API (#521 ) ## Summary - Fixes TDT Japanese model downloads by returning union of both CTC and TDT required models - Enables Japanese TDT models to work with `AsrModels`/`AsrManager` for timing information - Removes all redundant Japanese-specific managers (TdtJaManager, CtcJaManager) - consolidates to unified AsrModels path ## Problems Fixed ### Problem 1: Model Download (Issue #517) The `parakeet-ctc-0.6b-ja-coreml` repository contains both CTC models (`CtcDecoder.mlmodelc`) and TDT models (`Decoderv2.mlmodelc`, `Jointerv2.mlmodelc`). When `TdtJaModels` attempts to download from `Repo.parakeetJa`, the `getRequiredModelNames()` function was only returning `CTCJa.requiredModels`, which doesn't include the TDT-specific models. This caused TDT Japanese models to fail downloading with "Model file not found" errors. ### Problem 2: AsrModels File Name Mismatch `AsrModels` used hardcoded file names from `ModelNames.ASR` (expecting `Decoder.mlmodelc` and `JointDecision.mlmodelc`), which didn't match Japanese TDT model files (`Decoderv2.mlmodelc`, `Jointerv2.mlmodelc`). This prevented users from loading `.tdtJa` with `AsrModels`/`AsrManager` to get timing information. ### Problem 3: Code Duplication Japanese models had 4 specialized managers (TdtJaManager, CtcJaManager, TdtJaModels, CtcJaModels) that duplicated functionality and didn't match the pattern used by other TDT variants (v2, v3, tdtCtc110m all use AsrModels directly). ## Solution ### Fix 1: Model Downloads (ModelNames.swift) Updated `ModelNames.swift` line 675-677 to return the union of both model sets: ```swift case .parakeetJa: // Repo contains BOTH CTC and TDT models - return union of both sets return ModelNames.CTCJa.requiredModels.union(ModelNames.TDTJa.requiredModels) ``` This ensures all 5 models are downloaded: - Preprocessor.mlmodelc (shared) - Encoder.mlmodelc (shared) - CtcDecoder.mlmodelc (CTC only) - Decoderv2.mlmodelc (TDT only) - Jointerv2.mlmodelc (TDT only) ### Fix 2: Version-Specific Model File Names (AsrModels.swift) - Added `getModelFileNames()` to return version-specific decoder, joint, and vocabulary file names - Added `getRequiredModels()` to return version-specific model sets - Updated `load()`, `loadVocabulary()`, and `modelsExist()` to use version-specific names ### Fix 3: Remove Redundant Code Deleted: - `TdtJaManager.swift` - broken, redundant - `TdtJaModels.swift` - redundant - `CtcJaManager.swift` - redundant (TDT is superior) - `CtcJaModels.swift` - redundant - `AsrModelVersion.ctcJa` enum case - no longer needed - All related tests (replaced by `AsrModelsTdtJaTests`) Updated: - JapaneseAsrBenchmark → uses `AsrModels` + `AsrManager` - Removed `.ctcJa` from version labels and validation ## Result Clean, unified API for Japanese TDT models that matches other TDT variants: ```swift // Load Japanese TDT models let models = try await AsrModels.load(version: .tdtJa) let manager = AsrManager(models: models) // Transcribe with timing info var state = try TdtDecoderState(decoderLayers: 2) let result = try await manager.transcribe(url, decoderState: &state) // Access text and timing information print(result.text) print(result.timings) // ✅ Timing info available! ``` ## Benefits 1. Timing information - Users get token timings via `AsrManager` (not available in `TdtJaManager`) 2. Consistency - Japanese TDT follows same pattern as v2/v3/tdtCtc110m 3. Less code - Removed ~1000 lines of redundant manager code 4. Single source of truth - One way to load Japanese TDT models ## Testing - ✅ `CtcJaTests.testCtcJaTranscription` - Full CTC Japanese pipeline test - ✅ `TdtJaTests.testTdtJaTranscription` - Full TDT Japanese pipeline test - ✅ `AsrModelsTdtJaTests.testTdtJaWithAsrModels` - TDT Japanese loads via AsrModels - ✅ `AsrModelsTdtJaTests.testTdtJaWithAsrManager` - TDT Japanese works with AsrManager - ✅ Build verified with `swift build` Fixes #517	2026-04-12 15:09:58 -04:00
Alex	044bb0bf8f	Refactor: Rename Repo.parakeetCtcJa to Repo.parakeetJa for accuracy (#520 ) ## Problem The enum name `Repo.parakeetCtcJa` is misleading because it implies the repository only contains CTC models, but it actually contains both CTC and TDT models. ## Verified Repository Contents `FluidInference/parakeet-ctc-0.6b-ja-coreml` contains: - ✅ CTC models: `CtcDecoder.mlmodelc` - ✅ TDT v2 models: `Decoderv2.mlmodelc` + `Jointerv2.mlmodelc` - Shared: `Preprocessor.mlmodelc`, `Encoder.mlmodelc`, `vocab.json` ## Solution Renamed `Repo.parakeetCtcJa` → `Repo.parakeetJa` to accurately reflect that it's the Japanese models repository containing both decoder variants. ## Changes - ModelNames.swift: Renamed enum case from `.parakeetCtcJa` to `.parakeetJa` - AsrModels.swift: Updated `.ctcJa` and `.tdtJa` to use `.parakeetJa` - CtcJaModels.swift: Updated repository reference - TdtJaModels.swift: Updated repository reference and added comment ## Testing - ✅ Build succeeds - ✅ Both CTC and TDT Japanese managers now use the correct repository name ## Related - Follow-up to #516 and #519 - Addresses naming clarity issue raised by @Josscii <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/520" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a> <!-- devin-review-badge-end -->	2026-04-12 00:21:30 -04:00
Alex	c4aaa5d018	Fix parakeet-ctc-ja download error: Prevent AsrModels from loading CTC-only models (#516 ) ## Problem Issue #514 reported that downloading `parakeet-ctc-ja` models would succeed, but then fail during loading with: ``` [WARN] First load failed: Model file not found: Decoder.mlmodelc ``` ### Root Cause `AsrModels` (designed for TDT models) was incorrectly accepting `.ctcJa` and `.ctcZhCn` model versions, which use different decoder file names: - TDT models use `Decoder.mlmodelc` - Japanese CTC models use `CtcDecoder.mlmodelc` - Chinese CTC models use `Decoder.mlmodelc` (but with different structure) When users tried to load `.ctcJa` models via `AsrModels`: 1. Download succeeded (correct files downloaded: `CtcDecoder.mlmodelc`) 2. Loading failed (looking for wrong file: `Decoder.mlmodelc`) ## Solution Added validation in `AsrModels.load()` and `AsrModels.download()` to reject CTC-only model versions with clear error messages that direct users to the correct manager classes: - For `.ctcJa` → Use `CtcJaManager` - For `.ctcZhCn` → Use `CtcZhCnManager` ## Changes ### Modified Files - `Sources/FluidAudio/ASR/Parakeet/SlidingWindow/TDT/AsrModels.swift` - Added validation at the start of `load()` method - Added validation at the start of `download()` method - Throws descriptive `AsrModelsError` with guidance to correct manager - `Tests/FluidAudioTests/ASR/Parakeet/SlidingWindow/TDT/AsrModelsTests.swift` - Added 5 new tests for CTC-only model validation - Tests verify both `.ctcJa` and `.ctcZhCn` are properly rejected - Tests verify error messages contain correct manager class names ## Testing All 32 tests in `AsrModelsTests` pass, including the new validation tests: - ✅ `testCtcJaModelRejectsAsrModelsLoad()` - ✅ `testCtcJaModelRejectsAsrModelsDownload()` - ✅ `testCtcZhCnModelRejectsAsrModelsLoad()` - ✅ `testCtcZhCnModelRejectsAsrModelsDownload()` - ✅ `testCtcOnlyModelsAreMarkedCorrectly()` ## Example Error Message Before (confusing): ``` Model file not found: Decoder.mlmodelc ``` After (clear guidance): ``` CTC-only model .ctcJa must be loaded via CtcJaManager, not AsrModels ``` Closes #514 <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/516" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a> <!-- devin-review-badge-end -->	2026-04-11 23:01:12 -04:00
Alex	421313a5b3	docs: Fix speaker diarization model references from 3.1 to community-1 (#510 ) ### Why is this change needed? Clarifies that FluidAudio's speaker diarization is based on pyannote/speaker-diarization-community-1, not 3.1. Fixes confusion from issue #508. ### What changed? - Updated code comment in `SegmentationProcessor.swift` - Fixed `CLAUDE.md` model source reference - Clarified `Documentation/Benchmarks.md` that both online/offline use community-1 ### Context The actual CoreML model at [FluidInference/speaker-diarization-coreml](https://huggingface.co/FluidInference/speaker-diarization-coreml) has always been based on community-1, but some documentation incorrectly referenced 3.1. Related: #508 <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/510" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a> <!-- devin-review-badge-end -->	2026-04-10 22:45:30 -04:00
Hamza Qayyum	fcd80f1085	Parallelize chunked Parakeet batch transcription (#507 ) ### Why is this change needed? This PR speeds up Parakeet batch transcription for long audio by ~2.2-2.8x, by parallelizing the existing stateless chunked path. It doesn't change the streaming/live transcription path. It adds a configurable `parallelChunkConcurrency` setting to `ASRConfig`, lets `AsrManager` create worker clones from already-loaded `AsrModels`, and updates `ChunkProcessor` to send independent chunks across that worker pool before merging the results with the existing merge logic. The important part is that the decoding behavior for each chunk stays the same. The patch is really about scheduling chunk work in parallel so the runtime can keep more hardware busy and improve throughput on longer files. ### Validation Benchmarked on Apple M3, using 16 KHz 16-bit mono wav file downloaded from [this](https://www.youtube.com/watch?v=GT_sXIUJPUo) video (~1 hour duration), with 5 runs each for current upstream vs. PR branch. \| Model \| Upstream Avg Time \| PR Branch Avg Time \| Speedup \| Upstream Avg Peak Mem \| PR Branch Avg Peak Mem \| Delta \| \| --- \| ---: \| ---: \| ---: \| ---: \| ---: \| ---: \| \| Parakeet v2 \| 31.84 s \| 11.25 s \| 2.83x \| 515.9 MiB \| 537.4 MiB \| +21.4 MiB \| \| Parakeet v3 \| 31.37 s \| 12.75 s \| 2.46x \| 496.0 MiB \| 527.0 MiB \| +31.0 MiB \| \| Parakeet tdt-ctc-110m \| 19.89 s \| 9.08 s \| 2.19x \| 489.6 MiB \| 509.2 MiB \| +19.7 MiB \| I compared the resulting transcripts and word timings before and after this change for v2, v3, and `tdt-ctc-110m`, and found no differences. So based on this one test file at least, the optimization appears safe. Peak memory footprint was measured with macOS `/usr/bin/time -lp`. While it does increase, the measured increase is modest relative to the speedup, so I think it's reasonable to keep `parallelChunkConcurrency` set to `4` by default rather than make it opt-in. ### `parallelChunkConcurrency` Optimal Value A default value of `4` for the chunk parallelism was chosen becuase values higher than it yielded little to no extra speedup and values less than it still left speed on the table; on the two devices I tested on, at least, which were iPhone SE 3 and M3 MacBook Air. ### AI Disclosure OpenAI Codex was used to write the code for this patch. <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/507" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a> <!-- devin-review-badge-end --> --------- Co-authored-by: Alex <hanweng9@gmail.com>	2026-04-10 22:34:43 -04:00
Alex	04747b3e77	Standardize model loading API across all ASR managers (#506 ) ## Summary Standardizes the model loading API across all ASR managers to reduce developer cognitive load and improve consistency. This addresses [issue #457 comment #4203327648](https://github.com/FluidInference/FluidAudio/issues/457#issuecomment-4203327648). ## Problem Each ASR manager had different model loading APIs: - `AsrManager`: `configure(models:)` ❌ - `SlidingWindowAsrManager`: `startStreaming(models:, source:)` ❌ - `StreamingEouAsrManager`: `loadModels(modelDir:)` with inconsistent overloads ❌ - `StreamingNemotronAsrManager`: `loadModels(modelDir:)` ❌ This created developer confusion and increased documentation burden. ## Solution Unified API pattern across all managers: ```swift // All managers now use consistent naming: manager.loadModels(from: URL) // Load from local directory manager.loadModels(_ models: PreloadedModels) // Use pre-loaded models manager.loadModels(to: URL?, progressHandler:) // Download and load (optional) ``` ### Changes - AsrManager: Added `loadModels(_:)`, deprecated `configure(models:)` - SlidingWindowAsrManager: Separated model loading from streaming activation, added CoreML import - StreamingEouAsrManager: Standardized to `loadModels(from:)` - StreamingNemotronAsrManager: Standardized to `loadModels(from:)` with download support - CLI: Updated 9 command files to use new APIs ## Verification Ran full benchmark suite (8 models × 100 files) to verify zero regression: \| Model \| Baseline \| Current \| Delta \| Status \| \|-------\|----------\|---------\|-------\|--------\| \| Parakeet TDT v3 (0.6B) \| 2.6% \| 2.64% \| +0.04% \| ✅ \| \| Parakeet TDT v2 (0.6B) \| 3.8% \| 3.79% \| -0.01% \| ✅ \| \| CTC-TDT 110M \| 3.6% \| 3.56% \| -0.04% \| ✅ \| \| CTC Earnings \| 16.54% \| 16.55% \| +0.01% \| ✅ \| \| EOU 320ms (120M) \| 7.11% \| 7.11% \| 0.00% \| ✅ \| \| Nemotron 1120ms (0.6B) \| 1.99% \| 1.99% \| 0.00% \| ✅ \| \| TDT Japanese (0.6B) \| 6.11% \| 6.11% \| 0.00% \| ✅ \| \| CTC Chinese (0.6B) \| 8.37% \| 8.37% \| 0.00% \| ✅ \| ✓ No WER/CER regressions (all within 0.3% of baseline) ## Benefits - ✅ Reduced cognitive load - single pattern across all managers - ✅ Cleaner separation of concerns - model loading vs. streaming activation - ✅ Consistent prepositions - all use `from:` for loading from directory - ✅ Zero performance impact - validated with comprehensive benchmarks - ✅ Backward compatibility - deprecated APIs still work with migration warnings <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/506" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a> <!-- devin-review-badge-end -->	2026-04-08 14:14:45 -04:00

1 2 3 4 5 ...

273 Commits