273 Commits

Author SHA1 Message Date
Alex 847a985ae4 fix(tts/pocket-tts): repair v1 voice cloning for pocket-tts 2.0.0 (#592) (#601)
## Summary

Fixes #592 — PocketTTS voice cloning produced garbled audio on macOS
after the `pocket-tts==2.0.0` upgrade. v2 (pre-baked KV snapshot) voices
were unaffected — only the v1 path (user audio → `mimi_encoder` →
`cond_step` prefill) was broken.

Two compounding bugs:

### RCA 1 — stale `mimi_encoder`
The `mimi_encoder.mlpackage` originally published on HF was traced
against pre-2.0.0 `pocket-tts` (torch 2.9.1, Float32, scalar output) and
no longer matched the runtime cond_step contract. Re-traced as
`mimi_encoderv2` from `pocket-tts==2.0.0` (torch 2.11.0, Float16, fixed
`[1, 1, 240000]` → `[1, 125, 1024]`). Both files now live at the HF repo
root (legacy file kept for backwards compat); `ModelNames.mimiEncoder`
points at the new one.

### RCA 2 — missing `bos_before_voice` prepend
`pocket-tts` 2.0.0 added a learned 1024-d `flow_lm.bos_before_voice`
buffer that has to be prepended to the audio_prompt during cond_step
prefill. Without it the FlowLM sees a different token distribution than
training. Extracted per-language as `constants_bin/bos_before_voice.bin`
(4096 bytes each, 10 packs × distinct SHA-256s, all verified
byte-for-byte against the HF upload).

### Swift-side changes
- `PocketTtsVoiceCloner` pads/truncates input to the encoder's fixed 240
000 samples (10 s @ 24 kHz, non-flexible shape) and trims output frames
to real-audio duration so zero-padded frames don't bleed into the
prompt.
- `PocketTtsSynthesizer+KVCache.prefillKVCache` prepends
`bos_before_voice` ahead of the audio_prompt on the v1 path. v2
snapshots skip this — their pre-baked KV cache already encodes the
prefix.
- `PocketTtsResourceDownloader.ensureModels` backfills
`bos_before_voice.bin` for caches that predate this fix (per-file fetch)
instead of forcing a full language-pack re-download.

Conversion artifacts and per-language SHA-256s documented in
`mobius/models/tts/pocket_tts/coreml/TRIALS.md` (Phase 7).

## Test plan

- [x] `swift build` clean
- [x] `swift test --filter PocketTtsConstantsLoaderTests` — 3 new tests
pass
- [x] `swift format` applied
- [x] E2E v1 cloning: `am_michael.wav` (7.5 s) → 3.92 s @ 24 kHz Int16,
intelligible voice match. KV cache prefill lands at position 113 = 1 BOS
+ 95 voice + 17 text tokens (matches pocket-tts 2.0.0 layout).
- [x] v2 snapshot regression check: default `alba.safetensors` voice
still synthesizes correctly (prefill position 140, no `bos_before_voice`
involvement)
- [x] Backfill path: deleted `bos_before_voice.bin` from cache, re-ran
cloning — file auto-fetched from HF (4096 bytes) before synthesis
- [x] All 10 language packs verified on HF: SHA-256 match between local
extraction and uploaded `v2/<lang>/constants_bin/bos_before_voice.bin`
2026-05-12 08:55:44 -04:00
Benjamin Lee a0092cf163 Fixed LS-EEND Memory Leak + Updated Docs (#605)
1. LS-EEND had a memory leak since the autorelease pool was not
releasing the multiarrays properly and was allocating new ones every
chunk. Switched to backed output arrays to eliminate new allocations
2. LS-EEND docs were somewhat stale. Updated them to reflect the new API

---------
2026-05-12 08:53:59 -04:00
Alex fb8b779380 feat(tts/magpie): warmup API for cold-start mitigation (#60 Track 2) (#595) 2026-05-10 16:51:09 -04:00
Alex 2c45df3035 docs(tts): refresh Benchmarks.md per #590; wire styletts2 + --variant into tts-benchmark (#593)
## Summary

Closes the work tracked in #590: bring `Documentation/TTS/Benchmarks.md`
into agreement with what's actually shipped on `main` for CoreML TTS
backends, and add the two CLI affordances needed to benchmark the
in-scope backend × language matrix.

### Doc changes (`Documentation/TTS/Benchmarks.md`)

- Single consolidated **per-backend table** that merges basic info
(license, language+voice, footprint in **GB**, sample rate, max chunk
per pass, streaming flag) with performance metrics (TTFT p50/p95, synth
p50/p95, agg RTFx, peak RSS, WER %, CER %). Five rows: Kokoro ANE en
(`af_heart`), Kokoro ANE zh (`zf_001`), PocketTTS en (`alba` 6L), Magpie
en (`John`, batch-only on `main`), StyleTTS2 en (LibriTTS iteration_3,
zero-shot).
- Dropped from the top-line per scope decision: non-ANE Kokoro,
CosyVoice3 zh, PocketTTS 24L variants, Hindi/Cantonese rows. CosyVoice3
narrative sections (decode budget cap + auto-chunker validation) stay
verbatim.
- Refreshed Kokoro ANE per-stage breakdown (post-laishere 7-graph
chain).
- Replaced the old Magpie per-stage table with a pointer paragraph
(`MagpieSynthesisResult.timings` is still populated for callers; sub-1.5
s TTFA work referenced in #590 lives on `feat/magpie-lt-fusion`, not
`main`).
- Corrected PocketTTS footprint to `fp16 ~0.77 / int8 ~0.55 GB` (was
`~140 / ~520 MB`); enumerated all 10 packs in the corpus matrix; added
zh to the Kokoro ANE corpus row; added a StyleTTS2 row.

### CLI changes
(`Sources/FluidAudioCLI/Commands/TtsBenchmarkCommand.swift`)

- New `styletts2` / `style-tts2` backend wired to
`StyleTTS2Manager.synthesize(text:referenceAudioURL:)`. Requires
`--reference <wav>`; the shipped iteration_3 `ref_encoder` is fixed at
`[1, 1, 80, 231]`, so the reference must be exactly **2.875 s @ 24 kHz
mono** — the harness errors out at predict time on mismatched durations.
- New `--variant {english|mandarin}` flag for `kokoro-ane` so the
`zf_001` Mandarin voice pack can be benchmarked alongside `af_heart`.
Falls back to `english` when unset; the manager constructor now receives
the parsed `KokoroAneVariant` and the default voice is variant-aware.

### Methodology

100-phrase MiniMax-Multilingual on MacBook Air M2 (16 GB, macOS 26, on
AC), `--compute-units default`. English WER/CER via Parakeet TDT
roundtrip; Mandarin CER via `whisper-large-v3` (Python CPU FP32,
`Scripts/whisper_zh_cer.py`) — macro 4.01% / micro 4.14% across all 100
zh phrases. WER omitted for Mandarin because `WERCalculator` splits on
whitespace.

## Test plan

- [x] `swift build` clean on `main`-based branch.
- [x] `swift format lint --recursive --configuration .swift-format
Sources/FluidAudioCLI/Commands/TtsBenchmarkCommand.swift` clean.
- [x] Smoke test: `swift run fluidaudio tts-benchmark --backend
styletts2 --reference ref.wav --corpus minimax-english --output-json
/tmp/styletts2-smoke.json` — produces a valid JSON report.
- [x] Smoke test: `swift run fluidaudio tts-benchmark --backend
kokoro-ane --variant mandarin --voice zf_001 --corpus minimax-chinese
--skip-asr --output-json /tmp/kokoro-zh-smoke.json` — pulls the Mandarin
voice pack and produces audio.
- [x] Full 100-phrase runs for all five table rows produced under
`Benchmarks/tts/runs/590/` (gitignored); table numbers come straight
from those JSON reports.
- [ ] Reviewer cross-check: footnote markers (`*`, `‡`, `∥`, `¶`) in the
consolidated table all have matching paragraphs below.
2026-05-09 21:47:45 -04:00
panv-kw a400080380 Make SpeakerManager a struct and de-async DiarizerManager (#591)
### Why is this change needed?

`DiarizerManger.performCompleteDiarization` is `async`, even though no
asynchronous operations occur when running the models and processing the
results - this is just plain, synchronous computation. It doesn't wait
on the network or things like that. It is important to be able to
integrate it in to other synchronous compute workflows.

The reason it had to be `async` until now is that the `SpeakerManager`
type containing the speaker database was a `class`, meaning that it was
shared mutable state. It was made an `actor` because this shared mutable
state could be mutated concurrently.

But really, there should not be concurrent mutations to the
`SpeakerManager` in the first place. The user of this type,
`DiarizationManager`, is not actually prepared for other code to be
modifying this database while it is using it, and anybody who is trying
is almost certainly writing a bug because their code would be logically
racing with `DiarizationManager` and the results would be unpredictable.

This change makes `SpeakerManager` a struct. It has copy-on-write value
semantics because it wraps a `Dictionary` for its storage, and mutations
are marked by the `mutating` keyword and require exclusive ownership of
the variable -- again, just like `Dictionary`. The compiler statically
diagnoses attempts to concurrently mutate the `DiarizationManager`'s
speaker database, so the test for this can be removed (it no longer
compiles).

<img width="1108" height="386" alt="Screenshot 2026-05-09 at 19 10 26"
src="https://github.com/user-attachments/assets/04fc3395-7d46-42a8-b035-4d0b559cc8aa"
/>

In summary, this change significantly reduces the cognitive load of
using and maintaining this code, promotes correct usage through static
diagnostics rather than allowing unpredictable results through
concurrent mutation of the speaker database, and enables diarization to
be used in more contexts in more programs.

(BTW, `SpeakerManager` doesn't strictly _need_ to be `Sendable`, but the
previous one was by virtue of being an `actor`, so I marked this one as
being `Sendable` too in case anybody was relying on it. I don't think
the implementation of this type is going to change radically in the
future to the point where that might be a problem)
2026-05-09 17:14:07 -04:00
Alex 3ff5ae2d0c refactor(tts): async StyleTTS2 predict + drop non-native Magpie synthesizeStream (#589) 2026-05-09 12:54:07 -04:00
Alex ce59fb14b8 feat(tts): StyleTTS2 LibriTTS (iteration_3) CoreML backend (#588)
## Summary

Swift port of `mobius/models/tts/styletts2/coreml/inference.py` against
the `FluidInference/StyleTTS-2-coreml/iteration_3/compiled` mlmodelc
assets. New `StyleTTS2Manager` actor exposes the same public shape as
`MagpieTtsManager` / `KokoroSynthesizer`, plus a `--backend styletts2`
route in the CLI.

## Architecture

`StyleTTS2Manager` orchestrates four pieces:
1. `StyleTTS2ModelStore` — actor-managed lazy load of the 8 default
`.mlmodelc` stages plus 6 token-axis bucket variants (T = 64 / 128 / 256
fp16).
2. `StyleTTS2Phonemizer` — wraps shared `MultilingualG2PModel`
(CharsiuG2P) with an espeak-fallback note in the docs;
`synthesize(ipa:)` escape hatch preserves parity for callers that
already have espeak output.
3. `StyleTTS2MelExtractor` — vDSP FFT + 80-bin HTK mel filterbank with
the training-time `sample_rate=16000` quirk for the speaker-reference
path.
4. `StyleTTS2Synthesizer` — drives the 8-stage CoreML graph
(`text_encoder`, `bert`, `ref_encoder`, `fused_diffusion_sampler`,
`duration_predictor`, `fused_f0n_har_source`, `decoder_pre`,
`decoder_upsample`) and returns 24 kHz mono Float32 PCM.

Eager-glue ops (`StyleTTS2GlueOps`) bridge the stages on the CPU side:
sigmoid+round of duration logits, one-hot alignment matrix, BLAS
`cblas_sgemm` matmul, vDSP transpose, HiFi-GAN causal asr-shift, and the
alpha/beta style blend (`s_pred[:, 128:]` / `s_pred[:, :128]`).

The fused diffusion sampler consumes pre-materialized noise —
`StyleTTS2DiffusionSchedule` provides the Karras sigma formula plus a
SplitMix64 + Box-Muller source so a fixed `noiseSeed` reproduces the
same audio.

## CLI

```
swift run fluidaudiocli tts "Hello from StyleTTS2." \
    --backend styletts2 \
    --reference path/to/speaker.wav \
    --output out.wav \
    --alpha 0.3 --beta 0.7 --seed 0
```

`--ipa` overrides the text path with a verbatim IPA string for espeak
parity.

## Test plan

- [x] `swift build` clean
- [x] `swift format lint` clean on touched files
- [x] `swift test --filter StyleTTS2` — 32 / 32 passing
- `StyleTTS2TextCleanerTests` — symbol vocab + encode round-trip +
drop-unknown
- `StyleTTS2GlueOpsTests` — duration rounding, alignment matrix, BLAS
matmul, transpose, HiFi-GAN shift, alpha/beta blend
- `StyleTTS2DiffusionScheduleTests` — Karras boundary conditions +
monotonicity, RNG determinism, Gaussian stats
- `StyleTTS2MultiArrayTests` — Float32 / Int32 round-trip,
`extractFloats` for double / int32 backings
- [ ] End-to-end smoke run via `swift run fluidaudiocli tts ...
--backend styletts2 --reference ...` against a downloaded
`iteration_3/compiled` asset bundle
2026-05-09 00:25:54 -04:00
Greg Young b3a725db3e Fix: Prevent Metal crash when targetTokens is 0 in Kokoro TTS (#586)
Adds a defensive guard against targetTokens == 0 reaching CoreML in the Kokoro TTS pipeline. A zero-length int put_ids tensor causes the Metal backend to dispatch compute shaders with threadgroupsPerGrid.width(0), which is an uncatchable assertion failure:

-[MTLDebugComputeCommandEncoder dispatchThreadgroups:threadsPerThreadgroup:]:1377:  failed assertion `(threadgroupsPerGrid.width(0) * ...) must not be 0.'

Changes
1. KokoroSynthesizer.swift — synthesizeChunk() now throws a descriptive TTSError.processingFailed when targetTokens == 0, before any MLMultiArray allocation or model prediction. This converts an uncatchable Metal assertion into a recoverable Swift error.
2. KokoroModelCache.swift — Cached token lengths are clamped with max(1, inferTokenLength(...)) at all 3 caching sites (loadModelsIfNeeded, tokenLength(for:), registerPreloadedModels). Defense-in-depth: although inferTokenLength() already returns a positive value or falls back to 124, this guarantees the cache invariant is locally enforced regardless of future changes to the inference helper.


Testing
- Manual: confirmed synthesizeChunk now throws TTSError.processingFailed instead of trapping when a 0 token length is forced.
2026-05-08 17:56:13 -04:00
local 024bd8e454 chore(tts): remove StyleTTS2 backend, models, and references 2026-05-07 13:32:16 -04:00
Prakash Joshi Pax a53aff438b fix(tts): guard direct Float16 reads with #if arch(arm64) (CosyVoice3, StyleTTS2) (#582)
## Summary

`Float16` is an arm64-only Swift built-in, so any direct `Float16`
typing fails to compile in the x86_64 slice of a Universal build. Four
sites in CosyVoice3 and StyleTTS2 do raw `Float16` pointer binds with no
arch guard, which currently breaks Universal archive builds with errors
like:

```
'Float16' is unavailable in macOS
No exact matches in call to initializer
Failed to produce diagnostic for expression; please submit a bug report
```

Affected sites:
-
`Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Synthesize/CosyVoice3SpeechEmbeddings.swift:65`
— `assumingMemoryBound(to: Float16.self)` for the fp16 safetensors
lookup table.
-
`Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Synthesize/CosyVoice3Synthesizer.swift:554`
— fp16 branch of the Flow→HiFT mel copy in `runHiFT`.
-
`Sources/FluidAudio/TTS/StyleTTS2/Pipeline/StyleTTS2Synthesizer.swift:288`
— fp16 case in `sliceFirstAxis2D`.
-
`Sources/FluidAudio/TTS/StyleTTS2/Pipeline/StyleTTS2Synthesizer.swift:541`
— fp16 case in `readMLMultiArrayPrefix`.

This PR mirrors the package's existing pattern for fp16 reads on
non-arm64 (`ASR/Qwen3/Qwen3AsrModels.swift:310`,
`Diarizer/Sortformer/SortformerModelInference.swift:278`): wrap each
`Float16`-touching arm in `#if arch(arm64) ... #endif`.

## Behavior on x86_64

- **CosyVoice3SpeechEmbeddings** — throws
`CosyVoice3Error.predictionFailed("requires Apple Silicon (arm64); fp16
lookup table cannot be read on x86_64")`. The `speech_embedding`
safetensors table is fp16-only on disk with no fp32 alternative, so this
matches the Qwen3-ASR posture (its embedding table is also fp16-only on
disk; it `fatalError`s on x86_64).
- **CosyVoice3Synthesizer.runHiFT** — the `case .float16:` arm is
omitted on x86_64. The `case .float32:` path is unchanged. If a Flow
variant emits fp16 at runtime on Intel, control falls into the existing
`default:` arm, which already throws `"runHiFT: unexpected Flow mel
dtype …"`.
- **StyleTTS2Synthesizer (`sliceFirstAxis2D`,
`readMLMultiArrayPrefix`)** — the `case .float16:` arm is omitted on
x86_64. fp16 arrays fall through to the existing NSNumber-bridged
`default:` arm (`arr[i].floatValue` / `fill { arr[$0].floatValue }`),
which already converts fp16 correctly. Slightly slower on Intel; no
behavior regression.

No new error types or dependencies. Diff is +13 lines across 3 files.

## Why this approach (vs. vImage byte-level conversion)

FluidAudio already uses two patterns for cross-arch fp16 handling:
- **`#if arch(arm64)` guard** — used for fp16 *reads* in
`Qwen3AsrModels.swift` and `SortformerModelInference.swift`.
- **vImage `Planar16FtoPlanarF` / `PlanarFtoPlanar16F`** — used for fp16
*writes* in `KokoroAneSynthesizer+Conversion.swift`, `TtsModels.swift`,
`KokoroSynthesizer.swift`, with the explicit comment in
`TtsModels.swift:182`: *"This avoids direct Float16 usage which isn't
available in all build configurations"*.

This PR matches the existing fp16-read precedent (Pattern A). A
follow-up could port these read paths to vImage for full Intel runtime
support (Pattern B), but that's a larger change and would need testing
on Intel hardware. The minimal goal here is unblocking the Universal
compile.

## Test plan

- [x] Universal archive build of a downstream macOS app that links
FluidAudio as a local SPM package now succeeds (failed prior to this
patch with the errors above).
- [ ] CI lint / build on the package itself.
- [ ] No CosyVoice3 / StyleTTS2 runtime regression on Apple Silicon (the
arm64 path is byte-identical to before).
2026-05-05 09:27:27 -04:00
Alex 284ce520f9 feat(tts/magpie): nanocodec v4 (fp32 + int8 palettize) precision (#581)
## Summary

Add `MagpieNanocodecPrecision.fp32Pal` selecting `nanocodec_decoder_v4`:
v3's fp32 architecture with 8-bit kmeans-palettized weights.
Acoustically transparent vs v3 at ~4× smaller on disk and ~11% lower
peak RSS. Same recipe Kokoro Noise uses for `fp32 + int8pal`.

Compute units track precision: `.fp32Pal` pins `.cpuOnly` (palettized
weights dequantize to fp32 at runtime; ANE refuses fp32 / GPU is 50%+
slower than CPU on fp32 codec).

## Bench (M2, 16 GB, .cpuOnly, T_in=24, 5 warmup + 50 timed iters)

| metric              | v3 fp32 | v4 fp32+int8pal | delta  |
|---------------------|---------|-----------------|--------|
| mlpackage on disk   | 121.0MB |          30.9MB |   -74% |
| post-load RSS delta | +59.9MB |         +61.7MB |    eq. |
| peak RSS            | 700.8MB |         621.8MB |   -11% |
| latency median      | 117.6ms |         117.1ms |    eq. |
| latency p95         | 145.9ms |         123.6ms |   -15% |
| RTFx (codec)        |   9.48× |           9.52× |    eq. |
| SNR vs v3 (AR codes)|    inf. |          33.6dB | clean* |

*User-confirmed acoustically transparent on AR-emitted speech.

## Fallback chain

Each candidate carries its own config so the fallback doesn't inherit
the primary's compute-unit selection. fp16 (v2) is only reached when
explicitly requested or when no other candidate is present, since it's
audibly noisy on voiced speech:

| Requested      | Order                    |
|----------------|--------------------------|
| `.fp32Pal`     | v4 → v3 → v2             |
| `.fp32`        | v3 → v4 → v2             |
| `.fp16`        | v2 → v4 → v3             |

If every chunked artifact is missing the loader falls through to legacy
monolithic v1 with `.cpuOnly` (audibly noisy).

## HF artifacts

Already uploaded to
`FluidInference/magpie-tts-multilingual-357m-coreml`:
- `nanocodec_decoder_v4.mlmodelc/`
- `nanocodec_decoder_v4.mlpackage/`

## Companion PR

mobius converter: https://github.com/FluidInference/mobius/pull/54

## Test plan
- [x] `swift build` green
- [x] `swift test --filter MagpieConstantsTests` 5/5 pass
- [x] `swift format lint` clean for changed files
- [ ] End-to-end `MagpieTtsManager` synth with `.fp32Pal` once HF
artifacts propagate to user caches
2026-05-04 23:22:34 -04:00
Alex 8389c1b714 feat(tts/magpie): nanocodec v1/v2/v3 + decoder_step ANE pin + dual-precision API (#580)
## Summary

Companion to mobius PR FluidInference/mobius#53. Wires the new nanocodec
v2/v3 builds into the FluidAudio Magpie runtime, plus pins
`decoder_step` to ANE for ~2× wall speedup.

## Commits

- `5879a32b3` — `fix(tts/magpie): pin decoder_step to ANE for ~2x
speedup + correct EOS`
- `decoder_step.mlmodelc` was running CPU+GPU. Pinning to
`.cpuAndNeuralEngine` halves wall on M2.
  - EOS handling: don't emit the post-EOS frame.
- `ec7051504` — `feat(tts/magpie): chunked T=24 fp32 nanocodec +
edge-pad (Phase C v2)`
- Slide a 24-frame window with stride 8, overlap 16 (= dilated-conv
input receptive field).
- Edge-replicate context at sequence boundaries instead of zero-padding
(zero-pad produces a sharp pop in the first ~30 ms).
- `2f0aab7a7` — `feat(tts/magpie): dual fp16/fp32 nanocodec t24 builds
via MagpieNanocodecPrecision`
  - New `MagpieNanocodecPrecision` enum (`.fp16` / `.fp32`).
- Compute-unit dispatch: fp32 → `.cpuOnly` (ANE is fp16-only); fp16 →
`.cpuAndNeuralEngine` unless caller pinned CPU.
- Plumbed through `MagpieModelStore.init` and `MagpieTtsManager.init` /
`downloadAndCreate`.
- `4bd31469f` — `refactor(tts/magpie): nanocodec v1/v2/v3 versioning
(drop t24 prefix)`
- Final naming: v1 = legacy mono, v2 = chunked fp16, v3 = chunked fp32.
- `requiredModels` now lists `nanocodecDecoderV3File` so legacy v1-only
users auto-upgrade on next bulk fetch.
- Load chain: primary (precision-matched) → secondary (cross-precision
warning) → legacy v1 fallback.

## Production state

| Build | File | Precision | Shape | Selector | Audio |
|---|---|---|---|---|---|
| v1 | `nanocodec_decoder.mlmodelc` | fp16 | T=256 monolithic | legacy
fallback | noisy + slow |
| v2 | `nanocodec_decoder_v2.mlmodelc` | fp16 | T_in=24 chunked |
`MagpieNanocodecPrecision.fp16` | noisy / fast |
| v3 | `nanocodec_decoder_v3.mlmodelc` | fp32 | T_in=24 chunked |
`MagpieNanocodecPrecision.fp32` (default) | clean |

All three live on `FluidInference/magpie-tts-multilingual-357m-coreml`.

## Background

Phase F mixed-precision sweep (mobius#53) confirmed no fp16 op/location
combination recovers cleanliness — production stays on v3 (fp32) with v2
as opt-in for throughput-bound callers willing to accept the 27 dB SNR
floor.

## Test plan

- [x] `swift format` clean
- [x] `swift build` clean
- [ ] Sanity-check `swift test --filter MagpieTtsTests` (if present)
- [ ] Spot-check synthesis via CLI on default speaker
2026-05-04 22:35:22 -04:00
Alex bdbff4d88a feat(tts/kokoro-ane/zh): consolidated Mandarin G2P (erhua + jieba HMM + g2pW) (#572 items 1, 3, 4) (#579)
## Summary

Consolidates PRs #574, #575, and #576 into a single landing for Mandarin
G2P enhancements per [issue
#572](https://github.com/FluidInference/FluidAudio/issues/572). All
three features are non-overlapping and stack cleanly inside
`MandarinG2P.phonemize`:

- **Item 3 — Erhua merging** (was #574): folds trailing `儿` into the
previous syllable so `小孩儿` emits a single r-coloured token instead of a
stray `er` tail.
- **Item 4 — Jieba HMM tail** (was #575): re-segments OOV runs of
single-char fallbacks via a 4-state B/M/E/S Viterbi to recover
proper-noun boundaries (`特朗普`, `比特币`); recovered words are then retried
against the phrase dict before per-char fallback.
- **Item 1 — g2pW polyphone disambiguation** (was #576): int8 BERT-base
classifier (152 MB CoreML) picks the right reading for polyphonic Hanzi
(`行`/`长`/`重`/`朝`/…) using the full sentence as context. Best-effort:
falls back to dict-only when assets are missing.

Item 2 (number normalization) already merged via #573. Items 5 (POS
sandhi, #577) and 6 (custom lexicon, #578) remain as separate PRs.

## Pipeline order

```
text
  → MandarinNumberNormalizer.normalize        (already on main)
  → normalizeText (punctuation)
  → segment(): FMM phrases + jieba HMM tail   (NEW: item 4)
  → polyphone disambiguation via g2pW         (NEW: item 1)
  → diacritic → digit (MandarinPinyinNormalizer)
  → MandarinErhua.merge                       (NEW: item 3)
  → MandarinToneSandhi.apply
  → MandarinBopomofoMap.encode
```

## API changes

- `MandarinG2P.phonemize` is now `async throws` (g2pW disambiguation
requires async). Backwards-compatible callers must add `try await`.
- `MandarinG2P.init(dict:, jiebaHmm:, g2pw:)` — both new parameters are
optional, default `nil` keeps baseline behaviour.
- New `Segment.bopomofoOverride(String)` case carries g2pW's pre-encoded
bopomofo + tone digit; bypasses sandhi.

## Asset requirements

Pulled from `huggingface.co/FluidInference/kokoro-82m-coreml`:

- `ANE-zh/g2pw/g2pw.mlmodelc/` — bulk `ensureModels` (added to
`requiredModelsZh`)
- `ANE-zh/g2pw/vocab.txt` + `POLYPHONIC_CHARS.txt` —
`ensureMandarinG2pw` lazy fetch
- `ANE-zh/assets/jieba_hmm_{start,trans,emit}.bin` —
`ensureMandarinJiebaHmm` lazy fetch

All three optional asset groups degrade gracefully: missing g2pW falls
back to dict-first reading, missing jieba HMM falls back to per-char
singles.

## Test plan

- [x] `swift build` clean
- [x] 102 tests pass across `MandarinG2PTests`, `MandarinErhuaTests`,
`MandarinJiebaHmmTests`, `MandarinPolyphoneCatalogTests`,
`MandarinBertTokenizerTests`, `MandarinNumberNormalizerTests`
- [x] Polyphone target tracking through jieba HMM resegmentation:
`flushHanziRun` carries absolute char positions so g2pW sees the right
context window
- [x] Backward-compat: `MandarinG2P(dict:)` (no jieba, no g2pW) still
passes baseline tests

## Closes

- #574 (erhua)
- #575 (jieba HMM)
- #576 (g2pW)

Refs #572.
2026-05-04 01:01:39 -04:00
Alex 684ceaf42b feat(tts/kokoro-ane/zh): POS-aware tone sandhi (#572 item 5) (#577)
## Summary

Issue #572 item 5. The baseline \`MandarinToneSandhi\` rules are
POS-independent and audibly misfire on three contexts:

- **一 ordinals** (\`第一 dì-yī\`, \`一月 yī-yuè\`, \`一号\`) keep
  tone 1; baseline promotes them to 2/4 unconditionally
- **不 reduplication** (\`要不要\`, \`好不好\`, \`行不行\`) keeps
  \`不\` at tone 4 inside \`[X, 不, X]\`; baseline misfires with
  bu4+tone4 → bu2
- **3+3 chains** apply within prosodic words; cross-word 3+3 only
  promotes the word-final syllable. Baseline's pure-run rule
  cascades too far left (\`我也想去\` → wrong \`2 2 2 3\` instead of
  correct \`2 2 3 4\`)

## Design

\`MandarinToneSandhiPOS.apply(_:words:tags:)\` — pure function, takes
the syllable buffer plus pre-computed word ranges + jieba POS tags.
Backward-compat path stays on \`MandarinToneSandhi.apply\` for
callers without a POS tagger (existing behavior preserved).

## Test plan

- [x] 一 ordinal carve-outs (\`第一\`, \`一月\`)
- [x] 一 contextual sandhi still fires in non-numeral words
      (\`一定\`, \`一起\`)
- [x] 不 reduplication keeps tone 4 (\`要不要\`)
- [x] 不 promotion still fires for non-reduplication (\`不要\`)
- [x] In-word 3+3 run promotes all but last
- [x] Cross-word 3+3 only promotes the boundary
- [x] Cross-word chain stops at non-3 (\`我是你的\`)
- [x] Backward-compat for single-word ranges
- [x] \`swift build\` + \`swift format lint\` clean
- 14 unit tests, all passing

## Out of scope (follow-up)

- **MandarinG2P routing** to \`MandarinToneSandhiPOS\` lands once
  PR #575 (jieba HMM + POS tagger tables) merges and the POS tagger
  is loaded by \`KokoroAneModelStore.mandarinG2PPipeline\`. Until
  then this module is testable in isolation via synthetic POS input.

## Depends on

- #575 — for the POS tagger Viterbi + tables that produce the
  \`words\`/\`tags\` arrays at runtime
2026-05-04 00:39:50 -04:00
Alex f202200d1f feat(tts/kokoro-ane): user-supplied Mandarin custom lexicon (#572 item 6) (#578)
## Summary

Issue #572 item 6. Lets app developers ship a project-specific
Mandarin lexicon that overrides both the bundled phrase dict and
g2pW. Useful for proper nouns the bundled dict doesn't cover
(brand names, technical jargon, regionalisms) and for cases where
the user knows the correct reading and wants to bypass any
heuristic.

## Test plan

- [x] Custom lexicon entry overrides phrase dict
- [x] Custom lexicon entry overrides single-char dict
- [x] Empty lexicon = no-op (baseline preserved)
- [x] \`swift build\` + \`swift format lint\` clean

## Independent

This PR is independent of #573–#577. Land in any order.
2026-05-04 00:39:37 -04:00
Alex 0ea7c900b0 feat(tts/kokoro-ane/zh): number/date/currency verbalization (#572 item 2) (#573)
## Summary

Issue #572 item 2. Pre-pass that verbalizes numerics, dates, times,
percentages, fractions, and currencies into Hanzi *before*
`MandarinG2P` segments the text. Without it, conversational input
like \`¥120\`, \`2025年5月3日\`, \`8:30\`, \`99%\` either fragments
into per-digit literals or gets dropped entirely by the segmenter.

- Port misaki/zh/num.py rules: cardinals up to 兆 (10¹²), decimal
  point form, percentages with 百分之, fractions (二分之一), money
  (¥/$/€/¥), dates (YYYY年MM月DD日, YYYY/MM/DD), times (HH:MM[:SS])
- Hook in `MandarinG2P.phonemize` before punctuation normalization
- Pure function, no new public API surface on `KokoroAneManager`

## Test plan

- [x] `MandarinNumberNormalizerTests` covers cardinals, decimals,
      percentages, fractions, money, dates, times
- [x] `MandarinG2PTests` baseline regression (no behavior change on
      pure-Hanzi input)
- [x] `swift build` + `swift format lint` clean
2026-05-04 00:36:31 -04:00
Benjamin Lee e4ce919762 Finalized DiarzerTimeline segment updates no longer commit tentative segments (#568)
There was a bug that would cause the trailing diarizer segment to
disappear if minFramesOff was nonzero once the person stopped talking.

---------
2026-05-04 00:18:18 -04:00
Alex 98acce358a feat(tts/kokoro-ane): add Mandarin (v1.1-zh) variant (#570)
## Summary

Phase 1 — variant plumbing + phonemes-bypass synthesis for
Kokoro-82M-v1.1-zh
on the existing 7-stage CoreML chain. Callers that supply pre-computed
Bopomofo (e.g. via misaki[zh] in Python or a future Swift G2P) can now
synthesize Mandarin audio. Mandarin text-to-Bopomofo G2P is deferred to
a
separate Phase 2 PR.

The 7-stage chain is **language-agnostic by construction** — input ids,
voice slices, and per-stage I/O contracts are identical across v1.0
(English) and v1.1-zh (Mandarin). Only the embedding vocab (177 → 171),
the
HF subdir (`ANE/` → `ANE-zh/`), the voice-file layout (flat →
`voices/<voice>.bin`), and the default voice (`af_heart` → `zf_001`)
differ.

## Changes

- New `Repo.kokoroAneZh` → `FluidInference/kokoro-82m-coreml/ANE-zh`
with
  `subPath = ANE-zh`, `folderName = kokoro-82m-coreml/ANE-zh`.
- `ModelNames.KokoroAne.requiredModelsZh` references `voices/zf_001.bin`
so the downloader's all-files-present check resolves correctly when the
  file lands at `<repoDir>/voices/zf_001.bin`.
- New `KokoroAneVariant` enum (`.english` / `.mandarin`) with
  `defaultVoice`, `useVoicesSubdir`, and `repo` accessors.
- `KokoroAneResourceDownloader.ensureModels` and `ensureVoicePack`
accept a
  `variant` param (default `.english` keeps existing callers
  source-compatible). Mandarin voice fetch creates the `voices/` parent
  directory on demand.
- `KokoroAneModelStore` and `KokoroAneManager` thread the variant
through
  to download + load.
- `KokoroAneManager.synthesize(text:)` and `synthesizeDetailed(text:)`
  reject Mandarin with a clear error directing callers to
  `synthesizeFromPhonemes()`. The phonemes-bypass entry point already
  works for any vocab via `vocab.encode → 7-stage chain`.
- CLI `--variant` flag accepts `en` / `english` / `zh` / `mandarin` for
  the `kokoro-ane` backend. Mandarin runs treat the input text as
  pre-computed Bopomofo and call `synthesizeFromPhonemesDetailed`.
- 12 new unit tests (`KokoroAneVariantTests`): variant defaults, repo
  wiring, required-files set routing, manager init signatures, and
  Mandarin text-path rejection on both `synthesize` and
  `synthesizeDetailed`.

End-to-end Mandarin synthesis verified against PyTorch ground truth on
`zf_001` and `zm_009`. Background-noise investigation tracked separately
in #569 (atan2 phase correction in upstream `CoreMLForwardSTFT`).

## Test plan

- [x] `swift build` clean
- [x] `swift test --filter KokoroAneVariantTests` — 12/12 pass
- [x] `swift format lint` clean (only pre-existing warnings on
`fastV2_1`/`balancedV2_1`/`highContextV2_1` enum cases unrelated to
      this PR)
- [ ] After HF upload of `ANE-zh/` bundle, end-to-end smoke test:
`swift run fluidaudiocli tts "ㄋㄧˇㄏㄠˇㄕˋㄐㄧㄝˋ。" --backend kokoro-ane
--variant zh --voice zf_001 --output /tmp/zh.wav`
- [ ] No regressions on existing English path (default-arg behavior
      preserved)

## Out of scope

- Mandarin text-to-Bopomofo G2P — Phase 2 (separate PR).
- HF upload of `ANE-zh/` bundle — handled outside this repo.
- Updating `Documentation/` with Mandarin voice list — defer to Phase 2
  when the path is fully usable end-to-end.
2026-05-03 22:03:27 -04:00
Benjamin Lee 821e0f97bc Fixed an LS-EEND constructor (#567)
The asynchronous constructor for `LSEENDDiarizer` that simultaneously
loads the model did not update the timeline config's speaker count or
frame duration, as it would've if using
```swift
diarizer = LSEENDDiarizer()
await diarizer.initialize(variant: .dihard3, stepSize: .step500ms)
```

---------
2026-05-02 19:24:54 -04:00
Benjamin Lee 0a9aace382 Fixed short segment filter for trailing tentative segments in DiarizerTimeline (#566)
Apparently i did it incorrectly last time.

---------

Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-02 10:11:32 -07:00
Benjamin Lee 5bb84bc0b0 Fix DiarizerTimeline Short Segment Filter (#565)
The `DiarizerTimeline` was incorrectly closing short gaps as soon as
another speech frame appeared, instead of waiting for a sufficiently
long speech segment to merge the old one with.

This bug fix ensures that gaps are only closed between two segments of
sufficient length (at least `config.minFramesOn` frames long).

Also removed an unnecessary `throws` from a non-throwing `LSENDDiarizer`
constructor.

---------

Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-01 23:25:58 -04:00
Alex cad8a2b563 feat(asr/cohere): long-form transcribeLong + cold/warm docs (#564)
## Summary

Two related changes that grew out of the cohere isolated-bench
investigation:

### 1. Long-form Cohere ASR — `CoherePipeline.transcribeLong`

The encoder is fixed at a 35 s window, so the prior `transcribe()`
silently truncated longer audio via `padOrTruncate(... fixedFrames:
3_500)` in `Sources/FluidAudio/ASR/Cohere/CoherePipeline.swift:250`.

`transcribeLong` slices audio into 35 s chunks with **5 s overlap**
(matches upstream `cohere-pytorch/config.json` `overlap_chunk_second:
5`) and stitches adjacent chunks via **token-level
longest-common-substring merge**. No model changes — encoder shape stays
`[1, 128, 3500]`, decoder cache shape unchanged.

- Audio ≤ 35 s short-circuits to the existing single-chunk
`transcribe()` path → byte-identical short-form behavior, zero perf
delta on FLEURS / LibriSpeech (which are all ≤ 35 s)
- Audio > 35 s: hop = 30 s, decode each chunk independently, merge token
streams (drop the suffix's matched head, keep prefix as-is)
- LCS window bounded to 32 tokens per seam → O(K²) merge is negligible
vs. decode
- Per-chunk encoder/decoder/total seconds are summed into one
`TranscriptionResult`

CLI rewiring:
- `cohere-transcribe`, `cohere-benchmark`, `tts-benchmark` now route
through `transcribeLong`
- `cohere-benchmark` no longer skips files exceeding 35 s

Smoke-tested on a Mandarin 81 s WAV: full 80 s now transcribes
(previously cut at 35 s). 10 unit tests cover `mergeTokenStreams`
correctness (empty-input, no-overlap, threshold fallback, boundary
overlap, offset overlap, longest-run preference, window bounds) and
chunk-config constants.

### 2. Cold-start vs warm inference docs

Adds a section to `Documentation/ASR/Cohere.md` capturing the isolated
single-process bench (cold ANE compile ~186 s on M2 Tahoe; warm calls
3.4–4.6 s, RTFx 1.96×–8.73×). Clarifies process reuse is what unlocks
the headline FLEURS/LibriSpeech RTFx.

## Test plan

- [x] `swift test --filter CohereLongFormTests` (10 / 10 pass)
- [x] `swift test --filter CohereAsrConfigTests` and
`CoherePipelineMaskTests` (no regressions)
- [x] `swift build` (debug) clean; `swift build -c release` clean
- [x] Smoke: `cohere-transcribe /tmp/cohere_test_80s.wav --language zh`
transcribes full 80 s of audio (3 chunks merged, no duplicated overlap
content)
- [x] `swift format lint` — no new warnings in changed files
2026-05-01 10:26:27 -04:00
Alex 7603ac6733 feat(tts/benchmark): tts-benchmark CLI covering all TTS backends (#557)
## Summary

Adds `fluidaudio tts-benchmark`, a unified harness for measuring
**latency × efficiency × quality** across every shipping TTS backend in
FluidAudio, plus the model + runtime fixes needed to actually clear all
six backends end-to-end on the [MiniMax Multilingual TTS Test
Set](https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set).
Also tags Magpie / StyleTTS2 / CosyVoice3 as **beta** at the API + docs
level so users get a runtime warning on `initialize()` reflecting their
actual perf / quality posture.

### Backends — all green on M2 / macOS 26

| Backend | Corpus | Status | Audio out (min / p50 / max) | RTFx | WER |
Notes |
|---|---|---|---|---|---|---|
| Kokoro ANE | minimax-en (100/100) |  | 3.5 s / 8.0 s / 11.4 s | 5.19×
| 10.8% | one-shot @ 24 kHz, 7-graph pipeline; per-stage CU sweep |
| Kokoro | minimax-en (100/100) |  | 3.5 s / 6.8 s / 9.3 s | 2.02× |
1.3% | one-shot @ 24 kHz; multi-chunk w/ 8 ms crossfade; cleanest
English ASR roundtrip |
| PocketTTS | minimax-en (100/100) |  | 2.8 s / 6.3 s / 9.4 s | 0.61× |
1.4% | **streaming** @ 24 kHz, 80 ms frames; TTFT 1244 ms — RTFx looks
slow but is honest per-frame cost (see "RTFx caveat" below) |
| Magpie | minimax-en (100/100) | ⚠️ **BETA** | 4.7 s / 10.0 s / 20.6 s
| 0.64× | 5.6% | **streaming TTFT** @ 22.05 kHz: first chunk at **9.6 s
p50** vs full synth 15.1 s; split-K/V decoder + `outputBackings` fast
path; below real-time, runtime warning on init |
| StyleTTS2 | minimax-en (100/100) | ⚠️ **BETA** | 9.6 s / 22.6 s / 32.6
s | 2.72× | 44.0% | one-shot @ 24 kHz; flex-shape fix + misaki→espeak
post-pass remap (WER 58.1% → 44.0%); WER ~30× Kokoro's, runtime warning
on init |
| CosyVoice3 | minimax-zh (100/100) | ⚠️ **BETA** | 2.2 s / 6.5 s /
**16.0 s** | 0.357׆ | n/a‡ | post auto-chunker @ 24 kHz; long phrases
now split + crossfaded (8 ms cosine) — longest output 16.0 s (was capped
at ~6.5 s); HiFT `.cpuAndGPU` + LLM-Decode `outputBackings` (+33% RTFx);
**whisper-large-v3 CER 1.68% (macro) / 1.84% (micro)** across 100/100
phrases‡; RTFx < 1, runtime warning on init |
| CosyVoice3 | minimax-yue (100/100) | ⚠️ **BETA** | 3.3 s / 8.0 s /
**16.1 s** | 0.249× | n/a | post auto-chunker; **truncation 80/100 →
5/100 phrases** (`finished_on_eos=false` field), longest output 6.5 s →
16.1 s. TTFT-p50 climbs (24 s → 36 s) as the cost of multi-chunk synth |

⚠️ **BETA** = `${Backend}TtsManager.initialize()` emits a
`logger.warning` flagging the perf / quality posture; safe to ship in
non-latency-sensitive paths but read the per-backend doc first.

‡ CosyVoice3 zh WER stays `n/a` because `WERCalculator`
whitespace-tokenizes and Mandarin has no word boundaries (word-level WER
reads ~100% and is meaningless). CER is `whisper-large-v3` against the
rendered WAVs from the full 100-phrase `minimax-chinese` run via
`Scripts/whisper_zh_cer.py`. Cohere Transcribe q8 is also wired in this
PR via `--asr-backend cohere` (see [Cohere ASR backend in the
harness](#cohere-asr-backend-in-the-harness) below) and agrees with
whisper at the 3–5% CER range on a 10-phrase sub-sample, but hits a
`MILCompilerForANE` cache failure on this M2 host that drops it to RTFx
~0.13×, so whisper is the practical source-of-truth for the full
100-phrase run.

Full numbers (cold start, p50/p95 synth, peak RSS, WER/CER per category)
live in `Documentation/TTS/Benchmarks.md`. Corpus attribution +
reproduction notes live in `Documentation/TTS/MinimaxCorpus.md`.

### RTFx caveat — phrase length and streaming granularity both matter

Aggregate RTFx (audio_duration / wall_clock) is **only directly
comparable between backends when both produce similar phrase lengths and
yield audio at the same granularity**. Two things skew the headline
number on this corpus:

**1. Phrase-length spread.** StyleTTS2 emits ~22 s p50 of audio per
`minimax-english` phrase while Kokoro emits ~7 s — same input text, ~3×
more audio out. That's mostly long inter-word pauses + slow speaking
rate baked into the LibriTTS multi-speaker checkpoint, not a measurement
artifact. A 2.72× RTFx on 22 s audio = ~8 s wall — which matches the
TTFT p50 column. Kokoro's 2.02× on 7 s audio = ~3.5 s wall. Same-corpus
RTFx ratios alone hide this.

**2. Streaming granularity.** PocketTTS posts 0.61× agg-RTFx vs.
Kokoro's 2.02× but it's **not slower from a user perspective**:
PocketTTS yields its first 80 ms audio frame at TTFT **1244 ms**,
Kokoro's first frame at TTFT **3113 ms** (full one-shot chunk). The
0.61× is the per-frame cost averaged across the streaming run; what
users feel is TTFT.

| Backend | TTFT p50 | First yield | Implication |

|-------------|----------|------------------|--------------------------------------------|
| PocketTTS | 1244 ms | 80 ms frame | true streaming;
conversational-ready |
| Kokoro ANE | 1586 ms | full ~8 s chunk | ~1.6 s to any audio;
ANE-tuned |
| Kokoro | 3113 ms | full ~7 s chunk | clean quality, slower first-byte
|
| StyleTTS2 | 6671 ms | full ~22 s chunk | one-shot only; long phrase
output amortizes the wall |
| Magpie | **9580 ms** | first chunk @ 22.05 kHz | streaming via
`synthesizeStream`; TTFT-p50 9.6 s vs full synth 15.1 s — 36% earlier
playback start |
| CosyVoice3 | 14091 / 35681 ms (zh / yue) | full chunk @ 24 kHz |
one-shot per chunk; multi-chunk phrases pay TTFT for the first chunk
only |

For conversational use cases, **TTFT > RTFx**. PocketTTS (true
streaming), Magpie (streaming via `synthesizeStream`), and Kokoro ANE
(small one-shot chunks) are the three backends that meaningfully clear
the "user feels it's responsive" bar today.

### Beta callouts (StyleTTS2, Magpie, CosyVoice3)

Three of the six shipping backends post numbers that callers should
weigh against an explicit caveat:

- **StyleTTS2** — WER 44% on `minimax-english` is ~30× Kokoro's 1.3%.
The misaki→espeak post-pass remap closed half the gap; the remainder is
BART G2P misses + diffusion-sampler formant breaks on long phrases.
- **Magpie** — agg-RTFx 0.64× on M2 — below real-time but streaming via
`synthesizeStream` so TTFT (9.6 s p50) is significantly better than
full-synth wall (15.1 s p50). Long-tail phrases still pull p95 wall to
~30 s.
- **CosyVoice3** — agg-RTFx 0.357× on `minimax-chinese` (0.249× on the
longer-phrase `minimax-cantonese` after the auto-chunker). The 250-token
Flow input cap is now worked around at the call site by the auto-chunker
(long phrases split + crossfaded), dropping cantonese truncation from
80/100 → 5/100 and lifting longest output from 6.5 s → 16.1 s. The 5/100
residual is the long-tail token-rate worst case; the structural fix is
re-exporting Flow with a larger fixed input shape (tracked in
`mobius-cosyvoice3`). `CosyVoice3SynthesisResult.finishedOnEos: Bool` +
a `.warning`-level `LLM-Decode budget exhausted` log still surface any
truncation, and the harness writes `finished_on_eos` into each phrase in
the JSON report.

Each manager now logs a `.warning`-level beta notice on `initialize()`
(mirroring the existing CosyVoice3 pattern) so anyone wiring these into
a product gets a console signal, not a silent surprise. Docs
(`Documentation/TTS/Magpie.md`, `Documentation/TTS/Benchmarks.md`
StyleTTS2 footnote, existing `CosyVoice3.md` callout) carry the same
caveat at the top.

### Model + runtime fixes landed in this PR

#### CosyVoice3 stateless port (`71130c9fb`)
Switches LLM-Decode from the macOS 15+ stateful `MLState` path to the
non-stateful `LLM-Decode-M768-fp16` graph that's actually shipped on
HuggingFace. Drops ~95 LOC of state plumbing for ~30 LOC of plain
`MLDictionaryFeatureProvider` prediction with explicit kv carry-forward;
lowers the availability gate from macOS 15 / iOS 18 back to the package
baseline (macOS 14 / iOS 17). `CosyVoice3ModelNameTests` guard the
rename.

#### CosyVoice3 HiFT timeout fix (`267766b62`)
`minimax-chinese` runs were aborting mid-corpus with `E5RT: Submit Async
failed for [3:29] ... HiFT-T500-fp16_main__Op104_BnnsCpuInference has
timed out`. Root cause: HiFT was loaded with `.cpuAndNeuralEngine`,
which let the planner place most of the graph on ANE but kept at least
one op on the BNNS CPU async-dispatch path; long phrases tripped the
BNNS async watchdog. Fix pins HiFT to `.cpuAndGPU` regardless of
user-supplied compute-units, removing the BNNS path entirely. Verified
on 100/100 zh + 100/100 yue.

#### CosyVoice3 LLM-Decode `outputBackings` double-buffer (`248c638c6`)
The autoregressive decode loop runs ~163 steps per phrase to fill the
250-token cap. Each step takes the previous step's KV cache as `kv_k` /
`kv_v` (fp32 `[24, 1, 2, 768, 64]` = 9 MB each) and produces fresh
`kv_k_out` / `kv_v_out` plus logits — i.e. ~36 MB of host-side
`MLMultiArray` allocation **per step**. Fix pre-allocates 4 KV
back-buffers + a logits backing, rotates front/back/spare across steps
via `MLPredictionOptions.outputBackings`, and falls back to fresh-alloc
on first rejection (one-shot `logger.warning`). Mirrors the Magpie
pattern. Result on full `minimax-chinese`: agg-RTFx **0.269 → 0.357
(+33%)**, TTFT-p50 14091 ms → 9619 ms (-31%), peak RSS 3302 MB → 2470
MB.

#### CosyVoice3 auto-chunker (`f80e0b92e` + `fd22624b5` + `f60cccd0d`)
The 250-token Flow input cap means a single synth pass produces at most
~6.5 s of audio regardless of input length. Re-exporting Flow with a
larger fixed input shape is gated on upstream conversion work, so this
PR works around it at the call site: long inputs are split at
sentence/clause boundaries by `CosyVoice3TextChunker`, synthesized
independently, and merged with an 8 ms equal-power cosine crossfade.

**Splitter policy**: hard enders (`. ! ? 。 ! ? \n`) commit always; soft
enders (`, 、 ; : ; ,` + ASCII space) commit only at-or-past budget;
force-split at +30 token overshoot if no natural boundary exists.
`defaultMaxSpeechTokens` = 110 (leaves margin under the 250-token cap
minus a typical 60–90-token speech-prompt context). Token-rate heuristic
is calibrated against minimax-zh + minimax-yue runs:

| Char class | Tokens / char | Rationale |

|------------|---------------|--------------------------------------------------------------|
| CJK | 7.5 | worst-case observed in real generation; varies 5.5–9 per
char |
| ASCII | 1.5 | matches BPE rate on English text |
| Other | 2.5 | conservative for accented Latin / non-CJK Unicode |

**Validation** on full `minimax-cantonese` (100 phrases, M2):

| Metric | Pre-chunker | Post-chunker | Δ |

|-------------------------------------------|-------------|--------------|------------|
| `finished_on_eos=false` (truncated) | 80 / 100 | **5 / 100** | −94% |
| Longest audio output | 6.5 s | **16.1 s** | +148% |
| agg-RTFx | 0.245× | 0.249× | +1.6% |
| TTFT p50 | 23.9 s | 35.7 s | +49% |

The TTFT regression is the cost of running multiple synth passes per
long phrase — splitting unblocks long-form output at the price of
wall-clock latency. The 5/100 residual truncation is the long-tail
token-rate worst case (some chars hit ~9 tokens/char); raising the
per-CJK heuristic further would over-fragment short phrases. Cleaner fix
is the Flow re-export.

16-test suite covers tokenization estimates, hard/soft/force-split
policy, and the crossfade arithmetic. Lives in
`Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Preprocess/CosyVoice3TextChunker.swift`
+ `CosyVoice3TtsManager.concatWithCrossfade`.

#### Magpie streaming TTFT wire-up (`ace0bf485`)
`TtsBenchmarkCommand.swift` now drives Magpie through
`MagpieTtsManager.synthesizeStream`, recording `ttft_ms` at first
`MagpieAudioChunk` emit instead of conflating it with full-synth wall
time. Result on full `minimax-english` (100 phrases, M2): TTFT-p50 **9.6
s** vs full synth-p50 **15.1 s** — agents start playback ~36% earlier
than waiting for full synth. agg-RTFx 0.41× → 0.64× (warm-cache re-run
benefit; fundamentals unchanged).

#### StyleTTS2 `FlexibleShapeInfo` fix (`c24900731` + `8f9e42fd9`)
`text_predictor.mlmodelc` aborted on long MiniMax phrases with `E5RT:
tensor_buffer has known strides while the model has FlexibleShapeInfo`.
The CoreML runtime rejects two access patterns on outputs from a
flex-shape model: `arr.strides` reads, and `arr[idx].floatValue` element
subscripts — and the original `sliceFirstAxis2D` helper used both. Fix
rewrites it to read via `arr.dataPointer.bindMemory(...)` (handling
`.float32`, `.float16`, `.double`) and computes the flat index from the
known `(1, leading, trailing)` row-major layout. Verified on full
100/100 minimax-en with a `ref_s.bin` dumped from the upstream LibriTTS
demo voice.

#### StyleTTS2 misaki → espeak post-pass remap (`ded0b9467`)
After `sliceFirstAxis2D` unblocked the full corpus, StyleTTS2 still
landed at **WER 0.581 / CER 0.476** — an order of magnitude worse than
Kokoro (0.013). Instrumented the encoder via a new `--tokenize-only
--corpus` mode and disproved the silent-vocab-drop hypothesis: only
**0.09% of scalars** dropped on the full 100-phrase corpus (11 ASCII
hyphens / 12247 scalars).

Real root cause: G2P convention mismatch. Both Kokoro and StyleTTS2
share the in-tree misaki BART G2P (`G2PModel`), but the StyleTTS2
LibriTTS checkpoint was trained by yl4579 on **espeak-ng-phonemized**
LibriTTS — predating misaki by years. The 178-vocab accepts both forms
(e.g. both `ʧ` U+02A7 and `tʃ` decomposed encode), but acoustic
embeddings for the misaki ligature glyphs are essentially untrained
noise.

Side-by-side comparison against locally-installed `espeak-ng -v en-us
--ipa -q` flagged four systematic divergences:

| misaki | espeak-ng | example                  |
|--------|-----------|--------------------------|
| `ʧ`    | `tʃ`      | choice → `tʃˈɔɪs`        |
| `ʤ`    | `dʒ`      | jump   → `dʒˈʌmps`       |
| `ɜɹ`   | `ɝ`       | girl   → `ɡˈɝl`          |
| `əɹ`   | `ɚ`       | over   → `ˈoʊvɚ`         |

Fix: a 4-rule post-pass remap in `StyleTTS2Phonemizer.phonemize`, gated
on `.americanEnglish` and applied to the assembled phoneme string after
every word has been emitted by the BART G2P. Lives alongside the
existing per-piece misaki diphthong remap. Result on the same 100-phrase
MiniMax-English run with the same `libritts_696` voice and same Parakeet
TDT roundtrip:

| Metric          | Pre   | Post  | Δ      |
|-----------------|-------|-------|--------|
| Macro WER       | 0.581 | 0.440 | −24.2% |
| Macro CER       | 0.476 | 0.241 | −49.5% |
| TTFT p50 (ms)   | 8937  | 6671  | −25.4% |
| Agg RTFx        | 2.36× | 2.72× | +15.3% |
| Peak RSS (MB)   | 1428  | 963   | −32.6% |

Phrase 1 (`"…simple choice. Get busy living…"`) went from `simple voice.
Busy dying.` (0.40 WER) to a perfect roundtrip. Remaining errors cluster
on word-level G2P misses from the BART itself (`practical →
practicckles`, `separation → expiration`) and diffusion-sampler formant
breaks; closing the rest of the gap to Kokoro likely needs richer espeak
coverage or libespeak-ng vendor — tracked separately.

#### Beta callouts on StyleTTS2 + Magpie managers (`25e2b492a`)
`StyleTTS2Manager.initialize` and `MagpieTtsManager.initialize` now emit
`logger.warning` beta notices mirroring the existing
`CosyVoice3TtsManager.initialize` pattern. Backends docs (`Magpie.md`
Status section, `Benchmarks.md` StyleTTS2 footnote) gain matching `⚠️
Beta / experimental` callouts so the perf / quality posture is visible
at every entry point — runtime, manager docstring, doc top, PR body.

#### Magpie `outputBackings` rejection fallback (`72dae8400` +
`9767e1ef9`)
The shipped `decoder_step.mlmodelc` reaches the user before the rebuild
lands, so CoreML can reject our `outputBackings` dictionary on a
name-mismatch. Latched fallback path falls back to a fresh-alloc decode
so the model still runs; first rejection latches the flag for the rest
of the run.

### Cohere ASR backend in the harness (`8e741e659`)

Lets non-English TTS runs (CosyVoice3, Magpie zh, etc.) score WER / CER
through the harness against [Cohere
Transcribe](Sources/FluidAudio/ASR/Cohere/) instead of being forced into
`--skip-asr`. Four new flags on `tts-benchmark`:

- `--asr-backend parakeet|cohere|none` — selects the ASR roundtrip
engine. Default is `parakeet` for English-only runs and skipped for
CosyVoice3.
- `--cohere-model-dir <path>` — path to a directory containing
`cohere_encoder.mlmodelc`, `cohere_decoder_cache_external_v2.mlmodelc`,
and `vocab.json`.
- `--asr-language <code>` — overrides the inferred language code (covers
all 14 Cohere languages: en, fr, de, es, it, pt, nl, pl, el, ar, ja, zh,
ko, vi).
- `--cohere-compute-units all|cpu-and-gpu|cpu-only|all-ane` — pins
`MLComputeUnits` at `CoherePipeline.loadModels` time. Use `cpu-and-gpu`
when the q8 encoder fails ANE compilation (`MILCompilerForANE error:
failed to compile ANE model using ANEF`) to skip the multi-minute
fallback compile on the first call. The harness logs a WER caveat for
zh/ja runs flagging that whitespace-tokenized WER is meaningless and the
CER column is the real signal.

Example end-to-end:
```bash
fluidaudio tts-benchmark \
    --backend cosyvoice3 \
    --corpus minimax-chinese \
    --asr-backend cohere \
    --cohere-model-dir /path/to/cohere/q8 \
    --asr-language zh \
    --output-json benchmark_results/cv3-zh-cohere.json \
    --audio-dir benchmark_results/cv3-zh-cohere/audio
```

On this M2 host the q8 encoder hits a CoreML ANE-cache failure
(`MILCompilerForANE error: ANECCompile() FAILED`) and CoreML silently
falls back to CPU+GPU, dropping Cohere from its documented RTFx ~2× (per
`Documentation/ASR/Cohere.md`) to RTFx ~0.13× — correctness is
unaffected (same graph, same output), only latency. The full 100-phrase
CosyVoice3 zh CER number reported above (1.68% macro / 1.84% micro) was
therefore produced via `whisper-large-v3` (Python CPU FP32,
`Scripts/whisper_zh_cer.py`) rather than by running Cohere over all 100
phrases. A 10-phrase Cohere sub-sample agrees with whisper at the 3–5%
CER range.

### Corpus migration (`4cc7d3111`) + on-demand fetch CLI (`8022e8384`)

Replaces the original `prose-en` / `numbers-en` / `names-en` /
`prose-zh` shipped with the first cut of this PR with the [MiniMax
Multilingual TTS Test
Set](https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set)
(CC-BY-SA-4.0; 100 phrases × 25 languages). Same public corpus used by
[MiniMax-Speech](https://arxiv.org/abs/2505.07916), seed-tts-eval, and
Gradium — numbers in this PR are paper-comparable.

The 24 per-language `.txt` files used to be vendored in
`Benchmarks/tts/corpus/minimax/`. **Removed in this PR** in favor of an
on-demand `fluidaudio minimax-corpus` CLI subcommand that fetches them
from the upstream HF dataset at the pinned revision and writes them to
the same path. Reuses `DownloadUtils.fetchHuggingFaceFile` for HF auth
(HF_TOKEN env) + retry/backoff — no `swift-transformers` dep added, no
hardcoded asset URLs. The `.txt` files now live in `.gitignore` since
they're CC-BY-SA-4.0 derivative content; only
`Documentation/TTS/MinimaxCorpus.md` (attribution + revision pin + WER
caveats — moved from `Benchmarks/tts/corpus/minimax/README.md` in
`ac21d60bf`) and the CLI subcommand are tracked. Replaces the prior
`python Scripts/fetch_minimax_tts_corpus.py` (also deleted). Per-backend
language scope:

| Backend | Languages benchmarked |
|---|---|
| Kokoro / Kokoro ANE | en (af_heart) |
| PocketTTS | en + de + it + pt + es + fr |
| Magpie | en + es + de + fr + it + vi + zh + hi |
| StyleTTS2 | en (LibriTTS multi-spk) |
| CosyVoice3 | zh + yue |

### PocketTTS streaming TTFT (`c26f1e163`)
PocketTTS now drives the harness through its `synthesizeStreaming` API
so TTFT measures time-to-first-80ms-frame instead of full one-shot
synth. TTFT 1244 ms vs. full synth 8757 ms — a 7× streaming advantage
that one-shot benchmarking previously hid.

### Reference voice dumper helper (mobius-styletts2)
`mobius-styletts2/scripts/06_dump_ref_s.py` (added in the sibling repo)
wraps `style_encoder` + `predictor_encoder` from `99_parity_check.py` to
dump a 256-fp32 LE `ref_s.bin` that `StyleTTS2Manager.synthesize`
consumes via `--voice`. Required because the shipped CoreML bundle
doesn't include those upstream-only PyTorch encoders.

## Test plan

- [x] `swift build -c release` clean
- [x] `swift format lint` clean for new files
- [x] `fluidaudio tts-benchmark --help` lists all 6 backends
- [x] `fluidaudio minimax-corpus --languages english --out-dir /tmp/x`
produces byte-identical output to the deleted Python script
- [x] Kokoro / Kokoro ANE / PocketTTS / Magpie — full 100/100 minimax-en
- [x] StyleTTS2 — full 100/100 minimax-en (verified after
`sliceFirstAxis2D` fix + post-pass remap)
- [x] CosyVoice3 — full 100/100 minimax-zh + 100/100 minimax-yue
(verified after HiFT + LLM-Decode `outputBackings` fixes)
- [x] `CosyVoice3ModelNameTests` + `TtsComputeUnitPresetTests` green
- [x] No `@unchecked Sendable`; per-backend error enums use `Error,
LocalizedError`
- [x] StyleTTS2 + Magpie + CosyVoice3 emit beta `logger.warning` on
`initialize()`
- [x] Corpus README moved to `Documentation/TTS/MinimaxCorpus.md`;
cross-refs in `Benchmarks.md`, `MinimaxCorpusCommand.swift`,
`TtsBenchmarkCommand.swift` updated
- [x] CosyVoice3 6.5 s output cap investigated — confirmed structural
(250-token Flow input shape, 40 ms / token); surfaced via
`finishedOnEos` + warning log + JSON `finished_on_eos` field. See
[Decode budget
cap](Documentation/TTS/Benchmarks.md#cosyvoice3-decode-budget-cap)
- [x] **CosyVoice3 auto-chunker** lands in this PR as a call-site
workaround. Validated on full minimax-cantonese: truncation **80/100 →
5/100**, longest output **6.5 s → 16.1 s**, agg-RTFx 0.245× → 0.249×.
16-test suite (`CosyVoice3TextChunkerTests`) green. See [CosyVoice3
auto-chunker](Documentation/TTS/Benchmarks.md#cosyvoice3-auto-chunker)
- [x] **Magpie streaming TTFT** wired through `synthesizeStream` in
`TtsBenchmarkCommand.swift`. Validated on full minimax-english: TTFT-p50
**9.6 s** (first chunk) vs full-synth-p50 **15.1 s** — 36% earlier
playback start. agg-RTFx 0.41× → 0.64× (warm-cache re-run)
- [x] **Cohere ASR harness wiring** (`--asr-backend cohere` +
`--cohere-model-dir` + `--asr-language` + `--cohere-compute-units`).
Smoke-tested on a 10-phrase `minimax-chinese` sub-sample (Cohere q8
macro CER 4.88%, hit `MILCompilerForANE` fallback, RTFx ~0.13× on this
M2 host). Whisper-large-v3 cross-check on the same WAVs: macro CER 3.04%
— both backends agree
- [x] **CosyVoice3 zh CER on full corpus** measured via
`whisper-large-v3` (Python CPU FP32, `Scripts/whisper_zh_cer.py`) over
all 100 minimax-chinese WAVs: macro CER **1.68%**, micro CER **1.84%**.
Recorded in `Documentation/TTS/Benchmarks.md` (CosyVoice3 row + footnote
‡)
2026-05-01 09:09:42 -04:00
Alex b5d8017d1f feat(asr/parakeet-v3): default to int4-per-channel encoder (#560)
## Summary

Switch the Parakeet TDT v3 default encoder from the 6-bit palettized
`Encoder.mlmodelc` to a new int4-per-channel `EncoderInt4.mlmodelc`. v2
and TDTJa keep the legacy 6-bit encoder; v3 is the only path that
changes.

## WER / size / speed (LibriSpeech test-clean, 100 files, M2)

| variant | WER | disk | RTFx | ANE residency |
|---|---|---|---|---|
| baseline (6-bit palettized, current default) | 2.64% | 426 MB | 36.8x
| 99.4% |
| **int4-per-channel (new default)** | **5.24%** | **285 MB** |
**49.2x** | 82.0% |
| enc-prune+int8 | 2.57% | 568 MB | 19.8x | 82.0% |
| enc-int4-linear-per-block-32 | 3.95% | 319 MB | 15.6x | 33.3% |
| enc-prune+int4-block | 3.95% | 319 MB | 15.9x | 33.3% |

The chosen variant trades roughly 2× LibriSpeech WER (still in the same
single-digit-percent regime) for **33% less disk** and the **fastest
RTFx** of any variant tested. Per-block quants drop off ANE entirely
(33%) while per-channel stays compatible (82%).

## Implementation

- `ModelNames.ASR`
- Add `encoderInt4 = \"EncoderInt4\"` and `encoderInt4File =
\"EncoderInt4.mlmodelc\"`.
- Swap `encoderFile` for `encoderInt4File` in `requiredModelsV3`.
`encoderFile` stays defined and is still used by v2 / TDTJa / 110m.
- `AsrModels.swift`
- Extend `getModelFileNames(version:)` return tuple from `(decoder,
joint, vocabulary)` to `(encoder, decoder, joint, vocabulary)`.
- Thread `fileNames.encoder` through `createModelSpecs`, the v3 `load`
flow, the `download` spec list, and `isModelValid`. v3 returns
`Names.encoderInt4File`; v2/tdtJa return their existing
`Encoder.mlmodelc`; fused (110m) is unaffected.
- Tests: add `testV3UsesInt4EncoderAsDefault` and
`testV2KeepsLegacyEncoder` in `ModelNamesTests`.

## Distribution

The new `EncoderInt4.mlpackage` / `EncoderInt4.mlmodelc` will be
uploaded to the existing `FluidInference/parakeet-tdt-0.6b-v3-coreml` HF
repo alongside the current `Encoder.mlmodelc`. Older library versions
that still ask for `Encoder.mlmodelc` continue to work unchanged.

## Test plan

- [x] `swift build` clean
- [x] `swift test --filter ModelNamesTests` — 20/20 (2 new)
- [x] `swift test --filter AsrModelsTests` — 30/30
- [x] End-to-end transcription smoke test on LibriSpeech
61-70970-0001.flac via `EncoderInt4.mlmodelc`: correct text, RTFx
29.12x. ANE cold compile 21.3s (one-time).
- [x] swift-format lint clean on the modified files (only pre-existing
Sortformer warnings remain in `ModelNames.swift`).
- [ ] CI: tests + asr-benchmark
- [ ] Verify HF download path on a clean cache once
`EncoderInt4.mlmodelc` is uploaded to the v3 repo.

## Companion

The mobius PR adds the conversion scripts that produced these variants
(`extra_encoder_variants.py`, `analyze_fallback.py`,
`compute_unit_sweep.py`).
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/560"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open in Devin Review">
  </picture>
</a>
<!-- devin-review-badge-end -->
2026-04-30 23:00:43 -04:00
Benjamin Lee 35f6ba697f Added Back the Old LS-EEND Constructors (#563)
I accidentally deleted the old constructor in my last PR.

---------

Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-04-30 17:24:18 -07:00
Benjamin Lee 4065a9917e Optimized LS-EEND API (#526) 2026-04-30 17:49:32 -04:00
Zhongpai Gao c4d56a5cb5 Feat/pocket tts int8 precision swap (#558)
### Why is this change needed?
<!-- Explain the motivation for this change. What problem does it solve?
-->

Wires up the published `flowlm_stepv2.mlmodelc` int8 variant via a
`PocketTtsPrecision { .fp16, .int8 }` parameter on `PocketTtsManager`,
threaded through to `PocketTtsModelStore` and
`PocketTtsResourceDownloader`.

Closes the loop on the `flowlm_stepv2.mlmodelc` artifact that's been
published under `v2/<lang>/` for a while but didn't have a Swift loader
hook. Default stays `.fp16`, no behavior change for existing callers.

## What's in this PR

**Code (5 files, +171/-12):**
- `PocketTtsPrecision.swift` — new enum `{ .fp16, .int8 }`, with
  docstring documenting the kyutai-labs/pocket-tts#147 recipe and
  preserving the per-submodel A/B data from `experiment/pocket-tts-int8`
  (cond_step / flowlm_step / flow_decoder / mimi_decoder safety summary)
- `ModelNames.swift` — `flowlmStepV2` constant +
  `flowlmStepFile(precision:)` and `requiredModels(precision:)` helpers
- `PocketTtsResourceDownloader.swift` — `precision:` param,
  precision-aware cache check, and `removeUnusedFlowlmVariant()` post-
  download cleanup so callers' disk usage matches the loaded models
- `PocketTtsModelStore.swift` — `precision:` init param plumbed to the
  precision-aware filename helper
- `PocketTtsManager.swift` — `precision:` init param threaded to the
  store

**Docs (1 file, +47):**
- `Documentation/TTS/PocketTTS.md` — new "Model Files & Precision"
  section: per-submodel precision/size/HF-path table, fp16-vs-int8
  totals, rationale for why only `flowlm_step` is quantized

## Why default is `.fp16`

I asked about the on-disk weight format before committing the rename
and verified by inspecting `model.mlmodel` for both flowlm variants:
the int8 variant has explicit `cast_fp16_to_fp32` op scaffolding
throughout, while the default has none — indicating uniform fp16
weights. Combined with the 304→77 MB size ratio (~4×, consistent with
fp16→int8 plus quantization scale tensors) the default file's weights
are fp16 on disk. The existing `PocketTtsModelStore.swift:65-67`
comment about "CPU/GPU compute in float32 matches the Python reference"
is correct about runtime compute precision (CoreML upcasts fp16
weights to fp32 on `.cpuAndGPU`); it just doesn't describe disk format
and reads as accurate as-is.

## Why per-submodel quantization isn't exposed

The `experiment/pocket-tts-int8` branch's `PocketTtsQuantization`
struct (per-submodel `PocketTtsModelPrecision`) is a richer API, but
the per-submodel int8 artifacts (`cond_step_int8.mlmodelc`, etc.)
aren't published on HuggingFace today. Adding the API would let
callers request configurations that 404 at download time. Only
`flowlm_stepv2.mlmodelc` is published, and that's what this PR wires
up. The `PocketTtsPrecision` enum can grow into the experiment
branch's `PocketTtsQuantization` shape mechanically if/when the
per-submodel artifacts ship.

## Disk footprint (English language pack)

| | fp16 (default) | int8 |
|---|---|---|
| Total active files on disk | 766.3 MB | 549.3 MB |
| **int8 savings vs fp16** | — | **−217 MB (28%)** |

The `v2/<lang>/` HF directory ships both flowlm variants, so first
download briefly holds ~857 MB before the cleanup pass deletes the
unused `.mlmodelc` and `.mlpackage`.

## Backward compatibility

- `PocketTtsManager()` / `PocketTtsModelStore()` / `ensureModels()`
  defaults all stay `.fp16`, which loads `flowlm_step.mlmodelc` exactly
  as before
- Existing `requiredModels` constant retained alongside new
  `requiredModels(precision:)` so non-precision-aware callers keep
  compiling

## Verification done

- All 11 published language packs have both `flowlm_step.mlmodelc`
  and `flowlm_stepv2.mlmodelc` under `v2/<lang>/` — verified via HF
  tree API
- Branch is exactly +2 commits on top of `main`
  (`00ea906 fix: remove module_map from MachTaskSelfWrapper subspec`)
- Diff content is identical to `Gaozhongpai/FluidAudio:main`, just
  squashed from 5 iterative commits into 2 clean ones (one feat, one
  docs)

I haven't run `swift test` locally — Bash on Windows here, no Swift
toolchain. Happy to fix anything CI flags.

<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/558"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open in Devin Review">
  </picture>
</a>
<!-- devin-review-badge-end -->
2026-04-29 13:57:15 -04:00
dianshu 00ea906c20 fix: remove module_map from MachTaskSelfWrapper subspec (#546)
## Summary

- Remove `mach.module_map` from the `MachTaskSelfWrapper` subspec —
CocoaPods does not allow `module_map` on subspecs
- Guard `import MachTaskSelfWrapper` with `#if
canImport(MachTaskSelfWrapper)`, matching the existing
`FastClusterWrapper` pattern
- In CocoaPods builds, the C headers are already exposed via the
umbrella header, so the explicit module import is only needed under
SwiftPM

## Verification

- `pod lib lint FluidAudio.podspec --allow-warnings` — **passed**
- `swift build` — **passed**

Fixes #545
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/546"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open in Devin Review">
  </picture>
</a>
<!-- devin-review-badge-end -->

Co-authored-by: dianshu <dianshu@123.com>
2026-04-29 09:25:25 -04:00
Alex 248b76b8b6 feat(tts/styletts2): scaffold StyleTTS2 4-stage pipeline integration (#554)
## Summary

Adds the FluidAudio host surface for the StyleTTS2 LibriTTS
multi-speaker checkpoint published at
`FluidInference/StyleTTS-2-coreml`, end-to-end. Covers asset download,
lazy bucketed model loading, text frontend (G2P + 178-token vocab),
bundle config validation, the ADPM2/Karras sampler, hard-alignment,
decoder driver, and a CLI driver.

`fluidaudio styletts2 "Hello world." --voice ref_s.bin --output out.wav`
produces an audible 24 kHz mono WAV.

## Pipeline

Per utterance (~5 ADPM2 steps default):

| Stage | Bucket axis | Buckets | Precision | Compute |
|---|---|---|---|---|
| `text_predictor` | input tokens | 32, 64, 128, 256, 512 | fp16 | ANE |
| `diffusion_step` | bert_dur frames | 512 only (5× per utt) | fp16 |
CPU+GPU |
| `f0n_energy` | dynamic (en frames) | enumerated 256/512/1024/2048/4096
| fp16 | CPU |
| `decoder` | mel frames | 256, 512, 1024, 2048, 4096 | fp32 | CPU+GPU |

The decoder is fp32 because SineGen phase saturation in fp16 produces
robotic audio. The HF repo ships precompiled `compiled/*.mlmodelc`
bundles (skipping the cold-start `anecompilerservice` hit) plus
`.mlpackage` doubles for portability — only the `.mlmodelc` bundles are
fetched.

`f0n_energy` is pinned to CPU and always called at the largest
enumerated shape (1, 640, 4096) with zero-padding — the E5RT runtime
emits a stderr "tensor_buffer has known strides while the model has
FlexibleShapeInfo" warning when it sees enumerated shapes on GPU/ANE,
which is non-fatal but the CPU/largest-shape path sidesteps it cleanly.

## What's in this PR

**Sources:**
- `StyleTTS2Constants` — audio/tokenizer/model dims + sampler defaults
(Karras `rho=9` to match upstream)
- `StyleTTS2Error` — module-local `LocalizedError` enum
- `Assets/StyleTTS2ResourceDownloader` — `DownloadUtils.downloadRepo`
wrapper
- `Assets/StyleTTS2Vocab` — 178-token espeak-ng IPA vocab loader;
iterates Unicode scalars (not graphemes) so combining marks like U+0329
syllabic / U+0361 tie-bar look up against their own vocab entries
- `Assets/StyleTTS2BundleConfig` — `config.json` Codable + `validate()`
against `StyleTTS2Constants`
- `Assets/StyleTTS2VoiceStyle` — parser for precomputed `ref_s.bin` (256
fp32 LE) speaker-prosody blobs (dump script lives in
`mobius-styletts2/scripts/06_dump_ref_s.py`)
- `Pipeline/StyleTTS2ModelStore` — actor with lazy per-bucket `MLModel`
cache + lazy vocab/config caches; `f0nEnergy()` pinned `.cpuOnly`
- `Pipeline/StyleTTS2Phonemizer` — `TtsTextPreprocessor` → in-tree
`G2PModel` (BART, misaki IPA) for English with a small misaki→espeak-ng
remap (`A→eɪ`, `I→aɪ`, `O→oʊ`, `W→aʊ`, `Y→ɔɪ`, schwa-offglide → `ə`);
other languages fall back to `MultilingualG2PModel`
- `Pipeline/StyleTTS2Sampler` — ADPM2 / Karras-rho noise schedule +
CFG-aware sampling closure; deterministic via SplitMix64 + Box-Muller
- `Pipeline/StyleTTS2Synthesizer` — full 4-stage driver. Float16-aware
`MLMultiArray` reads (`denoised`, `F0`, `N` all ship as fp16 per
schema), cumsum-of-durations → one-hot → matmul hard-alignment, decoder
fan-out
- `StyleTTS2Manager` — public actor; `initialize()` validates bundle
config; `tokenize()` exposes the text frontend;
`synthesize(text:voiceStyleURL:steps:alpha:beta:randomSeed:)` returns 24
kHz mono WAV `Data`
- `Sources/FluidAudioCLI/Commands/StyleTTS2Command` — `fluidaudio
styletts2 "<text>" --voice <ref_s.bin> [--output --steps --alpha --beta
--seed]`
- `ModelNames.StyleTTS2` + `Repo.styleTts2` wired into the central
registries
- `TtsBackend.styleTts2` case

**Tests** (37/37 pass, no network or CoreML deps):
- `StyleTTS2VocabTests` — load happy path, combining-grapheme handling,
missing/malformed JSON, encode known/unknown/empty
- `StyleTTS2BundleConfigTests` — load + validate against every constant
mismatch
- `StyleTTS2VoiceStyleTests` — `ref_s.bin` parsing (size, fp32
round-trip, wrong-size rejection)
- `StyleTTS2SamplerTests` — Karras schedule, RNG determinism

## Verification

- `fluidaudio styletts2 "Hello world. The quick brown fox jumps over the
lazy dog." --voice /tmp/styletts2-ref_s.bin --output /tmp/out.wav --seed
42` → 4.80s @ 24 kHz, RMS 7158, 0.0009% clipping
- `fluidaudio transcribe /tmp/out.wav` → `Hello world quick brown fax
nomps over lazy` (most words recovered; residual gaps are BART G2P
emitting reduced `ð` for "the" with no schwa, and lacking length marks
`ː` on stressed long vowels)

## Test plan

- [x] `swift build -c release` clean
- [x] `swift test --filter StyleTTS2` → 37/37 pass
- [x] `swift format lint` clean on new files
- [x] End-to-end CLI synth produces audible WAV
- [x] ASR roundtrip recovers most content words

## Known follow-up

- Tune misaki→espeak remap for length marks `ː` and reduced
function-words (would push ASR WER lower)
- Voice-bank packaging story (currently the user must precompute
`ref_s.bin` via `mobius-styletts2/scripts/06_dump_ref_s.py`)
- StyleTTS2 benchmark suite

<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/554"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open in Devin Review">
  </picture>
</a>
<!-- devin-review-badge-end -->
2026-04-29 09:24:44 -04:00
Alex 3d9d422202 feat(tts/magpie): add NVIDIA Magpie TTS Multilingual 357M Swift port (#541)
## Summary

Ports the NVIDIA Magpie TTS Multilingual 357M autoregressive TTS from
Python (mobius [#24](https://github.com/FluidInference/mobius/pull/24))
to Swift. Closes FluidInference/FluidAudio#49.

> **⚠️ Experimental — quite slow on Apple Silicon, needs significant
perf work.** First synth on a fresh process is dominated by CoreML model
load + first-call ANE compile (~30 s). Warm synths run at **~96 s wall
for an 8-word English sentence** on M-series — RTFx ≈ **0.04** (~25×
slower than realtime). Whether the throughput ceiling is a model
characteristic, a CoreML conversion limitation, or both is still being
investigated and is expected to improve in subsequent iterations. **Do
not use in latency-sensitive paths.** For real-time use prefer Kokoro
(~20× RTFx, parallel) or PocketTTS (~1.5–2× RTFx, streaming Mimi).
Magpie's value prop is multilingual coverage + 5 built-in speaker
contexts, not throughput.

## Status

Functional. Audio quality is perceptually clean across all 5 speakers;
first synth on a fresh process is dominated by CoreML model load +
first-call ANE compile (~30 s), warm synths run at ~96 s wall for an
8-word English sentence on M-series (RTFx ≈ 0.04). Quality is ASR-clean
on 4/5 speakers; speaker 0 has a single trailing-word artifact ("…and")
attributable to fp16 sampler-trajectory drift, **not a structural bug**.

Not yet covered: Japanese (deferred — needs OpenJTalk XCFramework +
MeCab dict), CFG performance optimization, MLX-backed LocalTransformer.

- **Languages (8/9):** English, Spanish, German, French, Italian,
Vietnamese, Mandarin, Hindi. Japanese deferred pending OpenJTalk
XCFramework integration.
- **5 built-in speakers** (`.john`, `.sofia`, `.aria`, `.jason`, `.leo`)
with 110-token (768d fp16) context embeddings.
- **Inline IPA override** (`"Hello | ˈ n ɛ m o ʊ | world"`) routes `|…|`
segments directly to the tokenizer for pronunciation control —
first-class feature.
- **Streaming**: `synthesizeStream(...)` yields `MagpieAudioChunk` per
chunk as soon as its NanoCodec decode finishes (first chunk is a small
clause-sized head ≈ 50 frames / 2.3 s for low TTFA). Each non-final
chunk includes punctuation-aware trailing silence for gapless playback.
- **ANE warmup at init**: `MagpieTtsManager.initialize()` runs an
unmeasured 16-step synthesis to force `MILCompilerForANE` to compile the
decoder graphs once. Without this the first user-facing `synthesize()`
can fall back to GPU/CPU and run multiple× slower.
- **Output:** 22.05 kHz mono WAV via 8-codebook NanoCodec decoder, max
11.89 s per synthesis (256 nanocodec frames).

## HF assets — live


[`FluidInference/magpie-tts-multilingual-357m-coreml`](https://huggingface.co/FluidInference/magpie-tts-multilingual-357m-coreml)
is **uploaded and ready** (1.4 GB). Ships:

- `text_encoder.{mlmodelc,mlpackage}` — both compiled and portable
- `decoder_step.{mlmodelc,mlpackage}` — rank-4 split-K/V cache, 97.3%
ANE residency
- `decoder_prefill.{mlmodelc,mlpackage}` — fast prefill path (110-token
batched)
- `nanocodec_decoder.{mlmodelc,mlpackage}` — 8-codebook → 22 kHz PCM
(CPU-only by export)
- `constants/` — `constants.json`, `speaker_info.json`, 8 audio-codebook
embeddings, 5 speaker contexts, local-transformer weights
- `tokenizer/` — per-language phoneme/jieba/pypinyin lookups
(lazy-downloaded)
- **`manifest.json`** — machine-readable index (sha256, file sizes, npy
shapes, model IO specs) consumed by `MagpieResourceDownloader`

## Architecture

| Stage | Implementation |
|---|---|
| Text encoder | `text_encoder.mlmodelc` (CoreML, cpuAndNeuralEngine) |
| Prefill | `decoder_prefill.mlmodelc` fast path (single batched call,
110 tokens), or fallback loop |
| AR loop | `decoder_step.mlmodelc` with **rank-4 split-K/V cache**
(`cache_k{i}` / `cache_v{i}`, shape `[1, 512, 12, 64]` × 12 layers;
logits `var_2129`); `outputBackings` + double-buffered KV cache to keep
allocations off the hot path |
| Local transformer | Pure Swift, 1-layer (256d), Accelerate
(`cblas_sgemm`) + BNNS (GELU); fp32 only (fp64 path removed); vDSP-fused
embed; min-heap top-K |
| Sampling | top-k (80) + temperature (0.6), audio-EOS mask during
`minFrames`, forbidden-token mask `[2016, 2018-2023]`;
`torch.topk`-faithful tie semantics (counts above-threshold +
earliest-index ties up to K) |
| Vocoder | `nanocodec_decoder.mlmodelc` pinned to `cpuOnly` (ANE
rejects the graph) — 8×N codes → float PCM → peak-normalize |

CFG is **off by default** (`cfgScale = 1.0`); enabling it doubles
per-step decoder cost. Assets fetched lazily via `DownloadUtils`; only
the languages requested in `downloadAndCreate(languages:)` are
materialized.

## Public API

```swift
let manager = try await MagpieTtsManager.downloadAndCreate(
    languages: [.english, .spanish]
)

// One-shot
let result = try await manager.synthesize(
    text: "Hello | ˈ n ɛ m o ʊ | from FluidAudio.",
    speaker: .john,
    language: .english
)
let wav = AudioWAV.data(from: result.samples, sampleRate: result.sampleRate)

// Streaming (chunk-level, per-chunk NanoCodec decode)
for try await chunk in try await manager.synthesizeStream(text: longText) {
    audioPlayer.append(chunk.samples)
}
```

## CLI

```
fluidaudiocli magpie download --languages en,es
fluidaudiocli magpie text --text "Bonjour." --speaker 0 --language fr --output out.wav
fluidaudiocli magpie text --text "Long passage..." --stream --output stream.wav
fluidaudiocli magpie bench --runs 5 --warmup 1   # in-process median RTFx
```

(Parity tooling moved to mobius — see
[FluidInference/mobius#44](https://github.com/FluidInference/mobius/pull/44)
for the fixture emitter / Python ground-truth path.)

## Inline IPA — verified working

The `|…|` passthrough is **native NeMo `IpaG2p` behavior** (not added by
us): segments inside pipes are looked up directly in `token2id.json` as
whitespace-separated phonemes, bypassing G2P.

```
input:  "Hello | n ɛ m o ʊ | from FluidAudio."
G2P:    həˈloʊ nɛmoʊ frʌm fluɪdaːdɪoʊ.   ← injected IPA visible mid-stream
```

Validated end-to-end with the live HF assets (Python reference): 30
tokens → 43 frames → 2.00 s @ 3.97x RTF.

## Guardrails followed

- No `@unchecked Sendable`; `MagpieTtsManager`, `MagpieModelStore`,
`MagpieTokenizer`, `MagpieSynthesizer` are all `actor`s.
- No dummy models / synthetic data.
- `AppLogger(category: "Magpie*")` throughout, no `print()` (including
`MagpieCommand.printUsage`).
- `MagpieError: Error, LocalizedError` for all error paths.

## Test plan

- [x] `swift build` — clean on macOS 14 / Swift 6 (only pre-existing
`cblas_sgemm` deprecation warnings from Accelerate); iOS build also
clean (Swift 6 isolation-checker workaround landed).
- [x] `swift test --filter "Magpie|NpyReader"` — 17 / 17 pass:
- `MagpieConstantsTests` (4) — forbidden-token mask, shape relations,
NeMo tokenizer-name parity, per-language file coverage
  - `MagpieIpaOverrideTests` (7) — `|…|` segmentation edge cases
- `MagpieKvCacheTests` (3) — cache shape, `addInputs` key count, static
output keys
- `NpyReaderTests` (3) — fp32 parse, fp16→fp32 upcast, bad-magic
rejection
- [x] HF assets uploaded; Python inference parity confirmed (4.60 s
plain English, 2.00 s + 11.05 s with inline IPA).
- [x] End-to-end Swift validation: `magpie download` → `magpie text`
produces audible 22 kHz WAV; `magpie bench` reports stable RTFx medians
on M-series.
- [x] Audio quality validated: ASR-clean on 4/5 speakers; speaker 0
trailing-word artifact diagnosed as fp16 sampler-trajectory drift, not
structural.
- [x] Streaming validated: chunk-level decode yields correct gapless
playback when concatenated; first chunk arrives in ~half the wall-time
of the full synthesis.
- [x] Devin review feedback addressed: `--text` flag handler,
`torch.topk`-faithful tie semantics, `AppLogger.info()` in
`printUsage()`, stale `MagpieComputePlanCommand` removed.

## Companion PR

Conversion pipeline + parity-fixture emitter + manifest generator:
[FluidInference/mobius#44](https://github.com/FluidInference/mobius/pull/44).

## Out of scope (follow-ups — perf is the headline item)

- **Throughput investigation** — current ~0.04 RTFx is the dominant gap.
Suspect surfaces: rank-4 split-K/V scatter ANE residency vs. apparent
GPU fallback, NanoCodec CPU-only export, LocalTransformer per-step
Accelerate path.
- **MLX-backed LocalTransformer** — drop-in replacement for the
Accelerate/BNNS forward pass to put the per-step hot loop on the GPU.
- **CFG perf optimization** — currently doubles per-step decoder cost.
- **Speaker 0 fp16 sampler drift** — investigate whether
higher-precision logits or a small temperature schedule eliminates the
trailing-word artifact.
- Japanese support (OpenJTalk + MeCab dict).
- Streaming NanoCodec via MLState conv-cache (current export is
fixed-window batch; chunked-overlap fallback yields <15 dB SNR —
unviable without proper state caching).
- CI workflow `magpie-benchmark.yml`.
2026-04-28 10:54:00 -04:00
Alex b82d4f2fc8 feat(tts): CosyVoice3 Mandarin zero-shot TTS port (#536)
## Summary

Swift port of **CosyVoice3** (Mandarin zero-shot TTS) wired through the
four validated CoreML mlpackages hosted at

[`FluidInference/CosyVoice3-0.5B-coreml`](https://huggingface.co/FluidInference/CosyVoice3-0.5B-coreml).
Delivered in two layered phases matching the existing Kokoro manager
shape:

- **Phase 1 (parity harness):** full Swift pipeline that ingests a
Python
frontend fixture (`.safetensors`) and produces WAV within parity of the
  Python reference — validates all four CoreML bindings, 24-layer Qwen2
  KV-cache slicing, RAS sampler, and Flow / HiFT wiring.
- **Phase 2 (native frontend):** pure-Swift Qwen2 BPE tokenizer + Qwen2
  text embeddings + minimal Mandarin text normalizer + 24 kHz log-mel
  DSP so callers can synthesize directly from `String` input without a
  Python dependency.

Conversion pipeline that produced the mlpackages lives at

[FluidInference/mobius#42](https://github.com/FluidInference/mobius/pull/42).
Backend documentation:
[`Documentation/TTS/CosyVoice3.md`](./Documentation/TTS/CosyVoice3.md).

> ⚠️ **Backend ships as beta / experimental.** End-to-end synthesis is
> currently slow on Apple Silicon — RTFx < 1.0 typical, several seconds
> of latency for short Mandarin utterances. Cause is partly the Flow CFM
> stage (fp32 / CPU-or-GPU only because fp16 + ANE produces NaNs through
> the fused `layer_norm`) and partly HiFT sinegen / windowing ops that
> fall back to CPU. Treat as preliminary; may be a model issue, may be
> recoverable via better conversion. Warnings surfaced via doc comments,
> runtime `logger.warning` in `initialize()`, and CLI help text.

## What's shipped

### Public API (`Sources/FluidAudio/TTS/CosyVoice3/`)

```swift
public actor CosyVoice3TtsManager {
    public init(directory: URL? = nil, computeUnits: MLComputeUnits = .cpuAndNeuralEngine)
    public static func downloadAndCreate(from repo: Repo = .cosyvoice3,
                                         computeUnits: MLComputeUnits = .cpuAndNeuralEngine)
                                         async throws -> CosyVoice3TtsManager
    public func initialize() async throws
    public func synthesize(text: String,
                           promptAssets: CosyVoice3PromptAssets,
                           options: CosyVoice3SynthesisOptions = .init(),
                           prenormalized: Bool = false) async throws -> CosyVoice3SynthesisResult
}
```

`TtsBackend` gains `case cosyvoice3`; `ModelNames` gets the
`CosyVoice3` enum plus `Repo.cosyvoice3` pointing at the HF repo.

### Pipeline components

| Layer | File | Notes |
|---|---|---|
| Model loader | `Assets/CosyVoice3ModelStore.swift` | Flat + nested
layout probing, `.mlmodelc` compile cache |
| Downloader | `Assets/CosyVoice3ResourceDownloader.swift` |
`DownloadUtils` wrapper for the 4 mlpackages + embeddings |
| Safetensors | `Shared/SafetensorsReader.swift` | ~170 LoC pure-Swift
mmap + fp16/fp32/i32 accessors |
| Prefill/decode | `Pipeline/Synthesize/CosyVoice3Synthesizer.swift` |
Actor; in-place `[24,1,2,768,64]` fp16 KV-cache passthrough |
| Sampler | `Pipeline/Synthesize/CosyVoice3RasSampler.swift` | top-p /
top-k / repetition mask, seed-tokens bypass |
| Speech embed | `Pipeline/Synthesize/CosyVoice3SpeechEmbeddings.swift`
| Lazy mmap of 6761×896 fp16 table (12 MB) |
| Frontend | `Pipeline/Preprocess/CosyVoice3TextFrontend.swift` |
Special-token splitting + lm_input assembly |
| Tokenizer | `Pipeline/Preprocess/Qwen2BpeTokenizer.swift` |
tiktoken-compatible byte-level BPE, 151 936 vocab |
| Text embed | `Pipeline/Preprocess/CosyVoice3TextEmbeddings.swift` |
151 936×896 fp16 mmap → row copy |
| TN | `Pipeline/Preprocess/CosyVoice3ChineseNormalizer.swift` | Minimal
regex-free port of `frontend_utils.py` |
| Prompt mel | `Pipeline/Preprocess/CosyVoice3PromptMel.swift` | 24 kHz
log-mel matching `matcha audio.py` |

### CLI (`Sources/FluidAudioCLI/Commands/`)

```
fluidaudio tts --backend cosyvoice3-parity --fixture … --models-dir … --output …
fluidaudio tts --backend cosyvoice3 --text "希望你以后能够做的比我还好用" \
               --prompt-assets … --models-dir … --output …
fluidaudio tts --backend cosyvoice3-tokenizer --fixture …     # BPE parity
fluidaudio tts --backend cosyvoice3-frontend --text …         # lm_input dump
```

`--backend` help text marks `cosyvoice3` as `[BETA — slow, RTFx < 1.0]`
and the dispatcher emits a runtime `logger.warning` so users see the
status without reading docs.

### Tests

- `CosyVoice3ChineseNormalizerTests` — 8 cases covering
`contains_chinese`,
  `replace_blank`, corner marks, brackets, digit spellout, trailing
  comma collapse, end-to-end, `is_only_punctuation`.
- `CosyVoice3PromptMelTests` — 8 cases covering the matcha frame-count
  formula, zero-audio log floor clamp, 200 Hz sine peak in low mel bins,
exact reflect-pad semantics, periodic Hann endpoints, mel-basis shape /
  non-zero integrals, token-ratio trimming (and the throws-if-too-short
  path).

### Integration

- `ModelNames.swift` — `CosyVoice3` enum + `Repo.cosyvoice3`
- `TtsBackend.swift` — `case cosyvoice3`
- `TTSCommand.swift` — subcommand wiring
- `Documentation/TTS/CosyVoice3.md` — file roster, call flow, public
API,
  CoreML caveats, indexed from `Documentation/README.md`

## Test plan

- [x] `swift build` (release)
- [x] Full `swift test` on this branch: **1 435 tests, 24 skipped, 0
failures** (~13 min)
- [x] `--filter CosyVoice3ChineseNormalizer` — 8/8 pass
- [x] `--filter CosyVoice3PromptMel` — 8/8 pass
- [x] Phase 1 end-to-end parity vs `build/wavs/e2e_shipping.wav` (max|Δ|
< 1e-3, SNR > 40 dB, CPU-only fp32 Flow)
- [x] Phase 2 end-to-end round-trip: Swift output → whisper.base →
expected transcript

## Non-goals / follow-ups

- SpeechTokenizer and CAMPPlus remain Python-side for prompt asset
  preparation; both have CoreML mlpackages but the required DSPs aren't
  yet ported. Users pass pre-computed `promptSpeechIds` / `spkEmbedding`
  in `CosyVoice3PromptAssets` for now.
- Full `wetext.ZhNormalizer` (year / currency / decimals / units) is not
  ported. Callers that need production-grade TN run wetext server-side
  and pass `prenormalized: true`.
- Flow stays fp32 (1.2 GB) until CoreMLTools pins `layer_norm` fused
fp16.

## Updates — Devin review + main merge

Picked up `origin/main` (resolved trivial enum-case merge in
`ModelNames.swift` / `TtsBackend.swift` / `TTSCommand.swift`; both
branches added new cases) and addressed the 12 Devin inline findings:

- **Sendable hygiene** — dropped `@unchecked Sendable` from 9 types.
  `CosyVoice3Synthesizer` is now a proper `actor` (it crosses actor
  boundaries from the manager); `CosyVoice3Models` is plain `: Sendable`
  via `@preconcurrency import CoreML` (matches the existing `TtsModels`
  pattern; the initial drop-to-no-Sendable broke the benchmark CI build
  with `non-sendable result type CosyVoice3Models cannot be sent from
  actor-isolated context`, since it's returned by `store.models()`).
  The remaining types had Sendable conformance dropped entirely since
  they don't escape the owning actor.
- **Prefill stop-token bug** — if the LLM emits an EOS token at step 0
  the synthesizer now throws `predictionFailed` instead of falling
  through into the decode loop and accumulating semantically meaningless
  tokens.
- **HiFT mel slice OOB** — added bounds check on `newMelStart` against
  the actual mel length and clamped `validFrames` to the available
window; previously a `newMelStart > totalMelFrames` would `MLMultiArray`
  out of range during the chunk-packed call path.
- **Production logging** — replaced `print()` stage timings with
  `AppLogger.info`; added `logger.warning` calls in `initialize()` and
  the CLI dispatcher for the beta-status banner.
- **Beta marker** — doc comments on `CosyVoice3TtsManager` and
  `TtsBackend.cosyvoice3` flag the backend as experimental; CLI help
  text annotates the backend label.
- **Documentation** — added `Documentation/TTS/CosyVoice3.md` mirroring
  the Kokoro / PocketTTS doc layout (files, call flow, public API, CLI,
  CoreML caveats, known limits) and indexed it from
  `Documentation/README.md`.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/536"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open in Devin Review">
  </picture>
</a>
<!-- devin-review-badge-end -->

---------
2026-04-28 09:57:13 -04:00
Alex eff1752ebf feat(tts/pocket): multi-language support (EN + 9 new packs) (#549)
## Summary

Adds first-class support for PocketTTS language packs upstream
`kyutai/pocket-tts` just published, tracking issue #49. Users pick a
language at manager construction; all packs (including English) are
downloaded from `v2/<lang>/` on `FluidInference/pocket-tts-coreml`.

This PR replaces #540 (rebased onto current `main` from a fresh branch).

### Supported languages

| ID              | Layers | HF subtree           |
|-----------------|--------|----------------------|
| `english`       | 6      | `v2/english`         |
| `french_24l`    | 24     | `v2/french_24l`      |
| `german`        | 6      | `v2/german`          |
| `german_24l`    | 24     | `v2/german_24l`      |
| `italian`       | 6      | `v2/italian`         |
| `italian_24l`   | 24     | `v2/italian_24l`     |
| `portuguese`    | 6      | `v2/portuguese`      |
| `portuguese_24l`| 24     | `v2/portuguese_24l`  |
| `spanish`       | 6      | `v2/spanish`         |
| `spanish_24l`   | 24     | `v2/spanish_24l`     |

French ships 24-layer only upstream; no 6-layer French pack exists.

### Per-language artifacts shipped on HF

Each `v2/<lang>/` subtree contains 5 `.mlmodelc` directories +
`constants_bin/`:

| Artifact                  | Precision           | Notes |
|---------------------------|---------------------|-------|
| `cond_step.mlmodelc` | fp16 | conditioning prefill (voice/text → KV
cache) |
| `flow_decoder.mlmodelc` | fp16 | flow-matching audio decoder |
| `flowlm_step.mlmodelc` | fp16 | per-token transformer step (default) |
| `flowlm_stepv2.mlmodelc` | **selective int8** | weight-only PTQ on
attn + FFN body linears (per kyutai-labs/pocket-tts#147 recipe); EOS
head + input embedding stay fp32. Optional smaller variant; **not
currently loaded by Swift** but available for client-side swap-in. |
| `mimi_decoder.mlmodelc` | fp16 | Mimi neural codec decoder |

`mimi_encoder.mlmodelc` (voice cloning, language-agnostic) is fetched
lazily, separately from any language pack.

The selective int8 in `flowlm_stepv2` quantizes 4 linears per
transformer layer (`attn_in_proj`, `attn_out_proj`, FFN expand, FFN
contract) via
`coremltools.optimize.torch.quantization.PostTrainingQuantizer`
(per-channel, symmetric, weight-only). Sizes: 6L 145 MB → 74 MB; 24L 1.1
GB → 291 MB.

## Changes

- **`PocketTtsLanguage`**: new enum (10 cases) with `repoSubdirectory`
(always `"v2/<rawValue>"`) and `transformerLayers` (6 or 24).
- **`ModelNames.PocketTTS`**: single `mimiDecoderFile =
"mimi_decoder.mlmodelc"` and single `requiredModels` set covering all
language packs uniformly.
- **`PocketTtsLayerKeys`**: discovers KV-cache I/O names at runtime so
6L and 24L packs share the same inference path. `discover(...)` requires
`expectedLayers: Int` (6 or 24) for early sanity-check.
- **`PocketTtsMimiKeys`**: discovers the Mimi decoder's audio output +
per-state input→output pairing dynamically (pass-through inputs first,
then shape-bucket pairing in canonical order).
- **Voice safetensors prebakes**: every language pack ships
`<voice>.safetensors` containing pre-computed LM transformer KV cache
snapshots (per-layer `[2, 1, seqLen, 16, 64]` F32 + I64 offset).
`PocketTtsConstantsLoader.loadVoiceSnapshot` parses the safetensors
header (8-byte LE u64 + JSON) and extracts per-layer cache + offset
tensors. `PocketTtsSynthesizer.kvCacheStateFromSnapshot` copies K/V
blocks into the runtime `[2, 1, kvCacheMaxLen, 16, 64]` state
independently. Skips the per-token `cond_step` voice prefill.
- **`PocketTtsResourceDownloader`**: `ensureModels(language:)` always
fetches the requested `v2/<lang>/` subtree via
`DownloadUtils.downloadSubdirectory`. `ensureVoice` downloads
`<voice>.safetensors`. `ensureMimiEncoder()` lazily fetches the
language-agnostic encoder for voice cloning without pulling a full
language pack.
- **`PocketTtsModelStore` / `PocketTtsManager` / `PocketTtsSession` /
`PocketTtsSynthesizer`**: language threaded through load + constants +
KV-cache sizing. Voice data is cached per `(language, voice)`. Mimi keys
discovered + cached per language.
- **Voice cloning across languages**: Mimi encoder is shared; cloned
`PocketTtsVoiceData` from one language's manager can be fed to another.
- **CLI**: `fluidaudiocli tts --backend pocket --language <id>` (default
`english`). Unknown values log the supported list and fall back to
English.
- **Docs**: `Documentation/TTS/PocketTTS.md` gains a Languages section +
cross-language cloning example.

## Tests

- `PocketTtsLanguageTests` — pure-logic cases covering
`repoSubdirectory`, `transformerLayers`, and `requiredModels`. No model
download / no network.
- Full PocketTTS test suite: 16/16 passing (`swift test --filter
PocketTts`).

## Test plan

- [x] `swift build` — clean Release build (rebased onto current `main`)
- [x] `swift format lint --recursive --configuration .swift-format` —
clean
- [x] `swift test --filter PocketTts` — 16/16 pass
- [x] Manual end-to-end via FluidAudio Swift CLI for **all 10 language
packs** (fresh HF download → fp16 baseline → swap
`flowlm_stepv2.mlmodelc` → re-synthesize → Parakeet TDT v3 ASR check on
both outputs):

| Language        | fp16 ASR | flowlm_stepv2 (int8) ASR |
|-----------------|---|---|
| english         | ✓ | ✓ |
| spanish         | ✓ | ✓ |
| spanish_24l     | ✓ | ✓ |
| french_24l      | ✓ | ✓ |
| german          | ✓ | ✓ |
| german_24l      | ✓ | ✓ |
| italian         | ✓ | ✓ |
| italian_24l     | ✓ | ✓ |
| portuguese      | ✓ | ✓ |
| portuguese_24l  | ✓ | ✓ |

Selective int8 vs fp16 for `flowlm_step`: 6L 145 MB → 74 MB; 24L 1.1 GB
→ 291 MB.

## Non-goals

- Runtime language switching on a live `PocketTtsManager` (create a new
manager instead).
- Auto-inferring language from text.
- French 6-layer (upstream did not ship it).
- Auto-loading `flowlm_stepv2` (Swift continues to load
`flowlm_step.mlmodelc`/fp16 by default; the int8 variant ships in the
pack so clients can opt in via cache swap, and a future PR can add a
`precision: .fp16 | .int8` selector).

Closes #49
2026-04-27 22:21:43 -04:00
Alexandre Mendonça Alvaro 982f117eb4 fix: avoid misleading confidence warning in SlidingWindowAsrManager.finish() (#548)
### Why is this change needed?
`SlidingWindowAsrManager.finish()` reconstructs final text by calling
`processTranscriptionResult(...)` with empty `timestamps` and
`confidences`.

That path only needs token-to-text reconstruction, but it also runs
confidence calculation, which logs:

`Expected token confidences but got none - this should not happen`

In practice this shows up during normal finalization even though nothing
is actually wrong.

### What changed?
Use `convertTokensToText(accumulatedTokens)` directly in `finish()` when
only the merged final text is needed.

This keeps behavior the same for the returned transcription while
avoiding a misleading warning during normal shutdown.

### Validation
- `swift test --filter SlidingWindowAsrManagerTests`
- Reproduced locally from an app integration path before the patch;
warning no longer appears after the change.

<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/548"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open in Devin Review">
  </picture>
</a>
<!-- devin-review-badge-end -->
2026-04-27 21:50:42 -04:00
Alex 7c115f6b4e feat(tts/kokoro-ane): add laishere 7-stage CoreML chain (ANE-optimized) (#547)
## Summary

Adds a second Kokoro TTS backend (`KokoroAne`) wrapping the
[laishere/kokoro-coreml](https://github.com/laishere/kokoro-coreml)
7-stage chain (Albert → PostAlbert → Alignment → Prosody → Noise →
Vocoder → Tail) behind an actor-based facade, used with the upstream
author's permission. Per-stage `MLComputeUnits` assignment routes
Albert/PostAlbert/Alignment/Vocoder to **ANE**; Prosody/Noise/Tail stay
on CPU+GPU for fp32/iSTFT-heavy ops.

The companion mobius PR for the conversion side:
https://github.com/FluidInference/mobius/pull/45

Existing `KokoroTtsManager` (single fp32 model) is untouched. Both
backends ship from the same `FluidInference/kokoro-82m-coreml` HF repo —
KokoroAne lives under the `ANE/` subdirectory.

## What's added

**Module: `Sources/FluidAudio/TTS/KokoroAne/`**
- `KokoroAneManager` — actor facade: `initialize`,
`synthesize(text|phonemes)`, `synthesizeDetailed`
- `KokoroAneSynthesizer` — 7-stage orchestration with fp16↔fp32 vImage
boundaries (Prosody→Noise→Vocoder→Tail). Uses `rebuild16`/`rebuild32`
helpers so each output is fetched once.
- `KokoroAneModelStore` — per-stage MLModel handles + vocab + voice pack
cache. Atomic-commit load (matches `PocketTtsModelStore` pattern) so
partial-load failures stay retryable.
- `KokoroAneVoicePack` — `[510, 256]` flat fp32 row indexing (timbre
cols `[0:128]`, style_s cols `[128:256]`)
- `KokoroAneVocab` — IPA → token IDs with BOS/EOS wrap, max 512
- `KokoroAneResourceDownloader` — HF cache management via existing
`DownloadUtils`; also downloads the shared kokoro G2P assets on first
init (see fix below)
- G2P reuses existing `G2PModel.shared`

**CLI:**
```bash
fluidaudiocli tts "Hello world" --backend kokoro-ane [--metrics m.json]
fluidaudiocli tts-asr-verify --texts-file phrases.txt --output-json results.json
```
The `tts-asr-verify` batch command synthesizes each phrase, transcribes
with Parakeet, and emits per-phrase + macro/micro WER with stage
timings.

**Tests** (`Tests/FluidAudioTests/TTS/KokoroAne/`):
- 13 unit tests (vocab, voice pack) — no model deps, run on CI
- 5 E2E tests (synth + ASR roundtrip) — gated by
`FLUIDAUDIO_RUN_KOKOROANE_E2E=1`

**Docs:**
- New `Documentation/TTS/KokoroAne.md` — when-to-pick decision table,
CLI/Swift quick start, per-stage compute targets, voice pack layout,
limits, perf numbers, source links.
- Top-of-file callout on `Documentation/TTS/Kokoro.md` linking to the
ANE-resident variant.
- Updated `Documentation/README.md` index, `Documentation/Models.md` TTS
table, `Documentation/API.md` reference, `Documentation/CLI.md` example.

## Verified end-to-end on M2

Cold model load: 20.6s (`anecompilerservice` first-run ANE compilation).
Warm load: ~300ms.

| Phrase | Synth | Audio | RTFx | ASR roundtrip |
|---|---|---|---|---|
| Hello world | 0.47s | 1.65s | 3.5× | "Hello world." (WER 0%) |
| The quick brown fox… | 0.32s | 3.18s | 9.9× | dropped "The" (WER 11%)
|
| She had been waiting… | 0.25s | 2.80s | 11.4× | "Shay" misheard (WER
12.5%) |

Aggregate macro WER 7.9%, micro WER 10.5% — error is ASR-side; TTS audio
is intelligible.

Steady-state per-stage timings confirm ANE residency (Albert/PostAlbert
~7-10ms each).

## Devin Review fixes addressed in this PR

- 🔴 **Partial model load wedged the store**
(`KokoroAneModelStore.loadIfNeeded`) — fixed via local `pendingModels`
accumulator + atomic commit, matching `PocketTtsModelStore`.
- 🐛 **G2P models not downloaded standalone** — `G2PModel.loadIfNeeded`
only reads from `~/.cache/fluidaudio/Models/kokoro/` and never
downloads. The kokoroAne download set didn't include G2P, so first-time
`--backend kokoro-ane` users (no prior `kokoro` use) hit a cryptic
`vocabLoadFailed`. Fixed by adding a `g2p-only` sentinel variant to
`getRequiredModelNames(.kokoro, …)` and a new
`KokoroAneResourceDownloader.ensureG2PAssets(directory:)` that runs
before `G2PModel.shared.ensureModelsAvailable()` in
`KokoroAneManager.initialize()`.
- 🟡 **Voice pack off-by-one (false positive)** — verified upstream
`convert-coreml.py:552` uses `voice_pack[len(phonemes) - 1]`, exactly
matching the existing Swift `phonemeCount - 1`. No change.

## Refactor pass

Internal cleanup applied across the module after the initial
implementation landed:
- `KokoroAneSynthesizer`: `rebuild16`/`rebuild32` helpers replace 11
inline `outputShape + outputArray + float16Array` patterns; F0/N shapes
cached once (was fetched 4×). Fixed a mislabeled `stage:` argument in
`outputArray` error reporting.
- `KokoroAneSynthesizer+Conversion`: extracted
`convertF32toF16`/`convertF16toF32`/`genericCopy` private helpers
(eliminates 4× duplicated vImage buffer setup).
- `KokoroAneModelStore`: folded `voicePack(_)` +
`loadVoicePackIfNeeded(_)` into one method; dropped unreachable
post-load guard and dead synthesized-URL throw.
- `KokoroAneVocab` / `KokoroAneError`: added `vocabParseFailed(URL,
String)` so a malformed top-level JSON object reports parse-failure
instead of file-not-found; removed dead NSNumber bridging fallback.
- `KokoroAneConstants`: dropped unused `defaultLanguage`,
`voicePackTimbreSlice`, `voicePackStyleSSlice`. Changed `defaultSpeed`
from `Float16` to `Float` (drops 4 `Float(...)` wraps at default-arg
sites).
- `KokoroAneError`: dropped unused `unsupportedPhoneme(Character)` —
`KokoroAneVocab.encode` silently drops unknown chars per the upstream
Python convention.

## Test plan

- [x] `swift build` clean
- [x] `swift test --filter KokoroAne` — 13 unit tests pass, 5 E2E gated
- [x] With models staged at
`~/.cache/fluidaudio/Models/kokoro-82m-coreml/ANE/`:
- [x] `FLUIDAUDIO_RUN_KOKOROANE_E2E=1 swift test --filter KokoroAne` —
all 18 pass
- [x] `swift run fluidaudiocli tts "Hello world" --backend kokoro-ane
--output /tmp/ane.wav --metrics /tmp/m.json` — produces non-silent audio
+ metrics with WER
- [x] `swift run fluidaudiocli tts-asr-verify --texts-file phrases.txt
--output-json /tmp/r.json` — aggregate WER ≤ 0.20

## Models

`FluidInference/kokoro-82m-coreml` on HuggingFace, under the `ANE/`
subdirectory:
```
ANE/KokoroAlbert.mlmodelc       fp16 + int8pal  (CPU+ANE)
ANE/KokoroPostAlbert.mlmodelc   fp16 + int8pal  (CPU+ANE)
ANE/KokoroAlignment.mlmodelc    fp16 + int8pal  (CPU+ANE)
ANE/KokoroProsody.mlmodelc      fp32             (CPU+GPU)
ANE/KokoroNoise.mlmodelc        fp32             (CPU+GPU)
ANE/KokoroVocoder.mlmodelc      fp16 + int8pal   (CPU+ANE)
ANE/KokoroTail.mlmodelc         fp32 + iSTFT     (CPU+GPU)
ANE/vocab.json                  114 IPA tokens
ANE/af_heart.bin                [510, 256] fp32 voice pack
```

G2P assets (`G2PEncoder.mlmodelc`, `G2PDecoder.mlmodelc`,
`g2p_vocab.json`) are pulled from the same repo's root and cached at
`~/.cache/fluidaudio/Models/kokoro/`, shared with the regular
`KokoroTtsManager` backend.

## License

Upstream (laishere) is MIT — carried forward in the mobius PR's LICENSE
file. Used with the upstream author's permission.
2026-04-27 20:08:49 -04:00
Alex d302273d49 fix(diarizer): convert SpeakerManager to actor, Speaker to struct (#528) (#539)
## Summary

Fixes [#528](https://github.com/FluidInference/FluidAudio/issues/528):
heap corruption (`BUG IN CLIENT OF LIBMALLOC: memory corruption of free
block`) and `Potential Structural Swift Concurrency Issue:
unsafeForcedSync called from Swift Concurrent context` warnings in the
diarizer on iOS 26.4 when `DiarizerModels.download()` +
`SpeakerManager.extractSpeakerEmbedding` are called from an async
context under Swift 6 strict concurrency.

**Root cause**
- `SpeakerManager` used `DispatchQueue.sync(flags: .barrier)` &rarr;
`unsafeForcedSync` warning when called from a Swift concurrent context.
- `Speaker` was a reference type with mutable `[Float]` embeddings
&rarr; concurrent COW mutations on the embedding buffers corrupted the
heap.

**Fix** &mdash; apply the same actor-conversion pattern used for
`AsrManager` in #419:
- `Speaker`: `final class` &rarr; `struct` (Sendable value type)
- `SpeakerManager`: class + `DispatchQueue` &rarr; `actor`
- `SpeakerOperations` extension: dropped `queue.sync`
- `DiarizerManager`: async-ified methods
- `SpeakerManager.upsertSpeaker(_:)` + `upsertSpeaker(id:...)`: thread
the speaker's `name` through persistence (previously implicit via
class-reference mutation; now required with struct value semantics).
- CLI (`ProcessCommand`, `DiarizationBenchmark`) and all
speaker/diarizer tests updated to `await` the actor-isolated API.
- `testConcurrentAccess` rewritten from
`DispatchQueue.async`/`DispatchGroup` to `withTaskGroup` for structured
concurrency.

## Test plan

- [x] `swift build` &mdash; clean on macOS
- [x] `swift test` &mdash; 1435 tests, 0 failures (24 skipped)
- [x] swift-format &mdash; no new warnings in touched files
(pre-existing warnings only, unrelated to this change)
- [ ] CI: build + tests + swift-format checks
- [ ] Verify on reporter's iOS 26.4 repro from #528
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/539"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open in Devin Review">
  </picture>
</a>
<!-- devin-review-badge-end -->
2026-04-23 22:13:47 -04:00
Alex 2ea0727541 ASR: fix Parakeet TDT v3 emitting Cyrillic for short Latin-script utterances (#512) (#515)
Fixes #512.

## TL;DR

Parakeet TDT v3 transcribed short Polish utterances like "Wpisz Google
kropka com" as Cyrillic (`Впиш Гугл к ком.`) because the joint decoder's
top-1 pick drifts to Cyrillic tokens under low acoustic confidence. This
PR adds an **opt-in** script filter: when a caller passes `language:
.polish` (or any other language with a declared script), the decoder
rejects top-1 if it's the wrong script and walks top-K to the
highest-probability candidate matching the expected script.

- **Opt-in**: `language:` defaults to `nil` — zero behavior change for
existing callers.
- **No acoustic-model changes** — this is purely a decoder-side
post-processing step over the joint logits.
- **Requires `JointDecisionv3.mlmodelc`** (exposes top-K outputs).
Auto-downloaded from HuggingFace alongside the other v3 files; falls
back to standard argmax when absent.

## Empirical validation — reporter's own audio

Samples pulled via `gdown --folder <link-from-issue-#512-comment>` from
@tajchert's Drive folder. **`JointDecisionv3.mlmodelc` is loaded in both
columns** — this isolates the Swift filter as the mechanism, not a model
swap.

| sample | ground truth | `language: nil` (current) | `language:
.polish` (this PR) |
|---|---|---|---|
| pl | Wpisz Google kropka com | **Впиш Гугл к ком.** | Wpis Google.com.
|
| pl2 | Wpisz Google kropka com | **Впиш Гугл крокаком.** | Wpish
Google, Com. |
| pl3 | Wpisz Google kropka com | **Впишь куглькрабком.** | VP Kugl.com.
|
| pl4 | Wpisz Google kropka com | **Впиш гугл к ком.** | Wpish gugl c. |
| pl5 | Wpisz Google kropka com | **Впиш гугл кракаком.** | Wpish Google
Croca kom. |
| pl6 | Wpisz Google kropka com | **Впиш, гугл крокаком.** | Wpish,
Google, Com. |
| pl_complex | Cały spichlarz jest ze spiżu | Cały spichlarz jest ze
spiżu. | Cały spichlarz jest ze spiżu. |

**6/6 short samples flip Cyrillic → Latin.** `pl_complex` was never
broken (long context → high joint confidence → no drift) and is
unchanged.

## Scope & limitations (important — please don't overclaim)

**This PR fixes the *script* the tokens are drawn from. It does NOT fix
per-word acoustic accuracy.**

| | `language: nil` | `language: .polish` |
|---|---|---|
| Script correct (Latin, not Cyrillic) | ✗ | ✓ (6/6) |
| Word spelling matches ground truth | ✗ | ✗ (still 6/7 wrong on short)
|

The residual errors — `Wpisz` → `Wpish`/`Wpis`, `kropka` → `Croca` /
dropped — are **Parakeet TDT v3 acoustic weaknesses on short Polish
commands**. No amount of output post-processing can turn `Wpish` into
`Wpisz`; that needs better acoustic modeling, a Polish LM rescorer, or
more training data. Out of scope here.

What users actually get by merging:

- Output is visually Polish (Latin script), not pseudo-Russian — works
with locale-aware post-processing, spell-check, and UI rendering
- Locale-strict WER evaluators no longer penalize Cyrillic-vs-Latin
substitution
- Opt-in; zero risk for callers who don't pass `language:`

What users do **not** get:

- Higher word accuracy on short Polish/Slavic Latin utterances
- Support for languages outside the `Language` enum (Greek, Maltese,
Hungarian, Turkish, Baltic — their characters fit the Latin Unicode
ranges but aren't exposed; easy follow-up)
- A meaningful FLEURS WER delta — see
[Documentation/fleurs-script-filtering-comparison.md](./Documentation/fleurs-script-filtering-comparison.md);
full sentences aren't in the failure regime

## Implementation

### New
- `Sources/FluidAudio/Shared/ScriptDetection.swift` (new, +112)
- `public enum Language` — 13 Latin (en, es, fr, de, it, pt, ro, pl, cs,
sk, sl, hr, bs) + 5 Cyrillic (ru, uk, be, bg, sr)
  - `public enum Script { case latin, cyrillic }`
- `matches(_:script:)` over Unicode ranges: ASCII (0x20–0x7F), Latin-1
(0xA0–0xFF), Latin Extended-A (0x100–0x17F), **Latin Extended-B
(0x180–0x24F — Romanian ș/ț)**, **Latin Extended Additional
(0x1E00–0x1EFF — Vietnamese)**, Cyrillic (0x400–0x4FF). Strips
SentencePiece boundary marker U+2581 before checking.
- `filterTopK(topKIds:topKLogits:vocabulary:preferredScript:) ->
(tokenId, probability)?` — returns the highest-probability top-K
candidate matching the target script; probability via **softmax over the
top-K subset** with the max-logit stability trick; guarded against top-K
array length mismatch.

### Changed
- `TdtJointDecision` — optional `topKIds` / `topKLogits` fields
(populated by JointDecisionv3 only)
- `TdtDecoderV3` — script filter runs **only when top-1 is already wrong
script**; both decode sites feed `filtered.probability` (a real [0,1])
into `TdtDurationMapping.clampProbability`, not raw logits
- `AsrManager.transcribe(...)` — `language: Language? = nil` plumbed
through all three overloads: `[Float]`, `URL`, `AVAudioPCMBuffer`
- `AsrModels` + `ModelNames` — `requiredModelsV3` set includes
`JointDecisionv3.mlmodelc` so the download utility fetches it on fresh
installs and also backfills it for existing users on next `.v3` load
- CLI — `fluidaudiocli transcribe <file> --language
{en|pl|cs|sk|sl|hr|bs|ro|es|fr|de|it|pt|ru|uk|be|bg|sr}`

### How to try it

```bash
swift run -c release fluidaudiocli transcribe sample.wav --language pl
```

## Model dependency

`JointDecisionv3.mlmodelc` must be present in
`FluidInference/parakeet-tdt-0.6b-v3-coreml` on HuggingFace. It exposes
`top_k_ids` / `top_k_logits` outputs (K=64 in our export) alongside the
standard argmax. When absent, `AsrModels` falls back to
`JointDecision.mlmodelc` and the script filter becomes a no-op —
backward compatible.

**Cache-upgrade verified**: removed `JointDecisionv3.mlmodelc` from a
populated cache, re-ran `--language pl`; the file was auto-fetched and
Polish output was Latin. Existing users pick up the fix on next `.v3`
load without manual intervention.

## Review notes / risky bits

- **Softmax over top-K subset, not the full vocab** — probabilities
won't exactly match a true full-softmax, but K=64 captures ~all the mass
when the model is anywhere near confident. If you prefer, we can expose
the raw top-K logits to callers and let them compute confidence however
they want.
- **Top-1 escape hatch**: filter is only triggered when top-1 fails
`matches(_, script:)`. When top-1 is already correct, nothing is changed
— so we can't regress the common case.
- **Length-mismatch guard** in `filterTopK` uses `min(topKIds.count,
topKLogits.count)`. If CoreML output arrays ever diverge, we iterate the
common prefix instead of crashing.
- **Latin Extended-B (0x0180–0x024F)** was added specifically so
Romanian ș/ț aren't rejected as non-Latin. Latin Extended Additional
(0x1E00–0x1EFF) was added for free — helps Vietnamese should anyone want
it later.

## Tests

- `ScriptDetectionTests` — **37 tests**: Unicode range coverage (Latin-1
/ Extended-A / Extended-B / Extended Additional / Cyrillic),
SentencePiece boundary-marker stripping, `filterTopK` happy path,
length-mismatch guard, probability-range invariant,
Czech/Slovak/Slovenian/Croatian/Romanian token coverage, cross-script
rejection
- Build clean; `swift format lint` clean on all touched files
- A/B end-to-end run against reporter's actual Polish audio (table
above)

## Checklist

- [x] Builds clean (`swift build`, `swift build -c release`)
- [x] `swift format lint` clean on touched files
- [x] `ScriptDetectionTests` 37/37 pass
- [x] A/B reproduction on #512 reporter's audio
- [x] Cache-upgrade path verified (JointDecisionv3 auto-fetched on
existing caches)
- [x] CLI accepts all 18 language codes end-to-end
- [ ] CI green

## Follow-ups (not blocking)

- Expose more Latin languages in the enum (Hungarian, Turkish, Baltic,
Maltese) — all character ranges already supported, just need enum cases
- Add `Script.greek` for `el_gr` (separate Unicode range)
- Short-utterance benchmark dataset (FLEURS is the wrong tool — it's all
long sentences where drift doesn't happen)
- Optional: publish a Polish LM rescorer to address the underlying
acoustic-accuracy issue the script filter cannot fix

---------
2026-04-23 17:43:09 -04:00
Alex cc4e712643 feat(asr/cohere): ANE-friendly static-shape decoder (v2) (#537)
## Summary

Adds support for a new Cohere decoder variant —
`cohere_decoder_cache_external_v2` — with **fully static shapes** so
CoreML can dispatch the decoder to the Apple Neural Engine.

- `ModelNames.CohereTranscribe`: adds v2 constants, flips default
`requiredModels` to v2, keeps legacy set as `requiredModelsLegacy`.
- `CoherePipeline.loadModels`: prefers v2 in `decoderDir`, falls back to
v1, clear error if neither present.
- Decode loop already auto-detects the variant from `attention_mask`
shape (shipped in #487 area) — nothing to change runtime-side.
- CLI help lists both decoder filenames.

v2 artifacts are published at
[`FluidInference/cohere-transcribe-03-2026-coreml/q8`](https://huggingface.co/FluidInference/cohere-transcribe-03-2026-coreml)
(`cohere_decoder_cache_external_v2.{mlmodelc,mlpackage}`). The existing
v1 decoder remains supported as a fallback.

## Why

The v1 (`RangeDim(1, 108)`) decoder has a dynamic `attention_mask`
length, which blocks ANE dispatch — `computeUnits = .all` silently falls
back to CPU/GPU. v2 fixes the mask at `[1, 1, 1, 108]` and sources the
decode position from `position_id`, letting the full decoder land on
ANE.

Measured with `fluidaudiocli cohere-transcribe` on the same audio (15
tokens, same q8 encoder, 3 warm runs each):

| Decoder | Config | Median decoder time |
|---|---|---:|
| **Static (v2)** | `.all` (ANE) | **2.58 s** |
| Dynamic (v1) | `.all` | 4.13 s |
| Static (v2) | `--cpu-gpu` | 10.02 s |
| Dynamic (v1) | `--cpu-gpu` | 4.32 s |

~1.6× faster decoder end-to-end. The v1 `.all` ≈ v1 `--cpu-gpu` rows
confirm RangeDim blocks ANE. v2 attends over the full 108 slots every
step, so on pure CPU/GPU it's slower — the win is entirely from ANE
residency. Transcripts are byte-identical across configs.

## Test plan

- [x] Smoke test v2-preferred: directory containing only
`cohere_decoder_cache_external_v2.mlmodelc` transcribes
`english_original.wav` correctly.
- [x] Smoke test v1 fallback: directory containing only
`cohere_decoder_cache_external.mlmodelc` transcribes correctly.
- [x] `swift build -c release --product fluidaudiocli` clean.
- [x] `swift format` clean on changed files.
- [ ] Reviewer: run `fluidaudiocli cohere-transcribe <audio> --model-dir
<q8 dir with v2>` to reproduce the ANE speedup.

## Related

- v2 export script (mobius): `export-decoder-cache-external-static.py`
(uncommitted, to land in a follow-up mobius PR).
- HF repo: `FluidInference/cohere-transcribe-03-2026-coreml` now ships
both decoders under `q8/`.
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/537"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open in Devin Review">
  </picture>
</a>
<!-- devin-review-badge-end -->
2026-04-23 17:42:34 -04:00
Sachin Desai bd5ba7e1b7 fix abbreviation handling for kokoro (#538)
### Why is this change needed?
This change fixes the following issues:

- Sort the common abbreviations on the longest keys so that, e.g. "etc."
is matched before "etc" to prevent a stray "." if the shorter match is
performed first
- The trailing "\b" fails when the abbreviation ends in a non-word char,
e.g. "Dr." followed by a space is non-word→non-word, so no boundary.
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/538"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open in Devin Review">
  </picture>
</a>
<!-- devin-review-badge-end -->

Co-authored-by: Sachin Desai <sdesai@salesforce.com>
2026-04-23 17:40:26 -04:00
Alex b10bdcb51d feat(asr): add Cohere Transcribe (INT8 encoder + FP16 cache-external decoder) (#487)
## Summary

Adds Cohere Transcribe ASR for 14 languages, shipped as an INT8 encoder
+ FP16 cache-external decoder hybrid (`CoherePipeline`). One CLI for
single-file transcription, one CLI for dataset benchmarking (FLEURS and
LibriSpeech).

## Languages

English, French, German, Spanish, Italian, Portuguese, Dutch, Polish,
Greek, Arabic, Japanese, Chinese (Simplified), Korean, Vietnamese.

## What's added

### Library (`Sources/FluidAudio/ASR/Cohere/`)

- **`CoherePipeline`** — encoder + cache-external decoder runner.
Allocates
  the K/V cache host-side (no CoreML State API; iOS 17+), applies the
  additive cross-attention mask, and detokenizes via SentencePiece byte
  fallback so CJK comes out as real characters. Accepts separate
  `encoderDir` / `decoderDir` to support the q8/f16 split.
- **`CohereAsrConfig`** — per-language prompt sequences and token IDs;
shared 35 s / 3500-frame audio window and 108-token decoder cache window
constants. The 35 s cap traces directly to upstream `max_audio_clip_s:
35`.
- **`CohereMelSpectrogram`** — 128-mel front-end matching the reference
  model (preemph, Slaney mel, CMVN).

### CLI (`Sources/FluidAudioCLI/Commands/ASR/Cohere/`)

- `fluidaudiocli cohere-transcribe <audio> --language <lang>` —
single-file
  transcription. Accepts either `--model-dir` (single dir with both
encoder and decoder) or `--encoder-dir` + `--decoder-dir` for the q8/f16
  split.
- `fluidaudiocli cohere-benchmark` — dataset benchmark with
  `--dataset fleurs|librispeech`, `--subset` for LibriSpeech splits,
  `--languages` for FLEURS codes, `--auto-download`, and
  `--checkpoint-every N` (default 100) so long runs persist partial
  results and survive mid-run crashes.

### `ModelNames.swift`

- New `Repo.cohereTranscribeCoreml` →
  `FluidInference/cohere-transcribe-03-2026-coreml/q8`.
- New `ModelNames.CohereTranscribe` enum with `encoder`,
`decoderCacheExternal`, `vocab` and the corresponding `.mlmodelc` paths.

### Documentation
- `Documentation/ASR/Cohere.md` — architecture, API, CLI, LibriSpeech +
  FLEURS results, upstream config provenance (`max_audio_clip_s`,
  `overlap_chunk_second`), comparison vs Cohere's Figure 4 reference
  numbers, caveats.

### FLEURS coverage
- Extends `FleursBenchmark.supportedLanguages` with the 6 non-European
  Cohere languages (`pt_br`, `ar_eg`, `ja_jp`, `cmn_hans_cn`, `ko_kr`,
  `vi_vn`).

## LibriSpeech test-clean (Apple M2 2022, Tahoe 26.0)

Full split, all 2,620 utterances, single-chunk.

| Subset | Samples | WER | CER | RTFx (per-file mean) | RTFx (total
audio/compute) |
|---|---:|---:|---:|---:|---:|
| test-clean | 2,620 | **1.77%** | **0.60%** | 2.04× | 1.72× |

5h 24m audio processed in 3h 09m compute (3h 12m wall time including
one-time ~6 min ANE cold-start compile). Competitive with Parakeet TDT
0.6B v3 (~1.7%) and Whisper large-v3 (~1.8%).

## FLEURS results (full splits, single-chunk)

M4 Pro / Tahoe 26.0, 9,911 samples total.

| FLEURS code | Language | Samples | WER | CER | RTFx |
|---|---|---:|---:|---:|---:|
| en_us | English | 647 | 5.63% | 3.19% | 2.49× |
| fr_fr | French | 676 | 6.22% | 3.11% | 2.21× |
| de_de | German | 862 | 5.84% | 2.83% | 1.98× |
| es_419 | Spanish (LATAM) | 908 | 4.53% | 2.40% | 1.34× |
| it_it | Italian | 865 | **4.03%** | 2.04% | **3.15×** |
| pt_br | Portuguese (BR) | 919 | 6.44% | 3.38% | 2.79× |
| nl_nl | Dutch | 364 | 8.07% | 4.14% | 2.04× |
| pl_pl | Polish | 758 | 7.49% | 3.23% | 1.98× |
| el_gr | Greek | 650 | 11.50% | 5.45% | 2.00× |
| ar_eg | Arabic (EG) | 428 | 18.46% | 6.71% | 2.06× |
| ja_jp | Japanese | 650 | 60.13%† | 6.25% | 2.23× |
| cmn_hans_cn | Mandarin | 945 | 98.52%† | 12.01% | 1.85× |
| ko_kr | Korean | 382 | 16.39% | 6.67% | 1.84× |
| vi_vn | Vietnamese | 857 | 9.55% | 6.87% | 1.55× |

†Japanese and Mandarin are written without word boundaries, so WER on
the
raw hypothesis is a tokenization artifact — **CER is the real accuracy
metric**. Cohere's own Figure 4 uses CER for zh/ja/ko for the same
reason.

## Usage

```swift
let models = try await CoherePipeline.loadModels(
    encoderDir: q8Dir,
    decoderDir: q8Dir,
    vocabDir: q8Dir
)
let pipeline = CoherePipeline()
let result = try await pipeline.transcribe(
    audio: samples,        // 16 kHz mono Float32, up to 35 s
    models: models,
    language: .english
)
```

```bash
# Single file
swift run -c release fluidaudiocli cohere-transcribe audio.wav --language en

# LibriSpeech
swift run -c release fluidaudiocli cohere-benchmark \
    --dataset librispeech --subset test-clean \
    --model-dir /path/to/q8 --auto-download

# FLEURS
swift run -c release fluidaudiocli cohere-benchmark \
    --dataset fleurs --languages en_us,fr_fr --auto-download
```

## HuggingFace

- INT8 hybrid (shipped):
  https://huggingface.co/FluidInference/cohere-transcribe-03-2026-coreml
  (subdir `q8/`)
- Upstream model:
https://huggingface.co/CohereLabs/cohere-transcribe-03-2026

## Notes

- **35 s single-chunk limit** is baked into the upstream model
  (`max_audio_clip_s: 35` in `cohere-pytorch/config.json`). Upstream
  Python also supports >35 s via 5 s-overlap chunking
  (`overlap_chunk_second: 5`); this port does not implement that wrapper
  yet and skips longer utterances with a warning.
- **Cache-external decoder stays FP16**: INT8 decoder quantization
  regresses quality significantly in testing and is not shipped.

## Test plan

- [x] Library + CLI release build clean
- [x] Single-file transcription via \`cohere-transcribe\`
- [x] FLEURS en_us sanity (5.63% WER)
- [x] Full 14-language FLEURS benchmark (9,911 samples)
- [x] Full LibriSpeech test-clean benchmark (2,620 samples, WER 1.77%)
- [x] CJK CER validated (word-boundary-agnostic metric for ja/zh)
- [x] Checkpoint-every survives kill mid-run
- [x] \`printFinalSummary\` no longer aborts on macOS 26
2026-04-23 10:59:07 -04:00
Alex 7c9be31c05 fix(benchmark): repair 3 pre-existing script/download bugs (#534)
## Summary

Three unrelated pre-existing bugs surfaced while validating PR #515. All
of them block `Scripts/parakeet_subset_benchmark.sh --download` from
succeeding, but none are related to the v3 script-filtering work.
Consolidating into one PR since each fix is ~1–3 lines.

### 1. Japanese TDT folder-name mismatch

`Scripts/parakeet_subset_benchmark.sh` verifies the Japanese TDT model
at `$MODELS_DIR/parakeet-tdt-ja/`, but the folder was renamed to
`parakeet-ja` in 4ef33f0b6 (`Repo.parakeetJa.folderName =
"parakeet-ja"`). Result: `verify_assets()` always reported missing
assets even on a fully provisioned machine. One-line rename to match.

### 2. EOU streaming CLI writes to wrong path

`ParakeetEouCommand` had a default / `--use-cache` split where the
default branch produced `$CWD/Models/<chunk>/<chunk>/` (double-nested,
relative to CWD) as the load path, while `downloadModels()` called
`deletingLastPathComponent().deletingLastPathComponent()` then
`DownloadUtils.downloadRepo(repo, to:)` which appended `folderName =
"parakeet-eou-streaming/<chunk>"`. Net effect: files landed at
`$CWD/Models/parakeet-eou-streaming/<chunk>/` while `loadModels()`
looked at `$CWD/Models/<chunk>/<chunk>/` — model load failed silently.

Unified on Application Support (matches every other CoreML model in
FluidAudio). `--use-cache` retained as a no-op flag for backward
compatibility.

### 3. earnings22-kws dataset 404

HuggingFace consolidated `argmaxinc/earnings22-kws-golden` into
`argmaxinc/contextual-earnings22`. The old id now returns 404 from the
Datasets-Server REST API (no redirect follow). The new dataset has the
same feature schema (`audio`, `file_id`, `text`, `dictionary`, ...), so
swapping the id is sufficient — no downstream consumer changes needed.

## Test plan

Ran `Scripts/parakeet_subset_benchmark.sh --download` end-to-end:

- [x] `verify_assets` correctly resolves `parakeet-ja/` (all 5 expected
files present)
- [x] EOU warmup: `Models downloaded to ~/Library/Application
Support/FluidAudio/Models/parakeet-eou-streaming/320ms`, 0.00% WER on
warmup file
- [x] earnings22-kws: 1140+ files downloaded (was 0 before), no 404
- [x] `swift build` passes

Out of scope but observed (pre-existing, unrelated):
- `ctc-earnings-benchmark --auto-download` does not actually
auto-download CTC-110m model
- THCHS-30 dataset hit HF IP rate limit (429) — transient
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/534"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open in Devin Review">
  </picture>
</a>
<!-- devin-review-badge-end -->
2026-04-21 04:22:18 -04:00
Alex f8badf7899 docs(diarization): update AMI offline benchmarks after #523 fix (#533)
## Summary

Follow-up to #523. The merged fix dramatically improved offline
diarization accuracy on the AMI SDM test set, but the documented
benchmark numbers still reflected the pre-fix buggy pipeline.

### Benchmark impact (full 16-meeting AMI SDM test set, post-#523 merged
main)

| Metric | Before #523 (buggy) | After #523 (fixed) | Change |
|---|---:|---:|---:|
| Average DER | 20.74% | **10.62%** | −48.8% |
| Average JER | 47.20% | **17.37%** | −63.2% |
| Average Speaker Error | 13.80% | **3.26%** | −76.4% |
| Correct speaker count | 3 / 16 | **12 / 16** | +9 meetings |
| Catastrophic failures (DER > 40%) | 2 | **0** | −2 |

ES2004d in particular went from 69.4% DER → 11.4% DER.

## Changes

1. **`Documentation/Diarization/BenchmarkAMISubset.md`**
- Offline VBx 4-meeting subset table updated with post-fix numbers
(12.0% avg DER, down from 21.8%)
   - Added full 16-meeting AMI SDM reference (10.62% DER)
   - Summary table Offline VBx row updated

2. **`Documentation/Benchmarks.md`**
- Added full AMI SDM 16-meeting offline results block in the "Offline
diarization pipeline" section (keeps existing VoxConverse numbers)

3. **`Sources/FluidAudioCLI/Commands/DiarizationBenchmark.swift`**
- Added EN2002 and TS3003 series to `allMeetings` so `--dataset ami-sdm
--auto-download` actually enumerates all 16 official test meetings.
Previously, `DatasetDownloader.swift` downloaded 16 WAVs but
`DiarizationBenchmark.swift`'s separate `allMeetings` list only included
8 of them (ES2004 + IS1009), silently skipping EN2002 and TS3003.

## Test plan

- [x] `swift build -c release` passes on branch
- [x] `swift run fluidaudiocli diarization-benchmark --mode offline
--dataset ami-sdm --auto-download` now enumerates all 16 meetings
- [x] Verification run reproduces documented numbers: 10.62% avg DER
across 16 meetings, 12/16 with correct speaker count
- [ ] CI benchmark workflow picks up new numbers on merge

## Related

- Implements follow-up docs for #523
- Closes #513 (original bug report) — already implicitly closed by #523
merge, this PR makes the docs reflect reality
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/533"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open in Devin Review">
  </picture>
</a>
<!-- devin-review-badge-end -->
2026-04-20 20:19:46 -04:00
thechatk 30599f9c89 Fix offline diarization pipeline producing single-speaker output (#523)
## Summary

Fixes three bugs in the offline diarization pipeline that caused it to
attribute nearly all segments to a single speaker:

- **`vDSP_mtrans` dimension swap** — `frameCount` and `speakerCount`
arguments were reversed, corrupting transposed speaker masks to be
nearly identical across all speakers
- **Missing activity ratio filter** — the reference pyannote
implementation filters speakers with <20% clean activation, but the
current code only filters completely silent speakers, allowing junk
embeddings through to clustering
- **Soft masks vs binary masks** — the reference derives per-frame
speaker masks from argmax on powerset logits (binary 0/1), but the
current code uses soft probabilistic masks via matrix-vector
multiplication, producing blurred activations

## Test results

After all three fixes, diarization works correctly across multiple test
scenarios. Testing against a 467-second 3-speaker audio file achieved
**97% F1 score** vs PyTorch pyannote while maintaining real-time
performance (120x faster than real-time).

## Files changed

-
`Sources/FluidAudio/Diarizer/Offline/Extraction/OfflineEmbeddingExtractor.swift`
-
`Sources/FluidAudio/Diarizer/Offline/Segmentation/OfflineSegmentationProcessor.swift`

Fixes #513
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/523"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open with Devin">
  </picture>
</a>
<!-- devin-review-badge-end -->

---------

Co-authored-by: thechatk <275578031+thechatk@users.noreply.github.com>
2026-04-20 19:35:56 -04:00
Alex b789a56609 Fix Japanese TDT model download filename mismatch (#522)
Fixes the infinite re-download loop for Japanese TDT models reported in
#521.

## Problem
The `download()` function was using hardcoded `Names.decoderFile` and
`Names.jointFile` for all model versions. For `.tdtJa`, this downloaded:
- `Decoder.mlmodelc` 
- `JointDecision.mlmodelc`

But `modelsExist()` checks for version-specific filenames:
- `Decoderv2.mlmodelc`
- `Jointerv2.mlmodelc`

This mismatch caused the existence check to fail, triggering cache purge
and re-download in an infinite loop.

## Solution
Use `getModelFileNames(version)` in the download function to get the
correct filenames for each version, matching what `modelsExist()`
expects.

## Testing
- [x] Build passes
- [x] Filenames now match between download and existence check
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/522"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open with Devin">
  </picture>
</a>
<!-- devin-review-badge-end -->

---------
2026-04-20 17:56:10 -04:00
Phoenix 3dc57c83b3 Fix: Lower ASR minimum audio guard from 1s to 300ms (#531)
Short single-word utterances (e.g. "yes", "no", "stop") are typically
500-700ms and were silently rejected by AsrManager with
ASRError.invalidAudioData before transcription ran. Lower the guard to
300ms so these reach the model.

Adds ASRConstants.minimumAudioDurationSeconds and a companion
minimumRequiredSamples(forSampleRate:) helper, mirroring the existing
calculateEncoderFrames(from:) pattern so the arithmetic lives next to
the other ASR audio math rather than being inlined at each guard.

  ## Summary
- Lower the minimum audio length accepted by `AsrManager` from 1s
(16,000 samples) to 300ms (4,800 samples) so short single-word
utterances
("yes", "no", "stop") reach the Parakeet TDT model instead of being
silently rejected with `ASRError.invalidAudioData`.
- Add `ASRConstants.minimumAudioDurationSeconds` (= 0.3) and a companion
`minimumRequiredSamples(forSampleRate:)` helper, mirroring the existing
`calculateEncoderFrames(from:)` pattern so the arithmetic is defined
once next to the other ASR audio math.
- Update the in-memory guard in `AsrManager+Transcription.swift` and the
disk-backed guard in `AsrManager.swift` to call the helper via a
locally-named `minimumRequiredSamples`.
- Refresh the `ASRError.invalidAudioData` message from "at least 1
second" to "at least 300ms" so it stays truthful.
  
 ## Files changed
- `Sources/FluidAudio/Shared/ASRConstants.swift` — add
`minimumAudioDurationSeconds` constant and
`minimumRequiredSamples(forSampleRate:)`
  helper.
-
`Sources/FluidAudio/ASR/Parakeet/SlidingWindow/TDT/AsrManager+Transcription.swift`
— in-memory guard now uses the helper.
- `Sources/FluidAudio/ASR/Parakeet/SlidingWindow/TDT/AsrManager.swift` —
disk-backed guard (`transcribeDiskBacked`) now uses the helper.
- `Sources/FluidAudio/ASR/Parakeet/AsrTypes.swift` — update
`ASRError.invalidAudioData` message to reflect the new threshold.
- `Tests/FluidAudioTests/ASR/Parakeet/ASRConstantsTests.swift` — add
`testMinimumAudioDurationSeconds` covering both the constant and the
  helper.

## Why is this change needed?
<img width="849" height="630" alt="image"
src="https://github.com/user-attachments/assets/a4adba61-4f37-4bd6-b341-554362cb2e32"
/>

https://github.com/altic-dev/FluidVoice/issues/276


https://github.com/altic-dev/FluidAudio/blob/f3dba78a23cb706d01c889bd54f7efd26871e82e/Sources/FluidAudio/ASR/Parakeet/AsrTranscription.swift#L11


<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/531"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open in Devin Review">
  </picture>
</a>
<!-- devin-review-badge-end -->
2026-04-18 21:43:57 -04:00
Alex 4ef33f0b64 Fix Japanese TDT models and consolidate to unified AsrModels API (#521)
## Summary
- Fixes TDT Japanese model downloads by returning union of both CTC and
TDT required models
- Enables Japanese TDT models to work with `AsrModels`/`AsrManager` for
timing information
- **Removes all redundant Japanese-specific managers** (TdtJaManager,
CtcJaManager) - consolidates to unified AsrModels path

## Problems Fixed

### Problem 1: Model Download (Issue #517)
The `parakeet-ctc-0.6b-ja-coreml` repository contains both CTC models
(`CtcDecoder.mlmodelc`) and TDT models (`Decoderv2.mlmodelc`,
`Jointerv2.mlmodelc`). When `TdtJaModels` attempts to download from
`Repo.parakeetJa`, the `getRequiredModelNames()` function was only
returning `CTCJa.requiredModels`, which doesn't include the TDT-specific
models.

This caused TDT Japanese models to fail downloading with "Model file not
found" errors.

### Problem 2: AsrModels File Name Mismatch
`AsrModels` used hardcoded file names from `ModelNames.ASR` (expecting
`Decoder.mlmodelc` and `JointDecision.mlmodelc`), which didn't match
Japanese TDT model files (`Decoderv2.mlmodelc`, `Jointerv2.mlmodelc`).

This prevented users from loading `.tdtJa` with `AsrModels`/`AsrManager`
to get timing information.

### Problem 3: Code Duplication
Japanese models had 4 specialized managers (TdtJaManager, CtcJaManager,
TdtJaModels, CtcJaModels) that duplicated functionality and didn't match
the pattern used by other TDT variants (v2, v3, tdtCtc110m all use
AsrModels directly).

## Solution

### Fix 1: Model Downloads (ModelNames.swift)
Updated `ModelNames.swift` line 675-677 to return the union of both
model sets:

```swift
case .parakeetJa:
    // Repo contains BOTH CTC and TDT models - return union of both sets
    return ModelNames.CTCJa.requiredModels.union(ModelNames.TDTJa.requiredModels)
```

This ensures all 5 models are downloaded:
- Preprocessor.mlmodelc (shared)
- Encoder.mlmodelc (shared)
- CtcDecoder.mlmodelc (CTC only)
- Decoderv2.mlmodelc (TDT only)
- Jointerv2.mlmodelc (TDT only)

### Fix 2: Version-Specific Model File Names (AsrModels.swift)
- Added `getModelFileNames()` to return version-specific decoder, joint,
and vocabulary file names
- Added `getRequiredModels()` to return version-specific model sets
- Updated `load()`, `loadVocabulary()`, and `modelsExist()` to use
version-specific names

### Fix 3: Remove Redundant Code
**Deleted:**
- `TdtJaManager.swift` - broken, redundant
- `TdtJaModels.swift` - redundant
- `CtcJaManager.swift` - redundant (TDT is superior)
- `CtcJaModels.swift` - redundant
- `AsrModelVersion.ctcJa` enum case - no longer needed
- All related tests (replaced by `AsrModelsTdtJaTests`)

**Updated:**
- JapaneseAsrBenchmark → uses `AsrModels` + `AsrManager`
- Removed `.ctcJa` from version labels and validation

## Result

Clean, unified API for Japanese TDT models that matches other TDT
variants:

```swift
// Load Japanese TDT models
let models = try await AsrModels.load(version: .tdtJa)
let manager = AsrManager(models: models)

// Transcribe with timing info
var state = try TdtDecoderState(decoderLayers: 2)
let result = try await manager.transcribe(url, decoderState: &state)

// Access text and timing information
print(result.text)
print(result.timings)  //  Timing info available!
```

## Benefits
1. **Timing information** - Users get token timings via `AsrManager`
(not available in `TdtJaManager`)
2. **Consistency** - Japanese TDT follows same pattern as
v2/v3/tdtCtc110m
3. **Less code** - Removed ~1000 lines of redundant manager code
4. **Single source of truth** - One way to load Japanese TDT models

## Testing
-  `CtcJaTests.testCtcJaTranscription` - Full CTC Japanese pipeline
test
-  `TdtJaTests.testTdtJaTranscription` - Full TDT Japanese pipeline
test
-  `AsrModelsTdtJaTests.testTdtJaWithAsrModels` - TDT Japanese loads
via AsrModels
-  `AsrModelsTdtJaTests.testTdtJaWithAsrManager` - TDT Japanese works
with AsrManager
-  Build verified with `swift build`

Fixes #517
2026-04-12 15:09:58 -04:00
Alex 044bb0bf8f Refactor: Rename Repo.parakeetCtcJa to Repo.parakeetJa for accuracy (#520)
## Problem

The enum name `Repo.parakeetCtcJa` is misleading because it implies the
repository only contains CTC models, but it actually contains **both CTC
and TDT models**.

## Verified Repository Contents

**`FluidInference/parakeet-ctc-0.6b-ja-coreml`** contains:
-  CTC models: `CtcDecoder.mlmodelc`
-  TDT v2 models: `Decoderv2.mlmodelc` + `Jointerv2.mlmodelc`
- Shared: `Preprocessor.mlmodelc`, `Encoder.mlmodelc`, `vocab.json`

## Solution

Renamed `Repo.parakeetCtcJa` → `Repo.parakeetJa` to accurately reflect
that it's the Japanese models repository containing both decoder
variants.

## Changes

- **ModelNames.swift**: Renamed enum case from `.parakeetCtcJa` to
`.parakeetJa`
- **AsrModels.swift**: Updated `.ctcJa` and `.tdtJa` to use
`.parakeetJa`
- **CtcJaModels.swift**: Updated repository reference
- **TdtJaModels.swift**: Updated repository reference and added comment

## Testing

-  Build succeeds
-  Both CTC and TDT Japanese managers now use the correct repository
name

## Related

- Follow-up to #516 and #519
- Addresses naming clarity issue raised by @Josscii
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/520"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open with Devin">
  </picture>
</a>
<!-- devin-review-badge-end -->
2026-04-12 00:21:30 -04:00
Alex c4aaa5d018 Fix parakeet-ctc-ja download error: Prevent AsrModels from loading CTC-only models (#516)
## Problem

Issue #514 reported that downloading `parakeet-ctc-ja` models would
succeed, but then fail during loading with:
```
[WARN] First load failed: Model file not found: Decoder.mlmodelc
```

### Root Cause

`AsrModels` (designed for TDT models) was incorrectly accepting `.ctcJa`
and `.ctcZhCn` model versions, which use different decoder file names:
- **TDT models** use `Decoder.mlmodelc`
- **Japanese CTC models** use `CtcDecoder.mlmodelc`
- **Chinese CTC models** use `Decoder.mlmodelc` (but with different
structure)

When users tried to load `.ctcJa` models via `AsrModels`:
1. Download succeeded (correct files downloaded: `CtcDecoder.mlmodelc`)
2. Loading failed (looking for wrong file: `Decoder.mlmodelc`)

## Solution

Added validation in `AsrModels.load()` and `AsrModels.download()` to
reject CTC-only model versions with clear error messages that direct
users to the correct manager classes:
- For `.ctcJa` → Use `CtcJaManager`
- For `.ctcZhCn` → Use `CtcZhCnManager`

## Changes

### Modified Files
- `Sources/FluidAudio/ASR/Parakeet/SlidingWindow/TDT/AsrModels.swift`
  - Added validation at the start of `load()` method
  - Added validation at the start of `download()` method
  - Throws descriptive `AsrModelsError` with guidance to correct manager

-
`Tests/FluidAudioTests/ASR/Parakeet/SlidingWindow/TDT/AsrModelsTests.swift`
  - Added 5 new tests for CTC-only model validation
  - Tests verify both `.ctcJa` and `.ctcZhCn` are properly rejected
  - Tests verify error messages contain correct manager class names

## Testing

All 32 tests in `AsrModelsTests` pass, including the new validation
tests:
-  `testCtcJaModelRejectsAsrModelsLoad()`
-  `testCtcJaModelRejectsAsrModelsDownload()`
-  `testCtcZhCnModelRejectsAsrModelsLoad()`
-  `testCtcZhCnModelRejectsAsrModelsDownload()`
-  `testCtcOnlyModelsAreMarkedCorrectly()`

## Example Error Message

Before (confusing):
```
Model file not found: Decoder.mlmodelc
```

After (clear guidance):
```
CTC-only model .ctcJa must be loaded via CtcJaManager, not AsrModels
```

Closes #514
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/516"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open with Devin">
  </picture>
</a>
<!-- devin-review-badge-end -->
2026-04-11 23:01:12 -04:00
Alex 421313a5b3 docs: Fix speaker diarization model references from 3.1 to community-1 (#510)
### Why is this change needed?
Clarifies that FluidAudio's speaker diarization is based on
pyannote/speaker-diarization-community-1, not 3.1. Fixes confusion from
issue #508.

### What changed?
- Updated code comment in `SegmentationProcessor.swift`
- Fixed `CLAUDE.md` model source reference
- Clarified `Documentation/Benchmarks.md` that both online/offline use
community-1

### Context
The actual CoreML model at
[FluidInference/speaker-diarization-coreml](https://huggingface.co/FluidInference/speaker-diarization-coreml)
has always been based on community-1, but some documentation incorrectly
referenced 3.1.

Related: #508
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/510"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open with Devin">
  </picture>
</a>
<!-- devin-review-badge-end -->
2026-04-10 22:45:30 -04:00
Hamza Qayyum fcd80f1085 Parallelize chunked Parakeet batch transcription (#507)
### Why is this change needed?
This PR speeds up Parakeet batch transcription for long audio by
~2.2-2.8x, by parallelizing the existing stateless chunked path. It
doesn't change the streaming/live transcription path.

It adds a configurable `parallelChunkConcurrency` setting to
`ASRConfig`, lets `AsrManager` create worker clones from already-loaded
`AsrModels`, and updates `ChunkProcessor` to send independent chunks
across that worker pool before merging the results with the existing
merge logic.

The important part is that the decoding behavior for each chunk stays
the same. The patch is really about scheduling chunk work in parallel so
the runtime can keep more hardware busy and improve throughput on longer
files.

### Validation
Benchmarked on Apple M3, using 16 KHz 16-bit mono wav file downloaded
from [this](https://www.youtube.com/watch?v=GT_sXIUJPUo) video (~1 hour
duration), with 5 runs each for current upstream vs. PR branch.

| Model | Upstream Avg Time | PR Branch Avg Time | Speedup | Upstream
Avg Peak Mem | PR Branch Avg Peak Mem | Delta |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| Parakeet v2 | 31.84 s | 11.25 s | 2.83x | 515.9 MiB | 537.4 MiB |
+21.4 MiB |
| Parakeet v3 | 31.37 s | 12.75 s | 2.46x | 496.0 MiB | 527.0 MiB |
+31.0 MiB |
| Parakeet tdt-ctc-110m | 19.89 s | 9.08 s | 2.19x | 489.6 MiB | 509.2
MiB | +19.7 MiB |

I compared the resulting transcripts and word timings before and after
this change for v2, v3, and `tdt-ctc-110m`, and found no differences. So
based on this one test file at least, the optimization appears safe.

Peak memory footprint was measured with macOS `/usr/bin/time -lp`. While
it does increase, the measured increase is modest relative to the
speedup, so I think it's reasonable to keep `parallelChunkConcurrency`
set to `4` by default rather than make it opt-in.

### `parallelChunkConcurrency` Optimal Value
A default value of `4` for the chunk parallelism was chosen becuase
values higher than it yielded little to no extra speedup and values less
than it still left speed on the table; on the two devices I tested on,
at least, which were iPhone SE 3 and M3 MacBook Air.

### AI Disclosure
OpenAI Codex was used to write the code for this patch.
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/507"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open with Devin">
  </picture>
</a>
<!-- devin-review-badge-end -->

---------

Co-authored-by: Alex <hanweng9@gmail.com>
2026-04-10 22:34:43 -04:00
Alex 04747b3e77 Standardize model loading API across all ASR managers (#506)
## Summary

Standardizes the model loading API across all ASR managers to reduce
developer cognitive load and improve consistency. This addresses [issue
#457 comment
#4203327648](https://github.com/FluidInference/FluidAudio/issues/457#issuecomment-4203327648).

## Problem

Each ASR manager had different model loading APIs:
- `AsrManager`: `configure(models:)` 
- `SlidingWindowAsrManager`: `startStreaming(models:, source:)`   
- `StreamingEouAsrManager`: `loadModels(modelDir:)` with inconsistent
overloads 
- `StreamingNemotronAsrManager`: `loadModels(modelDir:)` 

This created developer confusion and increased documentation burden.

## Solution

Unified API pattern across all managers:
```swift
// All managers now use consistent naming:
manager.loadModels(from: URL)                    // Load from local directory
manager.loadModels(_ models: PreloadedModels)    // Use pre-loaded models
manager.loadModels(to: URL?, progressHandler:)   // Download and load (optional)
```

### Changes
- **AsrManager**: Added `loadModels(_:)`, deprecated
`configure(models:)`
- **SlidingWindowAsrManager**: Separated model loading from streaming
activation, added CoreML import
- **StreamingEouAsrManager**: Standardized to `loadModels(from:)` 
- **StreamingNemotronAsrManager**: Standardized to `loadModels(from:)`
with download support
- **CLI**: Updated 9 command files to use new APIs

## Verification

Ran full benchmark suite (8 models × 100 files) to verify zero
regression:

| Model | Baseline | Current | Delta | Status |
|-------|----------|---------|-------|--------|
| Parakeet TDT v3 (0.6B) | 2.6% | 2.64% | +0.04% |  |
| Parakeet TDT v2 (0.6B) | 3.8% | 3.79% | -0.01% |  |
| CTC-TDT 110M | 3.6% | 3.56% | -0.04% |  |
| CTC Earnings | 16.54% | 16.55% | +0.01% |  |
| EOU 320ms (120M) | 7.11% | 7.11% | 0.00% |  |
| Nemotron 1120ms (0.6B) | 1.99% | 1.99% | 0.00% |  |
| TDT Japanese (0.6B) | 6.11% | 6.11% | 0.00% |  |
| CTC Chinese (0.6B) | 8.37% | 8.37% | 0.00% |  |

**✓ No WER/CER regressions (all within 0.3% of baseline)**

## Benefits
-  Reduced cognitive load - single pattern across all managers
-  Cleaner separation of concerns - model loading vs. streaming
activation
-  Consistent prepositions - all use `from:` for loading from directory
-  Zero performance impact - validated with comprehensive benchmarks
-  Backward compatibility - deprecated APIs still work with migration
warnings

<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/506"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open with Devin">
  </picture>
</a>
<!-- devin-review-badge-end -->
2026-04-08 14:14:45 -04:00