502 Commits

Author SHA1 Message Date
Alex 847a985ae4 fix(tts/pocket-tts): repair v1 voice cloning for pocket-tts 2.0.0 (#592) (#601)
## Summary

Fixes #592 — PocketTTS voice cloning produced garbled audio on macOS
after the `pocket-tts==2.0.0` upgrade. v2 (pre-baked KV snapshot) voices
were unaffected — only the v1 path (user audio → `mimi_encoder` →
`cond_step` prefill) was broken.

Two compounding bugs:

### RCA 1 — stale `mimi_encoder`
The `mimi_encoder.mlpackage` originally published on HF was traced
against pre-2.0.0 `pocket-tts` (torch 2.9.1, Float32, scalar output) and
no longer matched the runtime cond_step contract. Re-traced as
`mimi_encoderv2` from `pocket-tts==2.0.0` (torch 2.11.0, Float16, fixed
`[1, 1, 240000]` → `[1, 125, 1024]`). Both files now live at the HF repo
root (legacy file kept for backwards compat); `ModelNames.mimiEncoder`
points at the new one.

### RCA 2 — missing `bos_before_voice` prepend
`pocket-tts` 2.0.0 added a learned 1024-d `flow_lm.bos_before_voice`
buffer that has to be prepended to the audio_prompt during cond_step
prefill. Without it the FlowLM sees a different token distribution than
training. Extracted per-language as `constants_bin/bos_before_voice.bin`
(4096 bytes each, 10 packs × distinct SHA-256s, all verified
byte-for-byte against the HF upload).

### Swift-side changes
- `PocketTtsVoiceCloner` pads/truncates input to the encoder's fixed 240
000 samples (10 s @ 24 kHz, non-flexible shape) and trims output frames
to real-audio duration so zero-padded frames don't bleed into the
prompt.
- `PocketTtsSynthesizer+KVCache.prefillKVCache` prepends
`bos_before_voice` ahead of the audio_prompt on the v1 path. v2
snapshots skip this — their pre-baked KV cache already encodes the
prefix.
- `PocketTtsResourceDownloader.ensureModels` backfills
`bos_before_voice.bin` for caches that predate this fix (per-file fetch)
instead of forcing a full language-pack re-download.

Conversion artifacts and per-language SHA-256s documented in
`mobius/models/tts/pocket_tts/coreml/TRIALS.md` (Phase 7).

## Test plan

- [x] `swift build` clean
- [x] `swift test --filter PocketTtsConstantsLoaderTests` — 3 new tests
pass
- [x] `swift format` applied
- [x] E2E v1 cloning: `am_michael.wav` (7.5 s) → 3.92 s @ 24 kHz Int16,
intelligible voice match. KV cache prefill lands at position 113 = 1 BOS
+ 95 voice + 17 text tokens (matches pocket-tts 2.0.0 layout).
- [x] v2 snapshot regression check: default `alba.safetensors` voice
still synthesizes correctly (prefill position 140, no `bos_before_voice`
involvement)
- [x] Backfill path: deleted `bos_before_voice.bin` from cache, re-ran
cloning — file auto-fetched from HF (4096 bytes) before synthesis
- [x] All 10 language packs verified on HF: SHA-256 match between local
extraction and uploaded `v2/<lang>/constants_bin/bos_before_voice.bin`
2026-05-12 08:55:44 -04:00
Benjamin Lee a0092cf163 Fixed LS-EEND Memory Leak + Updated Docs (#605)
1. LS-EEND had a memory leak since the autorelease pool was not
releasing the multiarrays properly and was allocating new ones every
chunk. Switched to backed output arrays to eliminate new allocations
2. LS-EEND docs were somewhat stale. Updated them to reflect the new API

---------
2026-05-12 08:53:59 -04:00
Alex d9d06c731a ci: use CLAUDE_CODE_OAUTH_TOKEN for Claude Code Action (#600)
## Summary

Switches the Claude Code Action auth from `ANTHROPIC_API_KEY` to
`CLAUDE_CODE_OAUTH_TOKEN`, which uses a Claude Max/Pro subscription
instead of pay-per-token API billing.

The PR #599 workflow run failed with:
\`\`\`
Environment variable validation failed:
- Either ANTHROPIC_API_KEY or CLAUDE_CODE_OAUTH_TOKEN is required
\`\`\`

## Required setup (one-time, maintainer)

\`\`\`bash
# Generate an OAuth token tied to your Claude account
claude setup-token

# Store it in repo secrets
gh secret set CLAUDE_CODE_OAUTH_TOKEN --repo FluidInference/FluidAudio
# (paste the token when prompted)
\`\`\`

Verify:
\`\`\`bash
gh secret list --repo FluidInference/FluidAudio
\`\`\`

## Test plan

- [ ] Maintainer runs the two commands above to populate the secret
- [ ] After merge, post \`@claude help\` on a throwaway issue and
confirm the workflow runs without env-var errors
2026-05-11 11:15:55 -04:00
Alex ae1ef30240 ci: add Claude Code Action workflow (#599)
## Summary

Adds `.github/workflows/claude.yml` so the repo can respond to `@claude`
mentions in issues, issue comments, PR reviews, and PR review comments
via
[anthropics/claude-code-action@v1](https://github.com/anthropics/claude-code-action).

Motivation: PR #596 had a reviewer post `@claude review` and nothing
happened because no workflow was wired up. This PR fixes that for future
reviews.

## What it does

- Triggers on `issue_comment`, `pull_request_review_comment`,
`pull_request_review`, `issues` (opened/assigned)
- Job runs only when the body/title contains `@claude` (cheap filter,
prevents wasted runs)
- Uses `ANTHROPIC_API_KEY` repo secret for auth
- Minimal `read` permissions on contents/PRs/issues; `id-token: write`
for OIDC

## Required configuration (repo settings)

Before this workflow can run, a maintainer needs to:

1. Install the [Claude GitHub App](https://github.com/apps/claude) on
`FluidInference/FluidAudio`
2. Add an `ANTHROPIC_API_KEY` secret in repo Settings -> Secrets and
variables -> Actions

Without those, the workflow file is inert (no failed runs, just no-op).

## Test plan

- [ ] Maintainer installs the Claude GitHub App and sets
`ANTHROPIC_API_KEY`
- [ ] After merge, post `@claude help` on a throwaway issue and confirm
the workflow fires
- [ ] Confirm non-`@claude` comments do not trigger the job
2026-05-11 11:05:37 -04:00
TMS 6d4e09fe37 Add Resonant to showcase (#598)
## Summary
- add Resonant to the FluidAudio showcase table

## Validation
- documentation-only change; reviewed README diff
2026-05-11 10:05:36 -04:00
Alex fb8b779380 feat(tts/magpie): warmup API for cold-start mitigation (#60 Track 2) (#595) 2026-05-10 16:51:09 -04:00
Alex 2c45df3035 docs(tts): refresh Benchmarks.md per #590; wire styletts2 + --variant into tts-benchmark (#593)
## Summary

Closes the work tracked in #590: bring `Documentation/TTS/Benchmarks.md`
into agreement with what's actually shipped on `main` for CoreML TTS
backends, and add the two CLI affordances needed to benchmark the
in-scope backend × language matrix.

### Doc changes (`Documentation/TTS/Benchmarks.md`)

- Single consolidated **per-backend table** that merges basic info
(license, language+voice, footprint in **GB**, sample rate, max chunk
per pass, streaming flag) with performance metrics (TTFT p50/p95, synth
p50/p95, agg RTFx, peak RSS, WER %, CER %). Five rows: Kokoro ANE en
(`af_heart`), Kokoro ANE zh (`zf_001`), PocketTTS en (`alba` 6L), Magpie
en (`John`, batch-only on `main`), StyleTTS2 en (LibriTTS iteration_3,
zero-shot).
- Dropped from the top-line per scope decision: non-ANE Kokoro,
CosyVoice3 zh, PocketTTS 24L variants, Hindi/Cantonese rows. CosyVoice3
narrative sections (decode budget cap + auto-chunker validation) stay
verbatim.
- Refreshed Kokoro ANE per-stage breakdown (post-laishere 7-graph
chain).
- Replaced the old Magpie per-stage table with a pointer paragraph
(`MagpieSynthesisResult.timings` is still populated for callers; sub-1.5
s TTFA work referenced in #590 lives on `feat/magpie-lt-fusion`, not
`main`).
- Corrected PocketTTS footprint to `fp16 ~0.77 / int8 ~0.55 GB` (was
`~140 / ~520 MB`); enumerated all 10 packs in the corpus matrix; added
zh to the Kokoro ANE corpus row; added a StyleTTS2 row.

### CLI changes
(`Sources/FluidAudioCLI/Commands/TtsBenchmarkCommand.swift`)

- New `styletts2` / `style-tts2` backend wired to
`StyleTTS2Manager.synthesize(text:referenceAudioURL:)`. Requires
`--reference <wav>`; the shipped iteration_3 `ref_encoder` is fixed at
`[1, 1, 80, 231]`, so the reference must be exactly **2.875 s @ 24 kHz
mono** — the harness errors out at predict time on mismatched durations.
- New `--variant {english|mandarin}` flag for `kokoro-ane` so the
`zf_001` Mandarin voice pack can be benchmarked alongside `af_heart`.
Falls back to `english` when unset; the manager constructor now receives
the parsed `KokoroAneVariant` and the default voice is variant-aware.

### Methodology

100-phrase MiniMax-Multilingual on MacBook Air M2 (16 GB, macOS 26, on
AC), `--compute-units default`. English WER/CER via Parakeet TDT
roundtrip; Mandarin CER via `whisper-large-v3` (Python CPU FP32,
`Scripts/whisper_zh_cer.py`) — macro 4.01% / micro 4.14% across all 100
zh phrases. WER omitted for Mandarin because `WERCalculator` splits on
whitespace.

## Test plan

- [x] `swift build` clean on `main`-based branch.
- [x] `swift format lint --recursive --configuration .swift-format
Sources/FluidAudioCLI/Commands/TtsBenchmarkCommand.swift` clean.
- [x] Smoke test: `swift run fluidaudio tts-benchmark --backend
styletts2 --reference ref.wav --corpus minimax-english --output-json
/tmp/styletts2-smoke.json` — produces a valid JSON report.
- [x] Smoke test: `swift run fluidaudio tts-benchmark --backend
kokoro-ane --variant mandarin --voice zf_001 --corpus minimax-chinese
--skip-asr --output-json /tmp/kokoro-zh-smoke.json` — pulls the Mandarin
voice pack and produces audio.
- [x] Full 100-phrase runs for all five table rows produced under
`Benchmarks/tts/runs/590/` (gitignored); table numbers come straight
from those JSON reports.
- [ ] Reviewer cross-check: footnote markers (`*`, `‡`, `∥`, `¶`) in the
consolidated table all have matching paragraphs below.
2026-05-09 21:47:45 -04:00
panv-kw a400080380 Make SpeakerManager a struct and de-async DiarizerManager (#591)
### Why is this change needed?

`DiarizerManger.performCompleteDiarization` is `async`, even though no
asynchronous operations occur when running the models and processing the
results - this is just plain, synchronous computation. It doesn't wait
on the network or things like that. It is important to be able to
integrate it in to other synchronous compute workflows.

The reason it had to be `async` until now is that the `SpeakerManager`
type containing the speaker database was a `class`, meaning that it was
shared mutable state. It was made an `actor` because this shared mutable
state could be mutated concurrently.

But really, there should not be concurrent mutations to the
`SpeakerManager` in the first place. The user of this type,
`DiarizationManager`, is not actually prepared for other code to be
modifying this database while it is using it, and anybody who is trying
is almost certainly writing a bug because their code would be logically
racing with `DiarizationManager` and the results would be unpredictable.

This change makes `SpeakerManager` a struct. It has copy-on-write value
semantics because it wraps a `Dictionary` for its storage, and mutations
are marked by the `mutating` keyword and require exclusive ownership of
the variable -- again, just like `Dictionary`. The compiler statically
diagnoses attempts to concurrently mutate the `DiarizationManager`'s
speaker database, so the test for this can be removed (it no longer
compiles).

<img width="1108" height="386" alt="Screenshot 2026-05-09 at 19 10 26"
src="https://github.com/user-attachments/assets/04fc3395-7d46-42a8-b035-4d0b559cc8aa"
/>

In summary, this change significantly reduces the cognitive load of
using and maintaining this code, promotes correct usage through static
diagnostics rather than allowing unpredictable results through
concurrent mutation of the speaker database, and enables diarization to
be used in more contexts in more programs.

(BTW, `SpeakerManager` doesn't strictly _need_ to be `Sendable`, but the
previous one was by virtue of being an `actor`, so I marked this one as
being `Sendable` too in case anybody was relying on it. I don't think
the implementation of this type is going to change radically in the
future to the point where that might be a problem)
2026-05-09 17:14:07 -04:00
Alex 3ff5ae2d0c refactor(tts): async StyleTTS2 predict + drop non-native Magpie synthesizeStream (#589) 2026-05-09 12:54:07 -04:00
Alex ce59fb14b8 feat(tts): StyleTTS2 LibriTTS (iteration_3) CoreML backend (#588)
## Summary

Swift port of `mobius/models/tts/styletts2/coreml/inference.py` against
the `FluidInference/StyleTTS-2-coreml/iteration_3/compiled` mlmodelc
assets. New `StyleTTS2Manager` actor exposes the same public shape as
`MagpieTtsManager` / `KokoroSynthesizer`, plus a `--backend styletts2`
route in the CLI.

## Architecture

`StyleTTS2Manager` orchestrates four pieces:
1. `StyleTTS2ModelStore` — actor-managed lazy load of the 8 default
`.mlmodelc` stages plus 6 token-axis bucket variants (T = 64 / 128 / 256
fp16).
2. `StyleTTS2Phonemizer` — wraps shared `MultilingualG2PModel`
(CharsiuG2P) with an espeak-fallback note in the docs;
`synthesize(ipa:)` escape hatch preserves parity for callers that
already have espeak output.
3. `StyleTTS2MelExtractor` — vDSP FFT + 80-bin HTK mel filterbank with
the training-time `sample_rate=16000` quirk for the speaker-reference
path.
4. `StyleTTS2Synthesizer` — drives the 8-stage CoreML graph
(`text_encoder`, `bert`, `ref_encoder`, `fused_diffusion_sampler`,
`duration_predictor`, `fused_f0n_har_source`, `decoder_pre`,
`decoder_upsample`) and returns 24 kHz mono Float32 PCM.

Eager-glue ops (`StyleTTS2GlueOps`) bridge the stages on the CPU side:
sigmoid+round of duration logits, one-hot alignment matrix, BLAS
`cblas_sgemm` matmul, vDSP transpose, HiFi-GAN causal asr-shift, and the
alpha/beta style blend (`s_pred[:, 128:]` / `s_pred[:, :128]`).

The fused diffusion sampler consumes pre-materialized noise —
`StyleTTS2DiffusionSchedule` provides the Karras sigma formula plus a
SplitMix64 + Box-Muller source so a fixed `noiseSeed` reproduces the
same audio.

## CLI

```
swift run fluidaudiocli tts "Hello from StyleTTS2." \
    --backend styletts2 \
    --reference path/to/speaker.wav \
    --output out.wav \
    --alpha 0.3 --beta 0.7 --seed 0
```

`--ipa` overrides the text path with a verbatim IPA string for espeak
parity.

## Test plan

- [x] `swift build` clean
- [x] `swift format lint` clean on touched files
- [x] `swift test --filter StyleTTS2` — 32 / 32 passing
- `StyleTTS2TextCleanerTests` — symbol vocab + encode round-trip +
drop-unknown
- `StyleTTS2GlueOpsTests` — duration rounding, alignment matrix, BLAS
matmul, transpose, HiFi-GAN shift, alpha/beta blend
- `StyleTTS2DiffusionScheduleTests` — Karras boundary conditions +
monotonicity, RNG determinism, Gaussian stats
- `StyleTTS2MultiArrayTests` — Float32 / Int32 round-trip,
`extractFloats` for double / int32 backings
- [ ] End-to-end smoke run via `swift run fluidaudiocli tts ...
--backend styletts2 --reference ...` against a downloaded
`iteration_3/compiled` asset bundle
v0.14.5
2026-05-09 00:25:54 -04:00
Greg Young b3a725db3e Fix: Prevent Metal crash when targetTokens is 0 in Kokoro TTS (#586)
Adds a defensive guard against targetTokens == 0 reaching CoreML in the Kokoro TTS pipeline. A zero-length int put_ids tensor causes the Metal backend to dispatch compute shaders with threadgroupsPerGrid.width(0), which is an uncatchable assertion failure:

-[MTLDebugComputeCommandEncoder dispatchThreadgroups:threadsPerThreadgroup:]:1377:  failed assertion `(threadgroupsPerGrid.width(0) * ...) must not be 0.'

Changes
1. KokoroSynthesizer.swift — synthesizeChunk() now throws a descriptive TTSError.processingFailed when targetTokens == 0, before any MLMultiArray allocation or model prediction. This converts an uncatchable Metal assertion into a recoverable Swift error.
2. KokoroModelCache.swift — Cached token lengths are clamped with max(1, inferTokenLength(...)) at all 3 caching sites (loadModelsIfNeeded, tokenLength(for:), registerPreloadedModels). Defense-in-depth: although inferTokenLength() already returns a positive value or falls back to 124, this guarantees the cache invariant is locally enforced regardless of future changes to the inference helper.


Testing
- Manual: confirmed synthesizeChunk now throws TTSError.processingFailed instead of trapping when a 0 token length is forced.
2026-05-08 17:56:13 -04:00
Joe Petrakovich 1a27c9de31 Add Utter app to showcase in README.md (#585)
### Why is this change needed?
Adding a showcase app to the readme so people can find it.
2026-05-08 10:58:18 -04:00
local 024bd8e454 chore(tts): remove StyleTTS2 backend, models, and references 2026-05-07 13:32:16 -04:00
Prakash Joshi Pax a53aff438b fix(tts): guard direct Float16 reads with #if arch(arm64) (CosyVoice3, StyleTTS2) (#582)
## Summary

`Float16` is an arm64-only Swift built-in, so any direct `Float16`
typing fails to compile in the x86_64 slice of a Universal build. Four
sites in CosyVoice3 and StyleTTS2 do raw `Float16` pointer binds with no
arch guard, which currently breaks Universal archive builds with errors
like:

```
'Float16' is unavailable in macOS
No exact matches in call to initializer
Failed to produce diagnostic for expression; please submit a bug report
```

Affected sites:
-
`Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Synthesize/CosyVoice3SpeechEmbeddings.swift:65`
— `assumingMemoryBound(to: Float16.self)` for the fp16 safetensors
lookup table.
-
`Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Synthesize/CosyVoice3Synthesizer.swift:554`
— fp16 branch of the Flow→HiFT mel copy in `runHiFT`.
-
`Sources/FluidAudio/TTS/StyleTTS2/Pipeline/StyleTTS2Synthesizer.swift:288`
— fp16 case in `sliceFirstAxis2D`.
-
`Sources/FluidAudio/TTS/StyleTTS2/Pipeline/StyleTTS2Synthesizer.swift:541`
— fp16 case in `readMLMultiArrayPrefix`.

This PR mirrors the package's existing pattern for fp16 reads on
non-arm64 (`ASR/Qwen3/Qwen3AsrModels.swift:310`,
`Diarizer/Sortformer/SortformerModelInference.swift:278`): wrap each
`Float16`-touching arm in `#if arch(arm64) ... #endif`.

## Behavior on x86_64

- **CosyVoice3SpeechEmbeddings** — throws
`CosyVoice3Error.predictionFailed("requires Apple Silicon (arm64); fp16
lookup table cannot be read on x86_64")`. The `speech_embedding`
safetensors table is fp16-only on disk with no fp32 alternative, so this
matches the Qwen3-ASR posture (its embedding table is also fp16-only on
disk; it `fatalError`s on x86_64).
- **CosyVoice3Synthesizer.runHiFT** — the `case .float16:` arm is
omitted on x86_64. The `case .float32:` path is unchanged. If a Flow
variant emits fp16 at runtime on Intel, control falls into the existing
`default:` arm, which already throws `"runHiFT: unexpected Flow mel
dtype …"`.
- **StyleTTS2Synthesizer (`sliceFirstAxis2D`,
`readMLMultiArrayPrefix`)** — the `case .float16:` arm is omitted on
x86_64. fp16 arrays fall through to the existing NSNumber-bridged
`default:` arm (`arr[i].floatValue` / `fill { arr[$0].floatValue }`),
which already converts fp16 correctly. Slightly slower on Intel; no
behavior regression.

No new error types or dependencies. Diff is +13 lines across 3 files.

## Why this approach (vs. vImage byte-level conversion)

FluidAudio already uses two patterns for cross-arch fp16 handling:
- **`#if arch(arm64)` guard** — used for fp16 *reads* in
`Qwen3AsrModels.swift` and `SortformerModelInference.swift`.
- **vImage `Planar16FtoPlanarF` / `PlanarFtoPlanar16F`** — used for fp16
*writes* in `KokoroAneSynthesizer+Conversion.swift`, `TtsModels.swift`,
`KokoroSynthesizer.swift`, with the explicit comment in
`TtsModels.swift:182`: *"This avoids direct Float16 usage which isn't
available in all build configurations"*.

This PR matches the existing fp16-read precedent (Pattern A). A
follow-up could port these read paths to vImage for full Intel runtime
support (Pattern B), but that's a larger change and would need testing
on Intel hardware. The minimal goal here is unblocking the Universal
compile.

## Test plan

- [x] Universal archive build of a downstream macOS app that links
FluidAudio as a local SPM package now succeeds (failed prior to this
patch with the errors above).
- [ ] CI lint / build on the package itself.
- [ ] No CosyVoice3 / StyleTTS2 runtime regression on Apple Silicon (the
arm64 path is byte-identical to before).
2026-05-05 09:27:27 -04:00
Alex 284ce520f9 feat(tts/magpie): nanocodec v4 (fp32 + int8 palettize) precision (#581)
## Summary

Add `MagpieNanocodecPrecision.fp32Pal` selecting `nanocodec_decoder_v4`:
v3's fp32 architecture with 8-bit kmeans-palettized weights.
Acoustically transparent vs v3 at ~4× smaller on disk and ~11% lower
peak RSS. Same recipe Kokoro Noise uses for `fp32 + int8pal`.

Compute units track precision: `.fp32Pal` pins `.cpuOnly` (palettized
weights dequantize to fp32 at runtime; ANE refuses fp32 / GPU is 50%+
slower than CPU on fp32 codec).

## Bench (M2, 16 GB, .cpuOnly, T_in=24, 5 warmup + 50 timed iters)

| metric              | v3 fp32 | v4 fp32+int8pal | delta  |
|---------------------|---------|-----------------|--------|
| mlpackage on disk   | 121.0MB |          30.9MB |   -74% |
| post-load RSS delta | +59.9MB |         +61.7MB |    eq. |
| peak RSS            | 700.8MB |         621.8MB |   -11% |
| latency median      | 117.6ms |         117.1ms |    eq. |
| latency p95         | 145.9ms |         123.6ms |   -15% |
| RTFx (codec)        |   9.48× |           9.52× |    eq. |
| SNR vs v3 (AR codes)|    inf. |          33.6dB | clean* |

*User-confirmed acoustically transparent on AR-emitted speech.

## Fallback chain

Each candidate carries its own config so the fallback doesn't inherit
the primary's compute-unit selection. fp16 (v2) is only reached when
explicitly requested or when no other candidate is present, since it's
audibly noisy on voiced speech:

| Requested      | Order                    |
|----------------|--------------------------|
| `.fp32Pal`     | v4 → v3 → v2             |
| `.fp32`        | v3 → v4 → v2             |
| `.fp16`        | v2 → v4 → v3             |

If every chunked artifact is missing the loader falls through to legacy
monolithic v1 with `.cpuOnly` (audibly noisy).

## HF artifacts

Already uploaded to
`FluidInference/magpie-tts-multilingual-357m-coreml`:
- `nanocodec_decoder_v4.mlmodelc/`
- `nanocodec_decoder_v4.mlpackage/`

## Companion PR

mobius converter: https://github.com/FluidInference/mobius/pull/54

## Test plan
- [x] `swift build` green
- [x] `swift test --filter MagpieConstantsTests` 5/5 pass
- [x] `swift format lint` clean for changed files
- [ ] End-to-end `MagpieTtsManager` synth with `.fp32Pal` once HF
artifacts propagate to user caches
2026-05-04 23:22:34 -04:00
Alex 8389c1b714 feat(tts/magpie): nanocodec v1/v2/v3 + decoder_step ANE pin + dual-precision API (#580)
## Summary

Companion to mobius PR FluidInference/mobius#53. Wires the new nanocodec
v2/v3 builds into the FluidAudio Magpie runtime, plus pins
`decoder_step` to ANE for ~2× wall speedup.

## Commits

- `5879a32b3` — `fix(tts/magpie): pin decoder_step to ANE for ~2x
speedup + correct EOS`
- `decoder_step.mlmodelc` was running CPU+GPU. Pinning to
`.cpuAndNeuralEngine` halves wall on M2.
  - EOS handling: don't emit the post-EOS frame.
- `ec7051504` — `feat(tts/magpie): chunked T=24 fp32 nanocodec +
edge-pad (Phase C v2)`
- Slide a 24-frame window with stride 8, overlap 16 (= dilated-conv
input receptive field).
- Edge-replicate context at sequence boundaries instead of zero-padding
(zero-pad produces a sharp pop in the first ~30 ms).
- `2f0aab7a7` — `feat(tts/magpie): dual fp16/fp32 nanocodec t24 builds
via MagpieNanocodecPrecision`
  - New `MagpieNanocodecPrecision` enum (`.fp16` / `.fp32`).
- Compute-unit dispatch: fp32 → `.cpuOnly` (ANE is fp16-only); fp16 →
`.cpuAndNeuralEngine` unless caller pinned CPU.
- Plumbed through `MagpieModelStore.init` and `MagpieTtsManager.init` /
`downloadAndCreate`.
- `4bd31469f` — `refactor(tts/magpie): nanocodec v1/v2/v3 versioning
(drop t24 prefix)`
- Final naming: v1 = legacy mono, v2 = chunked fp16, v3 = chunked fp32.
- `requiredModels` now lists `nanocodecDecoderV3File` so legacy v1-only
users auto-upgrade on next bulk fetch.
- Load chain: primary (precision-matched) → secondary (cross-precision
warning) → legacy v1 fallback.

## Production state

| Build | File | Precision | Shape | Selector | Audio |
|---|---|---|---|---|---|
| v1 | `nanocodec_decoder.mlmodelc` | fp16 | T=256 monolithic | legacy
fallback | noisy + slow |
| v2 | `nanocodec_decoder_v2.mlmodelc` | fp16 | T_in=24 chunked |
`MagpieNanocodecPrecision.fp16` | noisy / fast |
| v3 | `nanocodec_decoder_v3.mlmodelc` | fp32 | T_in=24 chunked |
`MagpieNanocodecPrecision.fp32` (default) | clean |

All three live on `FluidInference/magpie-tts-multilingual-357m-coreml`.

## Background

Phase F mixed-precision sweep (mobius#53) confirmed no fp16 op/location
combination recovers cleanliness — production stays on v3 (fp32) with v2
as opt-in for throughput-bound callers willing to accept the 27 dB SNR
floor.

## Test plan

- [x] `swift format` clean
- [x] `swift build` clean
- [ ] Sanity-check `swift test --filter MagpieTtsTests` (if present)
- [ ] Spot-check synthesis via CLI on default speaker
2026-05-04 22:35:22 -04:00
Alex bdbff4d88a feat(tts/kokoro-ane/zh): consolidated Mandarin G2P (erhua + jieba HMM + g2pW) (#572 items 1, 3, 4) (#579)
## Summary

Consolidates PRs #574, #575, and #576 into a single landing for Mandarin
G2P enhancements per [issue
#572](https://github.com/FluidInference/FluidAudio/issues/572). All
three features are non-overlapping and stack cleanly inside
`MandarinG2P.phonemize`:

- **Item 3 — Erhua merging** (was #574): folds trailing `儿` into the
previous syllable so `小孩儿` emits a single r-coloured token instead of a
stray `er` tail.
- **Item 4 — Jieba HMM tail** (was #575): re-segments OOV runs of
single-char fallbacks via a 4-state B/M/E/S Viterbi to recover
proper-noun boundaries (`特朗普`, `比特币`); recovered words are then retried
against the phrase dict before per-char fallback.
- **Item 1 — g2pW polyphone disambiguation** (was #576): int8 BERT-base
classifier (152 MB CoreML) picks the right reading for polyphonic Hanzi
(`行`/`长`/`重`/`朝`/…) using the full sentence as context. Best-effort:
falls back to dict-only when assets are missing.

Item 2 (number normalization) already merged via #573. Items 5 (POS
sandhi, #577) and 6 (custom lexicon, #578) remain as separate PRs.

## Pipeline order

```
text
  → MandarinNumberNormalizer.normalize        (already on main)
  → normalizeText (punctuation)
  → segment(): FMM phrases + jieba HMM tail   (NEW: item 4)
  → polyphone disambiguation via g2pW         (NEW: item 1)
  → diacritic → digit (MandarinPinyinNormalizer)
  → MandarinErhua.merge                       (NEW: item 3)
  → MandarinToneSandhi.apply
  → MandarinBopomofoMap.encode
```

## API changes

- `MandarinG2P.phonemize` is now `async throws` (g2pW disambiguation
requires async). Backwards-compatible callers must add `try await`.
- `MandarinG2P.init(dict:, jiebaHmm:, g2pw:)` — both new parameters are
optional, default `nil` keeps baseline behaviour.
- New `Segment.bopomofoOverride(String)` case carries g2pW's pre-encoded
bopomofo + tone digit; bypasses sandhi.

## Asset requirements

Pulled from `huggingface.co/FluidInference/kokoro-82m-coreml`:

- `ANE-zh/g2pw/g2pw.mlmodelc/` — bulk `ensureModels` (added to
`requiredModelsZh`)
- `ANE-zh/g2pw/vocab.txt` + `POLYPHONIC_CHARS.txt` —
`ensureMandarinG2pw` lazy fetch
- `ANE-zh/assets/jieba_hmm_{start,trans,emit}.bin` —
`ensureMandarinJiebaHmm` lazy fetch

All three optional asset groups degrade gracefully: missing g2pW falls
back to dict-first reading, missing jieba HMM falls back to per-char
singles.

## Test plan

- [x] `swift build` clean
- [x] 102 tests pass across `MandarinG2PTests`, `MandarinErhuaTests`,
`MandarinJiebaHmmTests`, `MandarinPolyphoneCatalogTests`,
`MandarinBertTokenizerTests`, `MandarinNumberNormalizerTests`
- [x] Polyphone target tracking through jieba HMM resegmentation:
`flushHanziRun` carries absolute char positions so g2pW sees the right
context window
- [x] Backward-compat: `MandarinG2P(dict:)` (no jieba, no g2pW) still
passes baseline tests

## Closes

- #574 (erhua)
- #575 (jieba HMM)
- #576 (g2pW)

Refs #572.
v0.14.4
2026-05-04 01:01:39 -04:00
Alex 684ceaf42b feat(tts/kokoro-ane/zh): POS-aware tone sandhi (#572 item 5) (#577)
## Summary

Issue #572 item 5. The baseline \`MandarinToneSandhi\` rules are
POS-independent and audibly misfire on three contexts:

- **一 ordinals** (\`第一 dì-yī\`, \`一月 yī-yuè\`, \`一号\`) keep
  tone 1; baseline promotes them to 2/4 unconditionally
- **不 reduplication** (\`要不要\`, \`好不好\`, \`行不行\`) keeps
  \`不\` at tone 4 inside \`[X, 不, X]\`; baseline misfires with
  bu4+tone4 → bu2
- **3+3 chains** apply within prosodic words; cross-word 3+3 only
  promotes the word-final syllable. Baseline's pure-run rule
  cascades too far left (\`我也想去\` → wrong \`2 2 2 3\` instead of
  correct \`2 2 3 4\`)

## Design

\`MandarinToneSandhiPOS.apply(_:words:tags:)\` — pure function, takes
the syllable buffer plus pre-computed word ranges + jieba POS tags.
Backward-compat path stays on \`MandarinToneSandhi.apply\` for
callers without a POS tagger (existing behavior preserved).

## Test plan

- [x] 一 ordinal carve-outs (\`第一\`, \`一月\`)
- [x] 一 contextual sandhi still fires in non-numeral words
      (\`一定\`, \`一起\`)
- [x] 不 reduplication keeps tone 4 (\`要不要\`)
- [x] 不 promotion still fires for non-reduplication (\`不要\`)
- [x] In-word 3+3 run promotes all but last
- [x] Cross-word 3+3 only promotes the boundary
- [x] Cross-word chain stops at non-3 (\`我是你的\`)
- [x] Backward-compat for single-word ranges
- [x] \`swift build\` + \`swift format lint\` clean
- 14 unit tests, all passing

## Out of scope (follow-up)

- **MandarinG2P routing** to \`MandarinToneSandhiPOS\` lands once
  PR #575 (jieba HMM + POS tagger tables) merges and the POS tagger
  is loaded by \`KokoroAneModelStore.mandarinG2PPipeline\`. Until
  then this module is testable in isolation via synthetic POS input.

## Depends on

- #575 — for the POS tagger Viterbi + tables that produce the
  \`words\`/\`tags\` arrays at runtime
2026-05-04 00:39:50 -04:00
Alex f202200d1f feat(tts/kokoro-ane): user-supplied Mandarin custom lexicon (#572 item 6) (#578)
## Summary

Issue #572 item 6. Lets app developers ship a project-specific
Mandarin lexicon that overrides both the bundled phrase dict and
g2pW. Useful for proper nouns the bundled dict doesn't cover
(brand names, technical jargon, regionalisms) and for cases where
the user knows the correct reading and wants to bypass any
heuristic.

## Test plan

- [x] Custom lexicon entry overrides phrase dict
- [x] Custom lexicon entry overrides single-char dict
- [x] Empty lexicon = no-op (baseline preserved)
- [x] \`swift build\` + \`swift format lint\` clean

## Independent

This PR is independent of #573–#577. Land in any order.
2026-05-04 00:39:37 -04:00
Alex 0ea7c900b0 feat(tts/kokoro-ane/zh): number/date/currency verbalization (#572 item 2) (#573)
## Summary

Issue #572 item 2. Pre-pass that verbalizes numerics, dates, times,
percentages, fractions, and currencies into Hanzi *before*
`MandarinG2P` segments the text. Without it, conversational input
like \`¥120\`, \`2025年5月3日\`, \`8:30\`, \`99%\` either fragments
into per-digit literals or gets dropped entirely by the segmenter.

- Port misaki/zh/num.py rules: cardinals up to 兆 (10¹²), decimal
  point form, percentages with 百分之, fractions (二分之一), money
  (¥/$/€/¥), dates (YYYY年MM月DD日, YYYY/MM/DD), times (HH:MM[:SS])
- Hook in `MandarinG2P.phonemize` before punctuation normalization
- Pure function, no new public API surface on `KokoroAneManager`

## Test plan

- [x] `MandarinNumberNormalizerTests` covers cardinals, decimals,
      percentages, fractions, money, dates, times
- [x] `MandarinG2PTests` baseline regression (no behavior change on
      pure-Hanzi input)
- [x] `swift build` + `swift format lint` clean
2026-05-04 00:36:31 -04:00
Benjamin Lee e4ce919762 Finalized DiarzerTimeline segment updates no longer commit tentative segments (#568)
There was a bug that would cause the trailing diarizer segment to
disappear if minFramesOff was nonzero once the person stopped talking.

---------
2026-05-04 00:18:18 -04:00
Alex 98acce358a feat(tts/kokoro-ane): add Mandarin (v1.1-zh) variant (#570)
## Summary

Phase 1 — variant plumbing + phonemes-bypass synthesis for
Kokoro-82M-v1.1-zh
on the existing 7-stage CoreML chain. Callers that supply pre-computed
Bopomofo (e.g. via misaki[zh] in Python or a future Swift G2P) can now
synthesize Mandarin audio. Mandarin text-to-Bopomofo G2P is deferred to
a
separate Phase 2 PR.

The 7-stage chain is **language-agnostic by construction** — input ids,
voice slices, and per-stage I/O contracts are identical across v1.0
(English) and v1.1-zh (Mandarin). Only the embedding vocab (177 → 171),
the
HF subdir (`ANE/` → `ANE-zh/`), the voice-file layout (flat →
`voices/<voice>.bin`), and the default voice (`af_heart` → `zf_001`)
differ.

## Changes

- New `Repo.kokoroAneZh` → `FluidInference/kokoro-82m-coreml/ANE-zh`
with
  `subPath = ANE-zh`, `folderName = kokoro-82m-coreml/ANE-zh`.
- `ModelNames.KokoroAne.requiredModelsZh` references `voices/zf_001.bin`
so the downloader's all-files-present check resolves correctly when the
  file lands at `<repoDir>/voices/zf_001.bin`.
- New `KokoroAneVariant` enum (`.english` / `.mandarin`) with
  `defaultVoice`, `useVoicesSubdir`, and `repo` accessors.
- `KokoroAneResourceDownloader.ensureModels` and `ensureVoicePack`
accept a
  `variant` param (default `.english` keeps existing callers
  source-compatible). Mandarin voice fetch creates the `voices/` parent
  directory on demand.
- `KokoroAneModelStore` and `KokoroAneManager` thread the variant
through
  to download + load.
- `KokoroAneManager.synthesize(text:)` and `synthesizeDetailed(text:)`
  reject Mandarin with a clear error directing callers to
  `synthesizeFromPhonemes()`. The phonemes-bypass entry point already
  works for any vocab via `vocab.encode → 7-stage chain`.
- CLI `--variant` flag accepts `en` / `english` / `zh` / `mandarin` for
  the `kokoro-ane` backend. Mandarin runs treat the input text as
  pre-computed Bopomofo and call `synthesizeFromPhonemesDetailed`.
- 12 new unit tests (`KokoroAneVariantTests`): variant defaults, repo
  wiring, required-files set routing, manager init signatures, and
  Mandarin text-path rejection on both `synthesize` and
  `synthesizeDetailed`.

End-to-end Mandarin synthesis verified against PyTorch ground truth on
`zf_001` and `zm_009`. Background-noise investigation tracked separately
in #569 (atan2 phase correction in upstream `CoreMLForwardSTFT`).

## Test plan

- [x] `swift build` clean
- [x] `swift test --filter KokoroAneVariantTests` — 12/12 pass
- [x] `swift format lint` clean (only pre-existing warnings on
`fastV2_1`/`balancedV2_1`/`highContextV2_1` enum cases unrelated to
      this PR)
- [ ] After HF upload of `ANE-zh/` bundle, end-to-end smoke test:
`swift run fluidaudiocli tts "ㄋㄧˇㄏㄠˇㄕˋㄐㄧㄝˋ。" --backend kokoro-ane
--variant zh --voice zf_001 --output /tmp/zh.wav`
- [ ] No regressions on existing English path (default-arg behavior
      preserved)

## Out of scope

- Mandarin text-to-Bopomofo G2P — Phase 2 (separate PR).
- HF upload of `ANE-zh/` bundle — handled outside this repo.
- Updating `Documentation/` with Mandarin voice list — defer to Phase 2
  when the path is fully usable end-to-end.
2026-05-03 22:03:27 -04:00
Benjamin Lee 821e0f97bc Fixed an LS-EEND constructor (#567)
The asynchronous constructor for `LSEENDDiarizer` that simultaneously
loads the model did not update the timeline config's speaker count or
frame duration, as it would've if using
```swift
diarizer = LSEENDDiarizer()
await diarizer.initialize(variant: .dihard3, stepSize: .step500ms)
```

---------
2026-05-02 19:24:54 -04:00
Benjamin Lee 0a9aace382 Fixed short segment filter for trailing tentative segments in DiarizerTimeline (#566)
Apparently i did it incorrectly last time.

---------

Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-02 10:11:32 -07:00
Benjamin Lee 5bb84bc0b0 Fix DiarizerTimeline Short Segment Filter (#565)
The `DiarizerTimeline` was incorrectly closing short gaps as soon as
another speech frame appeared, instead of waiting for a sufficiently
long speech segment to merge the old one with.

This bug fix ensures that gaps are only closed between two segments of
sufficient length (at least `config.minFramesOn` frames long).

Also removed an unnecessary `throws` from a non-throwing `LSENDDiarizer`
constructor.

---------

Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-05-01 23:25:58 -04:00
Alex cad8a2b563 feat(asr/cohere): long-form transcribeLong + cold/warm docs (#564)
## Summary

Two related changes that grew out of the cohere isolated-bench
investigation:

### 1. Long-form Cohere ASR — `CoherePipeline.transcribeLong`

The encoder is fixed at a 35 s window, so the prior `transcribe()`
silently truncated longer audio via `padOrTruncate(... fixedFrames:
3_500)` in `Sources/FluidAudio/ASR/Cohere/CoherePipeline.swift:250`.

`transcribeLong` slices audio into 35 s chunks with **5 s overlap**
(matches upstream `cohere-pytorch/config.json` `overlap_chunk_second:
5`) and stitches adjacent chunks via **token-level
longest-common-substring merge**. No model changes — encoder shape stays
`[1, 128, 3500]`, decoder cache shape unchanged.

- Audio ≤ 35 s short-circuits to the existing single-chunk
`transcribe()` path → byte-identical short-form behavior, zero perf
delta on FLEURS / LibriSpeech (which are all ≤ 35 s)
- Audio > 35 s: hop = 30 s, decode each chunk independently, merge token
streams (drop the suffix's matched head, keep prefix as-is)
- LCS window bounded to 32 tokens per seam → O(K²) merge is negligible
vs. decode
- Per-chunk encoder/decoder/total seconds are summed into one
`TranscriptionResult`

CLI rewiring:
- `cohere-transcribe`, `cohere-benchmark`, `tts-benchmark` now route
through `transcribeLong`
- `cohere-benchmark` no longer skips files exceeding 35 s

Smoke-tested on a Mandarin 81 s WAV: full 80 s now transcribes
(previously cut at 35 s). 10 unit tests cover `mergeTokenStreams`
correctness (empty-input, no-overlap, threshold fallback, boundary
overlap, offset overlap, longest-run preference, window bounds) and
chunk-config constants.

### 2. Cold-start vs warm inference docs

Adds a section to `Documentation/ASR/Cohere.md` capturing the isolated
single-process bench (cold ANE compile ~186 s on M2 Tahoe; warm calls
3.4–4.6 s, RTFx 1.96×–8.73×). Clarifies process reuse is what unlocks
the headline FLEURS/LibriSpeech RTFx.

## Test plan

- [x] `swift test --filter CohereLongFormTests` (10 / 10 pass)
- [x] `swift test --filter CohereAsrConfigTests` and
`CoherePipelineMaskTests` (no regressions)
- [x] `swift build` (debug) clean; `swift build -c release` clean
- [x] Smoke: `cohere-transcribe /tmp/cohere_test_80s.wav --language zh`
transcribes full 80 s of audio (3 chunks merged, no duplicated overlap
content)
- [x] `swift format lint` — no new warnings in changed files
2026-05-01 10:26:27 -04:00
Alex 7603ac6733 feat(tts/benchmark): tts-benchmark CLI covering all TTS backends (#557)
## Summary

Adds `fluidaudio tts-benchmark`, a unified harness for measuring
**latency × efficiency × quality** across every shipping TTS backend in
FluidAudio, plus the model + runtime fixes needed to actually clear all
six backends end-to-end on the [MiniMax Multilingual TTS Test
Set](https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set).
Also tags Magpie / StyleTTS2 / CosyVoice3 as **beta** at the API + docs
level so users get a runtime warning on `initialize()` reflecting their
actual perf / quality posture.

### Backends — all green on M2 / macOS 26

| Backend | Corpus | Status | Audio out (min / p50 / max) | RTFx | WER |
Notes |
|---|---|---|---|---|---|---|
| Kokoro ANE | minimax-en (100/100) |  | 3.5 s / 8.0 s / 11.4 s | 5.19×
| 10.8% | one-shot @ 24 kHz, 7-graph pipeline; per-stage CU sweep |
| Kokoro | minimax-en (100/100) |  | 3.5 s / 6.8 s / 9.3 s | 2.02× |
1.3% | one-shot @ 24 kHz; multi-chunk w/ 8 ms crossfade; cleanest
English ASR roundtrip |
| PocketTTS | minimax-en (100/100) |  | 2.8 s / 6.3 s / 9.4 s | 0.61× |
1.4% | **streaming** @ 24 kHz, 80 ms frames; TTFT 1244 ms — RTFx looks
slow but is honest per-frame cost (see "RTFx caveat" below) |
| Magpie | minimax-en (100/100) | ⚠️ **BETA** | 4.7 s / 10.0 s / 20.6 s
| 0.64× | 5.6% | **streaming TTFT** @ 22.05 kHz: first chunk at **9.6 s
p50** vs full synth 15.1 s; split-K/V decoder + `outputBackings` fast
path; below real-time, runtime warning on init |
| StyleTTS2 | minimax-en (100/100) | ⚠️ **BETA** | 9.6 s / 22.6 s / 32.6
s | 2.72× | 44.0% | one-shot @ 24 kHz; flex-shape fix + misaki→espeak
post-pass remap (WER 58.1% → 44.0%); WER ~30× Kokoro's, runtime warning
on init |
| CosyVoice3 | minimax-zh (100/100) | ⚠️ **BETA** | 2.2 s / 6.5 s /
**16.0 s** | 0.357׆ | n/a‡ | post auto-chunker @ 24 kHz; long phrases
now split + crossfaded (8 ms cosine) — longest output 16.0 s (was capped
at ~6.5 s); HiFT `.cpuAndGPU` + LLM-Decode `outputBackings` (+33% RTFx);
**whisper-large-v3 CER 1.68% (macro) / 1.84% (micro)** across 100/100
phrases‡; RTFx < 1, runtime warning on init |
| CosyVoice3 | minimax-yue (100/100) | ⚠️ **BETA** | 3.3 s / 8.0 s /
**16.1 s** | 0.249× | n/a | post auto-chunker; **truncation 80/100 →
5/100 phrases** (`finished_on_eos=false` field), longest output 6.5 s →
16.1 s. TTFT-p50 climbs (24 s → 36 s) as the cost of multi-chunk synth |

⚠️ **BETA** = `${Backend}TtsManager.initialize()` emits a
`logger.warning` flagging the perf / quality posture; safe to ship in
non-latency-sensitive paths but read the per-backend doc first.

‡ CosyVoice3 zh WER stays `n/a` because `WERCalculator`
whitespace-tokenizes and Mandarin has no word boundaries (word-level WER
reads ~100% and is meaningless). CER is `whisper-large-v3` against the
rendered WAVs from the full 100-phrase `minimax-chinese` run via
`Scripts/whisper_zh_cer.py`. Cohere Transcribe q8 is also wired in this
PR via `--asr-backend cohere` (see [Cohere ASR backend in the
harness](#cohere-asr-backend-in-the-harness) below) and agrees with
whisper at the 3–5% CER range on a 10-phrase sub-sample, but hits a
`MILCompilerForANE` cache failure on this M2 host that drops it to RTFx
~0.13×, so whisper is the practical source-of-truth for the full
100-phrase run.

Full numbers (cold start, p50/p95 synth, peak RSS, WER/CER per category)
live in `Documentation/TTS/Benchmarks.md`. Corpus attribution +
reproduction notes live in `Documentation/TTS/MinimaxCorpus.md`.

### RTFx caveat — phrase length and streaming granularity both matter

Aggregate RTFx (audio_duration / wall_clock) is **only directly
comparable between backends when both produce similar phrase lengths and
yield audio at the same granularity**. Two things skew the headline
number on this corpus:

**1. Phrase-length spread.** StyleTTS2 emits ~22 s p50 of audio per
`minimax-english` phrase while Kokoro emits ~7 s — same input text, ~3×
more audio out. That's mostly long inter-word pauses + slow speaking
rate baked into the LibriTTS multi-speaker checkpoint, not a measurement
artifact. A 2.72× RTFx on 22 s audio = ~8 s wall — which matches the
TTFT p50 column. Kokoro's 2.02× on 7 s audio = ~3.5 s wall. Same-corpus
RTFx ratios alone hide this.

**2. Streaming granularity.** PocketTTS posts 0.61× agg-RTFx vs.
Kokoro's 2.02× but it's **not slower from a user perspective**:
PocketTTS yields its first 80 ms audio frame at TTFT **1244 ms**,
Kokoro's first frame at TTFT **3113 ms** (full one-shot chunk). The
0.61× is the per-frame cost averaged across the streaming run; what
users feel is TTFT.

| Backend | TTFT p50 | First yield | Implication |

|-------------|----------|------------------|--------------------------------------------|
| PocketTTS | 1244 ms | 80 ms frame | true streaming;
conversational-ready |
| Kokoro ANE | 1586 ms | full ~8 s chunk | ~1.6 s to any audio;
ANE-tuned |
| Kokoro | 3113 ms | full ~7 s chunk | clean quality, slower first-byte
|
| StyleTTS2 | 6671 ms | full ~22 s chunk | one-shot only; long phrase
output amortizes the wall |
| Magpie | **9580 ms** | first chunk @ 22.05 kHz | streaming via
`synthesizeStream`; TTFT-p50 9.6 s vs full synth 15.1 s — 36% earlier
playback start |
| CosyVoice3 | 14091 / 35681 ms (zh / yue) | full chunk @ 24 kHz |
one-shot per chunk; multi-chunk phrases pay TTFT for the first chunk
only |

For conversational use cases, **TTFT > RTFx**. PocketTTS (true
streaming), Magpie (streaming via `synthesizeStream`), and Kokoro ANE
(small one-shot chunks) are the three backends that meaningfully clear
the "user feels it's responsive" bar today.

### Beta callouts (StyleTTS2, Magpie, CosyVoice3)

Three of the six shipping backends post numbers that callers should
weigh against an explicit caveat:

- **StyleTTS2** — WER 44% on `minimax-english` is ~30× Kokoro's 1.3%.
The misaki→espeak post-pass remap closed half the gap; the remainder is
BART G2P misses + diffusion-sampler formant breaks on long phrases.
- **Magpie** — agg-RTFx 0.64× on M2 — below real-time but streaming via
`synthesizeStream` so TTFT (9.6 s p50) is significantly better than
full-synth wall (15.1 s p50). Long-tail phrases still pull p95 wall to
~30 s.
- **CosyVoice3** — agg-RTFx 0.357× on `minimax-chinese` (0.249× on the
longer-phrase `minimax-cantonese` after the auto-chunker). The 250-token
Flow input cap is now worked around at the call site by the auto-chunker
(long phrases split + crossfaded), dropping cantonese truncation from
80/100 → 5/100 and lifting longest output from 6.5 s → 16.1 s. The 5/100
residual is the long-tail token-rate worst case; the structural fix is
re-exporting Flow with a larger fixed input shape (tracked in
`mobius-cosyvoice3`). `CosyVoice3SynthesisResult.finishedOnEos: Bool` +
a `.warning`-level `LLM-Decode budget exhausted` log still surface any
truncation, and the harness writes `finished_on_eos` into each phrase in
the JSON report.

Each manager now logs a `.warning`-level beta notice on `initialize()`
(mirroring the existing CosyVoice3 pattern) so anyone wiring these into
a product gets a console signal, not a silent surprise. Docs
(`Documentation/TTS/Magpie.md`, `Documentation/TTS/Benchmarks.md`
StyleTTS2 footnote, existing `CosyVoice3.md` callout) carry the same
caveat at the top.

### Model + runtime fixes landed in this PR

#### CosyVoice3 stateless port (`71130c9fb`)
Switches LLM-Decode from the macOS 15+ stateful `MLState` path to the
non-stateful `LLM-Decode-M768-fp16` graph that's actually shipped on
HuggingFace. Drops ~95 LOC of state plumbing for ~30 LOC of plain
`MLDictionaryFeatureProvider` prediction with explicit kv carry-forward;
lowers the availability gate from macOS 15 / iOS 18 back to the package
baseline (macOS 14 / iOS 17). `CosyVoice3ModelNameTests` guard the
rename.

#### CosyVoice3 HiFT timeout fix (`267766b62`)
`minimax-chinese` runs were aborting mid-corpus with `E5RT: Submit Async
failed for [3:29] ... HiFT-T500-fp16_main__Op104_BnnsCpuInference has
timed out`. Root cause: HiFT was loaded with `.cpuAndNeuralEngine`,
which let the planner place most of the graph on ANE but kept at least
one op on the BNNS CPU async-dispatch path; long phrases tripped the
BNNS async watchdog. Fix pins HiFT to `.cpuAndGPU` regardless of
user-supplied compute-units, removing the BNNS path entirely. Verified
on 100/100 zh + 100/100 yue.

#### CosyVoice3 LLM-Decode `outputBackings` double-buffer (`248c638c6`)
The autoregressive decode loop runs ~163 steps per phrase to fill the
250-token cap. Each step takes the previous step's KV cache as `kv_k` /
`kv_v` (fp32 `[24, 1, 2, 768, 64]` = 9 MB each) and produces fresh
`kv_k_out` / `kv_v_out` plus logits — i.e. ~36 MB of host-side
`MLMultiArray` allocation **per step**. Fix pre-allocates 4 KV
back-buffers + a logits backing, rotates front/back/spare across steps
via `MLPredictionOptions.outputBackings`, and falls back to fresh-alloc
on first rejection (one-shot `logger.warning`). Mirrors the Magpie
pattern. Result on full `minimax-chinese`: agg-RTFx **0.269 → 0.357
(+33%)**, TTFT-p50 14091 ms → 9619 ms (-31%), peak RSS 3302 MB → 2470
MB.

#### CosyVoice3 auto-chunker (`f80e0b92e` + `fd22624b5` + `f60cccd0d`)
The 250-token Flow input cap means a single synth pass produces at most
~6.5 s of audio regardless of input length. Re-exporting Flow with a
larger fixed input shape is gated on upstream conversion work, so this
PR works around it at the call site: long inputs are split at
sentence/clause boundaries by `CosyVoice3TextChunker`, synthesized
independently, and merged with an 8 ms equal-power cosine crossfade.

**Splitter policy**: hard enders (`. ! ? 。 ! ? \n`) commit always; soft
enders (`, 、 ; : ; ,` + ASCII space) commit only at-or-past budget;
force-split at +30 token overshoot if no natural boundary exists.
`defaultMaxSpeechTokens` = 110 (leaves margin under the 250-token cap
minus a typical 60–90-token speech-prompt context). Token-rate heuristic
is calibrated against minimax-zh + minimax-yue runs:

| Char class | Tokens / char | Rationale |

|------------|---------------|--------------------------------------------------------------|
| CJK | 7.5 | worst-case observed in real generation; varies 5.5–9 per
char |
| ASCII | 1.5 | matches BPE rate on English text |
| Other | 2.5 | conservative for accented Latin / non-CJK Unicode |

**Validation** on full `minimax-cantonese` (100 phrases, M2):

| Metric | Pre-chunker | Post-chunker | Δ |

|-------------------------------------------|-------------|--------------|------------|
| `finished_on_eos=false` (truncated) | 80 / 100 | **5 / 100** | −94% |
| Longest audio output | 6.5 s | **16.1 s** | +148% |
| agg-RTFx | 0.245× | 0.249× | +1.6% |
| TTFT p50 | 23.9 s | 35.7 s | +49% |

The TTFT regression is the cost of running multiple synth passes per
long phrase — splitting unblocks long-form output at the price of
wall-clock latency. The 5/100 residual truncation is the long-tail
token-rate worst case (some chars hit ~9 tokens/char); raising the
per-CJK heuristic further would over-fragment short phrases. Cleaner fix
is the Flow re-export.

16-test suite covers tokenization estimates, hard/soft/force-split
policy, and the crossfade arithmetic. Lives in
`Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Preprocess/CosyVoice3TextChunker.swift`
+ `CosyVoice3TtsManager.concatWithCrossfade`.

#### Magpie streaming TTFT wire-up (`ace0bf485`)
`TtsBenchmarkCommand.swift` now drives Magpie through
`MagpieTtsManager.synthesizeStream`, recording `ttft_ms` at first
`MagpieAudioChunk` emit instead of conflating it with full-synth wall
time. Result on full `minimax-english` (100 phrases, M2): TTFT-p50 **9.6
s** vs full synth-p50 **15.1 s** — agents start playback ~36% earlier
than waiting for full synth. agg-RTFx 0.41× → 0.64× (warm-cache re-run
benefit; fundamentals unchanged).

#### StyleTTS2 `FlexibleShapeInfo` fix (`c24900731` + `8f9e42fd9`)
`text_predictor.mlmodelc` aborted on long MiniMax phrases with `E5RT:
tensor_buffer has known strides while the model has FlexibleShapeInfo`.
The CoreML runtime rejects two access patterns on outputs from a
flex-shape model: `arr.strides` reads, and `arr[idx].floatValue` element
subscripts — and the original `sliceFirstAxis2D` helper used both. Fix
rewrites it to read via `arr.dataPointer.bindMemory(...)` (handling
`.float32`, `.float16`, `.double`) and computes the flat index from the
known `(1, leading, trailing)` row-major layout. Verified on full
100/100 minimax-en with a `ref_s.bin` dumped from the upstream LibriTTS
demo voice.

#### StyleTTS2 misaki → espeak post-pass remap (`ded0b9467`)
After `sliceFirstAxis2D` unblocked the full corpus, StyleTTS2 still
landed at **WER 0.581 / CER 0.476** — an order of magnitude worse than
Kokoro (0.013). Instrumented the encoder via a new `--tokenize-only
--corpus` mode and disproved the silent-vocab-drop hypothesis: only
**0.09% of scalars** dropped on the full 100-phrase corpus (11 ASCII
hyphens / 12247 scalars).

Real root cause: G2P convention mismatch. Both Kokoro and StyleTTS2
share the in-tree misaki BART G2P (`G2PModel`), but the StyleTTS2
LibriTTS checkpoint was trained by yl4579 on **espeak-ng-phonemized**
LibriTTS — predating misaki by years. The 178-vocab accepts both forms
(e.g. both `ʧ` U+02A7 and `tʃ` decomposed encode), but acoustic
embeddings for the misaki ligature glyphs are essentially untrained
noise.

Side-by-side comparison against locally-installed `espeak-ng -v en-us
--ipa -q` flagged four systematic divergences:

| misaki | espeak-ng | example                  |
|--------|-----------|--------------------------|
| `ʧ`    | `tʃ`      | choice → `tʃˈɔɪs`        |
| `ʤ`    | `dʒ`      | jump   → `dʒˈʌmps`       |
| `ɜɹ`   | `ɝ`       | girl   → `ɡˈɝl`          |
| `əɹ`   | `ɚ`       | over   → `ˈoʊvɚ`         |

Fix: a 4-rule post-pass remap in `StyleTTS2Phonemizer.phonemize`, gated
on `.americanEnglish` and applied to the assembled phoneme string after
every word has been emitted by the BART G2P. Lives alongside the
existing per-piece misaki diphthong remap. Result on the same 100-phrase
MiniMax-English run with the same `libritts_696` voice and same Parakeet
TDT roundtrip:

| Metric          | Pre   | Post  | Δ      |
|-----------------|-------|-------|--------|
| Macro WER       | 0.581 | 0.440 | −24.2% |
| Macro CER       | 0.476 | 0.241 | −49.5% |
| TTFT p50 (ms)   | 8937  | 6671  | −25.4% |
| Agg RTFx        | 2.36× | 2.72× | +15.3% |
| Peak RSS (MB)   | 1428  | 963   | −32.6% |

Phrase 1 (`"…simple choice. Get busy living…"`) went from `simple voice.
Busy dying.` (0.40 WER) to a perfect roundtrip. Remaining errors cluster
on word-level G2P misses from the BART itself (`practical →
practicckles`, `separation → expiration`) and diffusion-sampler formant
breaks; closing the rest of the gap to Kokoro likely needs richer espeak
coverage or libespeak-ng vendor — tracked separately.

#### Beta callouts on StyleTTS2 + Magpie managers (`25e2b492a`)
`StyleTTS2Manager.initialize` and `MagpieTtsManager.initialize` now emit
`logger.warning` beta notices mirroring the existing
`CosyVoice3TtsManager.initialize` pattern. Backends docs (`Magpie.md`
Status section, `Benchmarks.md` StyleTTS2 footnote) gain matching `⚠️
Beta / experimental` callouts so the perf / quality posture is visible
at every entry point — runtime, manager docstring, doc top, PR body.

#### Magpie `outputBackings` rejection fallback (`72dae8400` +
`9767e1ef9`)
The shipped `decoder_step.mlmodelc` reaches the user before the rebuild
lands, so CoreML can reject our `outputBackings` dictionary on a
name-mismatch. Latched fallback path falls back to a fresh-alloc decode
so the model still runs; first rejection latches the flag for the rest
of the run.

### Cohere ASR backend in the harness (`8e741e659`)

Lets non-English TTS runs (CosyVoice3, Magpie zh, etc.) score WER / CER
through the harness against [Cohere
Transcribe](Sources/FluidAudio/ASR/Cohere/) instead of being forced into
`--skip-asr`. Four new flags on `tts-benchmark`:

- `--asr-backend parakeet|cohere|none` — selects the ASR roundtrip
engine. Default is `parakeet` for English-only runs and skipped for
CosyVoice3.
- `--cohere-model-dir <path>` — path to a directory containing
`cohere_encoder.mlmodelc`, `cohere_decoder_cache_external_v2.mlmodelc`,
and `vocab.json`.
- `--asr-language <code>` — overrides the inferred language code (covers
all 14 Cohere languages: en, fr, de, es, it, pt, nl, pl, el, ar, ja, zh,
ko, vi).
- `--cohere-compute-units all|cpu-and-gpu|cpu-only|all-ane` — pins
`MLComputeUnits` at `CoherePipeline.loadModels` time. Use `cpu-and-gpu`
when the q8 encoder fails ANE compilation (`MILCompilerForANE error:
failed to compile ANE model using ANEF`) to skip the multi-minute
fallback compile on the first call. The harness logs a WER caveat for
zh/ja runs flagging that whitespace-tokenized WER is meaningless and the
CER column is the real signal.

Example end-to-end:
```bash
fluidaudio tts-benchmark \
    --backend cosyvoice3 \
    --corpus minimax-chinese \
    --asr-backend cohere \
    --cohere-model-dir /path/to/cohere/q8 \
    --asr-language zh \
    --output-json benchmark_results/cv3-zh-cohere.json \
    --audio-dir benchmark_results/cv3-zh-cohere/audio
```

On this M2 host the q8 encoder hits a CoreML ANE-cache failure
(`MILCompilerForANE error: ANECCompile() FAILED`) and CoreML silently
falls back to CPU+GPU, dropping Cohere from its documented RTFx ~2× (per
`Documentation/ASR/Cohere.md`) to RTFx ~0.13× — correctness is
unaffected (same graph, same output), only latency. The full 100-phrase
CosyVoice3 zh CER number reported above (1.68% macro / 1.84% micro) was
therefore produced via `whisper-large-v3` (Python CPU FP32,
`Scripts/whisper_zh_cer.py`) rather than by running Cohere over all 100
phrases. A 10-phrase Cohere sub-sample agrees with whisper at the 3–5%
CER range.

### Corpus migration (`4cc7d3111`) + on-demand fetch CLI (`8022e8384`)

Replaces the original `prose-en` / `numbers-en` / `names-en` /
`prose-zh` shipped with the first cut of this PR with the [MiniMax
Multilingual TTS Test
Set](https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set)
(CC-BY-SA-4.0; 100 phrases × 25 languages). Same public corpus used by
[MiniMax-Speech](https://arxiv.org/abs/2505.07916), seed-tts-eval, and
Gradium — numbers in this PR are paper-comparable.

The 24 per-language `.txt` files used to be vendored in
`Benchmarks/tts/corpus/minimax/`. **Removed in this PR** in favor of an
on-demand `fluidaudio minimax-corpus` CLI subcommand that fetches them
from the upstream HF dataset at the pinned revision and writes them to
the same path. Reuses `DownloadUtils.fetchHuggingFaceFile` for HF auth
(HF_TOKEN env) + retry/backoff — no `swift-transformers` dep added, no
hardcoded asset URLs. The `.txt` files now live in `.gitignore` since
they're CC-BY-SA-4.0 derivative content; only
`Documentation/TTS/MinimaxCorpus.md` (attribution + revision pin + WER
caveats — moved from `Benchmarks/tts/corpus/minimax/README.md` in
`ac21d60bf`) and the CLI subcommand are tracked. Replaces the prior
`python Scripts/fetch_minimax_tts_corpus.py` (also deleted). Per-backend
language scope:

| Backend | Languages benchmarked |
|---|---|
| Kokoro / Kokoro ANE | en (af_heart) |
| PocketTTS | en + de + it + pt + es + fr |
| Magpie | en + es + de + fr + it + vi + zh + hi |
| StyleTTS2 | en (LibriTTS multi-spk) |
| CosyVoice3 | zh + yue |

### PocketTTS streaming TTFT (`c26f1e163`)
PocketTTS now drives the harness through its `synthesizeStreaming` API
so TTFT measures time-to-first-80ms-frame instead of full one-shot
synth. TTFT 1244 ms vs. full synth 8757 ms — a 7× streaming advantage
that one-shot benchmarking previously hid.

### Reference voice dumper helper (mobius-styletts2)
`mobius-styletts2/scripts/06_dump_ref_s.py` (added in the sibling repo)
wraps `style_encoder` + `predictor_encoder` from `99_parity_check.py` to
dump a 256-fp32 LE `ref_s.bin` that `StyleTTS2Manager.synthesize`
consumes via `--voice`. Required because the shipped CoreML bundle
doesn't include those upstream-only PyTorch encoders.

## Test plan

- [x] `swift build -c release` clean
- [x] `swift format lint` clean for new files
- [x] `fluidaudio tts-benchmark --help` lists all 6 backends
- [x] `fluidaudio minimax-corpus --languages english --out-dir /tmp/x`
produces byte-identical output to the deleted Python script
- [x] Kokoro / Kokoro ANE / PocketTTS / Magpie — full 100/100 minimax-en
- [x] StyleTTS2 — full 100/100 minimax-en (verified after
`sliceFirstAxis2D` fix + post-pass remap)
- [x] CosyVoice3 — full 100/100 minimax-zh + 100/100 minimax-yue
(verified after HiFT + LLM-Decode `outputBackings` fixes)
- [x] `CosyVoice3ModelNameTests` + `TtsComputeUnitPresetTests` green
- [x] No `@unchecked Sendable`; per-backend error enums use `Error,
LocalizedError`
- [x] StyleTTS2 + Magpie + CosyVoice3 emit beta `logger.warning` on
`initialize()`
- [x] Corpus README moved to `Documentation/TTS/MinimaxCorpus.md`;
cross-refs in `Benchmarks.md`, `MinimaxCorpusCommand.swift`,
`TtsBenchmarkCommand.swift` updated
- [x] CosyVoice3 6.5 s output cap investigated — confirmed structural
(250-token Flow input shape, 40 ms / token); surfaced via
`finishedOnEos` + warning log + JSON `finished_on_eos` field. See
[Decode budget
cap](Documentation/TTS/Benchmarks.md#cosyvoice3-decode-budget-cap)
- [x] **CosyVoice3 auto-chunker** lands in this PR as a call-site
workaround. Validated on full minimax-cantonese: truncation **80/100 →
5/100**, longest output **6.5 s → 16.1 s**, agg-RTFx 0.245× → 0.249×.
16-test suite (`CosyVoice3TextChunkerTests`) green. See [CosyVoice3
auto-chunker](Documentation/TTS/Benchmarks.md#cosyvoice3-auto-chunker)
- [x] **Magpie streaming TTFT** wired through `synthesizeStream` in
`TtsBenchmarkCommand.swift`. Validated on full minimax-english: TTFT-p50
**9.6 s** (first chunk) vs full-synth-p50 **15.1 s** — 36% earlier
playback start. agg-RTFx 0.41× → 0.64× (warm-cache re-run)
- [x] **Cohere ASR harness wiring** (`--asr-backend cohere` +
`--cohere-model-dir` + `--asr-language` + `--cohere-compute-units`).
Smoke-tested on a 10-phrase `minimax-chinese` sub-sample (Cohere q8
macro CER 4.88%, hit `MILCompilerForANE` fallback, RTFx ~0.13× on this
M2 host). Whisper-large-v3 cross-check on the same WAVs: macro CER 3.04%
— both backends agree
- [x] **CosyVoice3 zh CER on full corpus** measured via
`whisper-large-v3` (Python CPU FP32, `Scripts/whisper_zh_cer.py`) over
all 100 minimax-chinese WAVs: macro CER **1.68%**, micro CER **1.84%**.
Recorded in `Documentation/TTS/Benchmarks.md` (CosyVoice3 row + footnote
‡)
2026-05-01 09:09:42 -04:00
Alex b5d8017d1f feat(asr/parakeet-v3): default to int4-per-channel encoder (#560)
## Summary

Switch the Parakeet TDT v3 default encoder from the 6-bit palettized
`Encoder.mlmodelc` to a new int4-per-channel `EncoderInt4.mlmodelc`. v2
and TDTJa keep the legacy 6-bit encoder; v3 is the only path that
changes.

## WER / size / speed (LibriSpeech test-clean, 100 files, M2)

| variant | WER | disk | RTFx | ANE residency |
|---|---|---|---|---|
| baseline (6-bit palettized, current default) | 2.64% | 426 MB | 36.8x
| 99.4% |
| **int4-per-channel (new default)** | **5.24%** | **285 MB** |
**49.2x** | 82.0% |
| enc-prune+int8 | 2.57% | 568 MB | 19.8x | 82.0% |
| enc-int4-linear-per-block-32 | 3.95% | 319 MB | 15.6x | 33.3% |
| enc-prune+int4-block | 3.95% | 319 MB | 15.9x | 33.3% |

The chosen variant trades roughly 2× LibriSpeech WER (still in the same
single-digit-percent regime) for **33% less disk** and the **fastest
RTFx** of any variant tested. Per-block quants drop off ANE entirely
(33%) while per-channel stays compatible (82%).

## Implementation

- `ModelNames.ASR`
- Add `encoderInt4 = \"EncoderInt4\"` and `encoderInt4File =
\"EncoderInt4.mlmodelc\"`.
- Swap `encoderFile` for `encoderInt4File` in `requiredModelsV3`.
`encoderFile` stays defined and is still used by v2 / TDTJa / 110m.
- `AsrModels.swift`
- Extend `getModelFileNames(version:)` return tuple from `(decoder,
joint, vocabulary)` to `(encoder, decoder, joint, vocabulary)`.
- Thread `fileNames.encoder` through `createModelSpecs`, the v3 `load`
flow, the `download` spec list, and `isModelValid`. v3 returns
`Names.encoderInt4File`; v2/tdtJa return their existing
`Encoder.mlmodelc`; fused (110m) is unaffected.
- Tests: add `testV3UsesInt4EncoderAsDefault` and
`testV2KeepsLegacyEncoder` in `ModelNamesTests`.

## Distribution

The new `EncoderInt4.mlpackage` / `EncoderInt4.mlmodelc` will be
uploaded to the existing `FluidInference/parakeet-tdt-0.6b-v3-coreml` HF
repo alongside the current `Encoder.mlmodelc`. Older library versions
that still ask for `Encoder.mlmodelc` continue to work unchanged.

## Test plan

- [x] `swift build` clean
- [x] `swift test --filter ModelNamesTests` — 20/20 (2 new)
- [x] `swift test --filter AsrModelsTests` — 30/30
- [x] End-to-end transcription smoke test on LibriSpeech
61-70970-0001.flac via `EncoderInt4.mlmodelc`: correct text, RTFx
29.12x. ANE cold compile 21.3s (one-time).
- [x] swift-format lint clean on the modified files (only pre-existing
Sortformer warnings remain in `ModelNames.swift`).
- [ ] CI: tests + asr-benchmark
- [ ] Verify HF download path on a clean cache once
`EncoderInt4.mlmodelc` is uploaded to the v3 repo.

## Companion

The mobius PR adds the conversion scripts that produced these variants
(`extra_encoder_variants.py`, `analyze_fallback.py`,
`compute_unit_sweep.py`).
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/560"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open in Devin Review">
  </picture>
</a>
<!-- devin-review-badge-end -->
2026-04-30 23:00:43 -04:00
Benjamin Lee 35f6ba697f Added Back the Old LS-EEND Constructors (#563)
I accidentally deleted the old constructor in my last PR.

---------

Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-04-30 17:24:18 -07:00
Benjamin Lee 4065a9917e Optimized LS-EEND API (#526) 2026-04-30 17:49:32 -04:00
Alessandro 4db4af1390 Add Dictato to showcase (#561)
Adds Dictato to the showcase section, appended at the end to keep
chronological order.

Dictato turns your voice into text, anywhere on your Mac. It runs fully
on-device, so your words never leave your machine — fast, private, and
works offline. Boost recognition for your own vocabulary (names, brands,
acronyms) and dictate in multiple languages.

Website: https://dicta.to
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/561"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open in Devin Review">
  </picture>
</a>
<!-- devin-review-badge-end -->
2026-04-30 08:10:27 -04:00
Alex 3e3ee69084 docs: add top-level architecture overview (#559)
## Summary

Adds `Documentation/Architecture.md` — a conceptual companion to the
existing per-module docs (`Documentation/{ASR,TTS,VAD,Diarization}/`)
that explains **why** the code is structured the way it is, not just
what each module does.

Covers all four runtime modules:

- **Cross-cutting patterns**: actor-based model stores (and the
libmalloc heap-corruption bug that motivated them), lazy HuggingFace
downloads, per-module error enums, AsyncStream vs state-in/state-out
choices, per-model compute-unit selection driven by measured precision,
pure-Swift vs CoreML divide.
- **ASR**: three families (Parakeet TDT, Qwen3, Cohere), why sliding
window instead of stateful streaming, why TDT exists, why Qwen3 strips
the embedding graph.
- **TTS**: why no unified `TtsBackend` protocol, PocketTTS as the
canonical multi-stage pipeline, Kokoro vs KokoroAne ANE-split tradeoffs,
G2P/SSML pure-Swift rationale.
- **VAD**: single-model stateful LSTM, why VAD deliberately doesn't
expose AsyncStream (state-in/state-out composes better with the caller's
existing loop), hysteresis state machine.
- **Diarization**: why two managers (online cosine-threshold vs offline
VBx + AHC), why C++17 for linkage, why no built-in VAD-gating.

Every claim has `file_path:line` references so future contributors can
jump to the canonical implementation. Closes the gap where the design
rationale lived in PR descriptions and tribal knowledge instead of in
the repo.

## Test plan

- [ ] Doc renders cleanly on GitHub
- [ ] All `file_path:line` references resolve
- [ ] No broken internal links
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/559"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open in Devin Review">
  </picture>
</a>
<!-- devin-review-badge-end -->
2026-04-29 15:33:09 -04:00
Zhongpai Gao c4d56a5cb5 Feat/pocket tts int8 precision swap (#558)
### Why is this change needed?
<!-- Explain the motivation for this change. What problem does it solve?
-->

Wires up the published `flowlm_stepv2.mlmodelc` int8 variant via a
`PocketTtsPrecision { .fp16, .int8 }` parameter on `PocketTtsManager`,
threaded through to `PocketTtsModelStore` and
`PocketTtsResourceDownloader`.

Closes the loop on the `flowlm_stepv2.mlmodelc` artifact that's been
published under `v2/<lang>/` for a while but didn't have a Swift loader
hook. Default stays `.fp16`, no behavior change for existing callers.

## What's in this PR

**Code (5 files, +171/-12):**
- `PocketTtsPrecision.swift` — new enum `{ .fp16, .int8 }`, with
  docstring documenting the kyutai-labs/pocket-tts#147 recipe and
  preserving the per-submodel A/B data from `experiment/pocket-tts-int8`
  (cond_step / flowlm_step / flow_decoder / mimi_decoder safety summary)
- `ModelNames.swift` — `flowlmStepV2` constant +
  `flowlmStepFile(precision:)` and `requiredModels(precision:)` helpers
- `PocketTtsResourceDownloader.swift` — `precision:` param,
  precision-aware cache check, and `removeUnusedFlowlmVariant()` post-
  download cleanup so callers' disk usage matches the loaded models
- `PocketTtsModelStore.swift` — `precision:` init param plumbed to the
  precision-aware filename helper
- `PocketTtsManager.swift` — `precision:` init param threaded to the
  store

**Docs (1 file, +47):**
- `Documentation/TTS/PocketTTS.md` — new "Model Files & Precision"
  section: per-submodel precision/size/HF-path table, fp16-vs-int8
  totals, rationale for why only `flowlm_step` is quantized

## Why default is `.fp16`

I asked about the on-disk weight format before committing the rename
and verified by inspecting `model.mlmodel` for both flowlm variants:
the int8 variant has explicit `cast_fp16_to_fp32` op scaffolding
throughout, while the default has none — indicating uniform fp16
weights. Combined with the 304→77 MB size ratio (~4×, consistent with
fp16→int8 plus quantization scale tensors) the default file's weights
are fp16 on disk. The existing `PocketTtsModelStore.swift:65-67`
comment about "CPU/GPU compute in float32 matches the Python reference"
is correct about runtime compute precision (CoreML upcasts fp16
weights to fp32 on `.cpuAndGPU`); it just doesn't describe disk format
and reads as accurate as-is.

## Why per-submodel quantization isn't exposed

The `experiment/pocket-tts-int8` branch's `PocketTtsQuantization`
struct (per-submodel `PocketTtsModelPrecision`) is a richer API, but
the per-submodel int8 artifacts (`cond_step_int8.mlmodelc`, etc.)
aren't published on HuggingFace today. Adding the API would let
callers request configurations that 404 at download time. Only
`flowlm_stepv2.mlmodelc` is published, and that's what this PR wires
up. The `PocketTtsPrecision` enum can grow into the experiment
branch's `PocketTtsQuantization` shape mechanically if/when the
per-submodel artifacts ship.

## Disk footprint (English language pack)

| | fp16 (default) | int8 |
|---|---|---|
| Total active files on disk | 766.3 MB | 549.3 MB |
| **int8 savings vs fp16** | — | **−217 MB (28%)** |

The `v2/<lang>/` HF directory ships both flowlm variants, so first
download briefly holds ~857 MB before the cleanup pass deletes the
unused `.mlmodelc` and `.mlpackage`.

## Backward compatibility

- `PocketTtsManager()` / `PocketTtsModelStore()` / `ensureModels()`
  defaults all stay `.fp16`, which loads `flowlm_step.mlmodelc` exactly
  as before
- Existing `requiredModels` constant retained alongside new
  `requiredModels(precision:)` so non-precision-aware callers keep
  compiling

## Verification done

- All 11 published language packs have both `flowlm_step.mlmodelc`
  and `flowlm_stepv2.mlmodelc` under `v2/<lang>/` — verified via HF
  tree API
- Branch is exactly +2 commits on top of `main`
  (`00ea906 fix: remove module_map from MachTaskSelfWrapper subspec`)
- Diff content is identical to `Gaozhongpai/FluidAudio:main`, just
  squashed from 5 iterative commits into 2 clean ones (one feat, one
  docs)

I haven't run `swift test` locally — Bash on Windows here, no Swift
toolchain. Happy to fix anything CI flags.

<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/558"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open in Devin Review">
  </picture>
</a>
<!-- devin-review-badge-end -->
2026-04-29 13:57:15 -04:00
dianshu 00ea906c20 fix: remove module_map from MachTaskSelfWrapper subspec (#546)
## Summary

- Remove `mach.module_map` from the `MachTaskSelfWrapper` subspec —
CocoaPods does not allow `module_map` on subspecs
- Guard `import MachTaskSelfWrapper` with `#if
canImport(MachTaskSelfWrapper)`, matching the existing
`FastClusterWrapper` pattern
- In CocoaPods builds, the C headers are already exposed via the
umbrella header, so the explicit module import is only needed under
SwiftPM

## Verification

- `pod lib lint FluidAudio.podspec --allow-warnings` — **passed**
- `swift build` — **passed**

Fixes #545
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/546"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open in Devin Review">
  </picture>
</a>
<!-- devin-review-badge-end -->

Co-authored-by: dianshu <dianshu@123.com>
v0.14.3
2026-04-29 09:25:25 -04:00
Alex 248b76b8b6 feat(tts/styletts2): scaffold StyleTTS2 4-stage pipeline integration (#554)
## Summary

Adds the FluidAudio host surface for the StyleTTS2 LibriTTS
multi-speaker checkpoint published at
`FluidInference/StyleTTS-2-coreml`, end-to-end. Covers asset download,
lazy bucketed model loading, text frontend (G2P + 178-token vocab),
bundle config validation, the ADPM2/Karras sampler, hard-alignment,
decoder driver, and a CLI driver.

`fluidaudio styletts2 "Hello world." --voice ref_s.bin --output out.wav`
produces an audible 24 kHz mono WAV.

## Pipeline

Per utterance (~5 ADPM2 steps default):

| Stage | Bucket axis | Buckets | Precision | Compute |
|---|---|---|---|---|
| `text_predictor` | input tokens | 32, 64, 128, 256, 512 | fp16 | ANE |
| `diffusion_step` | bert_dur frames | 512 only (5× per utt) | fp16 |
CPU+GPU |
| `f0n_energy` | dynamic (en frames) | enumerated 256/512/1024/2048/4096
| fp16 | CPU |
| `decoder` | mel frames | 256, 512, 1024, 2048, 4096 | fp32 | CPU+GPU |

The decoder is fp32 because SineGen phase saturation in fp16 produces
robotic audio. The HF repo ships precompiled `compiled/*.mlmodelc`
bundles (skipping the cold-start `anecompilerservice` hit) plus
`.mlpackage` doubles for portability — only the `.mlmodelc` bundles are
fetched.

`f0n_energy` is pinned to CPU and always called at the largest
enumerated shape (1, 640, 4096) with zero-padding — the E5RT runtime
emits a stderr "tensor_buffer has known strides while the model has
FlexibleShapeInfo" warning when it sees enumerated shapes on GPU/ANE,
which is non-fatal but the CPU/largest-shape path sidesteps it cleanly.

## What's in this PR

**Sources:**
- `StyleTTS2Constants` — audio/tokenizer/model dims + sampler defaults
(Karras `rho=9` to match upstream)
- `StyleTTS2Error` — module-local `LocalizedError` enum
- `Assets/StyleTTS2ResourceDownloader` — `DownloadUtils.downloadRepo`
wrapper
- `Assets/StyleTTS2Vocab` — 178-token espeak-ng IPA vocab loader;
iterates Unicode scalars (not graphemes) so combining marks like U+0329
syllabic / U+0361 tie-bar look up against their own vocab entries
- `Assets/StyleTTS2BundleConfig` — `config.json` Codable + `validate()`
against `StyleTTS2Constants`
- `Assets/StyleTTS2VoiceStyle` — parser for precomputed `ref_s.bin` (256
fp32 LE) speaker-prosody blobs (dump script lives in
`mobius-styletts2/scripts/06_dump_ref_s.py`)
- `Pipeline/StyleTTS2ModelStore` — actor with lazy per-bucket `MLModel`
cache + lazy vocab/config caches; `f0nEnergy()` pinned `.cpuOnly`
- `Pipeline/StyleTTS2Phonemizer` — `TtsTextPreprocessor` → in-tree
`G2PModel` (BART, misaki IPA) for English with a small misaki→espeak-ng
remap (`A→eɪ`, `I→aɪ`, `O→oʊ`, `W→aʊ`, `Y→ɔɪ`, schwa-offglide → `ə`);
other languages fall back to `MultilingualG2PModel`
- `Pipeline/StyleTTS2Sampler` — ADPM2 / Karras-rho noise schedule +
CFG-aware sampling closure; deterministic via SplitMix64 + Box-Muller
- `Pipeline/StyleTTS2Synthesizer` — full 4-stage driver. Float16-aware
`MLMultiArray` reads (`denoised`, `F0`, `N` all ship as fp16 per
schema), cumsum-of-durations → one-hot → matmul hard-alignment, decoder
fan-out
- `StyleTTS2Manager` — public actor; `initialize()` validates bundle
config; `tokenize()` exposes the text frontend;
`synthesize(text:voiceStyleURL:steps:alpha:beta:randomSeed:)` returns 24
kHz mono WAV `Data`
- `Sources/FluidAudioCLI/Commands/StyleTTS2Command` — `fluidaudio
styletts2 "<text>" --voice <ref_s.bin> [--output --steps --alpha --beta
--seed]`
- `ModelNames.StyleTTS2` + `Repo.styleTts2` wired into the central
registries
- `TtsBackend.styleTts2` case

**Tests** (37/37 pass, no network or CoreML deps):
- `StyleTTS2VocabTests` — load happy path, combining-grapheme handling,
missing/malformed JSON, encode known/unknown/empty
- `StyleTTS2BundleConfigTests` — load + validate against every constant
mismatch
- `StyleTTS2VoiceStyleTests` — `ref_s.bin` parsing (size, fp32
round-trip, wrong-size rejection)
- `StyleTTS2SamplerTests` — Karras schedule, RNG determinism

## Verification

- `fluidaudio styletts2 "Hello world. The quick brown fox jumps over the
lazy dog." --voice /tmp/styletts2-ref_s.bin --output /tmp/out.wav --seed
42` → 4.80s @ 24 kHz, RMS 7158, 0.0009% clipping
- `fluidaudio transcribe /tmp/out.wav` → `Hello world quick brown fax
nomps over lazy` (most words recovered; residual gaps are BART G2P
emitting reduced `ð` for "the" with no schwa, and lacking length marks
`ː` on stressed long vowels)

## Test plan

- [x] `swift build -c release` clean
- [x] `swift test --filter StyleTTS2` → 37/37 pass
- [x] `swift format lint` clean on new files
- [x] End-to-end CLI synth produces audible WAV
- [x] ASR roundtrip recovers most content words

## Known follow-up

- Tune misaki→espeak remap for length marks `ː` and reduced
function-words (would push ASR WER lower)
- Voice-bank packaging story (currently the user must precompute
`ref_s.bin` via `mobius-styletts2/scripts/06_dump_ref_s.py`)
- StyleTTS2 benchmark suite

<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/554"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open in Devin Review">
  </picture>
</a>
<!-- devin-review-badge-end -->
2026-04-29 09:24:44 -04:00
Alex e332c18b49 docs(models): fix Cohere Transcribe Model Sources link target (#553)
## Summary

Follow-up fix for [Devin Review on
#551](https://github.com/FluidInference/FluidAudio/pull/551#pullrequestreview-4192652442):

The Cohere Transcribe Model Sources row had a Markdown link whose text
included `/q8` but whose URL pointed at the repo root, so clicking
landed at the root instead of the `q8` subdirectory. Move the subdir
into a parenthetical `(variant: \`/q8\`)` suffix to match the existing
**Qwen3-ASR** and **Parakeet EOU** rows in the same table, and drop the
mismatched suffix from the link text.

```diff
-| Cohere Transcribe (INT8 hybrid, default) | [FluidInference/cohere-transcribe-03-2026-coreml/q8](https://huggingface.co/FluidInference/cohere-transcribe-03-2026-coreml) |
+| Cohere Transcribe (INT8 hybrid, default) | [FluidInference/cohere-transcribe-03-2026-coreml](https://huggingface.co/FluidInference/cohere-transcribe-03-2026-coreml) (variant: \`/q8\`) |
```

## Test plan

- [x] Docs-only change.
- [x] Verified the link target now matches the displayed link text and
the surrounding row pattern (Qwen3-ASR / Parakeet EOU).
- [x] No code paths touched.
2026-04-28 17:57:09 -04:00
Alex 5c16ee120e docs(models): add Cohere Transcribe + Qwen3-ASR rows (#551)
## Summary

`Documentation/Models.md` was missing two ASR backends that already ship
via the public `Repo` enum (`cohereTranscribeCoreml`, `qwen3Asr` /
`qwen3AsrInt8`) and have full integration docs under
`Documentation/ASR/`. Add them to the **Batch Transcription** table +
the **Model Sources** table at the bottom.

- **Cohere Transcribe**
([#487](https://github.com/FluidInference/FluidAudio/pull/487),
[#537](https://github.com/FluidInference/FluidAudio/pull/537)) —
14-language encoder-decoder, 48L Conformer + 8L decoder, INT8 encoder +
FP32 ANE-resident static-shape decoder (v2). Hard 35 s per-call audio
cap from upstream config; language must be passed explicitly.
- **Qwen3-ASR**
([#281](https://github.com/FluidInference/FluidAudio/pull/281),
[#312](https://github.com/FluidInference/FluidAudio/pull/312),
[#410](https://github.com/FluidInference/FluidAudio/pull/410)) —
30-language with auto-detect, 2-model pipeline (ANE-optimized encoder +
stateful 28L decoder), FP32 / INT8 variants, macOS 15 / iOS 18+, beta
(accuracy may trail PyTorch reference).

Also clarify the Parakeet EOU `Model Sources` row to surface the
per-chunk-size subdirs (`/160ms`, `/320ms`, `/1280ms`) that
`Repo.parakeetEou*` actually points at — saved future contributors a
`grep ModelNames.swift`.

## Out of scope (flagged for follow-up)

Other Models.md staleness I noticed but did not touch in this PR:

- Sortformer row doesn't enumerate the 6 model variants
(`fastV2`/`v2_1`, `balancedV2`/`v2_1`, `highContextV2`/`v2_1`).
- LS-EEND row doesn't enumerate the 4 dataset variants (AMI / CALLHOME /
DIHARD II / DIHARD III).
- \"Pyannote CoreML Pipeline\" row covers both online and offline
diarization but doesn't mention that `OfflineDiarizerManager` uses an
entirely different model set (Segmentation + FBank + Embedding + PldaRho
+ plda-parameters.json).
- Nemotron Streaming row mentions chunk sizes inline but doesn't note
each is a distinct HF subdir (`/80ms`, `/160ms`, `/560ms`, `/1120ms`).

Happy to do those in a separate PR if useful.

## Test plan

- [x] Docs-only change; verified rendered tables in
`Documentation/Models.md`.
- [x] Cross-referenced PR / HF / variant info against
`Sources/FluidAudio/ModelNames.swift`, `Documentation/ASR/Cohere.md`,
and `Documentation/ASR/Qwen3-ASR.md`.
- [x] No code paths touched; CI build/test/format remain unaffected.
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/551"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open in Devin Review">
  </picture>
</a>
<!-- devin-review-badge-end -->
2026-04-28 17:50:48 -04:00
Alex e435319a2f docs(models): drop Parakeet CTC Japanese + ASR/TTS row cleanups (#552)
## Summary

`Documentation/Models.md` cleanup pass:

- **Drop Parakeet CTC Japanese.** CTC-only inference for Japanese was
deleted in 846924a1d; only the INT8 CTC-trained preprocessor + encoder
from the parakeet-0.6b-ja-coreml repo are reused as the acoustic
frontend, paired with a TDT decoder + joint (see `ModelNames.TDTJa` and
the comment on `Repo.parakeetJa` in
`Sources/FluidAudio/ModelNames.swift:11-14`). Fold the relevant detail
into the surviving **Parakeet TDT Japanese** row.
- **Fix Japanese HF path.** Was the stale `parakeet-ctc-0.6b-ja-coreml`,
now correctly points at
[`parakeet-0.6b-ja-coreml`](https://huggingface.co/FluidInference/parakeet-0.6b-ja-coreml)
— matches `Repo.parakeetJa`.
- **Rename ASR section** `Batch Transcription (Near Real-Time)` →
`Sliding-Window Transcription (Near Real-Time)` to match the actual
implementation (`SlidingWindowAsrManager` wrapping TDT/CTC chunks). Add
a short blurb contrasting it with the Streaming section so the
distinction is explicit instead of implied.
- **Parakeet EOU row:** add the missing **1280ms** chunk-size variant
(`Repo.parakeetEou1280`; `StreamingEouAsrManager.swift:7` explicitly
documents 160ms / 320ms / 1280ms support). Rephrase to highlight the
latency/accuracy spectrum.
- **Kokoro ANE row:** clarify that the variant was derived from
`laishere/kokoro-coreml` **with permission** (not just lifted).

## Test plan

- [x] Docs-only change; verified rendered tables in
`Documentation/Models.md`.
- [x] All HF paths and chunk-size variants cross-checked against
`Sources/FluidAudio/ModelNames.swift` and
`Sources/FluidAudio/ASR/Parakeet/Streaming/EOU/StreamingEouAsrManager.swift`.
- [x] No code paths touched; CI build/test/format remain unaffected.
2026-04-28 17:50:30 -04:00
Alex d89cf01ba6 docs(models): list CosyVoice3 under Not Production Ready (#550)
## Summary
- Add CosyVoice3 (Mandarin zero-shot voice cloning, #536) to the **Not
Production Ready** table in `Documentation/Models.md` alongside Magpie
TTS Multilingual.
- Mirrors the existing Magpie row format (PR / mobius / HF links +
status blurb) so contributors browsing the model index see which TTS
backends ship today but still need community perf work.
- Also adds the corresponding entry to the Model Sources table at the
bottom of the file (parallel to Magpie).

## Why now
CosyVoice3 landed via #536 with the `[BETA — slow, RTFx < 1.0]` flag in
the CLI and a beta warning in `Documentation/TTS/CosyVoice3.md`, but the
top-level `Models.md` still implied it was a fully supported TTS
backend. The dominant perf bottlenecks (Flow CFM forced fp32 /
`cpuAndGPU` because fp16+ANE NaNs through fused `layer_norm`; HiFT
sinegen / windowing falling back to CPU) are documented inline so future
PR / issue authors have shared context.

## Test plan
- [x] Docs-only change; verified rendered table in
`Documentation/Models.md` (Not Production Ready section + Model
Sources).
- [x] No code paths touched; CI build/test/format remain unaffected.
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/550"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open in Devin Review">
  </picture>
</a>
<!-- devin-review-badge-end -->
2026-04-28 17:28:48 -04:00
Alex 3d9d422202 feat(tts/magpie): add NVIDIA Magpie TTS Multilingual 357M Swift port (#541)
## Summary

Ports the NVIDIA Magpie TTS Multilingual 357M autoregressive TTS from
Python (mobius [#24](https://github.com/FluidInference/mobius/pull/24))
to Swift. Closes FluidInference/FluidAudio#49.

> **⚠️ Experimental — quite slow on Apple Silicon, needs significant
perf work.** First synth on a fresh process is dominated by CoreML model
load + first-call ANE compile (~30 s). Warm synths run at **~96 s wall
for an 8-word English sentence** on M-series — RTFx ≈ **0.04** (~25×
slower than realtime). Whether the throughput ceiling is a model
characteristic, a CoreML conversion limitation, or both is still being
investigated and is expected to improve in subsequent iterations. **Do
not use in latency-sensitive paths.** For real-time use prefer Kokoro
(~20× RTFx, parallel) or PocketTTS (~1.5–2× RTFx, streaming Mimi).
Magpie's value prop is multilingual coverage + 5 built-in speaker
contexts, not throughput.

## Status

Functional. Audio quality is perceptually clean across all 5 speakers;
first synth on a fresh process is dominated by CoreML model load +
first-call ANE compile (~30 s), warm synths run at ~96 s wall for an
8-word English sentence on M-series (RTFx ≈ 0.04). Quality is ASR-clean
on 4/5 speakers; speaker 0 has a single trailing-word artifact ("…and")
attributable to fp16 sampler-trajectory drift, **not a structural bug**.

Not yet covered: Japanese (deferred — needs OpenJTalk XCFramework +
MeCab dict), CFG performance optimization, MLX-backed LocalTransformer.

- **Languages (8/9):** English, Spanish, German, French, Italian,
Vietnamese, Mandarin, Hindi. Japanese deferred pending OpenJTalk
XCFramework integration.
- **5 built-in speakers** (`.john`, `.sofia`, `.aria`, `.jason`, `.leo`)
with 110-token (768d fp16) context embeddings.
- **Inline IPA override** (`"Hello | ˈ n ɛ m o ʊ | world"`) routes `|…|`
segments directly to the tokenizer for pronunciation control —
first-class feature.
- **Streaming**: `synthesizeStream(...)` yields `MagpieAudioChunk` per
chunk as soon as its NanoCodec decode finishes (first chunk is a small
clause-sized head ≈ 50 frames / 2.3 s for low TTFA). Each non-final
chunk includes punctuation-aware trailing silence for gapless playback.
- **ANE warmup at init**: `MagpieTtsManager.initialize()` runs an
unmeasured 16-step synthesis to force `MILCompilerForANE` to compile the
decoder graphs once. Without this the first user-facing `synthesize()`
can fall back to GPU/CPU and run multiple× slower.
- **Output:** 22.05 kHz mono WAV via 8-codebook NanoCodec decoder, max
11.89 s per synthesis (256 nanocodec frames).

## HF assets — live


[`FluidInference/magpie-tts-multilingual-357m-coreml`](https://huggingface.co/FluidInference/magpie-tts-multilingual-357m-coreml)
is **uploaded and ready** (1.4 GB). Ships:

- `text_encoder.{mlmodelc,mlpackage}` — both compiled and portable
- `decoder_step.{mlmodelc,mlpackage}` — rank-4 split-K/V cache, 97.3%
ANE residency
- `decoder_prefill.{mlmodelc,mlpackage}` — fast prefill path (110-token
batched)
- `nanocodec_decoder.{mlmodelc,mlpackage}` — 8-codebook → 22 kHz PCM
(CPU-only by export)
- `constants/` — `constants.json`, `speaker_info.json`, 8 audio-codebook
embeddings, 5 speaker contexts, local-transformer weights
- `tokenizer/` — per-language phoneme/jieba/pypinyin lookups
(lazy-downloaded)
- **`manifest.json`** — machine-readable index (sha256, file sizes, npy
shapes, model IO specs) consumed by `MagpieResourceDownloader`

## Architecture

| Stage | Implementation |
|---|---|
| Text encoder | `text_encoder.mlmodelc` (CoreML, cpuAndNeuralEngine) |
| Prefill | `decoder_prefill.mlmodelc` fast path (single batched call,
110 tokens), or fallback loop |
| AR loop | `decoder_step.mlmodelc` with **rank-4 split-K/V cache**
(`cache_k{i}` / `cache_v{i}`, shape `[1, 512, 12, 64]` × 12 layers;
logits `var_2129`); `outputBackings` + double-buffered KV cache to keep
allocations off the hot path |
| Local transformer | Pure Swift, 1-layer (256d), Accelerate
(`cblas_sgemm`) + BNNS (GELU); fp32 only (fp64 path removed); vDSP-fused
embed; min-heap top-K |
| Sampling | top-k (80) + temperature (0.6), audio-EOS mask during
`minFrames`, forbidden-token mask `[2016, 2018-2023]`;
`torch.topk`-faithful tie semantics (counts above-threshold +
earliest-index ties up to K) |
| Vocoder | `nanocodec_decoder.mlmodelc` pinned to `cpuOnly` (ANE
rejects the graph) — 8×N codes → float PCM → peak-normalize |

CFG is **off by default** (`cfgScale = 1.0`); enabling it doubles
per-step decoder cost. Assets fetched lazily via `DownloadUtils`; only
the languages requested in `downloadAndCreate(languages:)` are
materialized.

## Public API

```swift
let manager = try await MagpieTtsManager.downloadAndCreate(
    languages: [.english, .spanish]
)

// One-shot
let result = try await manager.synthesize(
    text: "Hello | ˈ n ɛ m o ʊ | from FluidAudio.",
    speaker: .john,
    language: .english
)
let wav = AudioWAV.data(from: result.samples, sampleRate: result.sampleRate)

// Streaming (chunk-level, per-chunk NanoCodec decode)
for try await chunk in try await manager.synthesizeStream(text: longText) {
    audioPlayer.append(chunk.samples)
}
```

## CLI

```
fluidaudiocli magpie download --languages en,es
fluidaudiocli magpie text --text "Bonjour." --speaker 0 --language fr --output out.wav
fluidaudiocli magpie text --text "Long passage..." --stream --output stream.wav
fluidaudiocli magpie bench --runs 5 --warmup 1   # in-process median RTFx
```

(Parity tooling moved to mobius — see
[FluidInference/mobius#44](https://github.com/FluidInference/mobius/pull/44)
for the fixture emitter / Python ground-truth path.)

## Inline IPA — verified working

The `|…|` passthrough is **native NeMo `IpaG2p` behavior** (not added by
us): segments inside pipes are looked up directly in `token2id.json` as
whitespace-separated phonemes, bypassing G2P.

```
input:  "Hello | n ɛ m o ʊ | from FluidAudio."
G2P:    həˈloʊ nɛmoʊ frʌm fluɪdaːdɪoʊ.   ← injected IPA visible mid-stream
```

Validated end-to-end with the live HF assets (Python reference): 30
tokens → 43 frames → 2.00 s @ 3.97x RTF.

## Guardrails followed

- No `@unchecked Sendable`; `MagpieTtsManager`, `MagpieModelStore`,
`MagpieTokenizer`, `MagpieSynthesizer` are all `actor`s.
- No dummy models / synthetic data.
- `AppLogger(category: "Magpie*")` throughout, no `print()` (including
`MagpieCommand.printUsage`).
- `MagpieError: Error, LocalizedError` for all error paths.

## Test plan

- [x] `swift build` — clean on macOS 14 / Swift 6 (only pre-existing
`cblas_sgemm` deprecation warnings from Accelerate); iOS build also
clean (Swift 6 isolation-checker workaround landed).
- [x] `swift test --filter "Magpie|NpyReader"` — 17 / 17 pass:
- `MagpieConstantsTests` (4) — forbidden-token mask, shape relations,
NeMo tokenizer-name parity, per-language file coverage
  - `MagpieIpaOverrideTests` (7) — `|…|` segmentation edge cases
- `MagpieKvCacheTests` (3) — cache shape, `addInputs` key count, static
output keys
- `NpyReaderTests` (3) — fp32 parse, fp16→fp32 upcast, bad-magic
rejection
- [x] HF assets uploaded; Python inference parity confirmed (4.60 s
plain English, 2.00 s + 11.05 s with inline IPA).
- [x] End-to-end Swift validation: `magpie download` → `magpie text`
produces audible 22 kHz WAV; `magpie bench` reports stable RTFx medians
on M-series.
- [x] Audio quality validated: ASR-clean on 4/5 speakers; speaker 0
trailing-word artifact diagnosed as fp16 sampler-trajectory drift, not
structural.
- [x] Streaming validated: chunk-level decode yields correct gapless
playback when concatenated; first chunk arrives in ~half the wall-time
of the full synthesis.
- [x] Devin review feedback addressed: `--text` flag handler,
`torch.topk`-faithful tie semantics, `AppLogger.info()` in
`printUsage()`, stale `MagpieComputePlanCommand` removed.

## Companion PR

Conversion pipeline + parity-fixture emitter + manifest generator:
[FluidInference/mobius#44](https://github.com/FluidInference/mobius/pull/44).

## Out of scope (follow-ups — perf is the headline item)

- **Throughput investigation** — current ~0.04 RTFx is the dominant gap.
Suspect surfaces: rank-4 split-K/V scatter ANE residency vs. apparent
GPU fallback, NanoCodec CPU-only export, LocalTransformer per-step
Accelerate path.
- **MLX-backed LocalTransformer** — drop-in replacement for the
Accelerate/BNNS forward pass to put the per-step hot loop on the GPU.
- **CFG perf optimization** — currently doubles per-step decoder cost.
- **Speaker 0 fp16 sampler drift** — investigate whether
higher-precision logits or a small temperature schedule eliminates the
trailing-word artifact.
- Japanese support (OpenJTalk + MeCab dict).
- Streaming NanoCodec via MLState conv-cache (current export is
fixed-window batch; chunked-overlap fallback yields <15 dB SNR —
unviable without proper state caching).
- CI workflow `magpie-benchmark.yml`.
2026-04-28 10:54:00 -04:00
Alex b82d4f2fc8 feat(tts): CosyVoice3 Mandarin zero-shot TTS port (#536)
## Summary

Swift port of **CosyVoice3** (Mandarin zero-shot TTS) wired through the
four validated CoreML mlpackages hosted at

[`FluidInference/CosyVoice3-0.5B-coreml`](https://huggingface.co/FluidInference/CosyVoice3-0.5B-coreml).
Delivered in two layered phases matching the existing Kokoro manager
shape:

- **Phase 1 (parity harness):** full Swift pipeline that ingests a
Python
frontend fixture (`.safetensors`) and produces WAV within parity of the
  Python reference — validates all four CoreML bindings, 24-layer Qwen2
  KV-cache slicing, RAS sampler, and Flow / HiFT wiring.
- **Phase 2 (native frontend):** pure-Swift Qwen2 BPE tokenizer + Qwen2
  text embeddings + minimal Mandarin text normalizer + 24 kHz log-mel
  DSP so callers can synthesize directly from `String` input without a
  Python dependency.

Conversion pipeline that produced the mlpackages lives at

[FluidInference/mobius#42](https://github.com/FluidInference/mobius/pull/42).
Backend documentation:
[`Documentation/TTS/CosyVoice3.md`](./Documentation/TTS/CosyVoice3.md).

> ⚠️ **Backend ships as beta / experimental.** End-to-end synthesis is
> currently slow on Apple Silicon — RTFx < 1.0 typical, several seconds
> of latency for short Mandarin utterances. Cause is partly the Flow CFM
> stage (fp32 / CPU-or-GPU only because fp16 + ANE produces NaNs through
> the fused `layer_norm`) and partly HiFT sinegen / windowing ops that
> fall back to CPU. Treat as preliminary; may be a model issue, may be
> recoverable via better conversion. Warnings surfaced via doc comments,
> runtime `logger.warning` in `initialize()`, and CLI help text.

## What's shipped

### Public API (`Sources/FluidAudio/TTS/CosyVoice3/`)

```swift
public actor CosyVoice3TtsManager {
    public init(directory: URL? = nil, computeUnits: MLComputeUnits = .cpuAndNeuralEngine)
    public static func downloadAndCreate(from repo: Repo = .cosyvoice3,
                                         computeUnits: MLComputeUnits = .cpuAndNeuralEngine)
                                         async throws -> CosyVoice3TtsManager
    public func initialize() async throws
    public func synthesize(text: String,
                           promptAssets: CosyVoice3PromptAssets,
                           options: CosyVoice3SynthesisOptions = .init(),
                           prenormalized: Bool = false) async throws -> CosyVoice3SynthesisResult
}
```

`TtsBackend` gains `case cosyvoice3`; `ModelNames` gets the
`CosyVoice3` enum plus `Repo.cosyvoice3` pointing at the HF repo.

### Pipeline components

| Layer | File | Notes |
|---|---|---|
| Model loader | `Assets/CosyVoice3ModelStore.swift` | Flat + nested
layout probing, `.mlmodelc` compile cache |
| Downloader | `Assets/CosyVoice3ResourceDownloader.swift` |
`DownloadUtils` wrapper for the 4 mlpackages + embeddings |
| Safetensors | `Shared/SafetensorsReader.swift` | ~170 LoC pure-Swift
mmap + fp16/fp32/i32 accessors |
| Prefill/decode | `Pipeline/Synthesize/CosyVoice3Synthesizer.swift` |
Actor; in-place `[24,1,2,768,64]` fp16 KV-cache passthrough |
| Sampler | `Pipeline/Synthesize/CosyVoice3RasSampler.swift` | top-p /
top-k / repetition mask, seed-tokens bypass |
| Speech embed | `Pipeline/Synthesize/CosyVoice3SpeechEmbeddings.swift`
| Lazy mmap of 6761×896 fp16 table (12 MB) |
| Frontend | `Pipeline/Preprocess/CosyVoice3TextFrontend.swift` |
Special-token splitting + lm_input assembly |
| Tokenizer | `Pipeline/Preprocess/Qwen2BpeTokenizer.swift` |
tiktoken-compatible byte-level BPE, 151 936 vocab |
| Text embed | `Pipeline/Preprocess/CosyVoice3TextEmbeddings.swift` |
151 936×896 fp16 mmap → row copy |
| TN | `Pipeline/Preprocess/CosyVoice3ChineseNormalizer.swift` | Minimal
regex-free port of `frontend_utils.py` |
| Prompt mel | `Pipeline/Preprocess/CosyVoice3PromptMel.swift` | 24 kHz
log-mel matching `matcha audio.py` |

### CLI (`Sources/FluidAudioCLI/Commands/`)

```
fluidaudio tts --backend cosyvoice3-parity --fixture … --models-dir … --output …
fluidaudio tts --backend cosyvoice3 --text "希望你以后能够做的比我还好用" \
               --prompt-assets … --models-dir … --output …
fluidaudio tts --backend cosyvoice3-tokenizer --fixture …     # BPE parity
fluidaudio tts --backend cosyvoice3-frontend --text …         # lm_input dump
```

`--backend` help text marks `cosyvoice3` as `[BETA — slow, RTFx < 1.0]`
and the dispatcher emits a runtime `logger.warning` so users see the
status without reading docs.

### Tests

- `CosyVoice3ChineseNormalizerTests` — 8 cases covering
`contains_chinese`,
  `replace_blank`, corner marks, brackets, digit spellout, trailing
  comma collapse, end-to-end, `is_only_punctuation`.
- `CosyVoice3PromptMelTests` — 8 cases covering the matcha frame-count
  formula, zero-audio log floor clamp, 200 Hz sine peak in low mel bins,
exact reflect-pad semantics, periodic Hann endpoints, mel-basis shape /
  non-zero integrals, token-ratio trimming (and the throws-if-too-short
  path).

### Integration

- `ModelNames.swift` — `CosyVoice3` enum + `Repo.cosyvoice3`
- `TtsBackend.swift` — `case cosyvoice3`
- `TTSCommand.swift` — subcommand wiring
- `Documentation/TTS/CosyVoice3.md` — file roster, call flow, public
API,
  CoreML caveats, indexed from `Documentation/README.md`

## Test plan

- [x] `swift build` (release)
- [x] Full `swift test` on this branch: **1 435 tests, 24 skipped, 0
failures** (~13 min)
- [x] `--filter CosyVoice3ChineseNormalizer` — 8/8 pass
- [x] `--filter CosyVoice3PromptMel` — 8/8 pass
- [x] Phase 1 end-to-end parity vs `build/wavs/e2e_shipping.wav` (max|Δ|
< 1e-3, SNR > 40 dB, CPU-only fp32 Flow)
- [x] Phase 2 end-to-end round-trip: Swift output → whisper.base →
expected transcript

## Non-goals / follow-ups

- SpeechTokenizer and CAMPPlus remain Python-side for prompt asset
  preparation; both have CoreML mlpackages but the required DSPs aren't
  yet ported. Users pass pre-computed `promptSpeechIds` / `spkEmbedding`
  in `CosyVoice3PromptAssets` for now.
- Full `wetext.ZhNormalizer` (year / currency / decimals / units) is not
  ported. Callers that need production-grade TN run wetext server-side
  and pass `prenormalized: true`.
- Flow stays fp32 (1.2 GB) until CoreMLTools pins `layer_norm` fused
fp16.

## Updates — Devin review + main merge

Picked up `origin/main` (resolved trivial enum-case merge in
`ModelNames.swift` / `TtsBackend.swift` / `TTSCommand.swift`; both
branches added new cases) and addressed the 12 Devin inline findings:

- **Sendable hygiene** — dropped `@unchecked Sendable` from 9 types.
  `CosyVoice3Synthesizer` is now a proper `actor` (it crosses actor
  boundaries from the manager); `CosyVoice3Models` is plain `: Sendable`
  via `@preconcurrency import CoreML` (matches the existing `TtsModels`
  pattern; the initial drop-to-no-Sendable broke the benchmark CI build
  with `non-sendable result type CosyVoice3Models cannot be sent from
  actor-isolated context`, since it's returned by `store.models()`).
  The remaining types had Sendable conformance dropped entirely since
  they don't escape the owning actor.
- **Prefill stop-token bug** — if the LLM emits an EOS token at step 0
  the synthesizer now throws `predictionFailed` instead of falling
  through into the decode loop and accumulating semantically meaningless
  tokens.
- **HiFT mel slice OOB** — added bounds check on `newMelStart` against
  the actual mel length and clamped `validFrames` to the available
window; previously a `newMelStart > totalMelFrames` would `MLMultiArray`
  out of range during the chunk-packed call path.
- **Production logging** — replaced `print()` stage timings with
  `AppLogger.info`; added `logger.warning` calls in `initialize()` and
  the CLI dispatcher for the beta-status banner.
- **Beta marker** — doc comments on `CosyVoice3TtsManager` and
  `TtsBackend.cosyvoice3` flag the backend as experimental; CLI help
  text annotates the backend label.
- **Documentation** — added `Documentation/TTS/CosyVoice3.md` mirroring
  the Kokoro / PocketTTS doc layout (files, call flow, public API, CLI,
  CoreML caveats, known limits) and indexed it from
  `Documentation/README.md`.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/536"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open in Devin Review">
  </picture>
</a>
<!-- devin-review-badge-end -->

---------
v0.14.2
2026-04-28 09:57:13 -04:00
Alex eff1752ebf feat(tts/pocket): multi-language support (EN + 9 new packs) (#549)
## Summary

Adds first-class support for PocketTTS language packs upstream
`kyutai/pocket-tts` just published, tracking issue #49. Users pick a
language at manager construction; all packs (including English) are
downloaded from `v2/<lang>/` on `FluidInference/pocket-tts-coreml`.

This PR replaces #540 (rebased onto current `main` from a fresh branch).

### Supported languages

| ID              | Layers | HF subtree           |
|-----------------|--------|----------------------|
| `english`       | 6      | `v2/english`         |
| `french_24l`    | 24     | `v2/french_24l`      |
| `german`        | 6      | `v2/german`          |
| `german_24l`    | 24     | `v2/german_24l`      |
| `italian`       | 6      | `v2/italian`         |
| `italian_24l`   | 24     | `v2/italian_24l`     |
| `portuguese`    | 6      | `v2/portuguese`      |
| `portuguese_24l`| 24     | `v2/portuguese_24l`  |
| `spanish`       | 6      | `v2/spanish`         |
| `spanish_24l`   | 24     | `v2/spanish_24l`     |

French ships 24-layer only upstream; no 6-layer French pack exists.

### Per-language artifacts shipped on HF

Each `v2/<lang>/` subtree contains 5 `.mlmodelc` directories +
`constants_bin/`:

| Artifact                  | Precision           | Notes |
|---------------------------|---------------------|-------|
| `cond_step.mlmodelc` | fp16 | conditioning prefill (voice/text → KV
cache) |
| `flow_decoder.mlmodelc` | fp16 | flow-matching audio decoder |
| `flowlm_step.mlmodelc` | fp16 | per-token transformer step (default) |
| `flowlm_stepv2.mlmodelc` | **selective int8** | weight-only PTQ on
attn + FFN body linears (per kyutai-labs/pocket-tts#147 recipe); EOS
head + input embedding stay fp32. Optional smaller variant; **not
currently loaded by Swift** but available for client-side swap-in. |
| `mimi_decoder.mlmodelc` | fp16 | Mimi neural codec decoder |

`mimi_encoder.mlmodelc` (voice cloning, language-agnostic) is fetched
lazily, separately from any language pack.

The selective int8 in `flowlm_stepv2` quantizes 4 linears per
transformer layer (`attn_in_proj`, `attn_out_proj`, FFN expand, FFN
contract) via
`coremltools.optimize.torch.quantization.PostTrainingQuantizer`
(per-channel, symmetric, weight-only). Sizes: 6L 145 MB → 74 MB; 24L 1.1
GB → 291 MB.

## Changes

- **`PocketTtsLanguage`**: new enum (10 cases) with `repoSubdirectory`
(always `"v2/<rawValue>"`) and `transformerLayers` (6 or 24).
- **`ModelNames.PocketTTS`**: single `mimiDecoderFile =
"mimi_decoder.mlmodelc"` and single `requiredModels` set covering all
language packs uniformly.
- **`PocketTtsLayerKeys`**: discovers KV-cache I/O names at runtime so
6L and 24L packs share the same inference path. `discover(...)` requires
`expectedLayers: Int` (6 or 24) for early sanity-check.
- **`PocketTtsMimiKeys`**: discovers the Mimi decoder's audio output +
per-state input→output pairing dynamically (pass-through inputs first,
then shape-bucket pairing in canonical order).
- **Voice safetensors prebakes**: every language pack ships
`<voice>.safetensors` containing pre-computed LM transformer KV cache
snapshots (per-layer `[2, 1, seqLen, 16, 64]` F32 + I64 offset).
`PocketTtsConstantsLoader.loadVoiceSnapshot` parses the safetensors
header (8-byte LE u64 + JSON) and extracts per-layer cache + offset
tensors. `PocketTtsSynthesizer.kvCacheStateFromSnapshot` copies K/V
blocks into the runtime `[2, 1, kvCacheMaxLen, 16, 64]` state
independently. Skips the per-token `cond_step` voice prefill.
- **`PocketTtsResourceDownloader`**: `ensureModels(language:)` always
fetches the requested `v2/<lang>/` subtree via
`DownloadUtils.downloadSubdirectory`. `ensureVoice` downloads
`<voice>.safetensors`. `ensureMimiEncoder()` lazily fetches the
language-agnostic encoder for voice cloning without pulling a full
language pack.
- **`PocketTtsModelStore` / `PocketTtsManager` / `PocketTtsSession` /
`PocketTtsSynthesizer`**: language threaded through load + constants +
KV-cache sizing. Voice data is cached per `(language, voice)`. Mimi keys
discovered + cached per language.
- **Voice cloning across languages**: Mimi encoder is shared; cloned
`PocketTtsVoiceData` from one language's manager can be fed to another.
- **CLI**: `fluidaudiocli tts --backend pocket --language <id>` (default
`english`). Unknown values log the supported list and fall back to
English.
- **Docs**: `Documentation/TTS/PocketTTS.md` gains a Languages section +
cross-language cloning example.

## Tests

- `PocketTtsLanguageTests` — pure-logic cases covering
`repoSubdirectory`, `transformerLayers`, and `requiredModels`. No model
download / no network.
- Full PocketTTS test suite: 16/16 passing (`swift test --filter
PocketTts`).

## Test plan

- [x] `swift build` — clean Release build (rebased onto current `main`)
- [x] `swift format lint --recursive --configuration .swift-format` —
clean
- [x] `swift test --filter PocketTts` — 16/16 pass
- [x] Manual end-to-end via FluidAudio Swift CLI for **all 10 language
packs** (fresh HF download → fp16 baseline → swap
`flowlm_stepv2.mlmodelc` → re-synthesize → Parakeet TDT v3 ASR check on
both outputs):

| Language        | fp16 ASR | flowlm_stepv2 (int8) ASR |
|-----------------|---|---|
| english         | ✓ | ✓ |
| spanish         | ✓ | ✓ |
| spanish_24l     | ✓ | ✓ |
| french_24l      | ✓ | ✓ |
| german          | ✓ | ✓ |
| german_24l      | ✓ | ✓ |
| italian         | ✓ | ✓ |
| italian_24l     | ✓ | ✓ |
| portuguese      | ✓ | ✓ |
| portuguese_24l  | ✓ | ✓ |

Selective int8 vs fp16 for `flowlm_step`: 6L 145 MB → 74 MB; 24L 1.1 GB
→ 291 MB.

## Non-goals

- Runtime language switching on a live `PocketTtsManager` (create a new
manager instead).
- Auto-inferring language from text.
- French 6-layer (upstream did not ship it).
- Auto-loading `flowlm_stepv2` (Swift continues to load
`flowlm_step.mlmodelc`/fp16 by default; the int8 variant ships in the
pack so clients can opt in via cache swap, and a future PR can add a
`precision: .fp16 | .int8` selector).

Closes #49
2026-04-27 22:21:43 -04:00
Alexandre Mendonça Alvaro 982f117eb4 fix: avoid misleading confidence warning in SlidingWindowAsrManager.finish() (#548)
### Why is this change needed?
`SlidingWindowAsrManager.finish()` reconstructs final text by calling
`processTranscriptionResult(...)` with empty `timestamps` and
`confidences`.

That path only needs token-to-text reconstruction, but it also runs
confidence calculation, which logs:

`Expected token confidences but got none - this should not happen`

In practice this shows up during normal finalization even though nothing
is actually wrong.

### What changed?
Use `convertTokensToText(accumulatedTokens)` directly in `finish()` when
only the merged final text is needed.

This keeps behavior the same for the returned transcription while
avoiding a misleading warning during normal shutdown.

### Validation
- `swift test --filter SlidingWindowAsrManagerTests`
- Reproduced locally from an app integration path before the patch;
warning no longer appears after the change.

<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/548"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open in Devin Review">
  </picture>
</a>
<!-- devin-review-badge-end -->
2026-04-27 21:50:42 -04:00
Alex 7c115f6b4e feat(tts/kokoro-ane): add laishere 7-stage CoreML chain (ANE-optimized) (#547)
## Summary

Adds a second Kokoro TTS backend (`KokoroAne`) wrapping the
[laishere/kokoro-coreml](https://github.com/laishere/kokoro-coreml)
7-stage chain (Albert → PostAlbert → Alignment → Prosody → Noise →
Vocoder → Tail) behind an actor-based facade, used with the upstream
author's permission. Per-stage `MLComputeUnits` assignment routes
Albert/PostAlbert/Alignment/Vocoder to **ANE**; Prosody/Noise/Tail stay
on CPU+GPU for fp32/iSTFT-heavy ops.

The companion mobius PR for the conversion side:
https://github.com/FluidInference/mobius/pull/45

Existing `KokoroTtsManager` (single fp32 model) is untouched. Both
backends ship from the same `FluidInference/kokoro-82m-coreml` HF repo —
KokoroAne lives under the `ANE/` subdirectory.

## What's added

**Module: `Sources/FluidAudio/TTS/KokoroAne/`**
- `KokoroAneManager` — actor facade: `initialize`,
`synthesize(text|phonemes)`, `synthesizeDetailed`
- `KokoroAneSynthesizer` — 7-stage orchestration with fp16↔fp32 vImage
boundaries (Prosody→Noise→Vocoder→Tail). Uses `rebuild16`/`rebuild32`
helpers so each output is fetched once.
- `KokoroAneModelStore` — per-stage MLModel handles + vocab + voice pack
cache. Atomic-commit load (matches `PocketTtsModelStore` pattern) so
partial-load failures stay retryable.
- `KokoroAneVoicePack` — `[510, 256]` flat fp32 row indexing (timbre
cols `[0:128]`, style_s cols `[128:256]`)
- `KokoroAneVocab` — IPA → token IDs with BOS/EOS wrap, max 512
- `KokoroAneResourceDownloader` — HF cache management via existing
`DownloadUtils`; also downloads the shared kokoro G2P assets on first
init (see fix below)
- G2P reuses existing `G2PModel.shared`

**CLI:**
```bash
fluidaudiocli tts "Hello world" --backend kokoro-ane [--metrics m.json]
fluidaudiocli tts-asr-verify --texts-file phrases.txt --output-json results.json
```
The `tts-asr-verify` batch command synthesizes each phrase, transcribes
with Parakeet, and emits per-phrase + macro/micro WER with stage
timings.

**Tests** (`Tests/FluidAudioTests/TTS/KokoroAne/`):
- 13 unit tests (vocab, voice pack) — no model deps, run on CI
- 5 E2E tests (synth + ASR roundtrip) — gated by
`FLUIDAUDIO_RUN_KOKOROANE_E2E=1`

**Docs:**
- New `Documentation/TTS/KokoroAne.md` — when-to-pick decision table,
CLI/Swift quick start, per-stage compute targets, voice pack layout,
limits, perf numbers, source links.
- Top-of-file callout on `Documentation/TTS/Kokoro.md` linking to the
ANE-resident variant.
- Updated `Documentation/README.md` index, `Documentation/Models.md` TTS
table, `Documentation/API.md` reference, `Documentation/CLI.md` example.

## Verified end-to-end on M2

Cold model load: 20.6s (`anecompilerservice` first-run ANE compilation).
Warm load: ~300ms.

| Phrase | Synth | Audio | RTFx | ASR roundtrip |
|---|---|---|---|---|
| Hello world | 0.47s | 1.65s | 3.5× | "Hello world." (WER 0%) |
| The quick brown fox… | 0.32s | 3.18s | 9.9× | dropped "The" (WER 11%)
|
| She had been waiting… | 0.25s | 2.80s | 11.4× | "Shay" misheard (WER
12.5%) |

Aggregate macro WER 7.9%, micro WER 10.5% — error is ASR-side; TTS audio
is intelligible.

Steady-state per-stage timings confirm ANE residency (Albert/PostAlbert
~7-10ms each).

## Devin Review fixes addressed in this PR

- 🔴 **Partial model load wedged the store**
(`KokoroAneModelStore.loadIfNeeded`) — fixed via local `pendingModels`
accumulator + atomic commit, matching `PocketTtsModelStore`.
- 🐛 **G2P models not downloaded standalone** — `G2PModel.loadIfNeeded`
only reads from `~/.cache/fluidaudio/Models/kokoro/` and never
downloads. The kokoroAne download set didn't include G2P, so first-time
`--backend kokoro-ane` users (no prior `kokoro` use) hit a cryptic
`vocabLoadFailed`. Fixed by adding a `g2p-only` sentinel variant to
`getRequiredModelNames(.kokoro, …)` and a new
`KokoroAneResourceDownloader.ensureG2PAssets(directory:)` that runs
before `G2PModel.shared.ensureModelsAvailable()` in
`KokoroAneManager.initialize()`.
- 🟡 **Voice pack off-by-one (false positive)** — verified upstream
`convert-coreml.py:552` uses `voice_pack[len(phonemes) - 1]`, exactly
matching the existing Swift `phonemeCount - 1`. No change.

## Refactor pass

Internal cleanup applied across the module after the initial
implementation landed:
- `KokoroAneSynthesizer`: `rebuild16`/`rebuild32` helpers replace 11
inline `outputShape + outputArray + float16Array` patterns; F0/N shapes
cached once (was fetched 4×). Fixed a mislabeled `stage:` argument in
`outputArray` error reporting.
- `KokoroAneSynthesizer+Conversion`: extracted
`convertF32toF16`/`convertF16toF32`/`genericCopy` private helpers
(eliminates 4× duplicated vImage buffer setup).
- `KokoroAneModelStore`: folded `voicePack(_)` +
`loadVoicePackIfNeeded(_)` into one method; dropped unreachable
post-load guard and dead synthesized-URL throw.
- `KokoroAneVocab` / `KokoroAneError`: added `vocabParseFailed(URL,
String)` so a malformed top-level JSON object reports parse-failure
instead of file-not-found; removed dead NSNumber bridging fallback.
- `KokoroAneConstants`: dropped unused `defaultLanguage`,
`voicePackTimbreSlice`, `voicePackStyleSSlice`. Changed `defaultSpeed`
from `Float16` to `Float` (drops 4 `Float(...)` wraps at default-arg
sites).
- `KokoroAneError`: dropped unused `unsupportedPhoneme(Character)` —
`KokoroAneVocab.encode` silently drops unknown chars per the upstream
Python convention.

## Test plan

- [x] `swift build` clean
- [x] `swift test --filter KokoroAne` — 13 unit tests pass, 5 E2E gated
- [x] With models staged at
`~/.cache/fluidaudio/Models/kokoro-82m-coreml/ANE/`:
- [x] `FLUIDAUDIO_RUN_KOKOROANE_E2E=1 swift test --filter KokoroAne` —
all 18 pass
- [x] `swift run fluidaudiocli tts "Hello world" --backend kokoro-ane
--output /tmp/ane.wav --metrics /tmp/m.json` — produces non-silent audio
+ metrics with WER
- [x] `swift run fluidaudiocli tts-asr-verify --texts-file phrases.txt
--output-json /tmp/r.json` — aggregate WER ≤ 0.20

## Models

`FluidInference/kokoro-82m-coreml` on HuggingFace, under the `ANE/`
subdirectory:
```
ANE/KokoroAlbert.mlmodelc       fp16 + int8pal  (CPU+ANE)
ANE/KokoroPostAlbert.mlmodelc   fp16 + int8pal  (CPU+ANE)
ANE/KokoroAlignment.mlmodelc    fp16 + int8pal  (CPU+ANE)
ANE/KokoroProsody.mlmodelc      fp32             (CPU+GPU)
ANE/KokoroNoise.mlmodelc        fp32             (CPU+GPU)
ANE/KokoroVocoder.mlmodelc      fp16 + int8pal   (CPU+ANE)
ANE/KokoroTail.mlmodelc         fp32 + iSTFT     (CPU+GPU)
ANE/vocab.json                  114 IPA tokens
ANE/af_heart.bin                [510, 256] fp32 voice pack
```

G2P assets (`G2PEncoder.mlmodelc`, `G2PDecoder.mlmodelc`,
`g2p_vocab.json`) are pulled from the same repo's root and cached at
`~/.cache/fluidaudio/Models/kokoro/`, shared with the regular
`KokoroTtsManager` backend.

## License

Upstream (laishere) is MIT — carried forward in the mobius PR's LICENSE
file. Used with the upstream author's permission.
2026-04-27 20:08:49 -04:00
Alex d302273d49 fix(diarizer): convert SpeakerManager to actor, Speaker to struct (#528) (#539)
## Summary

Fixes [#528](https://github.com/FluidInference/FluidAudio/issues/528):
heap corruption (`BUG IN CLIENT OF LIBMALLOC: memory corruption of free
block`) and `Potential Structural Swift Concurrency Issue:
unsafeForcedSync called from Swift Concurrent context` warnings in the
diarizer on iOS 26.4 when `DiarizerModels.download()` +
`SpeakerManager.extractSpeakerEmbedding` are called from an async
context under Swift 6 strict concurrency.

**Root cause**
- `SpeakerManager` used `DispatchQueue.sync(flags: .barrier)` &rarr;
`unsafeForcedSync` warning when called from a Swift concurrent context.
- `Speaker` was a reference type with mutable `[Float]` embeddings
&rarr; concurrent COW mutations on the embedding buffers corrupted the
heap.

**Fix** &mdash; apply the same actor-conversion pattern used for
`AsrManager` in #419:
- `Speaker`: `final class` &rarr; `struct` (Sendable value type)
- `SpeakerManager`: class + `DispatchQueue` &rarr; `actor`
- `SpeakerOperations` extension: dropped `queue.sync`
- `DiarizerManager`: async-ified methods
- `SpeakerManager.upsertSpeaker(_:)` + `upsertSpeaker(id:...)`: thread
the speaker's `name` through persistence (previously implicit via
class-reference mutation; now required with struct value semantics).
- CLI (`ProcessCommand`, `DiarizationBenchmark`) and all
speaker/diarizer tests updated to `await` the actor-isolated API.
- `testConcurrentAccess` rewritten from
`DispatchQueue.async`/`DispatchGroup` to `withTaskGroup` for structured
concurrency.

## Test plan

- [x] `swift build` &mdash; clean on macOS
- [x] `swift test` &mdash; 1435 tests, 0 failures (24 skipped)
- [x] swift-format &mdash; no new warnings in touched files
(pre-existing warnings only, unrelated to this change)
- [ ] CI: build + tests + swift-format checks
- [ ] Verify on reporter's iOS 26.4 repro from #528
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/539"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open in Devin Review">
  </picture>
</a>
<!-- devin-review-badge-end -->
v0.14.1
2026-04-23 22:13:47 -04:00
Alex 2ea0727541 ASR: fix Parakeet TDT v3 emitting Cyrillic for short Latin-script utterances (#512) (#515)
Fixes #512.

## TL;DR

Parakeet TDT v3 transcribed short Polish utterances like "Wpisz Google
kropka com" as Cyrillic (`Впиш Гугл к ком.`) because the joint decoder's
top-1 pick drifts to Cyrillic tokens under low acoustic confidence. This
PR adds an **opt-in** script filter: when a caller passes `language:
.polish` (or any other language with a declared script), the decoder
rejects top-1 if it's the wrong script and walks top-K to the
highest-probability candidate matching the expected script.

- **Opt-in**: `language:` defaults to `nil` — zero behavior change for
existing callers.
- **No acoustic-model changes** — this is purely a decoder-side
post-processing step over the joint logits.
- **Requires `JointDecisionv3.mlmodelc`** (exposes top-K outputs).
Auto-downloaded from HuggingFace alongside the other v3 files; falls
back to standard argmax when absent.

## Empirical validation — reporter's own audio

Samples pulled via `gdown --folder <link-from-issue-#512-comment>` from
@tajchert's Drive folder. **`JointDecisionv3.mlmodelc` is loaded in both
columns** — this isolates the Swift filter as the mechanism, not a model
swap.

| sample | ground truth | `language: nil` (current) | `language:
.polish` (this PR) |
|---|---|---|---|
| pl | Wpisz Google kropka com | **Впиш Гугл к ком.** | Wpis Google.com.
|
| pl2 | Wpisz Google kropka com | **Впиш Гугл крокаком.** | Wpish
Google, Com. |
| pl3 | Wpisz Google kropka com | **Впишь куглькрабком.** | VP Kugl.com.
|
| pl4 | Wpisz Google kropka com | **Впиш гугл к ком.** | Wpish gugl c. |
| pl5 | Wpisz Google kropka com | **Впиш гугл кракаком.** | Wpish Google
Croca kom. |
| pl6 | Wpisz Google kropka com | **Впиш, гугл крокаком.** | Wpish,
Google, Com. |
| pl_complex | Cały spichlarz jest ze spiżu | Cały spichlarz jest ze
spiżu. | Cały spichlarz jest ze spiżu. |

**6/6 short samples flip Cyrillic → Latin.** `pl_complex` was never
broken (long context → high joint confidence → no drift) and is
unchanged.

## Scope & limitations (important — please don't overclaim)

**This PR fixes the *script* the tokens are drawn from. It does NOT fix
per-word acoustic accuracy.**

| | `language: nil` | `language: .polish` |
|---|---|---|
| Script correct (Latin, not Cyrillic) | ✗ | ✓ (6/6) |
| Word spelling matches ground truth | ✗ | ✗ (still 6/7 wrong on short)
|

The residual errors — `Wpisz` → `Wpish`/`Wpis`, `kropka` → `Croca` /
dropped — are **Parakeet TDT v3 acoustic weaknesses on short Polish
commands**. No amount of output post-processing can turn `Wpish` into
`Wpisz`; that needs better acoustic modeling, a Polish LM rescorer, or
more training data. Out of scope here.

What users actually get by merging:

- Output is visually Polish (Latin script), not pseudo-Russian — works
with locale-aware post-processing, spell-check, and UI rendering
- Locale-strict WER evaluators no longer penalize Cyrillic-vs-Latin
substitution
- Opt-in; zero risk for callers who don't pass `language:`

What users do **not** get:

- Higher word accuracy on short Polish/Slavic Latin utterances
- Support for languages outside the `Language` enum (Greek, Maltese,
Hungarian, Turkish, Baltic — their characters fit the Latin Unicode
ranges but aren't exposed; easy follow-up)
- A meaningful FLEURS WER delta — see
[Documentation/fleurs-script-filtering-comparison.md](./Documentation/fleurs-script-filtering-comparison.md);
full sentences aren't in the failure regime

## Implementation

### New
- `Sources/FluidAudio/Shared/ScriptDetection.swift` (new, +112)
- `public enum Language` — 13 Latin (en, es, fr, de, it, pt, ro, pl, cs,
sk, sl, hr, bs) + 5 Cyrillic (ru, uk, be, bg, sr)
  - `public enum Script { case latin, cyrillic }`
- `matches(_:script:)` over Unicode ranges: ASCII (0x20–0x7F), Latin-1
(0xA0–0xFF), Latin Extended-A (0x100–0x17F), **Latin Extended-B
(0x180–0x24F — Romanian ș/ț)**, **Latin Extended Additional
(0x1E00–0x1EFF — Vietnamese)**, Cyrillic (0x400–0x4FF). Strips
SentencePiece boundary marker U+2581 before checking.
- `filterTopK(topKIds:topKLogits:vocabulary:preferredScript:) ->
(tokenId, probability)?` — returns the highest-probability top-K
candidate matching the target script; probability via **softmax over the
top-K subset** with the max-logit stability trick; guarded against top-K
array length mismatch.

### Changed
- `TdtJointDecision` — optional `topKIds` / `topKLogits` fields
(populated by JointDecisionv3 only)
- `TdtDecoderV3` — script filter runs **only when top-1 is already wrong
script**; both decode sites feed `filtered.probability` (a real [0,1])
into `TdtDurationMapping.clampProbability`, not raw logits
- `AsrManager.transcribe(...)` — `language: Language? = nil` plumbed
through all three overloads: `[Float]`, `URL`, `AVAudioPCMBuffer`
- `AsrModels` + `ModelNames` — `requiredModelsV3` set includes
`JointDecisionv3.mlmodelc` so the download utility fetches it on fresh
installs and also backfills it for existing users on next `.v3` load
- CLI — `fluidaudiocli transcribe <file> --language
{en|pl|cs|sk|sl|hr|bs|ro|es|fr|de|it|pt|ru|uk|be|bg|sr}`

### How to try it

```bash
swift run -c release fluidaudiocli transcribe sample.wav --language pl
```

## Model dependency

`JointDecisionv3.mlmodelc` must be present in
`FluidInference/parakeet-tdt-0.6b-v3-coreml` on HuggingFace. It exposes
`top_k_ids` / `top_k_logits` outputs (K=64 in our export) alongside the
standard argmax. When absent, `AsrModels` falls back to
`JointDecision.mlmodelc` and the script filter becomes a no-op —
backward compatible.

**Cache-upgrade verified**: removed `JointDecisionv3.mlmodelc` from a
populated cache, re-ran `--language pl`; the file was auto-fetched and
Polish output was Latin. Existing users pick up the fix on next `.v3`
load without manual intervention.

## Review notes / risky bits

- **Softmax over top-K subset, not the full vocab** — probabilities
won't exactly match a true full-softmax, but K=64 captures ~all the mass
when the model is anywhere near confident. If you prefer, we can expose
the raw top-K logits to callers and let them compute confidence however
they want.
- **Top-1 escape hatch**: filter is only triggered when top-1 fails
`matches(_, script:)`. When top-1 is already correct, nothing is changed
— so we can't regress the common case.
- **Length-mismatch guard** in `filterTopK` uses `min(topKIds.count,
topKLogits.count)`. If CoreML output arrays ever diverge, we iterate the
common prefix instead of crashing.
- **Latin Extended-B (0x0180–0x024F)** was added specifically so
Romanian ș/ț aren't rejected as non-Latin. Latin Extended Additional
(0x1E00–0x1EFF) was added for free — helps Vietnamese should anyone want
it later.

## Tests

- `ScriptDetectionTests` — **37 tests**: Unicode range coverage (Latin-1
/ Extended-A / Extended-B / Extended Additional / Cyrillic),
SentencePiece boundary-marker stripping, `filterTopK` happy path,
length-mismatch guard, probability-range invariant,
Czech/Slovak/Slovenian/Croatian/Romanian token coverage, cross-script
rejection
- Build clean; `swift format lint` clean on all touched files
- A/B end-to-end run against reporter's actual Polish audio (table
above)

## Checklist

- [x] Builds clean (`swift build`, `swift build -c release`)
- [x] `swift format lint` clean on touched files
- [x] `ScriptDetectionTests` 37/37 pass
- [x] A/B reproduction on #512 reporter's audio
- [x] Cache-upgrade path verified (JointDecisionv3 auto-fetched on
existing caches)
- [x] CLI accepts all 18 language codes end-to-end
- [ ] CI green

## Follow-ups (not blocking)

- Expose more Latin languages in the enum (Hungarian, Turkish, Baltic,
Maltese) — all character ranges already supported, just need enum cases
- Add `Script.greek` for `el_gr` (separate Unicode range)
- Short-utterance benchmark dataset (FLEURS is the wrong tool — it's all
long sentences where drift doesn't happen)
- Optional: publish a Polish LM rescorer to address the underlying
acoustic-accuracy issue the script filter cannot fix

---------
2026-04-23 17:43:09 -04:00
Alex cc4e712643 feat(asr/cohere): ANE-friendly static-shape decoder (v2) (#537)
## Summary

Adds support for a new Cohere decoder variant —
`cohere_decoder_cache_external_v2` — with **fully static shapes** so
CoreML can dispatch the decoder to the Apple Neural Engine.

- `ModelNames.CohereTranscribe`: adds v2 constants, flips default
`requiredModels` to v2, keeps legacy set as `requiredModelsLegacy`.
- `CoherePipeline.loadModels`: prefers v2 in `decoderDir`, falls back to
v1, clear error if neither present.
- Decode loop already auto-detects the variant from `attention_mask`
shape (shipped in #487 area) — nothing to change runtime-side.
- CLI help lists both decoder filenames.

v2 artifacts are published at
[`FluidInference/cohere-transcribe-03-2026-coreml/q8`](https://huggingface.co/FluidInference/cohere-transcribe-03-2026-coreml)
(`cohere_decoder_cache_external_v2.{mlmodelc,mlpackage}`). The existing
v1 decoder remains supported as a fallback.

## Why

The v1 (`RangeDim(1, 108)`) decoder has a dynamic `attention_mask`
length, which blocks ANE dispatch — `computeUnits = .all` silently falls
back to CPU/GPU. v2 fixes the mask at `[1, 1, 1, 108]` and sources the
decode position from `position_id`, letting the full decoder land on
ANE.

Measured with `fluidaudiocli cohere-transcribe` on the same audio (15
tokens, same q8 encoder, 3 warm runs each):

| Decoder | Config | Median decoder time |
|---|---|---:|
| **Static (v2)** | `.all` (ANE) | **2.58 s** |
| Dynamic (v1) | `.all` | 4.13 s |
| Static (v2) | `--cpu-gpu` | 10.02 s |
| Dynamic (v1) | `--cpu-gpu` | 4.32 s |

~1.6× faster decoder end-to-end. The v1 `.all` ≈ v1 `--cpu-gpu` rows
confirm RangeDim blocks ANE. v2 attends over the full 108 slots every
step, so on pure CPU/GPU it's slower — the win is entirely from ANE
residency. Transcripts are byte-identical across configs.

## Test plan

- [x] Smoke test v2-preferred: directory containing only
`cohere_decoder_cache_external_v2.mlmodelc` transcribes
`english_original.wav` correctly.
- [x] Smoke test v1 fallback: directory containing only
`cohere_decoder_cache_external.mlmodelc` transcribes correctly.
- [x] `swift build -c release --product fluidaudiocli` clean.
- [x] `swift format` clean on changed files.
- [ ] Reviewer: run `fluidaudiocli cohere-transcribe <audio> --model-dir
<q8 dir with v2>` to reproduce the ANE speedup.

## Related

- v2 export script (mobius): `export-decoder-cache-external-static.py`
(uncommitted, to land in a follow-up mobius PR).
- HF repo: `FluidInference/cohere-transcribe-03-2026-coreml` now ships
both decoders under `q8/`.
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/537"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open in Devin Review">
  </picture>
</a>
<!-- devin-review-badge-end -->
2026-04-23 17:42:34 -04:00
Sachin Desai bd5ba7e1b7 fix abbreviation handling for kokoro (#538)
### Why is this change needed?
This change fixes the following issues:

- Sort the common abbreviations on the longest keys so that, e.g. "etc."
is matched before "etc" to prevent a stray "." if the shorter match is
performed first
- The trailing "\b" fails when the abbreviation ends in a non-word char,
e.g. "Dr." followed by a space is non-word→non-word, so no boundary.
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/538"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open in Devin Review">
  </picture>
</a>
<!-- devin-review-badge-end -->

Co-authored-by: Sachin Desai <sdesai@salesforce.com>
2026-04-23 17:40:26 -04:00
Alex b10bdcb51d feat(asr): add Cohere Transcribe (INT8 encoder + FP16 cache-external decoder) (#487)
## Summary

Adds Cohere Transcribe ASR for 14 languages, shipped as an INT8 encoder
+ FP16 cache-external decoder hybrid (`CoherePipeline`). One CLI for
single-file transcription, one CLI for dataset benchmarking (FLEURS and
LibriSpeech).

## Languages

English, French, German, Spanish, Italian, Portuguese, Dutch, Polish,
Greek, Arabic, Japanese, Chinese (Simplified), Korean, Vietnamese.

## What's added

### Library (`Sources/FluidAudio/ASR/Cohere/`)

- **`CoherePipeline`** — encoder + cache-external decoder runner.
Allocates
  the K/V cache host-side (no CoreML State API; iOS 17+), applies the
  additive cross-attention mask, and detokenizes via SentencePiece byte
  fallback so CJK comes out as real characters. Accepts separate
  `encoderDir` / `decoderDir` to support the q8/f16 split.
- **`CohereAsrConfig`** — per-language prompt sequences and token IDs;
shared 35 s / 3500-frame audio window and 108-token decoder cache window
constants. The 35 s cap traces directly to upstream `max_audio_clip_s:
35`.
- **`CohereMelSpectrogram`** — 128-mel front-end matching the reference
  model (preemph, Slaney mel, CMVN).

### CLI (`Sources/FluidAudioCLI/Commands/ASR/Cohere/`)

- `fluidaudiocli cohere-transcribe <audio> --language <lang>` —
single-file
  transcription. Accepts either `--model-dir` (single dir with both
encoder and decoder) or `--encoder-dir` + `--decoder-dir` for the q8/f16
  split.
- `fluidaudiocli cohere-benchmark` — dataset benchmark with
  `--dataset fleurs|librispeech`, `--subset` for LibriSpeech splits,
  `--languages` for FLEURS codes, `--auto-download`, and
  `--checkpoint-every N` (default 100) so long runs persist partial
  results and survive mid-run crashes.

### `ModelNames.swift`

- New `Repo.cohereTranscribeCoreml` →
  `FluidInference/cohere-transcribe-03-2026-coreml/q8`.
- New `ModelNames.CohereTranscribe` enum with `encoder`,
`decoderCacheExternal`, `vocab` and the corresponding `.mlmodelc` paths.

### Documentation
- `Documentation/ASR/Cohere.md` — architecture, API, CLI, LibriSpeech +
  FLEURS results, upstream config provenance (`max_audio_clip_s`,
  `overlap_chunk_second`), comparison vs Cohere's Figure 4 reference
  numbers, caveats.

### FLEURS coverage
- Extends `FleursBenchmark.supportedLanguages` with the 6 non-European
  Cohere languages (`pt_br`, `ar_eg`, `ja_jp`, `cmn_hans_cn`, `ko_kr`,
  `vi_vn`).

## LibriSpeech test-clean (Apple M2 2022, Tahoe 26.0)

Full split, all 2,620 utterances, single-chunk.

| Subset | Samples | WER | CER | RTFx (per-file mean) | RTFx (total
audio/compute) |
|---|---:|---:|---:|---:|---:|
| test-clean | 2,620 | **1.77%** | **0.60%** | 2.04× | 1.72× |

5h 24m audio processed in 3h 09m compute (3h 12m wall time including
one-time ~6 min ANE cold-start compile). Competitive with Parakeet TDT
0.6B v3 (~1.7%) and Whisper large-v3 (~1.8%).

## FLEURS results (full splits, single-chunk)

M4 Pro / Tahoe 26.0, 9,911 samples total.

| FLEURS code | Language | Samples | WER | CER | RTFx |
|---|---|---:|---:|---:|---:|
| en_us | English | 647 | 5.63% | 3.19% | 2.49× |
| fr_fr | French | 676 | 6.22% | 3.11% | 2.21× |
| de_de | German | 862 | 5.84% | 2.83% | 1.98× |
| es_419 | Spanish (LATAM) | 908 | 4.53% | 2.40% | 1.34× |
| it_it | Italian | 865 | **4.03%** | 2.04% | **3.15×** |
| pt_br | Portuguese (BR) | 919 | 6.44% | 3.38% | 2.79× |
| nl_nl | Dutch | 364 | 8.07% | 4.14% | 2.04× |
| pl_pl | Polish | 758 | 7.49% | 3.23% | 1.98× |
| el_gr | Greek | 650 | 11.50% | 5.45% | 2.00× |
| ar_eg | Arabic (EG) | 428 | 18.46% | 6.71% | 2.06× |
| ja_jp | Japanese | 650 | 60.13%† | 6.25% | 2.23× |
| cmn_hans_cn | Mandarin | 945 | 98.52%† | 12.01% | 1.85× |
| ko_kr | Korean | 382 | 16.39% | 6.67% | 1.84× |
| vi_vn | Vietnamese | 857 | 9.55% | 6.87% | 1.55× |

†Japanese and Mandarin are written without word boundaries, so WER on
the
raw hypothesis is a tokenization artifact — **CER is the real accuracy
metric**. Cohere's own Figure 4 uses CER for zh/ja/ko for the same
reason.

## Usage

```swift
let models = try await CoherePipeline.loadModels(
    encoderDir: q8Dir,
    decoderDir: q8Dir,
    vocabDir: q8Dir
)
let pipeline = CoherePipeline()
let result = try await pipeline.transcribe(
    audio: samples,        // 16 kHz mono Float32, up to 35 s
    models: models,
    language: .english
)
```

```bash
# Single file
swift run -c release fluidaudiocli cohere-transcribe audio.wav --language en

# LibriSpeech
swift run -c release fluidaudiocli cohere-benchmark \
    --dataset librispeech --subset test-clean \
    --model-dir /path/to/q8 --auto-download

# FLEURS
swift run -c release fluidaudiocli cohere-benchmark \
    --dataset fleurs --languages en_us,fr_fr --auto-download
```

## HuggingFace

- INT8 hybrid (shipped):
  https://huggingface.co/FluidInference/cohere-transcribe-03-2026-coreml
  (subdir `q8/`)
- Upstream model:
https://huggingface.co/CohereLabs/cohere-transcribe-03-2026

## Notes

- **35 s single-chunk limit** is baked into the upstream model
  (`max_audio_clip_s: 35` in `cohere-pytorch/config.json`). Upstream
  Python also supports >35 s via 5 s-overlap chunking
  (`overlap_chunk_second: 5`); this port does not implement that wrapper
  yet and skips longer utterances with a warning.
- **Cache-external decoder stays FP16**: INT8 decoder quantization
  regresses quality significantly in testing and is not shipped.

## Test plan

- [x] Library + CLI release build clean
- [x] Single-file transcription via \`cohere-transcribe\`
- [x] FLEURS en_us sanity (5.63% WER)
- [x] Full 14-language FLEURS benchmark (9,911 samples)
- [x] Full LibriSpeech test-clean benchmark (2,620 samples, WER 1.77%)
- [x] CJK CER validated (word-boundary-agnostic metric for ja/zh)
- [x] Checkpoint-every survives kill mid-run
- [x] \`printFinalSummary\` no longer aborts on macOS 26
v0.14.0
2026-04-23 10:59:07 -04:00
Alex 1fdae40660 docs: add git worktree guidance for multi-agent workflow (#535)
## Summary

- This repo is worked on by multiple coding agents (Claude, Codex,
Devin,
etc.) in parallel. Switching branches inside a single shared working
tree
drags unrelated WIP from whoever else is active into your build,
surfaces
Swift's "input file ... was modified during the build" errors, and makes
  it easy to accidentally sweep other agents' files into a commit.
- Document \`git worktree\` as the convention: shared \`.git\`, isolated
  working tree and \`.build/\`, one tree per active task.

## Test plan

- [x] This PR branch was itself created via \`git worktree add\` to
dogfood
      the pattern — zero interference with the concurrent CosyVoice3 WIP
      in the primary checkout.
- [ ] Reviewer: confirm the command snippet matches your preferred
naming
      convention for the worktree directory.
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/535"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open in Devin Review">
  </picture>
</a>
<!-- devin-review-badge-end -->
v0.13.7
2026-04-21 12:23:44 -04:00