mirror of
https://github.com/FluidInference/FluidAudio.git
synced 2026-05-12 20:20:36 +00:00
847a985ae4
## Summary Fixes #592 — PocketTTS voice cloning produced garbled audio on macOS after the `pocket-tts==2.0.0` upgrade. v2 (pre-baked KV snapshot) voices were unaffected — only the v1 path (user audio → `mimi_encoder` → `cond_step` prefill) was broken. Two compounding bugs: ### RCA 1 — stale `mimi_encoder` The `mimi_encoder.mlpackage` originally published on HF was traced against pre-2.0.0 `pocket-tts` (torch 2.9.1, Float32, scalar output) and no longer matched the runtime cond_step contract. Re-traced as `mimi_encoderv2` from `pocket-tts==2.0.0` (torch 2.11.0, Float16, fixed `[1, 1, 240000]` → `[1, 125, 1024]`). Both files now live at the HF repo root (legacy file kept for backwards compat); `ModelNames.mimiEncoder` points at the new one. ### RCA 2 — missing `bos_before_voice` prepend `pocket-tts` 2.0.0 added a learned 1024-d `flow_lm.bos_before_voice` buffer that has to be prepended to the audio_prompt during cond_step prefill. Without it the FlowLM sees a different token distribution than training. Extracted per-language as `constants_bin/bos_before_voice.bin` (4096 bytes each, 10 packs × distinct SHA-256s, all verified byte-for-byte against the HF upload). ### Swift-side changes - `PocketTtsVoiceCloner` pads/truncates input to the encoder's fixed 240 000 samples (10 s @ 24 kHz, non-flexible shape) and trims output frames to real-audio duration so zero-padded frames don't bleed into the prompt. - `PocketTtsSynthesizer+KVCache.prefillKVCache` prepends `bos_before_voice` ahead of the audio_prompt on the v1 path. v2 snapshots skip this — their pre-baked KV cache already encodes the prefix. - `PocketTtsResourceDownloader.ensureModels` backfills `bos_before_voice.bin` for caches that predate this fix (per-file fetch) instead of forcing a full language-pack re-download. Conversion artifacts and per-language SHA-256s documented in `mobius/models/tts/pocket_tts/coreml/TRIALS.md` (Phase 7). ## Test plan - [x] `swift build` clean - [x] `swift test --filter PocketTtsConstantsLoaderTests` — 3 new tests pass - [x] `swift format` applied - [x] E2E v1 cloning: `am_michael.wav` (7.5 s) → 3.92 s @ 24 kHz Int16, intelligible voice match. KV cache prefill lands at position 113 = 1 BOS + 95 voice + 17 text tokens (matches pocket-tts 2.0.0 layout). - [x] v2 snapshot regression check: default `alba.safetensors` voice still synthesizes correctly (prefill position 140, no `bos_before_voice` involvement) - [x] Backfill path: deleted `bos_before_voice.bin` from cache, re-ran cloning — file auto-fetched from HF (4096 bytes) before synthesis - [x] All 10 language packs verified on HF: SHA-256 match between local extraction and uploaded `v2/<lang>/constants_bin/bos_before_voice.bin`