Files
Alex 847a985ae4 fix(tts/pocket-tts): repair v1 voice cloning for pocket-tts 2.0.0 (#592) (#601)
## Summary

Fixes #592 — PocketTTS voice cloning produced garbled audio on macOS
after the `pocket-tts==2.0.0` upgrade. v2 (pre-baked KV snapshot) voices
were unaffected — only the v1 path (user audio → `mimi_encoder` →
`cond_step` prefill) was broken.

Two compounding bugs:

### RCA 1 — stale `mimi_encoder`
The `mimi_encoder.mlpackage` originally published on HF was traced
against pre-2.0.0 `pocket-tts` (torch 2.9.1, Float32, scalar output) and
no longer matched the runtime cond_step contract. Re-traced as
`mimi_encoderv2` from `pocket-tts==2.0.0` (torch 2.11.0, Float16, fixed
`[1, 1, 240000]` → `[1, 125, 1024]`). Both files now live at the HF repo
root (legacy file kept for backwards compat); `ModelNames.mimiEncoder`
points at the new one.

### RCA 2 — missing `bos_before_voice` prepend
`pocket-tts` 2.0.0 added a learned 1024-d `flow_lm.bos_before_voice`
buffer that has to be prepended to the audio_prompt during cond_step
prefill. Without it the FlowLM sees a different token distribution than
training. Extracted per-language as `constants_bin/bos_before_voice.bin`
(4096 bytes each, 10 packs × distinct SHA-256s, all verified
byte-for-byte against the HF upload).

### Swift-side changes
- `PocketTtsVoiceCloner` pads/truncates input to the encoder's fixed 240
000 samples (10 s @ 24 kHz, non-flexible shape) and trims output frames
to real-audio duration so zero-padded frames don't bleed into the
prompt.
- `PocketTtsSynthesizer+KVCache.prefillKVCache` prepends
`bos_before_voice` ahead of the audio_prompt on the v1 path. v2
snapshots skip this — their pre-baked KV cache already encodes the
prefix.
- `PocketTtsResourceDownloader.ensureModels` backfills
`bos_before_voice.bin` for caches that predate this fix (per-file fetch)
instead of forcing a full language-pack re-download.

Conversion artifacts and per-language SHA-256s documented in
`mobius/models/tts/pocket_tts/coreml/TRIALS.md` (Phase 7).

## Test plan

- [x] `swift build` clean
- [x] `swift test --filter PocketTtsConstantsLoaderTests` — 3 new tests
pass
- [x] `swift format` applied
- [x] E2E v1 cloning: `am_michael.wav` (7.5 s) → 3.92 s @ 24 kHz Int16,
intelligible voice match. KV cache prefill lands at position 113 = 1 BOS
+ 95 voice + 17 text tokens (matches pocket-tts 2.0.0 layout).
- [x] v2 snapshot regression check: default `alba.safetensors` voice
still synthesizes correctly (prefill position 140, no `bos_before_voice`
involvement)
- [x] Backfill path: deleted `bos_before_voice.bin` from cache, re-ran
cloning — file auto-fetched from HF (4096 bytes) before synthesis
- [x] All 10 language packs verified on HF: SHA-256 match between local
extraction and uploaded `v2/<lang>/constants_bin/bos_before_voice.bin`
2026-05-12 08:55:44 -04:00
..
2025-10-22 20:18:54 -04:00