mirror of https://github.com/FluidInference/FluidAudio.git synced 2026-05-12 20:20:36 +00:00

Files

T

Zhongpai Gao c4d56a5cb5 Feat/pocket tts int8 precision swap (#558 )

### Why is this change needed?
<!-- Explain the motivation for this change. What problem does it solve?
-->

Wires up the published `flowlm_stepv2.mlmodelc` int8 variant via a
`PocketTtsPrecision { .fp16, .int8 }` parameter on `PocketTtsManager`,
threaded through to `PocketTtsModelStore` and
`PocketTtsResourceDownloader`.

Closes the loop on the `flowlm_stepv2.mlmodelc` artifact that's been
published under `v2/<lang>/` for a while but didn't have a Swift loader
hook. Default stays `.fp16`, no behavior change for existing callers.

## What's in this PR

**Code (5 files, +171/-12):**
- `PocketTtsPrecision.swift` — new enum `{ .fp16, .int8 }`, with
  docstring documenting the kyutai-labs/pocket-tts#147 recipe and
  preserving the per-submodel A/B data from `experiment/pocket-tts-int8`
  (cond_step / flowlm_step / flow_decoder / mimi_decoder safety summary)
- `ModelNames.swift` — `flowlmStepV2` constant +
  `flowlmStepFile(precision:)` and `requiredModels(precision:)` helpers
- `PocketTtsResourceDownloader.swift` — `precision:` param,
  precision-aware cache check, and `removeUnusedFlowlmVariant()` post-
  download cleanup so callers' disk usage matches the loaded models
- `PocketTtsModelStore.swift` — `precision:` init param plumbed to the
  precision-aware filename helper
- `PocketTtsManager.swift` — `precision:` init param threaded to the
  store

**Docs (1 file, +47):**
- `Documentation/TTS/PocketTTS.md` — new "Model Files & Precision"
  section: per-submodel precision/size/HF-path table, fp16-vs-int8
  totals, rationale for why only `flowlm_step` is quantized

## Why default is `.fp16`

I asked about the on-disk weight format before committing the rename
and verified by inspecting `model.mlmodel` for both flowlm variants:
the int8 variant has explicit `cast_fp16_to_fp32` op scaffolding
throughout, while the default has none — indicating uniform fp16
weights. Combined with the 304→77 MB size ratio (~4×, consistent with
fp16→int8 plus quantization scale tensors) the default file's weights
are fp16 on disk. The existing `PocketTtsModelStore.swift:65-67`
comment about "CPU/GPU compute in float32 matches the Python reference"
is correct about runtime compute precision (CoreML upcasts fp16
weights to fp32 on `.cpuAndGPU`); it just doesn't describe disk format
and reads as accurate as-is.

## Why per-submodel quantization isn't exposed

The `experiment/pocket-tts-int8` branch's `PocketTtsQuantization`
struct (per-submodel `PocketTtsModelPrecision`) is a richer API, but
the per-submodel int8 artifacts (`cond_step_int8.mlmodelc`, etc.)
aren't published on HuggingFace today. Adding the API would let
callers request configurations that 404 at download time. Only
`flowlm_stepv2.mlmodelc` is published, and that's what this PR wires
up. The `PocketTtsPrecision` enum can grow into the experiment
branch's `PocketTtsQuantization` shape mechanically if/when the
per-submodel artifacts ship.

## Disk footprint (English language pack)

| | fp16 (default) | int8 |
|---|---|---|
| Total active files on disk | 766.3 MB | 549.3 MB |
| **int8 savings vs fp16** | — | **−217 MB (28%)** |

The `v2/<lang>/` HF directory ships both flowlm variants, so first
download briefly holds ~857 MB before the cleanup pass deletes the
unused `.mlmodelc` and `.mlpackage`.

## Backward compatibility

- `PocketTtsManager()` / `PocketTtsModelStore()` / `ensureModels()`
  defaults all stay `.fp16`, which loads `flowlm_step.mlmodelc` exactly
  as before
- Existing `requiredModels` constant retained alongside new
  `requiredModels(precision:)` so non-precision-aware callers keep
  compiling

## Verification done

- All 11 published language packs have both `flowlm_step.mlmodelc`
  and `flowlm_stepv2.mlmodelc` under `v2/<lang>/` — verified via HF
  tree API
- Branch is exactly +2 commits on top of `main`
  (`00ea906 fix: remove module_map from MachTaskSelfWrapper subspec`)
- Diff content is identical to `Gaozhongpai/FluidAudio:main`, just
  squashed from 5 iterative commits into 2 clean ones (one feat, one
  docs)

I haven't run `swift test` locally — Bash on Windows here, no Swift
toolchain. Happy to fix anything CI flags.

<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/558"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open in Devin Review">
  </picture>
</a>
<!-- devin-review-badge-end -->

2026-04-29 13:57:15 -04:00

16 KiB

Raw Permalink Blame History

PocketTTS Swift Inference

How the Swift code generates speech from text.

Files

File	Role
`PocketTtsManager.swift`	Public API — `initialize()`, `synthesize()`, `synthesizeToFile()`, `makeSession()`, `cloneVoice()`
`PocketTtsModelStore.swift`	Loads and stores the 4 CoreML models + constants + voice data
`PocketTtsVoiceCloner.swift`	Voice cloning — converts audio to voice conditioning embeddings
`PocketTtsSynthesizer.swift`	Main synthesis loop — chunking, prefill, generation, output
`PocketTtsSession.swift`	Session actor — persistent voice KV cache, enqueue/finish/cancel API
`PocketTtsSynthesizer+KVCache.swift`	KV cache state, `prefillKVCacheVoice()`, `prefillKVCacheText()`, `cloneKVCacheState()`
`PocketTtsSynthesizer+Flow.swift`	Flow decoder loop, `denormalize()`, `quantize()`, SeededRNG
`PocketTtsSynthesizer+Mimi.swift`	Mimi decoder state, `runMimiDecoder()`, `loadMimiInitialState()`
`PocketTtsConstantsLoader.swift`	Loads binary constants (embeddings, tokenizer, quantizer weights)
`PocketTtsConstants.swift`	All numeric constants (dimensions, thresholds, etc.)

Model Files & Precision

The four CoreML submodels (plus the optional Mimi encoder) and their auxiliary asset directories. All paths are relative to FluidInference/pocket-tts-coreml on HuggingFace; sizes are for the English language pack.

File	Precision	Size	HF path	Role
`cond_step.mlmodelc`	fp16	254.3 MB	`v2/<lang>/cond_step.mlmodelc`	KV-cache prefill — runs once per chunk over voice + text tokens (~141 calls); writes the 6-layer KV cache that FlowLM consumes during generation
`flowlm_step.mlmodelc`	fp16	290.5 MB	`v2/<lang>/flowlm_step.mlmodelc`	Autoregressive transformer — runs once per audio frame during generation; outputs a `[1, 1024]` hidden state + EOS logit per step. Loaded when `precision: .fp16` (default)
`flowlm_stepv2.mlmodelc`	int8 attn + FFN linears, fp16 elsewhere	73.5 MB	`v2/<lang>/flowlm_stepv2.mlmodelc`	Drop-in replacement for `flowlm_step` when `precision: .int8` — same I/O signature, ~4× smaller. Quantization recipe per kyutai-labs/pocket-tts#147
`flow_decoder.mlmodelc`	fp16	37.3 MB	`v2/<lang>/flow_decoder.mlmodelc`	LSD flow-matching decoder — runs an 8-step Euler loop per audio frame (`latent += velocity · dt`); turns transformer output into a 32-dim audio latent
`mimi_decoder.mlmodelc`	fp16 (outputs explicitly fp16)	40.0 MB	`v2/<lang>/mimi_decoder.mlmodelc`	Mimi VAE audio decoder — runs once per audio frame, takes a 512-dim quantized vector and produces 1920 PCM samples (24 kHz). Maintains 23 streaming-state tensors fed back as next-frame input
`mimi_encoder.mlmodelc`	fp16	optional	`mimi_encoder.mlmodelc` (repo root)	Voice cloning only. Language-agnostic; lives at the repo root, not under `v2/<lang>/`. Downloaded separately on first `cloneVoice(...)` call
`constants_bin/`	binary tensors	144.2 MB	`v2/<lang>/constants_bin/`	Token embedding table, SentencePiece tokenizer, denormalize/quantize mean+std, per-voice prompts (`alba.safetensors`, etc.)
`constants/`	metadata sidecar	16.7 MB	`v2/<lang>/constants/`	Auxiliary constants referenced by `PocketTtsConstantsLoader`

<lang> is one of: english, french_24l, german, german_24l, italian, italian_24l, portuguese, portuguese_24l, spanish, spanish_24l.

Totals (English, on disk)

Configuration	Total
`precision: .fp16` (default)	766.3 MB
`precision: .int8`	549.3 MB
int8 savings vs fp16	−217 MB (28%)

The v2/<lang>/ HF directory ships both flowlm variants, so a fresh download pulls the unused one too. PocketTtsResourceDownloader deletes the unused FlowLM .mlmodelc and .mlpackage directories after download completes so only the requested precision occupies disk long-term.

Why only `flowlm_step` is quantized

The four submodels have different sensitivity to quantization. Only the FlowLM transformer is published in an int8 variant upstream:

Submodel	Quantized?	Why
`cond_step`	No	One-shot prefill; conditioning errors propagate through the entire utterance
`flowlm_step`	Yes	Per-frame transformer with causal attention; quantization error stays bounded per frame, doesn't compound. Largest file — best size-to-risk trade
`flow_decoder`	No	8-step Euler loop where each step's error feeds the next; small file (37 MB) makes savings marginal anyway
`mimi_decoder`	No	Autoregressive feedback loop where 23 streaming-state tensors carry across frames; errors compound frame-over-frame

Call Flow

PocketTtsManager.synthesize(text:)
  |
  v
PocketTtsSynthesizer.synthesize(text:voice:temperature:)
  |
  |-- chunkText()              split text into <=50 token chunks
  |-- loadMimiInitialState()   load 23 streaming state tensors from disk
  |
  |-- FOR EACH CHUNK:
  |     |
  |     |-- tokenizer.encode()     SentencePiece text → token IDs
  |     |-- embedTokens()          table lookup: token ID → [1024] vector
  |     |-- prefillKVCache()       feed 125 voice + N text tokens through cond_step
  |     |     |
  |     |     |-- emptyKVCacheState()   fresh cache (6 layers × [2,1,512,16,64])
  |     |     |-- runCondStep() × ~141  one token per call, updates cache
  |     |
  |     |-- GENERATE LOOP (until EOS or max frames):
  |     |     |
  |     |     |-- runFlowLMStep()       → transformer_out [1,1024] + eos_logit
  |     |     |-- flowDecode()          → 32-dim latent
  |     |     |     |-- randn(32) * sqrt(temperature)
  |     |     |     |-- runFlowDecoderStep() × 8 Euler steps
  |     |     |     |-- latent += velocity * dt each step
  |     |     |
  |     |     |-- denormalize()         latent * std + mean
  |     |     |-- quantize()            matmul [32] × [32,512] → [512]
  |     |     |-- runMimiDecoder()      [512] → 1920 audio samples
  |     |     |     updates 23 streaming state tensors
  |     |     |
  |     |     |-- createSequenceFromLatent()  feed latent back for next frame
  |
  |-- concatenate all frames
  |-- applyTtsPostProcessing() (optional de-essing)
  |-- AudioWAV.data()          wrap in WAV header (24kHz mono)

Key State

KV Cache (`KVCacheState`)

6 cache tensors [2, 1, 512, 16, 64] + 6 position counters
Written during prefill (voice + text tokens)
Read and extended during generation (one position per frame)
Reset per chunk — each chunk gets a fresh cache

Mimi State (`MimiState`)

23 tensors: convolution history, attention caches, overlap-add buffers
Loaded once from mimi_init_state/*.bin files via manifest.json
Updated after every runMimiDecoder() call — outputs feed back as next input
Continuous across chunks — never reset, keeps audio seamless

Text Chunking

Long text is split into chunks of <=50 tokens to fit the KV cache (512 positions, minus ~125 voice + ~25 overhead).

Splitting priority:

Sentence boundaries (.!?)
Clause boundaries (,;:)
Word boundaries (fallback)

normalizeText() also capitalizes, adds terminal punctuation, and pads short text with leading spaces for better prosody.

EOS Detection

runFlowLMStep() returns an eos_logit. When it exceeds -4.0, the code generates a few extra frames (3 for short text, 1 for long) then stops.

CoreML Details

All 4 models loaded with .cpuAndGPU compute units (ANE float16 causes artifacts in Mimi state feedback)
Models compiled from .mlpackage → .mlmodelc on first load, cached on disk
PocketTtsModelStore is an actor — thread-safe access to loaded models
Voice data cached per voice name to avoid reloading

Voice Cloning

Clone any voice from a short audio sample (1-30 seconds) using the Mimi encoder model.

How It Works

Audio is loaded and resampled to 24kHz mono using AudioConverter
The Mimi encoder converts audio to conditioning embeddings [1, num_frames, 1024]
Embeddings are used at their natural length (no padding) — the KV cache prefill processes the actual number of frames
The resulting PocketTtsVoiceData can be used directly for synthesis

Variable-length support is important: zero-padding shorter audio would corrupt voice conditioning by feeding meaningless vectors into the transformer.

Voice Cloning API

// Clone from audio file (WAV, MP3, M4A, etc.)
let voiceData = try await manager.cloneVoice(from: audioURL)

// Clone from raw samples (24kHz mono Float32)
let voiceData = try await manager.cloneVoice(from: samples)

// Use cloned voice immediately (no file I/O needed)
let audio = try await manager.synthesize(text: "Hello!", voiceData: voiceData)

// Save for later use
try manager.saveClonedVoice(voiceData, to: outputURL)

// Load previously saved voice
let savedVoice = try manager.loadClonedVoice(from: savedVoiceURL)
let audio = try await manager.synthesize(text: "Hello!", voiceData: savedVoice)

CLI Usage

# Clone voice and synthesize in one step
fluidaudio tts "Hello world" --backend pocket --clone-voice speaker.wav

# Clone, save for later, and synthesize
fluidaudio tts "Hello world" --backend pocket --clone-voice speaker.wav --save-voice my_voice.bin

# Use previously saved voice
fluidaudio tts "Hello world" --backend pocket --voice-file my_voice.bin

Requirements

Audio duration: 1-30 seconds (capped at 250 frames / ~20s to leave KV cache room)
The mimi_encoder.mlmodelc model is downloaded automatically on first use
Supports any audio format that AVFoundation can read

Cloning Across Languages

The Mimi encoder is language-agnostic — voice cloning produces a generic acoustic embedding that any language pack's cond_step model can consume. You can:

Clone a voice once and reuse the same PocketTtsVoiceData across managers configured with different languages.
Clone a voice with a Spanish-only manager without pulling in the English language pack — only the encoder subtree is downloaded.

// Clone with a Spanish manager
let esManager = PocketTtsManager(language: .spanish)
try await esManager.initialize()
let voiceData = try await esManager.cloneVoice(from: speakerAudioURL)

// Use the same cloned voice with a French manager
let frManager = PocketTtsManager(language: .french24L)
try await frManager.initialize()
let frAudio = try await frManager.synthesize(text: "Bonjour", voiceData: voiceData)

Pipeline and Pronunciation Control

text → SentencePiece tokenizer → subword tokens → PocketTTS model → audio
                                                    ↑
                                          pronunciation decisions
                                          happen inside model weights
                                          (no external control)

Unlike Kokoro which uses a CoreML G2P model to convert text to IPA phonemes before the model, PocketTTS feeds raw text tokens directly into the neural network. The model learned text→pronunciation mappings during training — there is no phoneme stage to intercept.

Feature Support

Feature	Supported	Can We Add?	Why
SSML `<phoneme>`	No	No	No IPA layer — model has no phoneme vocabulary
Custom lexicon (word → IPA)	No	No	No phoneme stage to apply mappings
Markdown `[word](/ipa/)`	No	No	Same — no phoneme input
SSML `<sub>` (text substitution)	No	Yes	Text-level, can run before tokenizer
Text preprocessing (numbers, dates)	Minimal	Yes	Text-level, can run before tokenizer

What Can Be Added

Text-level preprocessing that runs before the SentencePiece tokenizer:

Number/date/currency expansion — "123" → "one hundred twenty three"
<sub> substitution — replace abbreviations with full text before tokenization
Phonetic spelling workarounds — spelling out pronunciation ("NVIDIA" → "en-vidia"), though unreliable since the model may not pronounce phonetic spellings consistently

What Cannot Be Added (Without Retraining)

<phoneme> tags — the model has no IPA vocabulary
Custom lexicon — no phoneme stage to apply word → IPA mappings
Fine-grained pronunciation control — the model decides pronunciation from text tokens alone

See Kokoro.md if you need pronunciation control.

Session API

For streaming input or long-running low-latency sessions, makeSession() performs the voice prefill once, then each enqueued utterance only prefills text tokens. Mimi state persists across utterances for seamless audio.

let session = try await manager.makeSession(voice: "alba")
session.enqueue("Hello there.")
session.enqueue("How are you doing today?")
session.finish()
for try await frame in session.frames {
    playAudio(frame.samples)
}

Method	Description
`manager.makeSession(voice:temperature:seed:)`	Create session with named voice
`manager.makeSession(voiceData:temperature:seed:)`	Create session with cloned voice
`session.enqueue(_ text:)`	Add text (non-async, safe from any context)
`session.finish()`	End the session and complete the frames stream
`session.cancel()`	Stop generation immediately
`session.frames`	`AsyncThrowingStream<AudioFrame, Error>`

Scenario	API
One-shot synthesis	`synthesize()`
Streaming playback	`synthesizeStreaming()`
Streaming text or custom chunking	`makeSession()`

Languages

PocketTTS ships with multiple language packs converted from kyutai/pocket-tts. Pick the one that matches your input text — there is no automatic language detection.

ID	Layers	HF Path
`english`	6	repo root (legacy layout)
`german`	6	`v2/german/`
`german_24l`	24	`v2/german_24l/`
`italian`	6	`v2/italian/`
`italian_24l`	24	`v2/italian_24l/`
`portuguese`	6	`v2/portuguese/`
`portuguese_24l`	24	`v2/portuguese_24l/`
`spanish`	6	`v2/spanish/`
`spanish_24l`	24	`v2/spanish_24l/`
`french_24l`	24	`v2/french_24l/`

Notes:

French only ships a 24-layer pack upstream (no 6-layer variant).
24-layer packs are higher quality but slower and larger.
The 21 voice names (alba, anna, eve, michael, …) are shared across languages, but the underlying acoustic embeddings are per-language.
Mimi encoder weights (used for voice cloning) are language-agnostic and always live at the repo root.

Swift API

let manager = PocketTtsManager(language: .spanish)
try await manager.initialize()
let audio = try await manager.synthesize(text: "Hola mundo")

PocketTtsManager.language is immutable per instance. To support multiple languages in one app, instantiate one manager per language.

CLI Usage

# Default (English)
fluidaudio tts "Hello world" --backend pocket --output en.wav

# Spanish (6L)
fluidaudio tts "Hola mundo" --backend pocket --language spanish --output es.wav

# French (24L only)
fluidaudio tts "Bonjour" --backend pocket --language french_24l --output fr.wav

Usage

PocketTTS is part of core FluidAudio - no GPL dependencies required.

import FluidAudio

let manager = PocketTtsManager()
try await manager.initialize()

// Using built-in voices
let audioData = try await manager.synthesize(text: "Hello, world!")

// Using cloned voice
let voiceData = try await manager.cloneVoice(from: speakerAudioURL)
let audioData = try await manager.synthesize(text: "Hello, world!", voiceData: voiceData)

try await manager.synthesizeToFile(
    text: "Hello, world!",
    outputURL: URL(fileURLWithPath: "/tmp/output.wav")
)

License

PocketTTS models: CC-BY-4.0, inherited from kyutai/pocket-tts
FluidAudio SDK: Apache 2.0 licensed (no GPL dependencies)

16 KiB Raw Permalink Blame History Unescape Escape

PocketTTS Swift Inference

Files

Model Files & Precision

Totals (English, on disk)

Why only flowlm_step is quantized

Call Flow

Key State

KV Cache (KVCacheState)

Mimi State (MimiState)

Text Chunking

EOS Detection

CoreML Details

Voice Cloning

How It Works

Voice Cloning API

CLI Usage

Requirements

Cloning Across Languages

Pipeline and Pronunciation Control

Feature Support

What Can Be Added

What Cannot Be Added (Without Retraining)

Session API

Languages

Swift API

CLI Usage

Usage

License

16 KiB

Raw Permalink Blame History

Why only `flowlm_step` is quantized

KV Cache (`KVCacheState`)

Mimi State (`MimiState`)