mirror of
https://github.com/FluidInference/FluidAudio.git
synced 2026-05-12 20:20:36 +00:00
cad8a2b563
## Summary Two related changes that grew out of the cohere isolated-bench investigation: ### 1. Long-form Cohere ASR — `CoherePipeline.transcribeLong` The encoder is fixed at a 35 s window, so the prior `transcribe()` silently truncated longer audio via `padOrTruncate(... fixedFrames: 3_500)` in `Sources/FluidAudio/ASR/Cohere/CoherePipeline.swift:250`. `transcribeLong` slices audio into 35 s chunks with **5 s overlap** (matches upstream `cohere-pytorch/config.json` `overlap_chunk_second: 5`) and stitches adjacent chunks via **token-level longest-common-substring merge**. No model changes — encoder shape stays `[1, 128, 3500]`, decoder cache shape unchanged. - Audio ≤ 35 s short-circuits to the existing single-chunk `transcribe()` path → byte-identical short-form behavior, zero perf delta on FLEURS / LibriSpeech (which are all ≤ 35 s) - Audio > 35 s: hop = 30 s, decode each chunk independently, merge token streams (drop the suffix's matched head, keep prefix as-is) - LCS window bounded to 32 tokens per seam → O(K²) merge is negligible vs. decode - Per-chunk encoder/decoder/total seconds are summed into one `TranscriptionResult` CLI rewiring: - `cohere-transcribe`, `cohere-benchmark`, `tts-benchmark` now route through `transcribeLong` - `cohere-benchmark` no longer skips files exceeding 35 s Smoke-tested on a Mandarin 81 s WAV: full 80 s now transcribes (previously cut at 35 s). 10 unit tests cover `mergeTokenStreams` correctness (empty-input, no-overlap, threshold fallback, boundary overlap, offset overlap, longest-run preference, window bounds) and chunk-config constants. ### 2. Cold-start vs warm inference docs Adds a section to `Documentation/ASR/Cohere.md` capturing the isolated single-process bench (cold ANE compile ~186 s on M2 Tahoe; warm calls 3.4–4.6 s, RTFx 1.96×–8.73×). Clarifies process reuse is what unlocks the headline FLEURS/LibriSpeech RTFx. ## Test plan - [x] `swift test --filter CohereLongFormTests` (10 / 10 pass) - [x] `swift test --filter CohereAsrConfigTests` and `CoherePipelineMaskTests` (no regressions) - [x] `swift build` (debug) clean; `swift build -c release` clean - [x] Smoke: `cohere-transcribe /tmp/cohere_test_80s.wav --language zh` transcribes full 80 s of audio (3 chunks merged, no duplicated overlap content) - [x] `swift format lint` — no new warnings in changed files