FluidAudio

mirror of https://github.com/FluidInference/FluidAudio.git synced 2026-05-12 20:20:36 +00:00

Files

T

History

Alex cad8a2b563 feat(asr/cohere): long-form transcribeLong + cold/warm docs (#564 )

## Summary

Two related changes that grew out of the cohere isolated-bench
investigation:

### 1. Long-form Cohere ASR — `CoherePipeline.transcribeLong`

The encoder is fixed at a 35 s window, so the prior `transcribe()`
silently truncated longer audio via `padOrTruncate(... fixedFrames:
3_500)` in `Sources/FluidAudio/ASR/Cohere/CoherePipeline.swift:250`.

`transcribeLong` slices audio into 35 s chunks with **5 s overlap**
(matches upstream `cohere-pytorch/config.json` `overlap_chunk_second:
5`) and stitches adjacent chunks via **token-level
longest-common-substring merge**. No model changes — encoder shape stays
`[1, 128, 3500]`, decoder cache shape unchanged.

- Audio ≤ 35 s short-circuits to the existing single-chunk
`transcribe()` path → byte-identical short-form behavior, zero perf
delta on FLEURS / LibriSpeech (which are all ≤ 35 s)
- Audio > 35 s: hop = 30 s, decode each chunk independently, merge token
streams (drop the suffix's matched head, keep prefix as-is)
- LCS window bounded to 32 tokens per seam → O(K²) merge is negligible
vs. decode
- Per-chunk encoder/decoder/total seconds are summed into one
`TranscriptionResult`

CLI rewiring:
- `cohere-transcribe`, `cohere-benchmark`, `tts-benchmark` now route
through `transcribeLong`
- `cohere-benchmark` no longer skips files exceeding 35 s

Smoke-tested on a Mandarin 81 s WAV: full 80 s now transcribes
(previously cut at 35 s). 10 unit tests cover `mergeTokenStreams`
correctness (empty-input, no-overlap, threshold fallback, boundary
overlap, offset overlap, longest-run preference, window bounds) and
chunk-config constants.

### 2. Cold-start vs warm inference docs

Adds a section to `Documentation/ASR/Cohere.md` capturing the isolated
single-process bench (cold ANE compile ~186 s on M2 Tahoe; warm calls
3.4–4.6 s, RTFx 1.96×–8.73×). Clarifies process reuse is what unlocks
the headline FLEURS/LibriSpeech RTFx.

## Test plan

- [x] `swift test --filter CohereLongFormTests` (10 / 10 pass)
- [x] `swift test --filter CohereAsrConfigTests` and
`CoherePipelineMaskTests` (no regressions)
- [x] `swift build` (debug) clean; `swift build -c release` clean
- [x] Smoke: `cohere-transcribe /tmp/cohere_test_80s.wav --language zh`
transcribes full 80 s of audio (3 chunks merged, no duplicated overlap
content)
- [x] `swift format lint` — no new warnings in changed files

2026-05-01 10:26:27 -04:00

benchmarks100.md

Fix Japanese TDT model download filename mismatch (#522 )

2026-04-20 17:56:10 -04:00

Cohere.md

feat(asr/cohere): long-form transcribeLong + cold/warm docs (#564 )

2026-05-01 10:26:27 -04:00

CustomPronunciation.md

docs: fix CLI name references to fluidaudiocli (#372 )

2026-03-14 17:49:36 -04:00

CustomVocabulary.md

Clarify custom vocabulary model compatibility and approach selection (#469 )