Files
Brandon Weng 0935593bef Fix VAD threshold overriding per segment (#155)
### Why is this change needed?
<!-- Explain the motivation for this change. What problem does it solve?
-->

https://github.com/FluidInference/FluidAudio/pull/153 <-- from this PR,
but thought it would be easier for me to just clean it up entirely.

Rename the actor level threshold to "defaultThreshold" and actually
allow overriding per segment.

previously the end speech (negative threshold) wasn't being used either
2025-10-21 19:45:08 -04:00

4.1 KiB
Raw Permalink Blame History

Configuration fields

Configuration for turning raw VAD probabilities into stable speech segments.

This struct applies rules for minimum durations, thresholds, and hysteresis to avoid jittery cuts and to produce clean, ASR-ready segments.

public struct VadSegmentationConfig: Sendable {
    /// Minimum length of detected speech to keep as a segment (default: 0.15s).
    /// Prevents clicks or coughs from being treated as speech.
    public var minSpeechDuration: TimeInterval

    /// Minimum silence required to end a segment (default: 0.75s).
    /// Prevents early cut-offs when a speaker pauses briefly.
    public var minSilenceDuration: TimeInterval

    /// Maximum length of a single speech segment (default: 14s).
    /// Segments longer than this will be forcibly split to match ASR model limits.
    public var maxSpeechDuration: TimeInterval

    /// Extra padding (before and after) each detected speech segment (default: 0.1s).
    /// Keeps context around words so they arent clipped.
    public var speechPadding: TimeInterval

    /// Probability threshold below which audio is treated as silence (default: 0.3).
    /// Lower = stricter silence detection, higher = more tolerant.
    public var silenceThresholdForSplit: Float

    /// Explicit override for the *exit* hysteresis threshold (default: nil).
    /// If not set, the system computes it automatically from the base threshold minus `negativeThresholdOffset`.
    public var negativeThreshold: Float?

    /// How far below the base threshold the *exit* threshold should be (default: 0.15).
    /// Example: if entry = 0.5, exit = 0.35. Prevents rapid flipping on noisy inputs.
    public var negativeThresholdOffset: Float

    /// Minimum silence enforced when splitting a max-length segment (default: 0.098s).
    /// Ensures forced splits dont land mid-phoneme.
    public var minSilenceAtMaxSpeech: TimeInterval

    /// If true, try to split at the longest silence near the max duration cutoff.
    /// Produces cleaner segment boundaries compared to a hard cut.
    public var useMaxPossibleSilenceAtMaxSpeech: Bool

    public static let `default` = VadSegmentationConfig()

    public init(
        minSpeechDuration: TimeInterval = 0.15,
        minSilenceDuration: TimeInterval = 0.75,
        maxSpeechDuration: TimeInterval = 14.0,
        speechPadding: TimeInterval = 0.1,
        silenceThresholdForSplit: Float = 0.3,
        negativeThreshold: Float? = nil,
        negativeThresholdOffset: Float = 0.15,
        minSilenceAtMaxSpeech: TimeInterval = 0.098,
        useMaxPossibleSilenceAtMaxSpeech: Bool = true
    ) {
        self.minSpeechDuration = minSpeechDuration
        self.minSilenceDuration = minSilenceDuration
        self.maxSpeechDuration = maxSpeechDuration
        self.speechPadding = speechPadding
        self.silenceThresholdForSplit = silenceThresholdForSplit
        self.negativeThreshold = negativeThreshold
        self.negativeThresholdOffset = negativeThresholdOffset
        self.minSilenceAtMaxSpeech = minSilenceAtMaxSpeech
        self.useMaxPossibleSilenceAtMaxSpeech = useMaxPossibleSilenceAtMaxSpeech
    }

    /// Computes the working negative threshold for hysteresis:
    /// - If `negativeThreshold` is set, that value is used.
    /// - Otherwise, it is computed as (baseThreshold  negativeThresholdOffset).
    /// - This creates a "sticky zone" between thresholds:
    ///   - Enter speech when prob > baseThreshold
    ///   - Exit speech when prob < negativeThreshold
    ///   - Stay in current state in between
    public func effectiveNegativeThreshold(baseThreshold: Float) -> Float {
        if let override = negativeThreshold {
            return override
        }
        return max(baseThreshold - negativeThresholdOffset, 0.01)
    }
}

The entry threshold for hysteresis defaults to VadConfig.defaultThreshold, set when you construct a VadManager. If you provide a negativeThreshold, the streaming helpers derive an entry threshold by adding negativeThresholdOffset (clamped to 1.0), allowing per-request tuning without rebuilding the manager. To change the baseline entry threshold globally, create the manager with a different defaultThreshold.