Skip to content

Agent speech silently dropped when interrupted before the first audio frame (resumeFalseInterruption) — port of Python #5039 #1909

Description

@enriqueespaillat-gyde

Summary

With resumeFalseInterruption: true, a brief user sound that arrives after say() is issued but before the agent's first TTS audio frame is forwarded pauses the speech, leaves firstFrameFut unresolved, and the turn is dropped — the user hears no audio at all and the turn is dropped from history. The call then sits in silence.

This is the JS counterpart of livekit/agents#5038 (Python), which was fixed by livekit/agents#5039. That fix does not appear to be ported to agents-js (the relevant code path is unchanged on main).

Environment

Field Value
@livekit/agents 1.4.6 (relevant path identical on 1.4.7 and current main)
Node 22.x
Turn detection streaming STT, turnDetection: "stt"
Interruption config { mode: "adaptive", minWords: 2, minDuration: 500, resumeFalseInterruption: true }
Transport outbound telephony (SIP); mechanism is transport-independent

Steps to reproduce

  1. Outbound call. On answer, the agent speaks a fixed opener: session.say(text, { allowInterruptions: true }).
  2. The TTS has a non-trivial time-to-first-byte (a few hundred ms), so the first audio frame is still in flight after say() returns.
  3. The callee makes a brief sound ("hello?") in that pre-first-frame window.

Expected

The brief false interruption pauses the speech and then resumes and plays it once the false interruption clears (per resumeFalseInterruption), or the speech is interrupted cleanly and re-attempted.

Actual

firstFrameFut never resolves; the first audio frame arrives after the segment is torn down and is discarded; the turn is dropped from history. No audio reaches the user, and the interruption is not counted (session reports zero interruptions). minWords is irrelevant — the path that fires has no word-count gate.

Root cause (JS)

Two behaviors combine.

1. onStartOfSpeech pauses a not-yet-playing speech, ungated. In voice/agent_activity.ts, onStartOfSpeech pauses the current speech as soon as user VAD-start fires, guarded only by:

agentState !== "speaking"            // agent's first audio frame hasn't played yet
&& pauseEnabled()
&& _currentSpeech.allowInterruptions  // the opener is interruptible

There is no minWords and no duration check on this branch, so any user sound in the pre-first-frame window pauses the speech.

2. The pause leaves firstFrameFut unresolved, so audio + transcript are dropped. In voice/generation.ts, forwardAudio's finally rejects the future when no frame was forwarded:

if (!out.firstFrameFut.done) {
  out.firstFrameFut.reject(new Error("audio forwarding cancelled before playback started"));
}
audioOutput.flush();
if (signal?.aborted) audioOutput.clearBuffer();

Downstream, the reply task only preserves the synchronized transcript when firstFrameFut.done && !firstFrameFut.rejected, so on the rejected (no-first-frame) path the transcript is blanked and the audio is discarded — the JS analogue of the else: forwarded_text = "" overwrite called out in livekit/agents#5038.

Observable log signature on the dropped turn:

SegmentSynchronizerImpl.markPlaybackFinished called before text/audio input is done
SegmentSynchronizerImpl.onPlaybackStarted called after close
playback_finished called more times than playback segments were captured

Minimal reproduction (mechanism)

The destruction half is deterministic against the real performAudioForwarding: when the forwarding signal is aborted before the first frame, zero frames reach the sink and firstFrameFut rejects.

Case Frames to output firstFrameFut clearBuffer
Frames flow, no cancel 3 resolved 0
Cancelled before first frame 0 (silent) rejected 1
TTS yields no frame 0 (silent) rejected 0

I can attach a runnable AgentSession-level reproduction if helpful.

Relationship to the Python fix

livekit/agents#5038 describes this exact failure and was fixed by livekit/agents#5039, which relocates first_frame_fut handling to the callers and avoids blanking the generated text on the pre-first-frame path. The JS port still rejects firstFrameFut in forwardAudio and gates transcript preservation on !firstFrameFut.rejected, so the same defect is present.

One porting subtlety: in the JS Future, cancel() also sets rejected = true, so a literal "cancel instead of reject" transcription of #5039 would not change the downstream !firstFrameFut.rejected check. The JS fix likely needs to preserve the generated audio/transcript on the no-first-frame path explicitly.

Ask

Port the #5039 fix to agents-js (or confirm the intended approach), so a brief false interruption arriving before the first audio frame no longer drops the entire turn — and ideally resumes it, per resumeFalseInterruption. Happy to open a PR with a regression test if that's welcome.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions