Summary
With resumeFalseInterruption: true, a brief user sound that arrives after say() is issued but before the agent's first TTS audio frame is forwarded pauses the speech, leaves firstFrameFut unresolved, and the turn is dropped — the user hears no audio at all and the turn is dropped from history. The call then sits in silence.
This is the JS counterpart of livekit/agents#5038 (Python), which was fixed by livekit/agents#5039. That fix does not appear to be ported to agents-js (the relevant code path is unchanged on main).
Environment
| Field |
Value |
@livekit/agents |
1.4.6 (relevant path identical on 1.4.7 and current main) |
| Node |
22.x |
| Turn detection |
streaming STT, turnDetection: "stt" |
| Interruption config |
{ mode: "adaptive", minWords: 2, minDuration: 500, resumeFalseInterruption: true } |
| Transport |
outbound telephony (SIP); mechanism is transport-independent |
Steps to reproduce
- Outbound call. On answer, the agent speaks a fixed opener:
session.say(text, { allowInterruptions: true }).
- The TTS has a non-trivial time-to-first-byte (a few hundred ms), so the first audio frame is still in flight after
say() returns.
- The callee makes a brief sound ("hello?") in that pre-first-frame window.
Expected
The brief false interruption pauses the speech and then resumes and plays it once the false interruption clears (per resumeFalseInterruption), or the speech is interrupted cleanly and re-attempted.
Actual
firstFrameFut never resolves; the first audio frame arrives after the segment is torn down and is discarded; the turn is dropped from history. No audio reaches the user, and the interruption is not counted (session reports zero interruptions). minWords is irrelevant — the path that fires has no word-count gate.
Root cause (JS)
Two behaviors combine.
1. onStartOfSpeech pauses a not-yet-playing speech, ungated. In voice/agent_activity.ts, onStartOfSpeech pauses the current speech as soon as user VAD-start fires, guarded only by:
agentState !== "speaking" // agent's first audio frame hasn't played yet
&& pauseEnabled()
&& _currentSpeech.allowInterruptions // the opener is interruptible
There is no minWords and no duration check on this branch, so any user sound in the pre-first-frame window pauses the speech.
2. The pause leaves firstFrameFut unresolved, so audio + transcript are dropped. In voice/generation.ts, forwardAudio's finally rejects the future when no frame was forwarded:
if (!out.firstFrameFut.done) {
out.firstFrameFut.reject(new Error("audio forwarding cancelled before playback started"));
}
audioOutput.flush();
if (signal?.aborted) audioOutput.clearBuffer();
Downstream, the reply task only preserves the synchronized transcript when firstFrameFut.done && !firstFrameFut.rejected, so on the rejected (no-first-frame) path the transcript is blanked and the audio is discarded — the JS analogue of the else: forwarded_text = "" overwrite called out in livekit/agents#5038.
Observable log signature on the dropped turn:
SegmentSynchronizerImpl.markPlaybackFinished called before text/audio input is done
SegmentSynchronizerImpl.onPlaybackStarted called after close
playback_finished called more times than playback segments were captured
Minimal reproduction (mechanism)
The destruction half is deterministic against the real performAudioForwarding: when the forwarding signal is aborted before the first frame, zero frames reach the sink and firstFrameFut rejects.
| Case |
Frames to output |
firstFrameFut |
clearBuffer |
| Frames flow, no cancel |
3 |
resolved |
0 |
| Cancelled before first frame |
0 (silent) |
rejected |
1 |
| TTS yields no frame |
0 (silent) |
rejected |
0 |
I can attach a runnable AgentSession-level reproduction if helpful.
Relationship to the Python fix
livekit/agents#5038 describes this exact failure and was fixed by livekit/agents#5039, which relocates first_frame_fut handling to the callers and avoids blanking the generated text on the pre-first-frame path. The JS port still rejects firstFrameFut in forwardAudio and gates transcript preservation on !firstFrameFut.rejected, so the same defect is present.
One porting subtlety: in the JS Future, cancel() also sets rejected = true, so a literal "cancel instead of reject" transcription of #5039 would not change the downstream !firstFrameFut.rejected check. The JS fix likely needs to preserve the generated audio/transcript on the no-first-frame path explicitly.
Ask
Port the #5039 fix to agents-js (or confirm the intended approach), so a brief false interruption arriving before the first audio frame no longer drops the entire turn — and ideally resumes it, per resumeFalseInterruption. Happy to open a PR with a regression test if that's welcome.
Summary
With
resumeFalseInterruption: true, a brief user sound that arrives aftersay()is issued but before the agent's first TTS audio frame is forwarded pauses the speech, leavesfirstFrameFutunresolved, and the turn is dropped — the user hears no audio at all and the turn is dropped from history. The call then sits in silence.This is the JS counterpart of livekit/agents#5038 (Python), which was fixed by livekit/agents#5039. That fix does not appear to be ported to
agents-js(the relevant code path is unchanged onmain).Environment
@livekit/agentsmain)turnDetection: "stt"{ mode: "adaptive", minWords: 2, minDuration: 500, resumeFalseInterruption: true }Steps to reproduce
session.say(text, { allowInterruptions: true }).say()returns.Expected
The brief false interruption pauses the speech and then resumes and plays it once the false interruption clears (per
resumeFalseInterruption), or the speech is interrupted cleanly and re-attempted.Actual
firstFrameFutnever resolves; the first audio frame arrives after the segment is torn down and is discarded; the turn is dropped from history. No audio reaches the user, and the interruption is not counted (sessionreports zero interruptions).minWordsis irrelevant — the path that fires has no word-count gate.Root cause (JS)
Two behaviors combine.
1.
onStartOfSpeechpauses a not-yet-playing speech, ungated. Invoice/agent_activity.ts,onStartOfSpeechpauses the current speech as soon as user VAD-start fires, guarded only by:There is no
minWordsand no duration check on this branch, so any user sound in the pre-first-frame window pauses the speech.2. The pause leaves
firstFrameFutunresolved, so audio + transcript are dropped. Invoice/generation.ts,forwardAudio'sfinallyrejects the future when no frame was forwarded:Downstream, the reply task only preserves the synchronized transcript when
firstFrameFut.done && !firstFrameFut.rejected, so on the rejected (no-first-frame) path the transcript is blanked and the audio is discarded — the JS analogue of theelse: forwarded_text = ""overwrite called out in livekit/agents#5038.Observable log signature on the dropped turn:
Minimal reproduction (mechanism)
The destruction half is deterministic against the real
performAudioForwarding: when the forwarding signal is aborted before the first frame, zero frames reach the sink andfirstFrameFutrejects.firstFrameFutclearBufferI can attach a runnable
AgentSession-level reproduction if helpful.Relationship to the Python fix
livekit/agents#5038 describes this exact failure and was fixed by livekit/agents#5039, which relocates
first_frame_futhandling to the callers and avoids blanking the generated text on the pre-first-frame path. The JS port still rejectsfirstFrameFutinforwardAudioand gates transcript preservation on!firstFrameFut.rejected, so the same defect is present.One porting subtlety: in the JS
Future,cancel()also setsrejected = true, so a literal "cancel instead of reject" transcription of #5039 would not change the downstream!firstFrameFut.rejectedcheck. The JS fix likely needs to preserve the generated audio/transcript on the no-first-frame path explicitly.Ask
Port the #5039 fix to
agents-js(or confirm the intended approach), so a brief false interruption arriving before the first audio frame no longer drops the entire turn — and ideally resumes it, perresumeFalseInterruption. Happy to open a PR with a regression test if that's welcome.