Skip to content

feat(core): add user transcription timeout#6182

Open
chenghao-mou wants to merge 3 commits into
mainfrom
chenghao/feat/stt-transcription-timeout-AGT-3024
Open

feat(core): add user transcription timeout#6182
chenghao-mou wants to merge 3 commits into
mainfrom
chenghao/feat/stt-transcription-timeout-AGT-3024

Conversation

@chenghao-mou

Copy link
Copy Markdown
Member
  • add a new event when vad detects any speech, but stt fails to produce any transcript during that timeout
from livekit.agents import AgentSession, UserTranscriptionTimeoutEvent

  session = AgentSession(
      stt=...,
      llm=...,
      tts=...,
      # VAD heard speech but no transcript landed within 5s of the user stopping.
      # 5.0 is the default; set to None to disable.
      transcription_timeout=5.0,
  )

  @session.on("user_transcription_timeout")
  def _on_transcription_timeout(ev: UserTranscriptionTimeoutEvent) -> None:
      # ev.speech_duration   -> total VAD speech (s) this turn that produced no transcript
      # ev.vad_speech_started_at -> when the user first started speaking (epoch seconds)

      # ignore very short blips (coughs, door slams) that VAD picks up as "speech"
      if ev.speech_duration < 0.7:
          return

      session.generate_reply(
          instructions="Tell the user you didn't catch that and ask them to repeat it.",
      )

add a new event when vad detects any speech, but stt fails to produce any transcript during that timeout
@chenghao-mou chenghao-mou requested a review from a team as a code owner June 22, 2026 13:36

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 potential issues.

Open in Devin Review

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 New event not forwarded in SessionHost remote transport

The SessionHost in remote_session.py registers handlers for specific events (lines 371-379) and forwards them over the transport. The new user_transcription_timeout event is not registered or forwarded. This means remote sessions won't receive this event. This is likely acceptable as a first iteration (not all events need remote transport support immediately), but it's an inconsistency with the event being in EventTypes and AgentEvent.

(Refers to lines 366-379)

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +1801 to +1811
def _on_transcription_timeout(self) -> None:
self._transcription_timeout_handle = None
if self._user_turn_start is None or self._turn_transcript_received:
return

if self._agent_speaking:
return

self._hooks.on_transcription_timeout(
speech_duration=self._turn_speech_duration, turn_start=self._user_turn_start
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 Timer not cancelled when agent starts speaking — relies on lazy guard

When _arm_transcription_timeout schedules the timer and the agent subsequently starts speaking (via on_start_of_agent_speech at audio_recognition.py:385), the timer handle is NOT cancelled. Instead, _on_transcription_timeout at line 1806 checks self._agent_speaking and bails out if True. This is a valid lazy-check pattern but has a subtle edge case: if the agent speaks briefly (e.g. a backchannel like "hmm") and finishes before the timer fires, the timeout event will still fire even though the agent responded. This may be intentional (the user's speech was still never transcribed), but could surprise users who expect any agent speech to suppress the event.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there is no user content, we should still emit the signal so the agent can check.

self._user_speaking_event.clear()
self._last_speaking_time = time.time() - ev.silence_duration - ev.inference_duration

self._arm_transcription_timeout(ev.speech_duration)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe skip this when stt is not set?

@davidzhao

Copy link
Copy Markdown
Member

should we have a default handler with the above implementation? any downsides of having this work automatically?

@longcw

longcw commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

should we have a default handler with the above implementation? any downsides of having this work automatically?

I am wondering how many false alarms it may have, it can by risky if there are some background noise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants