feat(core): add user transcription timeout#6182
Conversation
chenghao-mou
commented
Jun 22, 2026
- add a new event when vad detects any speech, but stt fails to produce any transcript during that timeout
add a new event when vad detects any speech, but stt fails to produce any transcript during that timeout
There was a problem hiding this comment.
🚩 New event not forwarded in SessionHost remote transport
The SessionHost in remote_session.py registers handlers for specific events (lines 371-379) and forwards them over the transport. The new user_transcription_timeout event is not registered or forwarded. This means remote sessions won't receive this event. This is likely acceptable as a first iteration (not all events need remote transport support immediately), but it's an inconsistency with the event being in EventTypes and AgentEvent.
(Refers to lines 366-379)
Was this helpful? React with 👍 or 👎 to provide feedback.
| def _on_transcription_timeout(self) -> None: | ||
| self._transcription_timeout_handle = None | ||
| if self._user_turn_start is None or self._turn_transcript_received: | ||
| return | ||
|
|
||
| if self._agent_speaking: | ||
| return | ||
|
|
||
| self._hooks.on_transcription_timeout( | ||
| speech_duration=self._turn_speech_duration, turn_start=self._user_turn_start | ||
| ) |
There was a problem hiding this comment.
🚩 Timer not cancelled when agent starts speaking — relies on lazy guard
When _arm_transcription_timeout schedules the timer and the agent subsequently starts speaking (via on_start_of_agent_speech at audio_recognition.py:385), the timer handle is NOT cancelled. Instead, _on_transcription_timeout at line 1806 checks self._agent_speaking and bails out if True. This is a valid lazy-check pattern but has a subtle edge case: if the agent speaks briefly (e.g. a backchannel like "hmm") and finishes before the timer fires, the timeout event will still fire even though the agent responded. This may be intentional (the user's speech was still never transcribed), but could surprise users who expect any agent speech to suppress the event.
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
Since there is no user content, we should still emit the signal so the agent can check.
| self._user_speaking_event.clear() | ||
| self._last_speaking_time = time.time() - ev.silence_duration - ev.inference_duration | ||
|
|
||
| self._arm_transcription_timeout(ev.speech_duration) |
There was a problem hiding this comment.
maybe skip this when stt is not set?
|
should we have a default handler with the above implementation? any downsides of having this work automatically? |
I am wondering how many false alarms it may have, it can by risky if there are some background noise. |