-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Bug Description
Problem
When using turn_detection="stt" (e.g. with Deepgram Flux), transcription_delay
is always ~0 because END_OF_SPEECH overwrites _last_speaking_time:
# audio_recognition.py:446-452
elif ev.type == SpeechEventType.END_OF_SPEECH and self._turn_detection_mode == "stt":
self._speaking = False
self._user_turn_committed = True
self._last_speaking_time = time.time() # overwrites previous valueThen the calculation becomes:
# line 594
transcription_delay = last_final_transcript_time - last_speaking_time
# Both are ~time.time(), so result is ~0Also, if VAD is passed alongside STT turn detection, its timing gets discarded by this overwrite. Seems inconsistent behavior when VAD is being used for calculating latency in normal turn detection modes (en or multi lingual model)
Broader question
Why is transcription_delay using VAD events and wall-clock time.time() for delay calculation?
SpeechData already has audio timestamps:
- end_time - when speech ended in the audio stream
- words[].end_time - per-word timing
- start_time_offset - audio stream reference
These tell us exactly when speech occurred in the audio. The delay calculation
could be:
transcription_delay = now - (audio_stream_start + speech_end_in_audio)
Note: The data is already there in SpeechData, just not being used.
Expected Behavior
Ideally, transcription delay should be accurately calculated using speech data in cases of stt turn based detection
Reproduction Steps
Use deepgram flux with turn detection mode as stt
@server.rtc_session
async def my_agent(ctx: JobContext):
# Logging setup
# Add any other context you want in all log entries here
ctx.log_context_fields = {
"room": ctx.room.name,
}
stt = deepgram.STTv2()
tts = elevenlabs.TTS(
model="eleven_turbo_v2_5", voice_id="jqcCZkN6Knx8BJ5TBdYR", auto_mode=True
)
llm = anthropic.LLM(model="claude-sonnet-4-5", caching="ephemeral")
# Set up a voice AI pipeline using OpenAI, Cartesia, AssemblyAI, and the LiveKit turn detector
session = AgentSession(
# Speech-to-text (STT) is your agent's ears, turning the user's speech into text that the LLM can understand
# See all available models at https://docs.livekit.io/agents/models/stt/
stt=stt,
# A Large Language Model (LLM) is your agent's brain, processing user input and generating a response
# See all available models at https://docs.livekit.io/agents/models/llm/
llm=llm,
# Text-to-speech (TTS) is your agent's voice, turning the LLM's text into speech that the user can hear
# See all available models as well as voice selections at https://docs.livekit.io/agents/models/tts/
tts=tts,
# VAD and turn detection are used to determine when the user is speaking and when the agent should respond
# See more at https://docs.livekit.io/agents/build/turns
turn_detection=EnglishModel(),
vad=ctx.proc.userdata["vad"],
# allow the LLM to generate a response while waiting for the end of turn
# See more at https://docs.livekit.io/agents/build/audio/#preemptive-generation
preemptive_generation=True,
)
# Start the session, which initializes the voice pipeline and warms up the models
await session.start(
agent=Assistant(),
room=ctx.room,
room_options=room_io.RoomOptions(
audio_input=room_io.AudioInputOptions(
noise_cancellation=lambda params: noise_cancellation.BVCTelephony()
if params.participant.kind == rtc.ParticipantKind.PARTICIPANT_KIND_SIP
else noise_cancellation.BVC(),
),
),
record=True
)
# Join the room and connect to the user
await ctx.connect()
if __name__ == "__main__":
cli.run_app(server)You will notice the EOU Metrics will have 0 as transcription_delay
EOU metrics {"model_name": "en", "model_provider": "livekit", "end_of_utterance_delay": 0.81, "transcription_delay": 0.79}
Operating System
MacOS
Models Used
Deepgram flux
Package Versions
livekit==1.0.23
livekit-agents==1.3.10
livekit-api==1.1.0
livekit-blingfire==1.1.0
livekit-plugins-anthropic==1.3.10
livekit-plugins-cartesia==1.3.10
livekit-plugins-deepgram==1.3.10
livekit-plugins-elevenlabs==1.3.10
livekit-plugins-google==1.3.10
livekit-plugins-noise-cancellation==0.2.5
livekit-plugins-openai==1.3.10
livekit-plugins-silero==1.3.10
livekit-plugins-turn-detector==1.3.10
livekit-protocol==1.1.1Session/Room/Call IDs
No response
Proposed Solution
Use Speech Data or VAD if provided as mentioned above (Ideally speech data as that is more accurate representation of how much audio transcription providers have processed so far)Additional Context
No response
Screenshots and Recordings
No response