Skip to content

Incorrect transcription_delay when using STT turn detection mode #4388

@zaheerabbas-prodigal

Description

@zaheerabbas-prodigal

Bug Description

Problem

When using turn_detection="stt" (e.g. with Deepgram Flux), transcription_delay
is always ~0 because END_OF_SPEECH overwrites _last_speaking_time:

# audio_recognition.py:446-452
elif ev.type == SpeechEventType.END_OF_SPEECH and self._turn_detection_mode == "stt":
    self._speaking = False
    self._user_turn_committed = True
    self._last_speaking_time = time.time()  # overwrites previous value

Then the calculation becomes:

  # line 594
  transcription_delay = last_final_transcript_time - last_speaking_time
  # Both are ~time.time(), so result is ~0

Also, if VAD is passed alongside STT turn detection, its timing gets discarded by this overwrite. Seems inconsistent behavior when VAD is being used for calculating latency in normal turn detection modes (en or multi lingual model)

Broader question

Why is transcription_delay using VAD events and wall-clock time.time() for delay calculation?

SpeechData already has audio timestamps:

  • end_time - when speech ended in the audio stream
  • words[].end_time - per-word timing
  • start_time_offset - audio stream reference

These tell us exactly when speech occurred in the audio. The delay calculation
could be:

transcription_delay = now - (audio_stream_start + speech_end_in_audio)

Note: The data is already there in SpeechData, just not being used.

Expected Behavior

Ideally, transcription delay should be accurately calculated using speech data in cases of stt turn based detection

Reproduction Steps

Use deepgram flux with turn detection mode as stt

@server.rtc_session
async def my_agent(ctx: JobContext):
    # Logging setup
    # Add any other context you want in all log entries here
    ctx.log_context_fields = {
        "room": ctx.room.name,
    }
    stt = deepgram.STTv2()
    tts = elevenlabs.TTS(
        model="eleven_turbo_v2_5", voice_id="jqcCZkN6Knx8BJ5TBdYR", auto_mode=True
    )
    llm = anthropic.LLM(model="claude-sonnet-4-5", caching="ephemeral")

    # Set up a voice AI pipeline using OpenAI, Cartesia, AssemblyAI, and the LiveKit turn detector
    session = AgentSession(
        # Speech-to-text (STT) is your agent's ears, turning the user's speech into text that the LLM can understand
        # See all available models at https://docs.livekit.io/agents/models/stt/
        stt=stt,
        # A Large Language Model (LLM) is your agent's brain, processing user input and generating a response
        # See all available models at https://docs.livekit.io/agents/models/llm/
        llm=llm,
        # Text-to-speech (TTS) is your agent's voice, turning the LLM's text into speech that the user can hear
        # See all available models as well as voice selections at https://docs.livekit.io/agents/models/tts/
        tts=tts,
        # VAD and turn detection are used to determine when the user is speaking and when the agent should respond
        # See more at https://docs.livekit.io/agents/build/turns
        turn_detection=EnglishModel(),
        vad=ctx.proc.userdata["vad"],
        # allow the LLM to generate a response while waiting for the end of turn
        # See more at https://docs.livekit.io/agents/build/audio/#preemptive-generation
        preemptive_generation=True,
    )

    # Start the session, which initializes the voice pipeline and warms up the models
    await session.start(
        agent=Assistant(),
        room=ctx.room,
        room_options=room_io.RoomOptions(
            audio_input=room_io.AudioInputOptions(
                    noise_cancellation=lambda params: noise_cancellation.BVCTelephony()
                    if params.participant.kind == rtc.ParticipantKind.PARTICIPANT_KIND_SIP
                    else noise_cancellation.BVC(),
            ),
        ),
        record=True
    )

    # Join the room and connect to the user
    await ctx.connect()

if __name__ == "__main__":
    cli.run_app(server)

You will notice the EOU Metrics will have 0 as transcription_delay

EOU metrics {"model_name": "en", "model_provider": "livekit", "end_of_utterance_delay": 0.81, "transcription_delay": 0.79}

Operating System

MacOS

Models Used

Deepgram flux

Package Versions

livekit==1.0.23
livekit-agents==1.3.10
livekit-api==1.1.0
livekit-blingfire==1.1.0
livekit-plugins-anthropic==1.3.10
livekit-plugins-cartesia==1.3.10
livekit-plugins-deepgram==1.3.10
livekit-plugins-elevenlabs==1.3.10
livekit-plugins-google==1.3.10
livekit-plugins-noise-cancellation==0.2.5
livekit-plugins-openai==1.3.10
livekit-plugins-silero==1.3.10
livekit-plugins-turn-detector==1.3.10
livekit-protocol==1.1.1

Session/Room/Call IDs

No response

Proposed Solution

Use Speech Data or VAD if provided as mentioned above (Ideally speech data as that is more accurate representation of how much audio transcription providers have processed so far)

Additional Context

No response

Screenshots and Recordings

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions