Incorrect transcription_delay when using STT turn detection mode

### Bug Description

  ## Problem

  When using `turn_detection="stt"` (e.g. with Deepgram Flux), transcription_delay 
  is always ~0 because END_OF_SPEECH overwrites `_last_speaking_time`:

  ```python
  # audio_recognition.py:446-452
  elif ev.type == SpeechEventType.END_OF_SPEECH and self._turn_detection_mode == "stt":
      self._speaking = False
      self._user_turn_committed = True
      self._last_speaking_time = time.time()  # overwrites previous value
```
  Then the calculation becomes:
```python
  # line 594
  transcription_delay = last_final_transcript_time - last_speaking_time
  # Both are ~time.time(), so result is ~0
```
 
Also, if VAD is passed alongside STT turn detection, its timing gets discarded  by this overwrite. Seems inconsistent behavior when VAD is being used for calculating latency in normal turn detection modes (en or multi lingual model)

  Broader question

  Why is transcription_delay using VAD events and wall-clock time.time() for delay calculation?

  SpeechData already has audio timestamps:
  - end_time - when speech ended in the audio stream
  - words[].end_time - per-word timing
  - start_time_offset - audio stream reference

  These tell us exactly when speech occurred in the audio. The delay calculation
  could be:

  transcription_delay = now - (audio_stream_start + speech_end_in_audio)

  Note: The data is already there in SpeechData, just not being used.
  

### Expected Behavior

Ideally, transcription delay should be accurately calculated using speech data in cases of stt turn based detection

### Reproduction Steps
Use deepgram flux with turn detection mode as stt

```python

@server.rtc_session
async def my_agent(ctx: JobContext):
    # Logging setup
    # Add any other context you want in all log entries here
    ctx.log_context_fields = {
        "room": ctx.room.name,
    }
    stt = deepgram.STTv2()
    tts = elevenlabs.TTS(
        model="eleven_turbo_v2_5", voice_id="jqcCZkN6Knx8BJ5TBdYR", auto_mode=True
    )
    llm = anthropic.LLM(model="claude-sonnet-4-5", caching="ephemeral")

    # Set up a voice AI pipeline using OpenAI, Cartesia, AssemblyAI, and the LiveKit turn detector
    session = AgentSession(
        # Speech-to-text (STT) is your agent's ears, turning the user's speech into text that the LLM can understand
        # See all available models at https://docs.livekit.io/agents/models/stt/
        stt=stt,
        # A Large Language Model (LLM) is your agent's brain, processing user input and generating a response
        # See all available models at https://docs.livekit.io/agents/models/llm/
        llm=llm,
        # Text-to-speech (TTS) is your agent's voice, turning the LLM's text into speech that the user can hear
        # See all available models as well as voice selections at https://docs.livekit.io/agents/models/tts/
        tts=tts,
        # VAD and turn detection are used to determine when the user is speaking and when the agent should respond
        # See more at https://docs.livekit.io/agents/build/turns
        turn_detection=EnglishModel(),
        vad=ctx.proc.userdata["vad"],
        # allow the LLM to generate a response while waiting for the end of turn
        # See more at https://docs.livekit.io/agents/build/audio/#preemptive-generation
        preemptive_generation=True,
    )

    # Start the session, which initializes the voice pipeline and warms up the models
    await session.start(
        agent=Assistant(),
        room=ctx.room,
        room_options=room_io.RoomOptions(
            audio_input=room_io.AudioInputOptions(
                    noise_cancellation=lambda params: noise_cancellation.BVCTelephony()
                    if params.participant.kind == rtc.ParticipantKind.PARTICIPANT_KIND_SIP
                    else noise_cancellation.BVC(),
            ),
        ),
        record=True
    )

    # Join the room and connect to the user
    await ctx.connect()

if __name__ == "__main__":
    cli.run_app(server)
```

You will notice the EOU Metrics will have 0 as transcription_delay
```
EOU metrics {"model_name": "en", "model_provider": "livekit", "end_of_utterance_delay": 0.81, "transcription_delay": 0.79}
```

### Operating System

MacOS

### Models Used

Deepgram flux

### Package Versions

```bash
livekit==1.0.23
livekit-agents==1.3.10
livekit-api==1.1.0
livekit-blingfire==1.1.0
livekit-plugins-anthropic==1.3.10
livekit-plugins-cartesia==1.3.10
livekit-plugins-deepgram==1.3.10
livekit-plugins-elevenlabs==1.3.10
livekit-plugins-google==1.3.10
livekit-plugins-noise-cancellation==0.2.5
livekit-plugins-openai==1.3.10
livekit-plugins-silero==1.3.10
livekit-plugins-turn-detector==1.3.10
livekit-protocol==1.1.1
```

### Session/Room/Call IDs

_No response_

### Proposed Solution

```python
Use Speech Data or VAD if provided as mentioned above (Ideally speech data as that is more accurate representation of how much audio transcription providers have processed so far)
```

### Additional Context

_No response_

### Screenshots and Recordings

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Incorrect transcription_delay when using STT turn detection mode #4388

Bug Description

Problem

Expected Behavior

Reproduction Steps

Operating System

Models Used

Package Versions

Session/Room/Call IDs

Proposed Solution

Additional Context

Screenshots and Recordings

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Incorrect transcription_delay when using STT turn detection mode #4388

Description

Bug Description

Problem

Expected Behavior

Reproduction Steps

Operating System

Models Used

Package Versions

Session/Room/Call IDs

Proposed Solution

Additional Context

Screenshots and Recordings

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions