Skip to content

The gemini-2.5-flash-native-audio-preview-12-2025 model cannot be used with modalities text for hybrid architecture with a separate TTS plugin #4423

@sagorbrur

Description

@sagorbrur

Bug Description

Error message

websockets.exceptions.ConnectionClosedError: received 1007 (invalid frame payload data) Cannot extract voices from a non-audio request.

Code to reproduce

from livekit.agents import AgentSession
from livekit.plugins import google
from livekit.plugins.google.realtime import Modality

session = AgentSession(
    llm=google.realtime.RealtimeModel(
        model="gemini-2.5-flash-native-audio-preview-12-2025",
        modalities=["text"],
    ),
    tts=<YOUR_CUSTOM_TTS>,  # e.g., elevenlabs.TTS(), deepgram.TTS()
    vad=silero.VAD.load(),
)

Expected Behavior

When setting modalities=[Modality.TEXT], the Gemini Live API should return text-only responses, allowing the agent to use a separate TTS plugin for speech synthesis (half-cascade architecture).

Reproduction Steps

from livekit.agents import AgentSession
from livekit.plugins import google
from livekit.plugins.google.realtime import Modality

session = AgentSession(
    llm=google.realtime.RealtimeModel(
        model="gemini-2.5-flash-native-audio-preview-12-2025",
        modalities=[Modality.TEXT],
    ),
    tts=<YOUR_CUSTOM_TTS>,  # e.g., elevenlabs.TTS(), deepgram.TTS()
    vad=silero.VAD.load(),
)

Operating System

Ubuntu 22.04

Models Used

Deepgram, Google, Elevenlab

Package Versions

livekit-agents==1.3.10

Session/Room/Call IDs

No response

Proposed Solution

Additional Context

No response

Screenshots and Recordings

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions