Bug Description
Error message
websockets.exceptions.ConnectionClosedError: received 1007 (invalid frame payload data) Cannot extract voices from a non-audio request.
Code to reproduce
from livekit.agents import AgentSession
from livekit.plugins import google
from livekit.plugins.google.realtime import Modality
session = AgentSession(
llm=google.realtime.RealtimeModel(
model="gemini-2.5-flash-native-audio-preview-12-2025",
modalities=["text"],
),
tts=<YOUR_CUSTOM_TTS>, # e.g., elevenlabs.TTS(), deepgram.TTS()
vad=silero.VAD.load(),
)
Expected Behavior
When setting modalities=[Modality.TEXT], the Gemini Live API should return text-only responses, allowing the agent to use a separate TTS plugin for speech synthesis (half-cascade architecture).
Reproduction Steps
from livekit.agents import AgentSession
from livekit.plugins import google
from livekit.plugins.google.realtime import Modality
session = AgentSession(
llm=google.realtime.RealtimeModel(
model="gemini-2.5-flash-native-audio-preview-12-2025",
modalities=[Modality.TEXT],
),
tts=<YOUR_CUSTOM_TTS>, # e.g., elevenlabs.TTS(), deepgram.TTS()
vad=silero.VAD.load(),
)
Operating System
Ubuntu 22.04
Models Used
Deepgram, Google, Elevenlab
Package Versions
Session/Room/Call IDs
No response
Proposed Solution
Additional Context
No response
Screenshots and Recordings
No response