Bug Description
On my own MCP server project I had an issue of always hitting max tool calls before I simplified the tool list, so I was curious if there were any similar tool list issues here. That's when I noticed tools can go dead mid-session: if the MCP server process dies, nothing notices, and every tool call after that fails for the rest of the session. On a long voice call that's going to happen eventually, servers restart.
What actually happens: the connection task never finds out the transport is gone, initialized keeps returning True, and every tool call for the rest of the session raises a raw anyio.ClosedResourceError with no message.
Root cause (traced on main, livekit-agents/livekit/agents/llm/mcp.py):
_run_client (~L115) holds the ClientSession open and parks on await self._closing_ev.wait(). When the server process dies the transport streams close underneath it, but nothing links stream death to that wait: the connection task stays parked forever (step 4 in the repro output, task done=False).
- Because the task never unwinds, the
finally: cleanup (~L138-141, self._client = None) never runs in this death mode, so the guarded ToolError("Tool invocation failed: internal service is unavailable...") (~L171, gated on _client is None) is never reached either.
- Tool calls therefore fall through to
await self._client.call_tool(...) (~L175) and raise a raw anyio.ClosedResourceError (empty message) into the agent layer, i.e. into the LLM's tool-call turn.
initialized (~L96) is just self._client is not None, so it keeps reporting True after death.
- Nothing observes
_client_task except aclose() (~L203-205), and nothing ever re-runs initialize(): no reconnect, no backoff, no health event. Tools registered on an AgentSession at startup remain permanently dead handles for the rest of the session.
Expected Behavior
I expected some kind of reconnect attempt, or at least an error. Found "internal service is unavailable" in the ToolErrors sitting in mcp.py, but it never triggers when the server dies. Worse, it's a silent error, because initialized keeps saying True, and tool calls just throw anyio.ClosedResourceError with no message. Nothing worse than a silent error. At least it should fail loudly, or better yet it should still work.
Reproduction Steps
- Save the two files below and
pip install "livekit-agents[mcp]"
python repro.py
- Watch steps 4-6 of the output: the server process is dead,
initialized still says True, and tool calls raise raw anyio errors
dying_server.py (a ping tool and a crash tool that exits the process, simulating any server death: crash, redeploy, OOM kill):
import os
from mcp.server.fastmcp import FastMCP
mcp = FastMCP("dying-server")
@mcp.tool()
def ping() -> str:
"""Returns pong."""
return "pong"
@mcp.tool()
def crash() -> str:
"""Simulates the server process dying (crash, restart, OOM kill...)."""
os._exit(1)
if __name__ == "__main__":
mcp.run(transport="stdio")
repro.py:
import asyncio, sys
from pathlib import Path
from livekit.agents.llm.mcp import MCPServerStdio
SERVER = str(Path(__file__).parent / "dying_server.py")
async def main() -> None:
server = MCPServerStdio(command=sys.executable, args=[SERVER])
await server.initialize()
tools = await server.list_tools()
print(f"1) initialized={server.initialized}, tools={[t.info.name for t in tools]}")
ping = next(t for t in tools if t.info.name == "ping")
crash = next(t for t in tools if t.info.name == "crash")
print("2) ping ->", await ping(raw_arguments={}))
try:
await crash(raw_arguments={}) # the server process exits here
except Exception as e:
print(f"3) crash tool raised {type(e).__name__}: {str(e)[:60]}")
await asyncio.sleep(1.0)
print(f"4) one second after server death: initialized={server.initialized} "
f"(connection task done={server._client_task.done()})")
for n in (5, 6):
try:
await ping(raw_arguments={})
except Exception as e:
print(f"{n}) ping -> {type(e).__module__}.{type(e).__name__}: {str(e)[:70]!r}")
await asyncio.sleep(0.5)
print("7) no reconnect was attempted; initialized still", server.initialized)
asyncio.run(main())
Output:
1) initialized=True, tools=['ping', 'crash']
2) ping -> {"type":"text","text":"pong","annotations":null,"meta":null}
3) crash tool raised McpError: Connection closed
4) one second after server death: initialized=True (connection task done=False)
5) ping -> anyio.ClosedResourceError: ''
6) ping -> anyio.ClosedResourceError: ''
7) no reconnect was attempted; initialized still True
Step 4 is the heart of it: a full second after the server process is gone, the connection task has not observed the death.
Operating System
Linux (repro is OS-independent)
Models Used
None involved (pure MCP client lifecycle)
Package Versions
livekit-agents 1.6.4 (also reproduced identically against main @ e2b2d09, 2026-07-02)
mcp: stdio transport (streamable-HTTP shares the same lifecycle path)
Python 3.10.12
Session/Room/Call IDs
No response
Proposed Solution
The fix seemed simple and partly already written, it just never runs. Something needs to watch the connection, then it can at least call the proper error, failing loudly. Once the error is tracked the obvious next step is reconnect.
Additional Context
No response
Screenshots and Recordings
No response
Bug Description
On my own MCP server project I had an issue of always hitting max tool calls before I simplified the tool list, so I was curious if there were any similar tool list issues here. That's when I noticed tools can go dead mid-session: if the MCP server process dies, nothing notices, and every tool call after that fails for the rest of the session. On a long voice call that's going to happen eventually, servers restart.
What actually happens: the connection task never finds out the transport is gone,
initializedkeeps returning True, and every tool call for the rest of the session raises a rawanyio.ClosedResourceErrorwith no message.Root cause (traced on main,
livekit-agents/livekit/agents/llm/mcp.py):_run_client(~L115) holds theClientSessionopen and parks onawait self._closing_ev.wait(). When the server process dies the transport streams close underneath it, but nothing links stream death to that wait: the connection task stays parked forever (step 4 in the repro output,task done=False).finally:cleanup (~L138-141,self._client = None) never runs in this death mode, so the guardedToolError("Tool invocation failed: internal service is unavailable...")(~L171, gated on_client is None) is never reached either.await self._client.call_tool(...)(~L175) and raise a rawanyio.ClosedResourceError(empty message) into the agent layer, i.e. into the LLM's tool-call turn.initialized(~L96) is justself._client is not None, so it keeps reporting True after death._client_taskexceptaclose()(~L203-205), and nothing ever re-runsinitialize(): no reconnect, no backoff, no health event. Tools registered on anAgentSessionat startup remain permanently dead handles for the rest of the session.Expected Behavior
I expected some kind of reconnect attempt, or at least an error. Found "internal service is unavailable" in the ToolErrors sitting in mcp.py, but it never triggers when the server dies. Worse, it's a silent error, because
initializedkeeps saying True, and tool calls just throw anyio.ClosedResourceError with no message. Nothing worse than a silent error. At least it should fail loudly, or better yet it should still work.Reproduction Steps
pip install "livekit-agents[mcp]"python repro.pyinitializedstill says True, and tool calls raise raw anyio errorsdying_server.py(apingtool and acrashtool that exits the process, simulating any server death: crash, redeploy, OOM kill):repro.py:Output:
Step 4 is the heart of it: a full second after the server process is gone, the connection task has not observed the death.
Operating System
Linux (repro is OS-independent)
Models Used
None involved (pure MCP client lifecycle)
Package Versions
Session/Room/Call IDs
No response
Proposed Solution
The fix seemed simple and partly already written, it just never runs. Something needs to watch the connection, then it can at least call the proper error, failing loudly. Once the error is tracked the obvious next step is reconnect.
Additional Context
No response
Screenshots and Recordings
No response