Skip to content

fix: correct misleading log, swallowed exception, and HTTP client leak in RegistryTEEConnection.reconnect()#232

Open
amathxbt wants to merge 1 commit intoOpenGradient:mainfrom
amathxbt:fix/tee-reconnect-error-handling
Open

fix: correct misleading log, swallowed exception, and HTTP client leak in RegistryTEEConnection.reconnect()#232
amathxbt wants to merge 1 commit intoOpenGradient:mainfrom
amathxbt:fix/tee-reconnect-error-handling

Conversation

@amathxbt
Copy link
Copy Markdown
Contributor

@amathxbt amathxbt commented Apr 3, 2026

Bug Fix: RegistryTEEConnection.reconnect() — misleading log, swallowed exception, resource leak

Summary

Three bugs found in RegistryTEEConnection.reconnect() in tee_connection.py:


Bug 1 — Wrong log message

Before:

async def reconnect(self) -> None:
    async with self._refresh_lock:
        try:
            self._active = self._connect()
        except Exception:
            logger.debug("Failed to close previous HTTP client during TEE refresh.", exc_info=True)

The log message says "Failed to close previous HTTP client" — but the code never tried to close anything. The actual failure was in self._connect() (registry lookup, TLS handshake, etc.). The error message was completely wrong, making debugging nearly impossible.


Bug 2 — Exception silently swallowed

After self._connect() fails, the exception is caught and discarded. The caller (including _tee_refresh_loop) has no way to know that the reconnect failed. self._active stays pointing to the old (potentially dead) TEE and the system appears healthy.


Bug 3 — Resource leak on successful reconnect

When self._connect() succeeds, the old ActiveTEE.http_client is never closed. The old httpx client and its underlying TCP connection are leaked every time a reconnect happens (every 5 minutes by the background loop).


Fix

async def reconnect(self) -> None:
    async with self._refresh_lock:
        old_active = self._active
        try:
            self._active = self._connect()
        except Exception:
            logger.warning(
                "Failed to reconnect to a new TEE from registry; keeping existing connection.",
                exc_info=True,
            )
            raise  # re-raise so callers know reconnect failed
        else:
            # Close old client only after successful reconnect
            try:
                await old_active.http_client.aclose()
            except Exception:
                logger.debug(
                    "Failed to close previous HTTP client after successful TEE refresh.",
                    exc_info=True,
                )

Impact

  • Every 5-minute background TEE health check that triggers a reconnect leaks one HTTP connection
  • Registry or TLS failures during reconnect are invisible to callers and logs
  • Debugging TEE connectivity issues is significantly harder with a misleading log message

@adambalogh @kylexqian — three bugs in one method, all related. Happy to add a unit test if that would help get this merged. 🙏

…nection.reconnect()

Three bugs fixed:
1. Wrong log message — error said 'close previous HTTP client' but the failure
   was actually in _connect(), not in closing anything.
2. Exception was silently swallowed — callers had no way to know reconnect
   failed and self._active was left pointing to a potentially stale/dead TEE.
3. Resource leak — old HTTP client was never closed on a successful reconnect,
   leaking the underlying TCP connection.

Fix: snapshot old_active before attempting _connect(), re-raise on failure
with a correct warning-level log, and close the old client in the else branch
only after a successful reconnect.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant