fix(daemon): skip unresponsive targets on attach + per-call CDP timeout#307
fix(daemon): skip unresponsive targets on attach + per-call CDP timeout#307decisionm wants to merge 2 commits intobrowser-use:mainfrom
Conversation
There was a problem hiding this comment.
2 issues found across 2 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="helpers.py">
<violation number="1" location="helpers.py:26">
P2: BU_SEND_TIMEOUT parsed unsafely at import time; invalid env values crash module import</violation>
</file>
<file name="daemon.py">
<violation number="1" location="daemon.py:229">
P2: Stale-session recovery still awaits `attach_first_page()` without a timeout, so the new timeout does not protect reattach and a stalled CDP call can hang the request indefinitely.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.
| INTERNAL = ("chrome://", "chrome-untrusted://", "devtools://", "chrome-extension://", "about:") | ||
|
|
||
|
|
||
| _SEND_TIMEOUT = float(os.environ.get("BU_SEND_TIMEOUT", "60")) |
There was a problem hiding this comment.
P2: BU_SEND_TIMEOUT parsed unsafely at import time; invalid env values crash module import
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At helpers.py, line 26:
<comment>BU_SEND_TIMEOUT parsed unsafely at import time; invalid env values crash module import</comment>
<file context>
@@ -23,16 +23,29 @@ def _load_env():
INTERNAL = ("chrome://", "chrome-untrusted://", "devtools://", "chrome-extension://", "about:")
+_SEND_TIMEOUT = float(os.environ.get("BU_SEND_TIMEOUT", "60"))
+
+
</file context>
| _SEND_TIMEOUT = float(os.environ.get("BU_SEND_TIMEOUT", "60")) | |
| _send_timeout_raw = os.environ.get("BU_SEND_TIMEOUT", "60") | |
| try: | |
| _SEND_TIMEOUT = float(_send_timeout_raw) | |
| except ValueError: | |
| _SEND_TIMEOUT = 60.0 |
| if await self.attach_first_page(): | ||
| return {"result": await asyncio.wait_for( | ||
| self.cdp.send_raw(method, params, session_id=self.session), | ||
| timeout=cdp_timeout, | ||
| )} |
There was a problem hiding this comment.
P2: Stale-session recovery still awaits attach_first_page() without a timeout, so the new timeout does not protect reattach and a stalled CDP call can hang the request indefinitely.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At daemon.py, line 229:
<comment>Stale-session recovery still awaits `attach_first_page()` without a timeout, so the new timeout does not protect reattach and a stalled CDP call can hang the request indefinitely.</comment>
<file context>
@@ -180,14 +209,30 @@ async def handle(self, req):
- if await self.attach_first_page():
- return {"result": await self.cdp.send_raw(method, params, session_id=self.session)}
+ try:
+ if await self.attach_first_page():
+ return {"result": await asyncio.wait_for(
+ self.cdp.send_raw(method, params, session_id=self.session),
</file context>
| if await self.attach_first_page(): | |
| return {"result": await asyncio.wait_for( | |
| self.cdp.send_raw(method, params, session_id=self.session), | |
| timeout=cdp_timeout, | |
| )} | |
| if await asyncio.wait_for(self.attach_first_page(), timeout=cdp_timeout): | |
| return {"result": await asyncio.wait_for( | |
| self.cdp.send_raw(method, params, session_id=self.session), | |
| timeout=cdp_timeout, | |
| )} |
Two bugs caused the daemon to hang silently on real-world setups (multiple heavy SPA tabs open, sleeping tabs, etc.): 1. attach_first_page picked targets[0] unconditionally and silently swallowed Page/DOM/Runtime/Network.enable timeouts. When the first real page was a wedged renderer (sleeping tab, busy SPA, frozen iframe), the daemon happily reported "attached" with a session ID, but every subsequent CDP call against that session hung forever. Now iterates real pages, abandons (and detaches) any target whose enables time out within 4s, and falls back to a fresh about:blank if every existing page is unresponsive. 2. handle() wrapped cdp.send_raw with no per-call timeout. The IPC layer's 5s read timeout protects the *calling* client, but the daemon's own event loop stayed jammed while await self.cdp.send_raw sat forever, starving every other connected client. Now bounded by BU_CDP_TIMEOUT (default 45s); times out individual calls cleanly with a recovery hint. Also wraps the stale-session re-attach path in try/except so re-attach failures surface as errors rather than propagating out of the handler. The override env var lets users bump it for legitimately slow workloads (large captureBeyondViewport screenshots, long Runtime.evaluate scripts). Repro before fix: open Chrome with heavy SPA tabs (Gmail, Cloudflare dashboard, BBC article, etc.) such that targets[0] is unresponsive, then run any helper. The CLI hung indefinitely with no output, error, or diagnostic. The daemon log only showed empty-message TimeoutErrors from the four .enable calls and then went silent. After fix, the daemon log clearly shows the iteration: enable Page on D58089BF...: TimeoutError: ... target 8F5C7B... (https://dit.example.com/) unresponsive, trying next enable Page on 2417ABE3...: TimeoutError: ... target 784EE4CB... (https://notarycouncil.org/) unresponsive, trying next attached 723889E3... (https://mail.google.com/...) session=4E670D2D... Tests: - test_attach_first_page_skips_unresponsive_targets — wedged target with hanging Page.enable; verifies the next live target wins. - test_attach_first_page_creates_blank_when_all_pages_unresponsive — every real page wedged; verifies fallback to fresh about:blank. - test_handle_times_out_wedged_cdp_call — Runtime.evaluate hangs forever; verifies BU_CDP_TIMEOUT bounds it with a clean error containing the method name and a recovery hint. All pre-existing tests still pass.
e699fc9 to
29a7b1b
Compare
attach_first_page() iterates pages with per-call timeouts that stack (5s attach + 4×4s enables per page). On the stale-session recovery path (Session with given id not found), a browser full of wedged tabs could keep one client waiting well past the BU_CDP_TIMEOUT contract. Wrap the whole recovery in asyncio.wait_for(cdp_timeout) so it surfaces a clean error with a restart_daemon() hint instead of accumulating per-tab waits. Adds a regression test covering the multi-tab wedged recovery case. Identified by cubic-dev-ai on PR browser-use#307.
|
@sauravpanda — friendly ping when you have a moment 🙏 This PR fixes two real-world hangs we've been hitting in dm-crm with multi-tab Chrome setups:
Just pushed a follow-up commit ( Tests: 11/11 passing (4 new regression tests covering wedged-target skip, all-wedged → about:blank fallback, per-call timeout, and bounded recovery). PR is Happy to address any further feedback. |
Summary
Two daemon-side bugs cause
browser-harnessto hang silently when Chrome has multiple heavy/sleeping SPA tabs open. The IPC layer's 5s read timeout already lands a clean error in the calling client, but the daemon itself stays jammed for every other client — and when the first attach picks a wedged tab, every call from then on hits the same dead session.(Originally this PR also added an IPC-side socket timeout in
helpers.py:_send, but that bug has already been fixed upstream by_ipc.connect(NAME, timeout=5.0). Rebased to drop the redundant change.)Root causes
1.
attach_first_pagepinned the daemon to a wedged tabPicked
targets[0]unconditionally, ran the fourDomain.enablecalls in parallel, and silently swallowed each timeout. Whentargets[0]was a sleeping tab, busy SPA, or frozen iframe, the daemon happily reportedattached ...in the log — but every subsequent CDP call against that session hung forever because the renderer was unresponsive.2.
handle()had no per-call CDP timeoutawait self.cdp.send_raw(method, params, session_id=sid)with nowait_for. One wedged renderer pinned the daemon's event loop forever, starving every other connected client. The IPCrecvtimeout only protected the immediate caller.Changes
_try_attach(new): attaches to a single target and runs all four.enablecalls. ReturnsNone(and detaches cleanly) if any step fails.attach_first_page: iterates real pages, abandons unresponsive ones, falls back to a freshabout:blankif every existing page is unresponsive. Raises if even the fallback fails._enable_default_domains: gainsstrict=Falseflag. Initial attach (strict=True) returns False on any failure so the caller can abandon the target.set_sessionkeeps the existing best-effort behavior since the tab switch has already committed.handle(): wrapscdp.send_rawinasyncio.wait_forwithBU_CDP_TIMEOUT(default 45s). Returns a clean error with recovery hint instead of hanging. The stale-session re-attach path is now wrapped in try/except so re-attach failures surface as errors.BU_CDP_TIMEOUTis overridable for legitimately slow workloads (largecaptureBeyondViewportscreenshots, longRuntime.evaluatescripts).Repro
Before fix: open Chrome with heavy SPA tabs (Gmail, Cloudflare dashboard, BBC article, etc.) such that
targets[0]is unresponsive. Run any helper:```bash
browser-harness -c 'print(page_info())'
```
CLI hung indefinitely. Daemon log showed only:
```
attached 8F5C7B88... (https://dit.example.com/) session=...
listening on /tmp/bu-default.sock
```
…with no clue that the four
.enablecalls had silently timed out behind the scenes.After fix the iteration is visible and
page_info()returns in <1s:```
enable Page on D58089BF...: TimeoutError:
enable DOM on D58089BF...: TimeoutError:
enable Runtime on D58089BF...: TimeoutError:
enable Network on D58089BF...: TimeoutError:
target 8F5C7B88... (https://dit.example.com/) unresponsive, trying next
enable Page on 2417ABE3...: TimeoutError:
...
target 784EE4CB... (https://notarycouncil.org/) unresponsive, trying next
...
attached 723889E3... (https://mail.google.com/...) session=4E670D2D...
```
Tests
Three new tests in
tests/unit/test_daemon.py:test_attach_first_page_skips_unresponsive_targets— wedged target with hangingPage.enable; verifies the next live target wins.test_attach_first_page_creates_blank_when_all_pages_unresponsive— every real page wedged; verifies fallback to freshabout:blank.test_handle_times_out_wedged_cdp_call—Runtime.evaluatehangs forever; verifiesBU_CDP_TIMEOUTbounds it with a clean error containing the method name and recovery hint.All 7 pre-existing daemon tests continue to pass (10/10 total).
Test plan
page_inforeturns in <1s