Skip to content

fix(daemon): skip unresponsive targets on attach + per-call CDP timeout#307

Open
decisionm wants to merge 2 commits intobrowser-use:mainfrom
decisionm:fix/daemon-resilience
Open

fix(daemon): skip unresponsive targets on attach + per-call CDP timeout#307
decisionm wants to merge 2 commits intobrowser-use:mainfrom
decisionm:fix/daemon-resilience

Conversation

@decisionm
Copy link
Copy Markdown

@decisionm decisionm commented May 6, 2026

Summary

Two daemon-side bugs cause browser-harness to hang silently when Chrome has multiple heavy/sleeping SPA tabs open. The IPC layer's 5s read timeout already lands a clean error in the calling client, but the daemon itself stays jammed for every other client — and when the first attach picks a wedged tab, every call from then on hits the same dead session.

(Originally this PR also added an IPC-side socket timeout in helpers.py:_send, but that bug has already been fixed upstream by _ipc.connect(NAME, timeout=5.0). Rebased to drop the redundant change.)

Root causes

1. attach_first_page pinned the daemon to a wedged tab

Picked targets[0] unconditionally, ran the four Domain.enable calls in parallel, and silently swallowed each timeout. When targets[0] was a sleeping tab, busy SPA, or frozen iframe, the daemon happily reported attached ... in the log — but every subsequent CDP call against that session hung forever because the renderer was unresponsive.

2. handle() had no per-call CDP timeout

await self.cdp.send_raw(method, params, session_id=sid) with no wait_for. One wedged renderer pinned the daemon's event loop forever, starving every other connected client. The IPC recv timeout only protected the immediate caller.

Changes

  • _try_attach (new): attaches to a single target and runs all four .enable calls. Returns None (and detaches cleanly) if any step fails.
  • attach_first_page: iterates real pages, abandons unresponsive ones, falls back to a fresh about:blank if every existing page is unresponsive. Raises if even the fallback fails.
  • _enable_default_domains: gains strict=False flag. Initial attach (strict=True) returns False on any failure so the caller can abandon the target. set_session keeps the existing best-effort behavior since the tab switch has already committed.
  • handle(): wraps cdp.send_raw in asyncio.wait_for with BU_CDP_TIMEOUT (default 45s). Returns a clean error with recovery hint instead of hanging. The stale-session re-attach path is now wrapped in try/except so re-attach failures surface as errors.

BU_CDP_TIMEOUT is overridable for legitimately slow workloads (large captureBeyondViewport screenshots, long Runtime.evaluate scripts).

Repro

Before fix: open Chrome with heavy SPA tabs (Gmail, Cloudflare dashboard, BBC article, etc.) such that targets[0] is unresponsive. Run any helper:

```bash
browser-harness -c 'print(page_info())'
```

CLI hung indefinitely. Daemon log showed only:
```
attached 8F5C7B88... (https://dit.example.com/) session=...
listening on /tmp/bu-default.sock
```
…with no clue that the four .enable calls had silently timed out behind the scenes.

After fix the iteration is visible and page_info() returns in <1s:
```
enable Page on D58089BF...: TimeoutError:
enable DOM on D58089BF...: TimeoutError:
enable Runtime on D58089BF...: TimeoutError:
enable Network on D58089BF...: TimeoutError:
target 8F5C7B88... (https://dit.example.com/) unresponsive, trying next
enable Page on 2417ABE3...: TimeoutError:
...
target 784EE4CB... (https://notarycouncil.org/) unresponsive, trying next
...
attached 723889E3... (https://mail.google.com/...) session=4E670D2D...
```

Tests

Three new tests in tests/unit/test_daemon.py:

  • test_attach_first_page_skips_unresponsive_targets — wedged target with hanging Page.enable; verifies the next live target wins.
  • test_attach_first_page_creates_blank_when_all_pages_unresponsive — every real page wedged; verifies fallback to fresh about:blank.
  • test_handle_times_out_wedged_cdp_callRuntime.evaluate hangs forever; verifies BU_CDP_TIMEOUT bounds it with a clean error containing the method name and recovery hint.

All 7 pre-existing daemon tests continue to pass (10/10 total).

Test plan

  • Unit tests cover wedged-target skip, all-wedged-fallback, and per-call timeout
  • E2E smoke: real Chrome session w/ ~3 wedged SPA tabs → daemon iterates, attaches to live tab, page_info returns in <1s
  • All pre-existing daemon tests still pass
  • CI green

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 2 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="helpers.py">

<violation number="1" location="helpers.py:26">
P2: BU_SEND_TIMEOUT parsed unsafely at import time; invalid env values crash module import</violation>
</file>

<file name="daemon.py">

<violation number="1" location="daemon.py:229">
P2: Stale-session recovery still awaits `attach_first_page()` without a timeout, so the new timeout does not protect reattach and a stalled CDP call can hang the request indefinitely.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.

Comment thread helpers.py Outdated
INTERNAL = ("chrome://", "chrome-untrusted://", "devtools://", "chrome-extension://", "about:")


_SEND_TIMEOUT = float(os.environ.get("BU_SEND_TIMEOUT", "60"))
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot May 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: BU_SEND_TIMEOUT parsed unsafely at import time; invalid env values crash module import

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At helpers.py, line 26:

<comment>BU_SEND_TIMEOUT parsed unsafely at import time; invalid env values crash module import</comment>

<file context>
@@ -23,16 +23,29 @@ def _load_env():
 INTERNAL = ("chrome://", "chrome-untrusted://", "devtools://", "chrome-extension://", "about:")
 
 
+_SEND_TIMEOUT = float(os.environ.get("BU_SEND_TIMEOUT", "60"))
+
+
</file context>
Suggested change
_SEND_TIMEOUT = float(os.environ.get("BU_SEND_TIMEOUT", "60"))
_send_timeout_raw = os.environ.get("BU_SEND_TIMEOUT", "60")
try:
_SEND_TIMEOUT = float(_send_timeout_raw)
except ValueError:
_SEND_TIMEOUT = 60.0
Fix with Cubic

Comment thread daemon.py Outdated
Comment on lines +229 to +233
if await self.attach_first_page():
return {"result": await asyncio.wait_for(
self.cdp.send_raw(method, params, session_id=self.session),
timeout=cdp_timeout,
)}
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot May 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Stale-session recovery still awaits attach_first_page() without a timeout, so the new timeout does not protect reattach and a stalled CDP call can hang the request indefinitely.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At daemon.py, line 229:

<comment>Stale-session recovery still awaits `attach_first_page()` without a timeout, so the new timeout does not protect reattach and a stalled CDP call can hang the request indefinitely.</comment>

<file context>
@@ -180,14 +209,30 @@ async def handle(self, req):
-                if await self.attach_first_page():
-                    return {"result": await self.cdp.send_raw(method, params, session_id=self.session)}
+                try:
+                    if await self.attach_first_page():
+                        return {"result": await asyncio.wait_for(
+                            self.cdp.send_raw(method, params, session_id=self.session),
</file context>
Suggested change
if await self.attach_first_page():
return {"result": await asyncio.wait_for(
self.cdp.send_raw(method, params, session_id=self.session),
timeout=cdp_timeout,
)}
if await asyncio.wait_for(self.attach_first_page(), timeout=cdp_timeout):
return {"result": await asyncio.wait_for(
self.cdp.send_raw(method, params, session_id=self.session),
timeout=cdp_timeout,
)}
Fix with Cubic

Two bugs caused the daemon to hang silently on real-world setups (multiple
heavy SPA tabs open, sleeping tabs, etc.):

1. attach_first_page picked targets[0] unconditionally and silently
   swallowed Page/DOM/Runtime/Network.enable timeouts. When the first
   real page was a wedged renderer (sleeping tab, busy SPA, frozen
   iframe), the daemon happily reported "attached" with a session ID,
   but every subsequent CDP call against that session hung forever.
   Now iterates real pages, abandons (and detaches) any target whose
   enables time out within 4s, and falls back to a fresh about:blank
   if every existing page is unresponsive.

2. handle() wrapped cdp.send_raw with no per-call timeout. The IPC
   layer's 5s read timeout protects the *calling* client, but the
   daemon's own event loop stayed jammed while await self.cdp.send_raw
   sat forever, starving every other connected client. Now bounded by
   BU_CDP_TIMEOUT (default 45s); times out individual calls cleanly
   with a recovery hint. Also wraps the stale-session re-attach path
   in try/except so re-attach failures surface as errors rather than
   propagating out of the handler.

The override env var lets users bump it for legitimately slow workloads
(large captureBeyondViewport screenshots, long Runtime.evaluate scripts).

Repro before fix: open Chrome with heavy SPA tabs (Gmail, Cloudflare
dashboard, BBC article, etc.) such that targets[0] is unresponsive,
then run any helper. The CLI hung indefinitely with no output, error,
or diagnostic. The daemon log only showed empty-message TimeoutErrors
from the four .enable calls and then went silent.

After fix, the daemon log clearly shows the iteration:

  enable Page on D58089BF...: TimeoutError:
  ...
  target 8F5C7B... (https://dit.example.com/) unresponsive, trying next
  enable Page on 2417ABE3...: TimeoutError:
  ...
  target 784EE4CB... (https://notarycouncil.org/) unresponsive, trying next
  attached 723889E3... (https://mail.google.com/...) session=4E670D2D...

Tests:

- test_attach_first_page_skips_unresponsive_targets — wedged target with
  hanging Page.enable; verifies the next live target wins.
- test_attach_first_page_creates_blank_when_all_pages_unresponsive — every
  real page wedged; verifies fallback to fresh about:blank.
- test_handle_times_out_wedged_cdp_call — Runtime.evaluate hangs forever;
  verifies BU_CDP_TIMEOUT bounds it with a clean error containing the
  method name and a recovery hint.

All pre-existing tests still pass.
@decisionm decisionm force-pushed the fix/daemon-resilience branch from e699fc9 to 29a7b1b Compare May 6, 2026 04:32
@decisionm decisionm changed the title fix(daemon): resilient attach + per-call timeouts; helpers: socket timeout fix(daemon): skip unresponsive targets on attach + per-call CDP timeout May 6, 2026
attach_first_page() iterates pages with per-call timeouts that stack
(5s attach + 4×4s enables per page). On the stale-session recovery
path (Session with given id not found), a browser full of wedged tabs
could keep one client waiting well past the BU_CDP_TIMEOUT contract.

Wrap the whole recovery in asyncio.wait_for(cdp_timeout) so it surfaces
a clean error with a restart_daemon() hint instead of accumulating
per-tab waits.

Adds a regression test covering the multi-tab wedged recovery case.

Identified by cubic-dev-ai on PR browser-use#307.
@decisionm
Copy link
Copy Markdown
Author

@sauravpanda — friendly ping when you have a moment 🙏

This PR fixes two real-world hangs we've been hitting in dm-crm with multi-tab Chrome setups:

  1. Daemon pinned to a wedged tab on initial attachattach_first_page() used to pick pages[0] unconditionally and silently swallow Page.enable failures, leaving every later CDP call hanging. Now iterates real pages, abandons unresponsive ones, and falls back to a fresh about:blank if every existing page is wedged.

  2. No per-call CDP timeout in handle() — a stalled renderer (sleeping tab, infinite JS, frozen iframe) would block the daemon's event loop forever, starving every other connected client. Adds a per-call bound via BU_CDP_TIMEOUT (default 45s).

Just pushed a follow-up commit (ed94dc2) addressing the cubic bot's valid feedback: stale-session recovery is now also bounded by BU_CDP_TIMEOUT so cumulative per-tab attach attempts can't blow past the contract.

Tests: 11/11 passing (4 new regression tests covering wedged-target skip, all-wedged → about:blank fallback, per-call timeout, and bounded recovery). PR is MERGEABLE / CLEAN.

Happy to address any further feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants