fix(daemon): skip unresponsive targets on attach + per-call CDP timeout by decisionm · Pull Request #307 · browser-use/browser-harness

decisionm · 2026-05-06T04:27:53Z

Summary

Two daemon-side bugs cause browser-harness to hang silently when Chrome has multiple heavy/sleeping SPA tabs open. The IPC layer's 5s read timeout already lands a clean error in the calling client, but the daemon itself stays jammed for every other client — and when the first attach picks a wedged tab, every call from then on hits the same dead session.

(Originally this PR also added an IPC-side socket timeout in helpers.py:_send, but that bug has already been fixed upstream by _ipc.connect(NAME, timeout=5.0). Rebased to drop the redundant change.)

Root causes

1. `attach_first_page` pinned the daemon to a wedged tab

Picked targets[0] unconditionally, ran the four Domain.enable calls in parallel, and silently swallowed each timeout. When targets[0] was a sleeping tab, busy SPA, or frozen iframe, the daemon happily reported attached ... in the log — but every subsequent CDP call against that session hung forever because the renderer was unresponsive.

2. `handle()` had no per-call CDP timeout

await self.cdp.send_raw(method, params, session_id=sid) with no wait_for. One wedged renderer pinned the daemon's event loop forever, starving every other connected client. The IPC recv timeout only protected the immediate caller.

Changes

_try_attach (new): attaches to a single target and runs all four .enable calls. Returns None (and detaches cleanly) if any step fails.
attach_first_page: iterates real pages, abandons unresponsive ones, falls back to a fresh about:blank if every existing page is unresponsive. Raises if even the fallback fails.
_enable_default_domains: gains strict=False flag. Initial attach (strict=True) returns False on any failure so the caller can abandon the target. set_session keeps the existing best-effort behavior since the tab switch has already committed.
handle(): wraps cdp.send_raw in asyncio.wait_for with BU_CDP_TIMEOUT (default 45s). Returns a clean error with recovery hint instead of hanging. The stale-session re-attach path is now wrapped in try/except so re-attach failures surface as errors.

BU_CDP_TIMEOUT is overridable for legitimately slow workloads (large captureBeyondViewport screenshots, long Runtime.evaluate scripts).

Repro

Before fix: open Chrome with heavy SPA tabs (Gmail, Cloudflare dashboard, BBC article, etc.) such that targets[0] is unresponsive. Run any helper:

```bash
browser-harness -c 'print(page_info())'
```

CLI hung indefinitely. Daemon log showed only:
```
attached 8F5C7B88... (https://dit.example.com/) session=...
listening on /tmp/bu-default.sock
```
…with no clue that the four .enable calls had silently timed out behind the scenes.

After fix the iteration is visible and page_info() returns in <1s:
```
enable Page on D58089BF...: TimeoutError:
enable DOM on D58089BF...: TimeoutError:
enable Runtime on D58089BF...: TimeoutError:
enable Network on D58089BF...: TimeoutError:
target 8F5C7B88... (https://dit.example.com/) unresponsive, trying next
enable Page on 2417ABE3...: TimeoutError:
...
target 784EE4CB... (https://notarycouncil.org/) unresponsive, trying next
...
attached 723889E3... (https://mail.google.com/...) session=4E670D2D...
```

Tests

Three new tests in tests/unit/test_daemon.py:

test_attach_first_page_skips_unresponsive_targets — wedged target with hanging Page.enable; verifies the next live target wins.
test_attach_first_page_creates_blank_when_all_pages_unresponsive — every real page wedged; verifies fallback to fresh about:blank.
test_handle_times_out_wedged_cdp_call — Runtime.evaluate hangs forever; verifies BU_CDP_TIMEOUT bounds it with a clean error containing the method name and recovery hint.

All 7 pre-existing daemon tests continue to pass (10/10 total).

Test plan

Unit tests cover wedged-target skip, all-wedged-fallback, and per-call timeout
E2E smoke: real Chrome session w/ ~3 wedged SPA tabs → daemon iterates, attaches to live tab, page_info returns in <1s
All pre-existing daemon tests still pass
CI green

cubic-dev-ai

2 issues found across 2 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="helpers.py">

<violation number="1" location="helpers.py:26">
P2: BU_SEND_TIMEOUT parsed unsafely at import time; invalid env values crash module import</violation>
</file>

<file name="daemon.py">

<violation number="1" location="daemon.py:229">
P2: Stale-session recovery still awaits `attach_first_page()` without a timeout, so the new timeout does not protect reattach and a stalled CDP call can hang the request indefinitely.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.}

cubic-dev-ai · 2026-05-06T04:30:51Z

 INTERNAL = ("chrome://", "chrome-untrusted://", "devtools://", "chrome-extension://", "about:")


+_SEND_TIMEOUT = float(os.environ.get("BU_SEND_TIMEOUT", "60"))


P2: BU_SEND_TIMEOUT parsed unsafely at import time; invalid env values crash module import

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At helpers.py, line 26: <comment>BU_SEND_TIMEOUT parsed unsafely at import time; invalid env values crash module import</comment> <file context> @@ -23,16 +23,29 @@ def _load_env(): INTERNAL = ("chrome://", "chrome-untrusted://", "devtools://", "chrome-extension://", "about:") +_SEND_TIMEOUT = float(os.environ.get("BU_SEND_TIMEOUT", "60")) + + </file context>

Suggested change

_SEND_TIMEOUT = float(os.environ.get("BU_SEND_TIMEOUT", "60"))

_send_timeout_raw = os.environ.get("BU_SEND_TIMEOUT", "60")

try:

_SEND_TIMEOUT = float(_send_timeout_raw)

except ValueError:

_SEND_TIMEOUT = 60.0

cubic-dev-ai · 2026-05-06T04:30:51Z

+                    if await self.attach_first_page():
+                        return {"result": await asyncio.wait_for(
+                            self.cdp.send_raw(method, params, session_id=self.session),
+                            timeout=cdp_timeout,
+                        )}


P2: Stale-session recovery still awaits attach_first_page() without a timeout, so the new timeout does not protect reattach and a stalled CDP call can hang the request indefinitely.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At daemon.py, line 229: <comment>Stale-session recovery still awaits `attach_first_page()` without a timeout, so the new timeout does not protect reattach and a stalled CDP call can hang the request indefinitely.</comment> <file context> @@ -180,14 +209,30 @@ async def handle(self, req): - if await self.attach_first_page(): - return {"result": await self.cdp.send_raw(method, params, session_id=self.session)} + try: + if await self.attach_first_page(): + return {"result": await asyncio.wait_for( + self.cdp.send_raw(method, params, session_id=self.session), </file context>

Suggested change

if await self.attach_first_page():

return {"result": await asyncio.wait_for(

self.cdp.send_raw(method, params, session_id=self.session),

timeout=cdp_timeout,

)}

if await asyncio.wait_for(self.attach_first_page(), timeout=cdp_timeout):

return {"result": await asyncio.wait_for(

self.cdp.send_raw(method, params, session_id=self.session),

timeout=cdp_timeout,

)}

Two bugs caused the daemon to hang silently on real-world setups (multiple heavy SPA tabs open, sleeping tabs, etc.): 1. attach_first_page picked targets[0] unconditionally and silently swallowed Page/DOM/Runtime/Network.enable timeouts. When the first real page was a wedged renderer (sleeping tab, busy SPA, frozen iframe), the daemon happily reported "attached" with a session ID, but every subsequent CDP call against that session hung forever. Now iterates real pages, abandons (and detaches) any target whose enables time out within 4s, and falls back to a fresh about:blank if every existing page is unresponsive. 2. handle() wrapped cdp.send_raw with no per-call timeout. The IPC layer's 5s read timeout protects the *calling* client, but the daemon's own event loop stayed jammed while await self.cdp.send_raw sat forever, starving every other connected client. Now bounded by BU_CDP_TIMEOUT (default 45s); times out individual calls cleanly with a recovery hint. Also wraps the stale-session re-attach path in try/except so re-attach failures surface as errors rather than propagating out of the handler. The override env var lets users bump it for legitimately slow workloads (large captureBeyondViewport screenshots, long Runtime.evaluate scripts). Repro before fix: open Chrome with heavy SPA tabs (Gmail, Cloudflare dashboard, BBC article, etc.) such that targets[0] is unresponsive, then run any helper. The CLI hung indefinitely with no output, error, or diagnostic. The daemon log only showed empty-message TimeoutErrors from the four .enable calls and then went silent. After fix, the daemon log clearly shows the iteration: enable Page on D58089BF...: TimeoutError: ... target 8F5C7B... (https://dit.example.com/) unresponsive, trying next enable Page on 2417ABE3...: TimeoutError: ... target 784EE4CB... (https://notarycouncil.org/) unresponsive, trying next attached 723889E3... (https://mail.google.com/...) session=4E670D2D... Tests: - test_attach_first_page_skips_unresponsive_targets — wedged target with hanging Page.enable; verifies the next live target wins. - test_attach_first_page_creates_blank_when_all_pages_unresponsive — every real page wedged; verifies fallback to fresh about:blank. - test_handle_times_out_wedged_cdp_call — Runtime.evaluate hangs forever; verifies BU_CDP_TIMEOUT bounds it with a clean error containing the method name and a recovery hint. All pre-existing tests still pass.

attach_first_page() iterates pages with per-call timeouts that stack (5s attach + 4×4s enables per page). On the stale-session recovery path (Session with given id not found), a browser full of wedged tabs could keep one client waiting well past the BU_CDP_TIMEOUT contract. Wrap the whole recovery in asyncio.wait_for(cdp_timeout) so it surfaces a clean error with a restart_daemon() hint instead of accumulating per-tab waits. Adds a regression test covering the multi-tab wedged recovery case. Identified by cubic-dev-ai on PR browser-use#307.

decisionm · 2026-05-06T04:40:47Z

@sauravpanda — friendly ping when you have a moment 🙏

This PR fixes two real-world hangs we've been hitting in dm-crm with multi-tab Chrome setups:

Daemon pinned to a wedged tab on initial attach — attach_first_page() used to pick pages[0] unconditionally and silently swallow Page.enable failures, leaving every later CDP call hanging. Now iterates real pages, abandons unresponsive ones, and falls back to a fresh about:blank if every existing page is wedged.
No per-call CDP timeout in handle() — a stalled renderer (sleeping tab, infinite JS, frozen iframe) would block the daemon's event loop forever, starving every other connected client. Adds a per-call bound via BU_CDP_TIMEOUT (default 45s).

Just pushed a follow-up commit (ed94dc2) addressing the cubic bot's valid feedback: stale-session recovery is now also bounded by BU_CDP_TIMEOUT so cumulative per-tab attach attempts can't blow past the contract.

Tests: 11/11 passing (4 new regression tests covering wedged-target skip, all-wedged → about:blank fallback, per-call timeout, and bounded recovery). PR is MERGEABLE / CLEAN.

Happy to address any further feedback.

cubic-dev-ai Bot reviewed May 6, 2026

View reviewed changes

decisionm force-pushed the fix/daemon-resilience branch from e699fc9 to 29a7b1b Compare May 6, 2026 04:32

decisionm changed the title ~~fix(daemon): resilient attach + per-call timeouts; helpers: socket timeout~~ fix(daemon): skip unresponsive targets on attach + per-call CDP timeout May 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(daemon): skip unresponsive targets on attach + per-call CDP timeout#307

fix(daemon): skip unresponsive targets on attach + per-call CDP timeout#307
decisionm wants to merge 2 commits intobrowser-use:mainfrom
decisionm:fix/daemon-resilience

decisionm commented May 6, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

cubic-dev-ai Bot May 6, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot May 6, 2026 •

edited

Loading

Uh oh!

decisionm commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		INTERNAL = ("chrome://", "chrome-untrusted://", "devtools://", "chrome-extension://", "about:")


		_SEND_TIMEOUT = float(os.environ.get("BU_SEND_TIMEOUT", "60"))

-_SEND_TIMEOUT = float(os.environ.get("BU_SEND_TIMEOUT", "60"))
+_send_timeout_raw = os.environ.get("BU_SEND_TIMEOUT", "60")
+try:
+    _SEND_TIMEOUT = float(_send_timeout_raw)
+except ValueError:
+    _SEND_TIMEOUT = 60.0

Conversation

decisionm commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root causes

1. attach_first_page pinned the daemon to a wedged tab

2. handle() had no per-call CDP timeout

Changes

Repro

Tests

Test plan

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

decisionm commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

decisionm commented May 6, 2026 •

edited

Loading

1. `attach_first_page` pinned the daemon to a wedged tab

2. `handle()` had no per-call CDP timeout

cubic-dev-ai Bot May 6, 2026 •

edited

Loading

cubic-dev-ai Bot May 6, 2026 •

edited

Loading