Skip to content

Prompt tweaks in an attempt to pass more eval tasks#293

Closed
lmorchard wants to merge 4 commits intodevelopfrom
evals/partial/lmorchard/partial-eval-updates
Closed

Prompt tweaks in an attempt to pass more eval tasks#293
lmorchard wants to merge 4 commits intodevelopfrom
evals/partial/lmorchard/partial-eval-updates

Conversation

@lmorchard
Copy link
Copy Markdown
Collaborator

@lmorchard lmorchard commented Feb 19, 2026

Also opening as a draft to run against partial evals, will fixup conflicts later.

Description

Improves agent robustness on dynamic web pages by fixing several prompt instructions that were causing unnecessary failures:

  • Scrolling/dynamic content: Replaced the incorrect claim that "all content is visible — no scrolling needed" with accurate guidance that dynamic pages may load additional content on scroll or interaction.
  • Autocomplete/combobox fields: Added specific guidance for handling autocomplete dropdowns (e.g., flight search origin/destination fields). click() on autocomplete suggestions frequently times out; focus() + enter() is more reliable.
  • Stale element references: Added explicit instruction to use updated refs from the next page snapshot when receiving an "Invalid element reference" error, rather than retrying old ref IDs after DOM changes.
  • Click interception: Added guidance to try focus() and keyboard navigation (Tab, Enter, arrow keys) or Escape when click() fails due to element interception.
  • done() vs abort(): Clarified that abort() should only be used when the core task is impossible — if supplementary details are inaccessible due to site limitations, done() with what's available is preferred.
  • Academic papers: Added guidance to fall back to HTML versions (ACL Anthology, Semantic Scholar) when PDFs are inaccessible.
  • Validation feedback: Added instruction to call done() with best available answer (rather than aborting) when validator feedback cannot be addressed due to genuine site limitations.

These changes were validated against the WebVoyagerX partial eval set (30 tasks), improving the pass rate from 60% → 93% (with --max-retries 2).

PR Type

  • Bug Fix

Related issues

Checklist

  • I understand the code I am submitting
  • I have tested this code locally
  • New and existing tests pass locally (pnpm test)
  • I have added tests that prove my fix/feature works (if applicable)
  • Documentation was updated where necessary
  • I have read and followed the contribution guidelines

AI Usage

  • No AI was used
  • AI was used for drafting/refactoring
  • This is fully AI-generated

lmorchard and others added 4 commits February 18, 2026 15:08
Adds support for a list of CDP endpoints tried in sequence when
connecting to a remote Chromium browser. If an endpoint fails with a
timeout or connection error, the next one is tried automatically. Once
all are exhausted, the task fails immediately rather than burning
through retry attempts.

New option: --pw-cdp-endpoints (comma-separated, takes precedence over
the existing singular --pw-cdp-endpoint which is unchanged).

Key changes:
- configDefaults.ts: new "string[]" ConfigFieldType; pw_cdp_endpoints field
- config.ts: coerceValue() and addConfigOptions() support for "string[]"
- playwrightBrowser.ts: connectOverCDPWithFailover() loop with
  isCdpConnectionError() classification; nextStartIndex state for
  per-start() endpoint cycling; updated pwCdpEndpoint getter returns
  active endpoint; new pwCdpEndpoints getter; onCdpEndpointCycle callback
- cli/commands/run.ts + server/routes/spark.ts: normalize singular →
  plural and wire pwCdpEndpoints through
- events.ts: CDP_ENDPOINT_CYCLE event + CdpEndpointCycleEventData
- webAgent.ts: wire onCdpEndpointCycle callback to emit CDP_ENDPOINT_CYCLE;
  include pwCdpEndpoints in TASK_SETUP event
- secretsRedactor.ts: redact pwCdpEndpoints to ["(redacted)"] to hide
  both values and array length
- chalkConsole.ts: handle CDP_ENDPOINT_CYCLE event with warning log

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Phase 1 plan and Phase 2 plan (mid-task disconnect handling, not yet
implemented) for the CDP endpoint failover feature.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Detects mid-task browser disconnects (TargetClosedError) and automatically
reconnects to the next configured CDP endpoint, then restarts task execution
from the original starting URL with a fresh message history while preserving
the existing plan, success criteria, and action items.

Changes:
- Add BrowserDisconnectedError (extends RecoverableError) for signalling
  mid-task disconnects
- Detect TargetClosedError in PlaywrightBrowser.performAction,
  getTreeWithRefs, and getScreenshot; rethrow as BrowserDisconnectedError
- Add WebAgent.handleBrowserDisconnect(): shuts down browser, starts on
  next endpoint, navigates to starting URL, resets message history
- Intercept BrowserDisconnectedError in runMainLoop before error counters
  so disconnects don't consume the error budget
- Add CDP_ENDPOINT_CONNECTED and BROWSER_RECONNECTED events; wire
  onCdpEndpointConnected callback through WebAgent to event system
- Remove direct console.log/warn calls from PlaywrightBrowser; all output
  now flows through callbacks → events → loggers
- Add endpoint index (never URL) to CDP events for metrics tracking
- Fix double task metrics summary: MetricsCollector now triggers only on
  TASK_COMPLETED (which always fires via buildResult), not TASK_ABORTED
- Use error.name instead of error.message in CDP_ENDPOINT_CYCLE events
  to avoid leaking credentials or vendor URLs from Playwright error strings
- Reduce CDP connection timeout from 10s to 5s
- Add cdp: prefix to event shortname stripping in chalkConsole metrics view

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@lmorchard lmorchard force-pushed the evals/partial/lmorchard/partial-eval-updates branch from 835e53d to 12b709b Compare February 19, 2026 01:23
@lmorchard
Copy link
Copy Markdown
Collaborator Author

Going to close this: I'll have further prompt tweaks in a post-Pilo based PR along with some fixes for evals in general

@lmorchard lmorchard closed this Feb 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant