Prompt tweaks in an attempt to pass more eval tasks by lmorchard · Pull Request #293 · mozilla/pilo

lmorchard · 2026-02-19T00:30:38Z

Also opening as a draft to run against partial evals, will fixup conflicts later.

Description

Improves agent robustness on dynamic web pages by fixing several prompt instructions that were causing unnecessary failures:

Scrolling/dynamic content: Replaced the incorrect claim that "all content is visible — no scrolling needed" with accurate guidance that dynamic pages may load additional content on scroll or interaction.
Autocomplete/combobox fields: Added specific guidance for handling autocomplete dropdowns (e.g., flight search origin/destination fields). click() on autocomplete suggestions frequently times out; focus() + enter() is more reliable.
Stale element references: Added explicit instruction to use updated refs from the next page snapshot when receiving an "Invalid element reference" error, rather than retrying old ref IDs after DOM changes.
Click interception: Added guidance to try focus() and keyboard navigation (Tab, Enter, arrow keys) or Escape when click() fails due to element interception.
done() vs abort(): Clarified that abort() should only be used when the core task is impossible — if supplementary details are inaccessible due to site limitations, done() with what's available is preferred.
Academic papers: Added guidance to fall back to HTML versions (ACL Anthology, Semantic Scholar) when PDFs are inaccessible.
Validation feedback: Added instruction to call done() with best available answer (rather than aborting) when validator feedback cannot be addressed due to genuine site limitations.

These changes were validated against the WebVoyagerX partial eval set (30 tasks), improving the pass rate from 60% → 93% (with --max-retries 2).

PR Type

Bug Fix

Related issues

Checklist

I understand the code I am submitting
I have tested this code locally
New and existing tests pass locally (pnpm test)
I have added tests that prove my fix/feature works (if applicable)
Documentation was updated where necessary
I have read and followed the contribution guidelines

AI Usage

No AI was used
AI was used for drafting/refactoring
This is fully AI-generated

Adds support for a list of CDP endpoints tried in sequence when connecting to a remote Chromium browser. If an endpoint fails with a timeout or connection error, the next one is tried automatically. Once all are exhausted, the task fails immediately rather than burning through retry attempts. New option: --pw-cdp-endpoints (comma-separated, takes precedence over the existing singular --pw-cdp-endpoint which is unchanged). Key changes: - configDefaults.ts: new "string[]" ConfigFieldType; pw_cdp_endpoints field - config.ts: coerceValue() and addConfigOptions() support for "string[]" - playwrightBrowser.ts: connectOverCDPWithFailover() loop with isCdpConnectionError() classification; nextStartIndex state for per-start() endpoint cycling; updated pwCdpEndpoint getter returns active endpoint; new pwCdpEndpoints getter; onCdpEndpointCycle callback - cli/commands/run.ts + server/routes/spark.ts: normalize singular → plural and wire pwCdpEndpoints through - events.ts: CDP_ENDPOINT_CYCLE event + CdpEndpointCycleEventData - webAgent.ts: wire onCdpEndpointCycle callback to emit CDP_ENDPOINT_CYCLE; include pwCdpEndpoints in TASK_SETUP event - secretsRedactor.ts: redact pwCdpEndpoints to ["(redacted)"] to hide both values and array length - chalkConsole.ts: handle CDP_ENDPOINT_CYCLE event with warning log Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Phase 1 plan and Phase 2 plan (mid-task disconnect handling, not yet implemented) for the CDP endpoint failover feature. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Detects mid-task browser disconnects (TargetClosedError) and automatically reconnects to the next configured CDP endpoint, then restarts task execution from the original starting URL with a fresh message history while preserving the existing plan, success criteria, and action items. Changes: - Add BrowserDisconnectedError (extends RecoverableError) for signalling mid-task disconnects - Detect TargetClosedError in PlaywrightBrowser.performAction, getTreeWithRefs, and getScreenshot; rethrow as BrowserDisconnectedError - Add WebAgent.handleBrowserDisconnect(): shuts down browser, starts on next endpoint, navigates to starting URL, resets message history - Intercept BrowserDisconnectedError in runMainLoop before error counters so disconnects don't consume the error budget - Add CDP_ENDPOINT_CONNECTED and BROWSER_RECONNECTED events; wire onCdpEndpointConnected callback through WebAgent to event system - Remove direct console.log/warn calls from PlaywrightBrowser; all output now flows through callbacks → events → loggers - Add endpoint index (never URL) to CDP events for metrics tracking - Fix double task metrics summary: MetricsCollector now triggers only on TASK_COMPLETED (which always fires via buildResult), not TASK_ABORTED - Use error.name instead of error.message in CDP_ENDPOINT_CYCLE events to avoid leaking credentials or vendor URLs from Playwright error strings - Reduce CDP connection timeout from 10s to 5s - Add cdp: prefix to event shortname stripping in chalkConsole metrics view Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

lmorchard · 2026-02-20T22:48:00Z

Going to close this: I'll have further prompt tweaks in a post-Pilo based PR along with some fixes for evals in general

lmorchard and others added 4 commits February 18, 2026 15:08

docs: add dev session notes for cdp-endpoint-failover

6162a71

Phase 1 plan and Phase 2 plan (mid-task disconnect handling, not yet implemented) for the CDP endpoint failover feature. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Prompt tweaks in an attempt to pass more eval tasks

12b709b

lmorchard force-pushed the evals/partial/lmorchard/partial-eval-updates branch from 835e53d to 12b709b Compare February 19, 2026 01:23

lmorchard closed this Feb 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prompt tweaks in an attempt to pass more eval tasks#293

Prompt tweaks in an attempt to pass more eval tasks#293
lmorchard wants to merge 4 commits intodevelopfrom
evals/partial/lmorchard/partial-eval-updates

lmorchard commented Feb 19, 2026 •

edited

Loading

Uh oh!

lmorchard commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lmorchard commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

PR Type

Related issues

Checklist

AI Usage

Uh oh!

lmorchard commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lmorchard commented Feb 19, 2026 •

edited

Loading