Prompt tweaks in an attempt to pass more eval tasks#293
Closed
Prompt tweaks in an attempt to pass more eval tasks#293
Conversation
Adds support for a list of CDP endpoints tried in sequence when connecting to a remote Chromium browser. If an endpoint fails with a timeout or connection error, the next one is tried automatically. Once all are exhausted, the task fails immediately rather than burning through retry attempts. New option: --pw-cdp-endpoints (comma-separated, takes precedence over the existing singular --pw-cdp-endpoint which is unchanged). Key changes: - configDefaults.ts: new "string[]" ConfigFieldType; pw_cdp_endpoints field - config.ts: coerceValue() and addConfigOptions() support for "string[]" - playwrightBrowser.ts: connectOverCDPWithFailover() loop with isCdpConnectionError() classification; nextStartIndex state for per-start() endpoint cycling; updated pwCdpEndpoint getter returns active endpoint; new pwCdpEndpoints getter; onCdpEndpointCycle callback - cli/commands/run.ts + server/routes/spark.ts: normalize singular → plural and wire pwCdpEndpoints through - events.ts: CDP_ENDPOINT_CYCLE event + CdpEndpointCycleEventData - webAgent.ts: wire onCdpEndpointCycle callback to emit CDP_ENDPOINT_CYCLE; include pwCdpEndpoints in TASK_SETUP event - secretsRedactor.ts: redact pwCdpEndpoints to ["(redacted)"] to hide both values and array length - chalkConsole.ts: handle CDP_ENDPOINT_CYCLE event with warning log Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Phase 1 plan and Phase 2 plan (mid-task disconnect handling, not yet implemented) for the CDP endpoint failover feature. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Detects mid-task browser disconnects (TargetClosedError) and automatically reconnects to the next configured CDP endpoint, then restarts task execution from the original starting URL with a fresh message history while preserving the existing plan, success criteria, and action items. Changes: - Add BrowserDisconnectedError (extends RecoverableError) for signalling mid-task disconnects - Detect TargetClosedError in PlaywrightBrowser.performAction, getTreeWithRefs, and getScreenshot; rethrow as BrowserDisconnectedError - Add WebAgent.handleBrowserDisconnect(): shuts down browser, starts on next endpoint, navigates to starting URL, resets message history - Intercept BrowserDisconnectedError in runMainLoop before error counters so disconnects don't consume the error budget - Add CDP_ENDPOINT_CONNECTED and BROWSER_RECONNECTED events; wire onCdpEndpointConnected callback through WebAgent to event system - Remove direct console.log/warn calls from PlaywrightBrowser; all output now flows through callbacks → events → loggers - Add endpoint index (never URL) to CDP events for metrics tracking - Fix double task metrics summary: MetricsCollector now triggers only on TASK_COMPLETED (which always fires via buildResult), not TASK_ABORTED - Use error.name instead of error.message in CDP_ENDPOINT_CYCLE events to avoid leaking credentials or vendor URLs from Playwright error strings - Reduce CDP connection timeout from 10s to 5s - Add cdp: prefix to event shortname stripping in chalkConsole metrics view Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
835e53d to
12b709b
Compare
Collaborator
Author
|
Going to close this: I'll have further prompt tweaks in a post-Pilo based PR along with some fixes for evals in general |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Also opening as a draft to run against partial evals, will fixup conflicts later.
Description
Improves agent robustness on dynamic web pages by fixing several prompt instructions that were causing unnecessary failures:
click()on autocomplete suggestions frequently times out;focus()+enter()is more reliable.focus()and keyboard navigation (Tab, Enter, arrow keys) or Escape whenclick()fails due to element interception.abort()should only be used when the core task is impossible — if supplementary details are inaccessible due to site limitations,done()with what's available is preferred.done()with best available answer (rather than aborting) when validator feedback cannot be addressed due to genuine site limitations.These changes were validated against the WebVoyagerX partial eval set (30 tasks), improving the pass rate from 60% → 93% (with
--max-retries 2).PR Type
Related issues
Checklist
pnpm test)AI Usage