You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Once the one-shot Codex-parity comparison harness (#2952) lands in v0.8.56, automate it: run a fixed task set on a schedule (or per release tag), persist normalized results as a time series, and flag token/reward regressions between runs.
A runner that executes the suite via the v0.8.56: Build a Codex-parity token comparison harness #2952 harness against a pinned model, appends one row per task per run to a versioned results store (in-repo JSONL or similar), keyed by codewhale version + date.
A report/diff command that compares the latest run against a baseline and flags: reward drops, input-token growth beyond a threshold, output-token growth beyond a threshold.
Optional CI/nightly wiring once the runner is stable locally.
Goal
Once the one-shot Codex-parity comparison harness (#2952) lands in v0.8.56, automate it: run a fixed task set on a schedule (or per release tag), persist normalized results as a time series, and flag token/reward regressions between runs.
Current evidence
benchmark_results/cli-compare-*, but nothing tracks results across versions.benchmark_results/cli-compare-20260609) are point-in-time and easy to lose context on.Scope
prove-plus-comm,cancel-async-tasks,configure-git-webserver,fix-code-vulnerability).Non-goals
Acceptance criteria
Related: #2952, #2953, #2956, #2957, #1177
Deferred to v0.8.57 by design: 'deeper benchmark infrastructure after the first v0.8.56 harness lands'.