v0.8.57: Recurring Codex-parity benchmark runs with regression tracking

## Goal

Once the one-shot Codex-parity comparison harness (#2952) lands in v0.8.56, automate it: run a fixed task set on a schedule (or per release tag), persist normalized results as a time series, and flag token/reward regressions between runs.

## Current evidence

- #2952 builds the single-run harness with normalized JSON/Markdown output under `benchmark_results/cli-compare-*`, but nothing tracks results across versions.
- The v0.8.56 token-reduction work (#2953, #2956, #2957, #2959) needs before/after numbers per release to prove improvements stick and to catch re-inflation of prompt or transcript size in later releases.
- Existing ad hoc runs (e.g. `benchmark_results/cli-compare-20260609`) are point-in-time and easy to lose context on.

## Scope

- Define a small fixed task suite (start with the four tasks already used in #2952: `prove-plus-comm`, `cancel-async-tasks`, `configure-git-webserver`, `fix-code-vulnerability`).
- A runner that executes the suite via the #2952 harness against a pinned model, appends one row per task per run to a versioned results store (in-repo JSONL or similar), keyed by codewhale version + date.
- A report/diff command that compares the latest run against a baseline and flags: reward drops, input-token growth beyond a threshold, output-token growth beyond a threshold.
- Optional CI/nightly wiring once the runner is stable locally.

## Non-goals

- Building the first harness itself (#2952).
- Expanding to large task suites or public leaderboard claims.
- Gating releases on benchmark results (can be considered later once variance is understood).

## Acceptance criteria

- One command runs the fixed suite and appends normalized results keyed by version.
- One command diffs two runs and prints a pass/regress summary with thresholds.
- At least two recorded runs exist demonstrating the diff catching an injected regression (e.g. artificially inflated prompt).

Related: #2952, #2953, #2956, #2957, #1177

Deferred to v0.8.57 by design: 'deeper benchmark infrastructure after the first v0.8.56 harness lands'.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.8.57: Recurring Codex-parity benchmark runs with regression tracking #2962

Goal

Current evidence

Scope

Non-goals

Acceptance criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

v0.8.57: Recurring Codex-parity benchmark runs with regression tracking #2962

Description

Goal

Current evidence

Scope

Non-goals

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions