Skip to content

v0.8.57: Recurring Codex-parity benchmark runs with regression tracking #2962

@Hmbown

Description

@Hmbown

Goal

Once the one-shot Codex-parity comparison harness (#2952) lands in v0.8.56, automate it: run a fixed task set on a schedule (or per release tag), persist normalized results as a time series, and flag token/reward regressions between runs.

Current evidence

Scope

  • Define a small fixed task suite (start with the four tasks already used in v0.8.56: Build a Codex-parity token comparison harness #2952: prove-plus-comm, cancel-async-tasks, configure-git-webserver, fix-code-vulnerability).
  • A runner that executes the suite via the v0.8.56: Build a Codex-parity token comparison harness #2952 harness against a pinned model, appends one row per task per run to a versioned results store (in-repo JSONL or similar), keyed by codewhale version + date.
  • A report/diff command that compares the latest run against a baseline and flags: reward drops, input-token growth beyond a threshold, output-token growth beyond a threshold.
  • Optional CI/nightly wiring once the runner is stable locally.

Non-goals

Acceptance criteria

  • One command runs the fixed suite and appends normalized results keyed by version.
  • One command diffs two runs and prints a pass/regress summary with thresholds.
  • At least two recorded runs exist demonstrating the diff catching an injected regression (e.g. artificially inflated prompt).

Related: #2952, #2953, #2956, #2957, #1177

Deferred to v0.8.57 by design: 'deeper benchmark infrastructure after the first v0.8.56 harness lands'.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    Status
    Backlog

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions