Iris's empirical regression net. Runs iris eval against each fixture in fixtures/known-bugs/ and asserts the meta.json expectations — recall (must_find), precision (expected_to_NOT_find), and calibration (expected_score_range).
Behavioral correctness for an LLM-driven tool is empirical. Unit tests verify deterministic logic; cassette tests verify prompt/parse pipelines. The bench is the only thing that catches prompt drift that "looks fine" but actually drops a category of finding.
Run it nightly + on release branches.
Default bench runs use the same Codex App Server transport as normal Iris runs. Cost is therefore tracked from the provider response when available, but the bench no longer switches transports from ambient ANTHROPIC_API_KEY or a claude binary on PATH. Cost budgeting was removed from iris eval; use --timeout to bound wall time.
Use --transport claude-code or --transport api only when you intentionally want those providers for a comparison run.
pnpm buildhas been run (usespackages/cli/dist/bin.js)pnpm exec playwright install chromium(Phase 2 setup)ffmpegon PATH (optional — if missing, sliceEvidence falls back to screenshots)
# Full bench
pnpm bench
# Single fixture
pnpm bench -- --filter empty-form
# Lower per-fixture wall-time cap
pnpm bench -- --timeout 600
# Explicit provider comparison
pnpm bench -- --transport claude-codeFor each fixture's meta.json:
| Field | Assertion |
|---|---|
expected_findings[].must_find: true |
The matching finding (category + severity range + title keywords) appears in report.json. Recall. |
expected_score_range.overall: [min, max] |
report.headline.score falls in the range. Calibration. |
expected_to_NOT_find |
Nothing matching appears in report.json.findings. Precision. |
Plus:
- Per-fixture cost is tracked and logged.
- Total cost printed at the end.
- Exit code 1 if any fixture fails.
| Fixture | Bug | Expected score range |
|---|---|---|
| 01-empty-form-submit | Silent submit, no validation | 3–6 |
| 02-focus-trap-modal | No Esc handler, broken focus | 4–7 |
| 03-broken-export | "CSV" button produces JSON | 4–7 |
| 04-console-noise | 4 console errors per load | 5–8 |
| 05-bad-empty-state | Empty page with no copy | 5–8 |
| 06-keyboard-inaccessible | Critical action on <div> |
3–6 |
| 07-clean-baseline | No bugs (positive control) | 7.5–10 |
| 08-many-small-issues | 12+ copy nits, no majors | 6–8.5 |
When a fixture fails:
- Missing findings (recall): tighten the relevant rubric dimension's
common_signals.negativeto make it more sensitive, or adjust the JUDGE_SYSTEM prompt to look harder for that category. - Score out of band (calibration): adjust
weight_in_overallor dimension weights in the rubric YAML. - False positives (precision): loosen
common_signalsor improve the judge's dedup/discard prompt.
Re-run pnpm bench -- --filter <fixture> after each tweak. Iterate until the bench is consistently green.
- The bench uses real LLM calls, so it is stochastic. Even at
temperature: 0, run-to-run variance can be meaningful. Allow tolerance in score ranges (most fixtures use a 4-point window like[2, 6]). - Bench accuracy depends on rubric quality. The first calibration pass (2026-05-10) shifted ranges based on actual Iris severity behavior — Iris is generally harsher than initial guesses, especially on multi-bug pages.
| Iteration | Pass rate | Notes |
|---|---|---|
| Initial (2026-05-10) | 0/8 | Bench logic too strict + score ranges based on guesses |
| + bench logic fix | 4/8 | Allowed exit code 2 (max_turns) as valid |
| + score range tuning | 5/8 | Tuned 4 ranges based on real Iris output |
| + matcher relaxation | 5/8 | LLM stochasticity — different findings mix per run |
| + probe-nudge prompt | 7/8 | Initial user prompt now mandates console/network/axe probes |
| + final 04 matcher fix | 8/8 ✅ | Accept blocker severity for console finding |
End state: bench passes consistently, costs $4–5, takes ~12 min, all 8 known bug categories surfaced as findings with correct categories and reasonable severities.