Iris bench

Iris's empirical regression net. Runs iris eval against each fixture in fixtures/known-bugs/ and asserts the meta.json expectations — recall (must_find), precision (expected_to_NOT_find), and calibration (expected_score_range).

Why a bench

Behavioral correctness for an LLM-driven tool is empirical. Unit tests verify deterministic logic; cassette tests verify prompt/parse pipelines. The bench is the only thing that catches prompt drift that "looks fine" but actually drops a category of finding.

Run it nightly + on release branches.

Cost

Default bench runs use the same Codex App Server transport as normal Iris runs. Cost is therefore tracked from the provider response when available, but the bench no longer switches transports from ambient ANTHROPIC_API_KEY or a claude binary on PATH. Cost budgeting was removed from iris eval; use --timeout to bound wall time.

Use --transport claude-code or --transport api only when you intentionally want those providers for a comparison run.

Requirements

pnpm build has been run (uses packages/cli/dist/bin.js)
pnpm exec playwright install chromium (Phase 2 setup)
ffmpeg on PATH (optional — if missing, sliceEvidence falls back to screenshots)

Usage

# Full bench
pnpm bench

# Single fixture
pnpm bench -- --filter empty-form

# Lower per-fixture wall-time cap
pnpm bench -- --timeout 600

# Explicit provider comparison
pnpm bench -- --transport claude-code

What gets asserted

For each fixture's meta.json:

Field	Assertion
`expected_findings[].must_find: true`	The matching finding (category + severity range + title keywords) appears in `report.json`. Recall.
`expected_score_range.overall: [min, max]`	`report.headline.score` falls in the range. Calibration.
`expected_to_NOT_find`	Nothing matching appears in `report.json.findings`. Precision.

Plus:

Per-fixture cost is tracked and logged.
Total cost printed at the end.
Exit code 1 if any fixture fails.

What's in fixtures/known-bugs/

Fixture	Bug	Expected score range
01-empty-form-submit	Silent submit, no validation	3–6
02-focus-trap-modal	No Esc handler, broken focus	4–7
03-broken-export	"CSV" button produces JSON	4–7
04-console-noise	4 console errors per load	5–8
05-bad-empty-state	Empty page with no copy	5–8
06-keyboard-inaccessible	Critical action on `<div>`	3–6
07-clean-baseline	No bugs (positive control)	7.5–10
08-many-small-issues	12+ copy nits, no majors	6–8.5

Tuning rubrics from bench output

When a fixture fails:

Missing findings (recall): tighten the relevant rubric dimension's common_signals.negative to make it more sensitive, or adjust the JUDGE_SYSTEM prompt to look harder for that category.
Score out of band (calibration): adjust weight_in_overall or dimension weights in the rubric YAML.
False positives (precision): loosen common_signals or improve the judge's dedup/discard prompt.

Re-run pnpm bench -- --filter <fixture> after each tweak. Iterate until the bench is consistently green.

Limits

The bench uses real LLM calls, so it is stochastic. Even at temperature: 0, run-to-run variance can be meaningful. Allow tolerance in score ranges (most fixtures use a 4-point window like [2, 6]).
Bench accuracy depends on rubric quality. The first calibration pass (2026-05-10) shifted ranges based on actual Iris severity behavior — Iris is generally harsher than initial guesses, especially on multi-bug pages.

Validation history

Iteration	Pass rate	Notes
Initial (2026-05-10)	0/8	Bench logic too strict + score ranges based on guesses
+ bench logic fix	4/8	Allowed exit code 2 (max_turns) as valid
+ score range tuning	5/8	Tuned 4 ranges based on real Iris output
+ matcher relaxation	5/8	LLM stochasticity — different findings mix per run
+ probe-nudge prompt	7/8	Initial user prompt now mandates console/network/axe probes
+ final 04 matcher fix	8/8 ✅	Accept blocker severity for console finding

End state: bench passes consistently, costs $4–5, takes ~12 min, all 8 known bug categories surfaced as findings with correct categories and reasonable severities.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Iris bench

Why a bench

Cost

Requirements

Usage

What gets asserted

What's in fixtures/known-bugs/

Tuning rubrics from bench output

Limits

Validation history

FilesExpand file tree

bench.md

Latest commit

History

bench.md

File metadata and controls

Iris bench

Why a bench

Cost

Requirements

Usage

What gets asserted

What's in fixtures/known-bugs/

Tuning rubrics from bench output

Limits

Validation history