Skip to content

Latest commit

 

History

History
92 lines (64 loc) · 4.28 KB

File metadata and controls

92 lines (64 loc) · 4.28 KB

Iris bench

Iris's empirical regression net. Runs iris eval against each fixture in fixtures/known-bugs/ and asserts the meta.json expectations — recall (must_find), precision (expected_to_NOT_find), and calibration (expected_score_range).

Why a bench

Behavioral correctness for an LLM-driven tool is empirical. Unit tests verify deterministic logic; cassette tests verify prompt/parse pipelines. The bench is the only thing that catches prompt drift that "looks fine" but actually drops a category of finding.

Run it nightly + on release branches.

Cost

Default bench runs use the same Codex App Server transport as normal Iris runs. Cost is therefore tracked from the provider response when available, but the bench no longer switches transports from ambient ANTHROPIC_API_KEY or a claude binary on PATH. Cost budgeting was removed from iris eval; use --timeout to bound wall time.

Use --transport claude-code or --transport api only when you intentionally want those providers for a comparison run.

Requirements

  • pnpm build has been run (uses packages/cli/dist/bin.js)
  • pnpm exec playwright install chromium (Phase 2 setup)
  • ffmpeg on PATH (optional — if missing, sliceEvidence falls back to screenshots)

Usage

# Full bench
pnpm bench

# Single fixture
pnpm bench -- --filter empty-form

# Lower per-fixture wall-time cap
pnpm bench -- --timeout 600

# Explicit provider comparison
pnpm bench -- --transport claude-code

What gets asserted

For each fixture's meta.json:

Field Assertion
expected_findings[].must_find: true The matching finding (category + severity range + title keywords) appears in report.json. Recall.
expected_score_range.overall: [min, max] report.headline.score falls in the range. Calibration.
expected_to_NOT_find Nothing matching appears in report.json.findings. Precision.

Plus:

  • Per-fixture cost is tracked and logged.
  • Total cost printed at the end.
  • Exit code 1 if any fixture fails.

What's in fixtures/known-bugs/

Fixture Bug Expected score range
01-empty-form-submit Silent submit, no validation 3–6
02-focus-trap-modal No Esc handler, broken focus 4–7
03-broken-export "CSV" button produces JSON 4–7
04-console-noise 4 console errors per load 5–8
05-bad-empty-state Empty page with no copy 5–8
06-keyboard-inaccessible Critical action on <div> 3–6
07-clean-baseline No bugs (positive control) 7.5–10
08-many-small-issues 12+ copy nits, no majors 6–8.5

Tuning rubrics from bench output

When a fixture fails:

  • Missing findings (recall): tighten the relevant rubric dimension's common_signals.negative to make it more sensitive, or adjust the JUDGE_SYSTEM prompt to look harder for that category.
  • Score out of band (calibration): adjust weight_in_overall or dimension weights in the rubric YAML.
  • False positives (precision): loosen common_signals or improve the judge's dedup/discard prompt.

Re-run pnpm bench -- --filter <fixture> after each tweak. Iterate until the bench is consistently green.

Limits

  • The bench uses real LLM calls, so it is stochastic. Even at temperature: 0, run-to-run variance can be meaningful. Allow tolerance in score ranges (most fixtures use a 4-point window like [2, 6]).
  • Bench accuracy depends on rubric quality. The first calibration pass (2026-05-10) shifted ranges based on actual Iris severity behavior — Iris is generally harsher than initial guesses, especially on multi-bug pages.

Validation history

Iteration Pass rate Notes
Initial (2026-05-10) 0/8 Bench logic too strict + score ranges based on guesses
+ bench logic fix 4/8 Allowed exit code 2 (max_turns) as valid
+ score range tuning 5/8 Tuned 4 ranges based on real Iris output
+ matcher relaxation 5/8 LLM stochasticity — different findings mix per run
+ probe-nudge prompt 7/8 Initial user prompt now mandates console/network/axe probes
+ final 04 matcher fix 8/8 Accept blocker severity for console finding

End state: bench passes consistently, costs $4–5, takes ~12 min, all 8 known bug categories surfaced as findings with correct categories and reasonable severities.