Skip to content

Add 5 reform-API regression eval cases (closes #82)#90

Open
vahid-ahmadi wants to merge 1 commit into
feat/eval-harnessfrom
feat/reform-api-eval-cases
Open

Add 5 reform-API regression eval cases (closes #82)#90
vahid-ahmadi wants to merge 1 commit into
feat/eval-harnessfrom
feat/reform-api-eval-cases

Conversation

@vahid-ahmadi
Copy link
Copy Markdown
Collaborator

Summary

Adds 5 Test B scenarios (b6-b10) under evals/scenarios/ that catch silent reform-expressibility regressions — directly motivated by the 2026-05-28 live test where the agent could not express a basic-rate +1pp reform and produced no eval signal. Closes #82.

Path chosen: A (extend #52)

PR #52 (feat/eval-harness) is OPEN, mergeable, and structurally complete (SPEC, scenarios, runner, fixture builder, grader, first results writeup). The right home for #82's cases is the existing harness — this PR branches off feat/eval-harness and adds 5 new scenario YAMLs in the harness's existing shape.

Base branch: feat/eval-harness (rebase onto main once #52 merges).

What the cases assert

Per #82's constraint: direction + order-of-magnitude only — no exact-number pinning. Tax microsim numbers shift with dataset version and parameter year; exact pinning would create flaky tests.

ID Reform Direction Magnitude band
b6_basic_rate_plus_1pp basic rate 20%→21% positive (revenue raise) ~£5-10bn (midpoint £7.5bn, ±50%)
b7_pa_to_15k PA £12,570→£15,000 negative (cost) ~£15-25bn (midpoint -£20bn, ±50%)
b8_ni_primary_threshold_plus_1k NI primary threshold +£1k negative (cost) ~£2-5bn (midpoint -£3.5bn, ±60%)
b9_child_benefit_uprate_10pct Child Benefit +10% negative (cost) ~£1-2bn (midpoint -£1.3bn, ±70%)
b10_two_band_collapse basic + higher → single 25% unpinned (structural) magnitude-only check

Each case also uses anchor.must_mention to assert the chat's prose names the reform parameters (basic rate, personal allowance, child benefit, etc.) and anchor.must_not_say to catch sign-flip regressions ("revenue rise" on a costing reform, etc.).

How it slots into the harness

The grader (evals/runner/grade.py:474) already supports expected_approx in fields_to_compare as a fixture-free alternative path — when reference.fixture is omitted, the diff uses the inline expected value. No grader changes needed; the new cases use existing field paths (budget.budgetary_impact, budget.tax_revenue_impact, budget.benefit_spending_impact) that the extractor's FIELD_LABELS already covers.

Each case pins:

  • dataset: enhanced_frs_2023_24
  • parameter_year: 2025
  • chat_settings.model_backend: uk_python
  • chat_settings.num_runs: 3

evals/README.md is updated with a "Reform-API regression suite" section explaining the two reference shapes (fixture-backed vs expected_approx-only).

What's deliberately out of scope

Test plan

  • Wait for Add eval harness scaffold: spec, scenarios, fixtures dir #52 to merge, then this PR rebases onto main.
  • After rebase, runner picks up the 5 new YAMLs automatically (no registration needed).
  • First eval pass produces B_results.md entries for b6-b10; assertions should pass when the agent successfully expresses each reform via the PolicyEngine UK API.
  • Verify YAML parses: python -c "import yaml; from pathlib import Path; [yaml.safe_load(p.read_text()) for p in Path('evals/scenarios').glob('b[6-9]_*.yaml')]" ✅ (done)
  • Verify b10 (the expected_approx: 0 magnitude-only case) doesn't crash _diff_scalar — the grader's expected != 0 guard handles it; diff entry shows extracted-only as intended.

🤖 Generated with Claude Code

Adds 5 Test B scenarios (b6-b10) that catch silent reform-expressibility
regressions like the 2026-05-28 basic-rate +1pp failure. Each pins dataset
(enhanced_frs_2023_24) and parameter_year (2025), uses expected_approx with
wide tolerance_pct for direction + order-of-magnitude assertions only — no
fixture files, no exact-number pinning — and uses anchor.must_mention to
assert the chat's prose answer references the reform parameters by name.

Branched off feat/eval-harness (PR #52) since #82 explicitly depends on the
harness landing. README updated to document the two reference shapes
(fixture-backed vs expected_approx-only) and the regression sub-suite.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented May 29, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
policyengine-uk-chat Ready Ready Preview, Comment May 29, 2026 9:11am

Request Review

@github-actions
Copy link
Copy Markdown

Beta preview is ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant