Add 5 reform-API regression eval cases (closes #82) by vahid-ahmadi · Pull Request #90 · PolicyEngine/policyengine-uk-chat

vahid-ahmadi · 2026-05-29T09:11:01Z

Summary

Adds 5 Test B scenarios (b6-b10) under evals/scenarios/ that catch silent reform-expressibility regressions — directly motivated by the 2026-05-28 live test where the agent could not express a basic-rate +1pp reform and produced no eval signal. Closes #82.

Path chosen: A (extend #52)

PR #52 (feat/eval-harness) is OPEN, mergeable, and structurally complete (SPEC, scenarios, runner, fixture builder, grader, first results writeup). The right home for #82's cases is the existing harness — this PR branches off feat/eval-harness and adds 5 new scenario YAMLs in the harness's existing shape.

Base branch: feat/eval-harness (rebase onto main once #52 merges).

What the cases assert

Per #82's constraint: direction + order-of-magnitude only — no exact-number pinning. Tax microsim numbers shift with dataset version and parameter year; exact pinning would create flaky tests.

ID	Reform	Direction	Magnitude band
`b6_basic_rate_plus_1pp`	basic rate 20%→21%	positive (revenue raise)	~£5-10bn (midpoint £7.5bn, ±50%)
`b7_pa_to_15k`	PA £12,570→£15,000	negative (cost)	~£15-25bn (midpoint -£20bn, ±50%)
`b8_ni_primary_threshold_plus_1k`	NI primary threshold +£1k	negative (cost)	~£2-5bn (midpoint -£3.5bn, ±60%)
`b9_child_benefit_uprate_10pct`	Child Benefit +10%	negative (cost)	~£1-2bn (midpoint -£1.3bn, ±70%)
`b10_two_band_collapse`	basic + higher → single 25%	unpinned (structural)	magnitude-only check

Each case also uses anchor.must_mention to assert the chat's prose names the reform parameters (basic rate, personal allowance, child benefit, etc.) and anchor.must_not_say to catch sign-flip regressions ("revenue rise" on a costing reform, etc.).

How it slots into the harness

The grader (evals/runner/grade.py:474) already supports expected_approx in fields_to_compare as a fixture-free alternative path — when reference.fixture is omitted, the diff uses the inline expected value. No grader changes needed; the new cases use existing field paths (budget.budgetary_impact, budget.tax_revenue_impact, budget.benefit_spending_impact) that the extractor's FIELD_LABELS already covers.

Each case pins:

dataset: enhanced_frs_2023_24
parameter_year: 2025
chat_settings.model_backend: uk_python
chat_settings.num_runs: 3

evals/README.md is updated with a "Reform-API regression suite" section explaining the two reference shapes (fixture-backed vs expected_approx-only).

What's deliberately out of scope

Not running the cases live (per Add reform-API eval cases to catch silent regressions like the 1pp basic-rate failure #82's "cheap to run" constraint and CI-integration being a separate concern).
No changes to chatbot.py or agent_tools.py (per constraint).
No grader changes — uses existing expected_approx plumbing.
No exact-number pinning — direction + order-of-magnitude only.

Test plan

Wait for Add eval harness scaffold: spec, scenarios, fixtures dir #52 to merge, then this PR rebases onto main.
After rebase, runner picks up the 5 new YAMLs automatically (no registration needed).
First eval pass produces B_results.md entries for b6-b10; assertions should pass when the agent successfully expresses each reform via the PolicyEngine UK API.
Verify YAML parses: python -c "import yaml; from pathlib import Path; [yaml.safe_load(p.read_text()) for p in Path('evals/scenarios').glob('b[6-9]_*.yaml')]" ✅ (done)
Verify b10 (the expected_approx: 0 magnitude-only case) doesn't crash _diff_scalar — the grader's expected != 0 guard handles it; diff entry shows extracted-only as intended.

🤖 Generated with Claude Code

Adds 5 Test B scenarios (b6-b10) that catch silent reform-expressibility regressions like the 2026-05-28 basic-rate +1pp failure. Each pins dataset (enhanced_frs_2023_24) and parameter_year (2025), uses expected_approx with wide tolerance_pct for direction + order-of-magnitude assertions only — no fixture files, no exact-number pinning — and uses anchor.must_mention to assert the chat's prose answer references the reform parameters by name. Branched off feat/eval-harness (PR #52) since #82 explicitly depends on the harness landing. README updated to document the two reference shapes (fixture-backed vs expected_approx-only) and the regression sub-suite. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vercel · 2026-05-29T09:11:07Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
policyengine-uk-chat	Ready	Preview, Comment	May 29, 2026 9:11am

github-actions · 2026-05-29T09:11:32Z

Beta preview is ready.

Frontend: open preview
Backend: open backend

vercel Bot deployed to Preview May 29, 2026 09:11 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add 5 reform-API regression eval cases (closes #82)#90

Add 5 reform-API regression eval cases (closes #82)#90
vahid-ahmadi wants to merge 1 commit into
feat/eval-harnessfrom
feat/reform-api-eval-cases

vahid-ahmadi commented May 29, 2026

Uh oh!

vercel Bot commented May 29, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vahid-ahmadi commented May 29, 2026

Summary

Path chosen: A (extend #52)

What the cases assert

How it slots into the harness

What's deliberately out of scope

Test plan

Uh oh!

vercel Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented May 29, 2026 •

edited

Loading