From f049829ccfa2be2b323bd610f5da42cf4eb8dca6 Mon Sep 17 00:00:00 2001 From: Vahid Ahmadi Date: Fri, 29 May 2026 11:10:28 +0200 Subject: [PATCH] Add reform-API regression eval cases (closes #82) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds 5 Test B scenarios (b6-b10) that catch silent reform-expressibility regressions like the 2026-05-28 basic-rate +1pp failure. Each pins dataset (enhanced_frs_2023_24) and parameter_year (2025), uses expected_approx with wide tolerance_pct for direction + order-of-magnitude assertions only — no fixture files, no exact-number pinning — and uses anchor.must_mention to assert the chat's prose answer references the reform parameters by name. Branched off feat/eval-harness (PR #52) since #82 explicitly depends on the harness landing. README updated to document the two reference shapes (fixture-backed vs expected_approx-only) and the regression sub-suite. Co-Authored-By: Claude Opus 4.7 (1M context) --- evals/README.md | 27 ++++++- evals/scenarios/b10_two_band_collapse.yaml | 77 +++++++++++++++++++ evals/scenarios/b6_basic_rate_plus_1pp.yaml | 76 ++++++++++++++++++ evals/scenarios/b7_pa_to_15k.yaml | 60 +++++++++++++++ .../b8_ni_primary_threshold_plus_1k.yaml | 58 ++++++++++++++ .../b9_child_benefit_uprate_10pct.yaml | 64 +++++++++++++++ 6 files changed, 360 insertions(+), 2 deletions(-) create mode 100644 evals/scenarios/b10_two_band_collapse.yaml create mode 100644 evals/scenarios/b6_basic_rate_plus_1pp.yaml create mode 100644 evals/scenarios/b7_pa_to_15k.yaml create mode 100644 evals/scenarios/b8_ni_primary_threshold_plus_1k.yaml create mode 100644 evals/scenarios/b9_child_benefit_uprate_10pct.yaml diff --git a/evals/README.md b/evals/README.md index 5aa500e..eaaec9f 100644 --- a/evals/README.md +++ b/evals/README.md @@ -12,7 +12,9 @@ evals/ README.md ← this file scenarios/ a1_*.yaml ← Test A scenarios (supplement) - b1_*.yaml ← Test B scenarios (alternative) + b1_*.yaml — b5_*.yaml ← Test B scenarios (alternative — fixture-backed) + b6_*.yaml — b10_*.yaml ← Test B reform-API regression catchers (issue #82, + direction + order-of-magnitude, no fixtures) fixtures/ pe_api/ ← reference PE-API responses for Test B ← populated by a follow-up PR @@ -85,4 +87,25 @@ Pre-committed in SPEC.md so we don't rationalize ambiguous results later: Add a YAML file. That's it — the runner picks it up automatically. Keep the `id` short and stable (it appears in filenames and reports). -When adding a Test B scenario, also add the reference fixture under `fixtures/pe_api/` and reference it via the relative path in the scenario YAML. +When adding a Test B scenario, you have two options for the reference: + +1. **Fixture-backed** (preferred for replication tests, e.g. b1-b4): add the + reference JSON under `fixtures/pe_api/` and reference it via + `reference.fixture` in the YAML. +2. **`expected_approx` inline** (for regression catchers — see b6-b10 below, + filed against [#82](https://github.com/PolicyEngine/policyengine-uk-chat/issues/82)): + omit `reference.fixture` and put `expected_approx` directly on each entry + in `fields_to_compare`. Use a wide `tolerance_pct` (50-70%) so the assertion + encodes *direction + order-of-magnitude* rather than exact-number pinning. + Tax microsim numbers shift with dataset version and parameter year; exact + pinning creates flaky tests. + +### Reform-API regression suite (b6-b10, issue #82) + +A small sub-suite of B scenarios exists specifically to catch silent +regressions in reform expressibility — the failure mode where the agent +silently fails to apply a reform and produces no eval signal. Each pins +`dataset` and `parameter_year` at the YAML top level, uses `expected_approx` +inline (no fixture), and relies on the anchor's `must_mention` to assert the +reform parameters appear in the chat's prose answer. Direction and +order-of-magnitude only — no exact-number pinning. diff --git a/evals/scenarios/b10_two_band_collapse.yaml b/evals/scenarios/b10_two_band_collapse.yaml new file mode 100644 index 0000000..aaeb1f7 --- /dev/null +++ b/evals/scenarios/b10_two_band_collapse.yaml @@ -0,0 +1,77 @@ +id: b10_two_band_collapse +test: B +title: "Collapse basic + higher rates into a single 25% band — reform-API regression (#82)" +what_it_tests: | + Structural reform — not a single parameter tweak — that asks the agent to + apply two changes at once (lift basic to 25%, drop higher to 25%) and + describe the result. The revenue sign is *deliberately not pinned* because + the net direction depends on the dataset and is genuinely ambiguous; what + matters is that the chat (a) mentions both bands by name, (b) understands + it's a structural reform rather than a single-parameter change, and (c) + produces *a* numeric budgetary impact (not silence or refusal). + + Assertions are loose by design — direction + order-of-magnitude only. See + issue #82. + +source_issue: "https://github.com/PolicyEngine/policyengine-uk-chat/issues/82" + +prompt: | + Run a UK 2025 economy-wide comparison. Baseline: current law. Reform: + collapse the income tax basic rate (currently 20%) and higher rate + (currently 40%) into a single 25% band that covers all income from the + personal allowance up to the additional-rate threshold (£125,140). Leave + the additional rate (45% above £125,140) untouched. + + Use the Enhanced FRS 2023-24 dataset: + hf://policyengine/policyengine-uk-data-private/enhanced_frs_2023_24.h5. + Use Microsimulation. + + Report from the model run: + (1) total budgetary impact in £ (state direction explicitly — gain or cost) + (2) average household income change in £ for each decile (1-10) + (3) which decile is the biggest absolute loser and biggest absolute gainer + + British English. Every figure from the model run, not memory. + +chat_settings: + model_backend: uk_python + num_runs: 3 + +dataset: enhanced_frs_2023_24 +parameter_year: 2025 + +reference: + # Direction is NOT pinned for this scenario — a structural reform of this + # shape can come out either sign depending on dataset composition. The + # `expected_approx` value of 0 with a very wide percentage tolerance is + # effectively a *magnitude* check only: the field must be extractable and + # in the £-billion range. Sign is asserted via the anchor's + # `must_mention` ("either revenue raise or revenue cost", forcing the + # chat to commit to one direction in prose). + # + # NOTE: tolerance_pct against expected_approx=0 produces div-by-zero + # gracefully (the grader's `_diff_scalar` checks `expected != 0`); the + # diff entry will show extracted-only and the chat answer's must_mention + # block does the sign-direction work instead. This is intentional. + fields_to_compare: + - path: budget.budgetary_impact + expected_approx: 0 + tolerance_pct: 100.0 + +anchor: + must_mention: + - "basic rate" + - "higher rate" + - "25%" + must_not_say: + - "the basic rate stays the same" # it changes from 20% to 25% + - "the higher rate stays the same" # it changes from 40% to 25% + ideal_finding: | + Collapsing basic (20%) and higher (40%) into a single 25% band is a + structural reform with offsetting effects: basic-rate taxpayers (the bulk + of D5-D8) pay more, higher-rate taxpayers (D9-D10) pay less. The net + budgetary direction depends on which population is larger by tax base in + the dataset. The chat must (a) commit to a numeric budgetary impact with + an explicit sign, (b) describe the distributional split across deciles + (mid-deciles lose, top deciles gain), and (c) reference both bands by + name rather than treating this as a single-parameter change. diff --git a/evals/scenarios/b6_basic_rate_plus_1pp.yaml b/evals/scenarios/b6_basic_rate_plus_1pp.yaml new file mode 100644 index 0000000..1562381 --- /dev/null +++ b/evals/scenarios/b6_basic_rate_plus_1pp.yaml @@ -0,0 +1,76 @@ +id: b6_basic_rate_plus_1pp +test: B +title: "Income tax basic rate +1pp — silent regression catcher (#82)" +what_it_tests: | + Direct regression test for the failure observed in the 2026-05-28 live test: + the agent could not express a basic-rate +1pp reform via the PolicyEngine UK + API and produced no eval signal. This case asserts the agent (a) runs the + reform end-to-end, (b) reports a *positive* budgetary impact (revenue raise), + (c) gets the magnitude inside the right order (~£5-10bn for 2025-26 on + Enhanced FRS 2023-24), and (d) attributes the burden to upper deciles. + + Assertions are loose by design — direction + order-of-magnitude only. Tax + microsim numbers shift with dataset version and parameter year; exact-number + pinning would create flaky tests. See evals/SPEC.md and issue #82. + +source_issue: "https://github.com/PolicyEngine/policyengine-uk-chat/issues/82" + +prompt: | + Run a UK 2025 economy-wide comparison. Baseline: current law. Reform: raise + the income tax basic rate from 20% to 21%. Use the Enhanced FRS 2023-24 + dataset: hf://policyengine/policyengine-uk-data-private/enhanced_frs_2023_24.h5. + Use Microsimulation. + + Report from the model run: + (1) total budgetary impact in £ + (2) income tax revenue change in £ + (3) average household income change in £ for each decile (1-10) + + British English. Every figure from the model run, not memory. + +chat_settings: + model_backend: uk_python + num_runs: 3 + +# Pinned for reproducibility — see issue #82 constraint. +dataset: enhanced_frs_2023_24 +parameter_year: 2025 + +reference: + # No fixture file — assertions are direction + order-of-magnitude only, so + # `expected_approx` is supplied inline and the grader skips fixture loading. + # The wide `tolerance_pct` deliberately encodes "within an order of magnitude" + # rather than a pinned exact value. Sign of `expected_approx` enforces the + # direction assertion (positive = revenue-raising). + fields_to_compare: + # Direction: POSITIVE revenue change. Magnitude: ~£5-10bn band. + # Midpoint £7.5bn with ±50% tolerance covers £3.75bn–£11.25bn — wide + # enough to absorb dataset/year drift, tight enough to catch a sign flip + # or an order-of-magnitude miss. + - path: budget.budgetary_impact + expected_approx: 7_500_000_000 + tolerance_pct: 50.0 + - path: budget.tax_revenue_impact + expected_approx: 7_500_000_000 + tolerance_pct: 50.0 + +anchor: + # Direction-of-burden and parameter-mention assertions live here. The grader + # runs case-insensitive substring checks on these against the chat response. + must_mention: + - "basic rate" # the parameter under reform + - "21" # the new rate + - "income tax" # the tax surface + - "revenue" # the magnitude language + must_not_say: + - "revenue falls" # +1pp is a revenue raise, not a cost + - "tax cut" # this is a tax rise + - "everyone loses equally" # upper deciles bear most of the burden + ideal_finding: | + Raising the income tax basic rate from 20% to 21% is a revenue-raising + reform of order £5-10bn for 2025-26 on Enhanced FRS 2023-24. Because the + basic-rate band caps at the higher-rate threshold (£50,270), the £ burden + rises with income up to that point and then flattens — meaning upper-middle + and top deciles bear most of the absolute burden, while bottom deciles + (largely below the personal allowance) are barely affected. Relative-income + losses are larger in the upper deciles than the bottom. diff --git a/evals/scenarios/b7_pa_to_15k.yaml b/evals/scenarios/b7_pa_to_15k.yaml new file mode 100644 index 0000000..65c4397 --- /dev/null +++ b/evals/scenarios/b7_pa_to_15k.yaml @@ -0,0 +1,60 @@ +id: b7_pa_to_15k +test: B +title: "Personal allowance raised to £15,000 — reform-API regression (#82)" +what_it_tests: | + Companion to b1_society_wide_pa.yaml but framed as a regression check rather + than a fixture replication: asserts only direction (revenue cost, not raise) + and order-of-magnitude (~£15-25bn). Catches a class of failure where the + agent silently fails to apply the reform and reports baseline numbers, or + inverts the sign of the budgetary impact. + + Assertions are loose by design — direction + order-of-magnitude only. See + issue #82. + +source_issue: "https://github.com/PolicyEngine/policyengine-uk-chat/issues/82" + +prompt: | + Run a UK 2025 economy-wide comparison. Baseline: current law. Reform: raise + the income tax personal allowance from £12,570 to £15,000. Use the Enhanced + FRS 2023-24 dataset: + hf://policyengine/policyengine-uk-data-private/enhanced_frs_2023_24.h5. + Use Microsimulation. + + Report from the model run: + (1) total budgetary impact in £ + (2) average household income change in £ for each decile (1-10) + (3) relative income change in % for each decile (1-10) + + British English. Every figure from the model run, not memory. + +chat_settings: + model_backend: uk_python + num_runs: 3 + +dataset: enhanced_frs_2023_24 +parameter_year: 2025 + +reference: + # Direction: NEGATIVE budgetary impact (cost to the exchequer). + # Magnitude: ~£15-25bn band. Midpoint -£20bn with ±50% tolerance covers + # -£10bn to -£30bn — wide enough to absorb dataset drift, tight enough to + # catch a sign flip. + fields_to_compare: + - path: budget.budgetary_impact + expected_approx: -20_000_000_000 + tolerance_pct: 50.0 + +anchor: + must_mention: + - "personal allowance" + - "15,000" + - "12,570" + must_not_say: + - "revenue rise" # a PA raise is a cost + - "top decile benefits most" # PA taper above £100k removes the gain for very high earners + ideal_finding: | + Raising the PA to £15,000 is a tax cut of order £15-25bn in 2025-26 on + Enhanced FRS 2023-24. Households in the bottom deciles (especially D2-D4) + see meaningful proportional gains because the PA bites earlier in their + earnings; the top decile sees a smaller proportional gain because the PA + taper above £100k eliminates the personal allowance for very high earners. diff --git a/evals/scenarios/b8_ni_primary_threshold_plus_1k.yaml b/evals/scenarios/b8_ni_primary_threshold_plus_1k.yaml new file mode 100644 index 0000000..29d21b2 --- /dev/null +++ b/evals/scenarios/b8_ni_primary_threshold_plus_1k.yaml @@ -0,0 +1,58 @@ +id: b8_ni_primary_threshold_plus_1k +test: B +title: "NI primary threshold +£1,000 — reform-API regression (#82)" +what_it_tests: | + Asserts the agent applies an NI-side threshold change (not a tax-side change) + end-to-end. Targets a different model surface than the income tax cases + (b6, b7): if the chat conflates NI thresholds with the income tax PA, this + case will catch it via the direction + must_mention checks. + + Assertions are loose by design — direction + order-of-magnitude only. See + issue #82. + +source_issue: "https://github.com/PolicyEngine/policyengine-uk-chat/issues/82" + +prompt: | + Run a UK 2025 economy-wide comparison. Baseline: current law. Reform: raise + the National Insurance primary threshold by £1,000 (from £12,570 to £13,570). + Use the Enhanced FRS 2023-24 dataset: + hf://policyengine/policyengine-uk-data-private/enhanced_frs_2023_24.h5. + Use Microsimulation. + + Report from the model run: + (1) total budgetary impact in £ + (2) average household income change in £ for each decile (1-10) + + British English. Every figure from the model run, not memory. + +chat_settings: + model_backend: uk_python + num_runs: 3 + +dataset: enhanced_frs_2023_24 +parameter_year: 2025 + +reference: + # Direction: NEGATIVE (cost). Magnitude: ~£2-5bn band. Midpoint -£3.5bn + # with ±60% tolerance covers -£1.4bn to -£5.6bn. + fields_to_compare: + - path: budget.budgetary_impact + expected_approx: -3_500_000_000 + tolerance_pct: 60.0 + +anchor: + must_mention: + - "national insurance" + - "primary threshold" + - "1,000" + must_not_say: + - "personal allowance" # this reform is NI-side, not income-tax-side; conflation = regression + - "revenue rise" # threshold lift is a cost, not revenue-raising + ideal_finding: | + Lifting the NI primary threshold by £1,000 costs the exchequer of order + £2-5bn in 2025-26 on Enhanced FRS 2023-24. Workers earning above the new + threshold (£13,570) gain a flat ~£80/year from the rate-times-band + arithmetic. Mid-to-low earners — those whose earnings clear the threshold + but who are not concentrated at the top of the distribution — benefit + most in proportional terms; pensioners and non-workers see no change since + NI is not levied on pension income. diff --git a/evals/scenarios/b9_child_benefit_uprate_10pct.yaml b/evals/scenarios/b9_child_benefit_uprate_10pct.yaml new file mode 100644 index 0000000..fbf3bf2 --- /dev/null +++ b/evals/scenarios/b9_child_benefit_uprate_10pct.yaml @@ -0,0 +1,64 @@ +id: b9_child_benefit_uprate_10pct +test: B +title: "Child benefit uprated by 10% — reform-API regression (#82)" +what_it_tests: | + First benefit-side reform in this regression set — distinct from the + tax-side cases (b6, b7, b8). Asserts the agent can express a percentage + uprating of an existing benefit parameter (not a level change) and reports + the right direction (spending increase = revenue cost) at the right order + of magnitude. Households with children should gain. + + Assertions are loose by design — direction + order-of-magnitude only. See + issue #82. + +source_issue: "https://github.com/PolicyEngine/policyengine-uk-chat/issues/82" + +prompt: | + Run a UK 2025 economy-wide comparison. Baseline: current law. Reform: uprate + Child Benefit by 10% (raise both the eldest-child rate and the + additional-child rate by 10% of their current values). Use the Enhanced FRS + 2023-24 dataset: + hf://policyengine/policyengine-uk-data-private/enhanced_frs_2023_24.h5. + Use Microsimulation. + + Report from the model run: + (1) total budgetary impact in £ (i.e. cost to the exchequer) + (2) average household income change in £ for each decile (1-10) + (3) the change in household income for households with at least one child + + British English. Every figure from the model run, not memory. + +chat_settings: + model_backend: uk_python + num_runs: 3 + +dataset: enhanced_frs_2023_24 +parameter_year: 2025 + +reference: + # Direction: NEGATIVE (cost). Magnitude: ~£1-2bn band on a benefit that + # currently costs ~£12-13bn/year. Midpoint -£1.3bn with ±70% tolerance. + fields_to_compare: + - path: budget.budgetary_impact + expected_approx: -1_300_000_000 + tolerance_pct: 70.0 + - path: budget.benefit_spending_impact + expected_approx: 1_300_000_000 + tolerance_pct: 70.0 + +anchor: + must_mention: + - "child benefit" + - "10%" + - "households with children" # the affected population + must_not_say: + - "no households gain" + - "revenue rise" # spending-side uprating is a cost, not revenue + ideal_finding: | + Uprating Child Benefit by 10% costs the exchequer of order £1-2bn in + 2025-26. The gain concentrates in households with children, regardless of + income (since Child Benefit is universal at the household level, subject + to the High Income Child Benefit Charge withdrawal). Lower deciles see a + larger proportional gain because Child Benefit is a larger share of their + income; the High Income Child Benefit Charge taper limits the gain for + higher-earning parents.