From f049829ccfa2be2b323bd610f5da42cf4eb8dca6 Mon Sep 17 00:00:00 2001
From: Vahid Ahmadi <va.vahidahmadi@gmail.com>
Date: Fri, 29 May 2026 11:10:28 +0200
Subject: [PATCH] Add reform-API regression eval cases (closes #82)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Adds 5 Test B scenarios (b6-b10) that catch silent reform-expressibility
regressions like the 2026-05-28 basic-rate +1pp failure. Each pins dataset
(enhanced_frs_2023_24) and parameter_year (2025), uses expected_approx with
wide tolerance_pct for direction + order-of-magnitude assertions only — no
fixture files, no exact-number pinning — and uses anchor.must_mention to
assert the chat's prose answer references the reform parameters by name.

Branched off feat/eval-harness (PR #52) since #82 explicitly depends on the
harness landing. README updated to document the two reference shapes
(fixture-backed vs expected_approx-only) and the regression sub-suite.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 evals/README.md                               | 27 ++++++-
 evals/scenarios/b10_two_band_collapse.yaml    | 77 +++++++++++++++++++
 evals/scenarios/b6_basic_rate_plus_1pp.yaml   | 76 ++++++++++++++++++
 evals/scenarios/b7_pa_to_15k.yaml             | 60 +++++++++++++++
 .../b8_ni_primary_threshold_plus_1k.yaml      | 58 ++++++++++++++
 .../b9_child_benefit_uprate_10pct.yaml        | 64 +++++++++++++++
 6 files changed, 360 insertions(+), 2 deletions(-)
 create mode 100644 evals/scenarios/b10_two_band_collapse.yaml
 create mode 100644 evals/scenarios/b6_basic_rate_plus_1pp.yaml
 create mode 100644 evals/scenarios/b7_pa_to_15k.yaml
 create mode 100644 evals/scenarios/b8_ni_primary_threshold_plus_1k.yaml
 create mode 100644 evals/scenarios/b9_child_benefit_uprate_10pct.yaml

diff --git a/evals/README.md b/evals/README.md
index 5aa500e..eaaec9f 100644
--- a/evals/README.md
+++ b/evals/README.md
@@ -12,7 +12,9 @@ evals/
   README.md              ← this file
   scenarios/
     a1_*.yaml            ← Test A scenarios (supplement)
-    b1_*.yaml            ← Test B scenarios (alternative)
+    b1_*.yaml — b5_*.yaml ← Test B scenarios (alternative — fixture-backed)
+    b6_*.yaml — b10_*.yaml ← Test B reform-API regression catchers (issue #82,
+                              direction + order-of-magnitude, no fixtures)
   fixtures/
     pe_api/              ← reference PE-API responses for Test B
                           ← populated by a follow-up PR
@@ -85,4 +87,25 @@ Pre-committed in SPEC.md so we don't rationalize ambiguous results later:
 
 Add a YAML file. That's it — the runner picks it up automatically. Keep the `id` short and stable (it appears in filenames and reports).
 
-When adding a Test B scenario, also add the reference fixture under `fixtures/pe_api/` and reference it via the relative path in the scenario YAML.
+When adding a Test B scenario, you have two options for the reference:
+
+1. **Fixture-backed** (preferred for replication tests, e.g. b1-b4): add the
+   reference JSON under `fixtures/pe_api/` and reference it via
+   `reference.fixture` in the YAML.
+2. **`expected_approx` inline** (for regression catchers — see b6-b10 below,
+   filed against [#82](https://github.com/PolicyEngine/policyengine-uk-chat/issues/82)):
+   omit `reference.fixture` and put `expected_approx` directly on each entry
+   in `fields_to_compare`. Use a wide `tolerance_pct` (50-70%) so the assertion
+   encodes *direction + order-of-magnitude* rather than exact-number pinning.
+   Tax microsim numbers shift with dataset version and parameter year; exact
+   pinning creates flaky tests.
+
+### Reform-API regression suite (b6-b10, issue #82)
+
+A small sub-suite of B scenarios exists specifically to catch silent
+regressions in reform expressibility — the failure mode where the agent
+silently fails to apply a reform and produces no eval signal. Each pins
+`dataset` and `parameter_year` at the YAML top level, uses `expected_approx`
+inline (no fixture), and relies on the anchor's `must_mention` to assert the
+reform parameters appear in the chat's prose answer. Direction and
+order-of-magnitude only — no exact-number pinning.
diff --git a/evals/scenarios/b10_two_band_collapse.yaml b/evals/scenarios/b10_two_band_collapse.yaml
new file mode 100644
index 0000000..aaeb1f7
--- /dev/null
+++ b/evals/scenarios/b10_two_band_collapse.yaml
@@ -0,0 +1,77 @@
+id: b10_two_band_collapse
+test: B
+title: "Collapse basic + higher rates into a single 25% band — reform-API regression (#82)"
+what_it_tests: |
+  Structural reform — not a single parameter tweak — that asks the agent to
+  apply two changes at once (lift basic to 25%, drop higher to 25%) and
+  describe the result. The revenue sign is *deliberately not pinned* because
+  the net direction depends on the dataset and is genuinely ambiguous; what
+  matters is that the chat (a) mentions both bands by name, (b) understands
+  it's a structural reform rather than a single-parameter change, and (c)
+  produces *a* numeric budgetary impact (not silence or refusal).
+
+  Assertions are loose by design — direction + order-of-magnitude only. See
+  issue #82.
+
+source_issue: "https://github.com/PolicyEngine/policyengine-uk-chat/issues/82"
+
+prompt: |
+  Run a UK 2025 economy-wide comparison. Baseline: current law. Reform:
+  collapse the income tax basic rate (currently 20%) and higher rate
+  (currently 40%) into a single 25% band that covers all income from the
+  personal allowance up to the additional-rate threshold (£125,140). Leave
+  the additional rate (45% above £125,140) untouched.
+
+  Use the Enhanced FRS 2023-24 dataset:
+  hf://policyengine/policyengine-uk-data-private/enhanced_frs_2023_24.h5.
+  Use Microsimulation.
+
+  Report from the model run:
+  (1) total budgetary impact in £ (state direction explicitly — gain or cost)
+  (2) average household income change in £ for each decile (1-10)
+  (3) which decile is the biggest absolute loser and biggest absolute gainer
+
+  British English. Every figure from the model run, not memory.
+
+chat_settings:
+  model_backend: uk_python
+  num_runs: 3
+
+dataset: enhanced_frs_2023_24
+parameter_year: 2025
+
+reference:
+  # Direction is NOT pinned for this scenario — a structural reform of this
+  # shape can come out either sign depending on dataset composition. The
+  # `expected_approx` value of 0 with a very wide percentage tolerance is
+  # effectively a *magnitude* check only: the field must be extractable and
+  # in the £-billion range. Sign is asserted via the anchor's
+  # `must_mention` ("either revenue raise or revenue cost", forcing the
+  # chat to commit to one direction in prose).
+  #
+  # NOTE: tolerance_pct against expected_approx=0 produces div-by-zero
+  # gracefully (the grader's `_diff_scalar` checks `expected != 0`); the
+  # diff entry will show extracted-only and the chat answer's must_mention
+  # block does the sign-direction work instead. This is intentional.
+  fields_to_compare:
+    - path: budget.budgetary_impact
+      expected_approx: 0
+      tolerance_pct: 100.0
+
+anchor:
+  must_mention:
+    - "basic rate"
+    - "higher rate"
+    - "25%"
+  must_not_say:
+    - "the basic rate stays the same"     # it changes from 20% to 25%
+    - "the higher rate stays the same"    # it changes from 40% to 25%
+  ideal_finding: |
+    Collapsing basic (20%) and higher (40%) into a single 25% band is a
+    structural reform with offsetting effects: basic-rate taxpayers (the bulk
+    of D5-D8) pay more, higher-rate taxpayers (D9-D10) pay less. The net
+    budgetary direction depends on which population is larger by tax base in
+    the dataset. The chat must (a) commit to a numeric budgetary impact with
+    an explicit sign, (b) describe the distributional split across deciles
+    (mid-deciles lose, top deciles gain), and (c) reference both bands by
+    name rather than treating this as a single-parameter change.
diff --git a/evals/scenarios/b6_basic_rate_plus_1pp.yaml b/evals/scenarios/b6_basic_rate_plus_1pp.yaml
new file mode 100644
index 0000000..1562381
--- /dev/null
+++ b/evals/scenarios/b6_basic_rate_plus_1pp.yaml
@@ -0,0 +1,76 @@
+id: b6_basic_rate_plus_1pp
+test: B
+title: "Income tax basic rate +1pp — silent regression catcher (#82)"
+what_it_tests: |
+  Direct regression test for the failure observed in the 2026-05-28 live test:
+  the agent could not express a basic-rate +1pp reform via the PolicyEngine UK
+  API and produced no eval signal. This case asserts the agent (a) runs the
+  reform end-to-end, (b) reports a *positive* budgetary impact (revenue raise),
+  (c) gets the magnitude inside the right order (~£5-10bn for 2025-26 on
+  Enhanced FRS 2023-24), and (d) attributes the burden to upper deciles.
+
+  Assertions are loose by design — direction + order-of-magnitude only. Tax
+  microsim numbers shift with dataset version and parameter year; exact-number
+  pinning would create flaky tests. See evals/SPEC.md and issue #82.
+
+source_issue: "https://github.com/PolicyEngine/policyengine-uk-chat/issues/82"
+
+prompt: |
+  Run a UK 2025 economy-wide comparison. Baseline: current law. Reform: raise
+  the income tax basic rate from 20% to 21%. Use the Enhanced FRS 2023-24
+  dataset: hf://policyengine/policyengine-uk-data-private/enhanced_frs_2023_24.h5.
+  Use Microsimulation.
+
+  Report from the model run:
+  (1) total budgetary impact in £
+  (2) income tax revenue change in £
+  (3) average household income change in £ for each decile (1-10)
+
+  British English. Every figure from the model run, not memory.
+
+chat_settings:
+  model_backend: uk_python
+  num_runs: 3
+
+# Pinned for reproducibility — see issue #82 constraint.
+dataset: enhanced_frs_2023_24
+parameter_year: 2025
+
+reference:
+  # No fixture file — assertions are direction + order-of-magnitude only, so
+  # `expected_approx` is supplied inline and the grader skips fixture loading.
+  # The wide `tolerance_pct` deliberately encodes "within an order of magnitude"
+  # rather than a pinned exact value. Sign of `expected_approx` enforces the
+  # direction assertion (positive = revenue-raising).
+  fields_to_compare:
+    # Direction: POSITIVE revenue change. Magnitude: ~£5-10bn band.
+    # Midpoint £7.5bn with ±50% tolerance covers £3.75bn–£11.25bn — wide
+    # enough to absorb dataset/year drift, tight enough to catch a sign flip
+    # or an order-of-magnitude miss.
+    - path: budget.budgetary_impact
+      expected_approx: 7_500_000_000
+      tolerance_pct: 50.0
+    - path: budget.tax_revenue_impact
+      expected_approx: 7_500_000_000
+      tolerance_pct: 50.0
+
+anchor:
+  # Direction-of-burden and parameter-mention assertions live here. The grader
+  # runs case-insensitive substring checks on these against the chat response.
+  must_mention:
+    - "basic rate"            # the parameter under reform
+    - "21"                    # the new rate
+    - "income tax"            # the tax surface
+    - "revenue"               # the magnitude language
+  must_not_say:
+    - "revenue falls"         # +1pp is a revenue raise, not a cost
+    - "tax cut"               # this is a tax rise
+    - "everyone loses equally"  # upper deciles bear most of the burden
+  ideal_finding: |
+    Raising the income tax basic rate from 20% to 21% is a revenue-raising
+    reform of order £5-10bn for 2025-26 on Enhanced FRS 2023-24. Because the
+    basic-rate band caps at the higher-rate threshold (£50,270), the £ burden
+    rises with income up to that point and then flattens — meaning upper-middle
+    and top deciles bear most of the absolute burden, while bottom deciles
+    (largely below the personal allowance) are barely affected. Relative-income
+    losses are larger in the upper deciles than the bottom.
diff --git a/evals/scenarios/b7_pa_to_15k.yaml b/evals/scenarios/b7_pa_to_15k.yaml
new file mode 100644
index 0000000..65c4397
--- /dev/null
+++ b/evals/scenarios/b7_pa_to_15k.yaml
@@ -0,0 +1,60 @@
+id: b7_pa_to_15k
+test: B
+title: "Personal allowance raised to £15,000 — reform-API regression (#82)"
+what_it_tests: |
+  Companion to b1_society_wide_pa.yaml but framed as a regression check rather
+  than a fixture replication: asserts only direction (revenue cost, not raise)
+  and order-of-magnitude (~£15-25bn). Catches a class of failure where the
+  agent silently fails to apply the reform and reports baseline numbers, or
+  inverts the sign of the budgetary impact.
+
+  Assertions are loose by design — direction + order-of-magnitude only. See
+  issue #82.
+
+source_issue: "https://github.com/PolicyEngine/policyengine-uk-chat/issues/82"
+
+prompt: |
+  Run a UK 2025 economy-wide comparison. Baseline: current law. Reform: raise
+  the income tax personal allowance from £12,570 to £15,000. Use the Enhanced
+  FRS 2023-24 dataset:
+  hf://policyengine/policyengine-uk-data-private/enhanced_frs_2023_24.h5.
+  Use Microsimulation.
+
+  Report from the model run:
+  (1) total budgetary impact in £
+  (2) average household income change in £ for each decile (1-10)
+  (3) relative income change in % for each decile (1-10)
+
+  British English. Every figure from the model run, not memory.
+
+chat_settings:
+  model_backend: uk_python
+  num_runs: 3
+
+dataset: enhanced_frs_2023_24
+parameter_year: 2025
+
+reference:
+  # Direction: NEGATIVE budgetary impact (cost to the exchequer).
+  # Magnitude: ~£15-25bn band. Midpoint -£20bn with ±50% tolerance covers
+  # -£10bn to -£30bn — wide enough to absorb dataset drift, tight enough to
+  # catch a sign flip.
+  fields_to_compare:
+    - path: budget.budgetary_impact
+      expected_approx: -20_000_000_000
+      tolerance_pct: 50.0
+
+anchor:
+  must_mention:
+    - "personal allowance"
+    - "15,000"
+    - "12,570"
+  must_not_say:
+    - "revenue rise"          # a PA raise is a cost
+    - "top decile benefits most"   # PA taper above £100k removes the gain for very high earners
+  ideal_finding: |
+    Raising the PA to £15,000 is a tax cut of order £15-25bn in 2025-26 on
+    Enhanced FRS 2023-24. Households in the bottom deciles (especially D2-D4)
+    see meaningful proportional gains because the PA bites earlier in their
+    earnings; the top decile sees a smaller proportional gain because the PA
+    taper above £100k eliminates the personal allowance for very high earners.
diff --git a/evals/scenarios/b8_ni_primary_threshold_plus_1k.yaml b/evals/scenarios/b8_ni_primary_threshold_plus_1k.yaml
new file mode 100644
index 0000000..29d21b2
--- /dev/null
+++ b/evals/scenarios/b8_ni_primary_threshold_plus_1k.yaml
@@ -0,0 +1,58 @@
+id: b8_ni_primary_threshold_plus_1k
+test: B
+title: "NI primary threshold +£1,000 — reform-API regression (#82)"
+what_it_tests: |
+  Asserts the agent applies an NI-side threshold change (not a tax-side change)
+  end-to-end. Targets a different model surface than the income tax cases
+  (b6, b7): if the chat conflates NI thresholds with the income tax PA, this
+  case will catch it via the direction + must_mention checks.
+
+  Assertions are loose by design — direction + order-of-magnitude only. See
+  issue #82.
+
+source_issue: "https://github.com/PolicyEngine/policyengine-uk-chat/issues/82"
+
+prompt: |
+  Run a UK 2025 economy-wide comparison. Baseline: current law. Reform: raise
+  the National Insurance primary threshold by £1,000 (from £12,570 to £13,570).
+  Use the Enhanced FRS 2023-24 dataset:
+  hf://policyengine/policyengine-uk-data-private/enhanced_frs_2023_24.h5.
+  Use Microsimulation.
+
+  Report from the model run:
+  (1) total budgetary impact in £
+  (2) average household income change in £ for each decile (1-10)
+
+  British English. Every figure from the model run, not memory.
+
+chat_settings:
+  model_backend: uk_python
+  num_runs: 3
+
+dataset: enhanced_frs_2023_24
+parameter_year: 2025
+
+reference:
+  # Direction: NEGATIVE (cost). Magnitude: ~£2-5bn band. Midpoint -£3.5bn
+  # with ±60% tolerance covers -£1.4bn to -£5.6bn.
+  fields_to_compare:
+    - path: budget.budgetary_impact
+      expected_approx: -3_500_000_000
+      tolerance_pct: 60.0
+
+anchor:
+  must_mention:
+    - "national insurance"
+    - "primary threshold"
+    - "1,000"
+  must_not_say:
+    - "personal allowance"        # this reform is NI-side, not income-tax-side; conflation = regression
+    - "revenue rise"              # threshold lift is a cost, not revenue-raising
+  ideal_finding: |
+    Lifting the NI primary threshold by £1,000 costs the exchequer of order
+    £2-5bn in 2025-26 on Enhanced FRS 2023-24. Workers earning above the new
+    threshold (£13,570) gain a flat ~£80/year from the rate-times-band
+    arithmetic. Mid-to-low earners — those whose earnings clear the threshold
+    but who are not concentrated at the top of the distribution — benefit
+    most in proportional terms; pensioners and non-workers see no change since
+    NI is not levied on pension income.
diff --git a/evals/scenarios/b9_child_benefit_uprate_10pct.yaml b/evals/scenarios/b9_child_benefit_uprate_10pct.yaml
new file mode 100644
index 0000000..fbf3bf2
--- /dev/null
+++ b/evals/scenarios/b9_child_benefit_uprate_10pct.yaml
@@ -0,0 +1,64 @@
+id: b9_child_benefit_uprate_10pct
+test: B
+title: "Child benefit uprated by 10% — reform-API regression (#82)"
+what_it_tests: |
+  First benefit-side reform in this regression set — distinct from the
+  tax-side cases (b6, b7, b8). Asserts the agent can express a percentage
+  uprating of an existing benefit parameter (not a level change) and reports
+  the right direction (spending increase = revenue cost) at the right order
+  of magnitude. Households with children should gain.
+
+  Assertions are loose by design — direction + order-of-magnitude only. See
+  issue #82.
+
+source_issue: "https://github.com/PolicyEngine/policyengine-uk-chat/issues/82"
+
+prompt: |
+  Run a UK 2025 economy-wide comparison. Baseline: current law. Reform: uprate
+  Child Benefit by 10% (raise both the eldest-child rate and the
+  additional-child rate by 10% of their current values). Use the Enhanced FRS
+  2023-24 dataset:
+  hf://policyengine/policyengine-uk-data-private/enhanced_frs_2023_24.h5.
+  Use Microsimulation.
+
+  Report from the model run:
+  (1) total budgetary impact in £ (i.e. cost to the exchequer)
+  (2) average household income change in £ for each decile (1-10)
+  (3) the change in household income for households with at least one child
+
+  British English. Every figure from the model run, not memory.
+
+chat_settings:
+  model_backend: uk_python
+  num_runs: 3
+
+dataset: enhanced_frs_2023_24
+parameter_year: 2025
+
+reference:
+  # Direction: NEGATIVE (cost). Magnitude: ~£1-2bn band on a benefit that
+  # currently costs ~£12-13bn/year. Midpoint -£1.3bn with ±70% tolerance.
+  fields_to_compare:
+    - path: budget.budgetary_impact
+      expected_approx: -1_300_000_000
+      tolerance_pct: 70.0
+    - path: budget.benefit_spending_impact
+      expected_approx: 1_300_000_000
+      tolerance_pct: 70.0
+
+anchor:
+  must_mention:
+    - "child benefit"
+    - "10%"
+    - "households with children"   # the affected population
+  must_not_say:
+    - "no households gain"
+    - "revenue rise"               # spending-side uprating is a cost, not revenue
+  ideal_finding: |
+    Uprating Child Benefit by 10% costs the exchequer of order £1-2bn in
+    2025-26. The gain concentrates in households with children, regardless of
+    income (since Child Benefit is universal at the household level, subject
+    to the High Income Child Benefit Charge withdrawal). Lower deciles see a
+    larger proportional gain because Child Benefit is a larger share of their
+    income; the High Income Child Benefit Charge taper limits the gain for
+    higher-earning parents.