Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 25 additions & 2 deletions evals/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,9 @@ evals/
README.md ← this file
scenarios/
a1_*.yaml ← Test A scenarios (supplement)
b1_*.yaml ← Test B scenarios (alternative)
b1_*.yaml — b5_*.yaml ← Test B scenarios (alternative — fixture-backed)
b6_*.yaml — b10_*.yaml ← Test B reform-API regression catchers (issue #82,
direction + order-of-magnitude, no fixtures)
fixtures/
pe_api/ ← reference PE-API responses for Test B
← populated by a follow-up PR
Expand Down Expand Up @@ -85,4 +87,25 @@ Pre-committed in SPEC.md so we don't rationalize ambiguous results later:

Add a YAML file. That's it — the runner picks it up automatically. Keep the `id` short and stable (it appears in filenames and reports).

When adding a Test B scenario, also add the reference fixture under `fixtures/pe_api/` and reference it via the relative path in the scenario YAML.
When adding a Test B scenario, you have two options for the reference:

1. **Fixture-backed** (preferred for replication tests, e.g. b1-b4): add the
reference JSON under `fixtures/pe_api/` and reference it via
`reference.fixture` in the YAML.
2. **`expected_approx` inline** (for regression catchers — see b6-b10 below,
filed against [#82](https://github.com/PolicyEngine/policyengine-uk-chat/issues/82)):
omit `reference.fixture` and put `expected_approx` directly on each entry
in `fields_to_compare`. Use a wide `tolerance_pct` (50-70%) so the assertion
encodes *direction + order-of-magnitude* rather than exact-number pinning.
Tax microsim numbers shift with dataset version and parameter year; exact
pinning creates flaky tests.

### Reform-API regression suite (b6-b10, issue #82)

A small sub-suite of B scenarios exists specifically to catch silent
regressions in reform expressibility — the failure mode where the agent
silently fails to apply a reform and produces no eval signal. Each pins
`dataset` and `parameter_year` at the YAML top level, uses `expected_approx`
inline (no fixture), and relies on the anchor's `must_mention` to assert the
reform parameters appear in the chat's prose answer. Direction and
order-of-magnitude only — no exact-number pinning.
77 changes: 77 additions & 0 deletions evals/scenarios/b10_two_band_collapse.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
id: b10_two_band_collapse
test: B
title: "Collapse basic + higher rates into a single 25% band — reform-API regression (#82)"
what_it_tests: |
Structural reform — not a single parameter tweak — that asks the agent to
apply two changes at once (lift basic to 25%, drop higher to 25%) and
describe the result. The revenue sign is *deliberately not pinned* because
the net direction depends on the dataset and is genuinely ambiguous; what
matters is that the chat (a) mentions both bands by name, (b) understands
it's a structural reform rather than a single-parameter change, and (c)
produces *a* numeric budgetary impact (not silence or refusal).

Assertions are loose by design — direction + order-of-magnitude only. See
issue #82.

source_issue: "https://github.com/PolicyEngine/policyengine-uk-chat/issues/82"

prompt: |
Run a UK 2025 economy-wide comparison. Baseline: current law. Reform:
collapse the income tax basic rate (currently 20%) and higher rate
(currently 40%) into a single 25% band that covers all income from the
personal allowance up to the additional-rate threshold (£125,140). Leave
the additional rate (45% above £125,140) untouched.

Use the Enhanced FRS 2023-24 dataset:
hf://policyengine/policyengine-uk-data-private/enhanced_frs_2023_24.h5.
Use Microsimulation.

Report from the model run:
(1) total budgetary impact in £ (state direction explicitly — gain or cost)
(2) average household income change in £ for each decile (1-10)
(3) which decile is the biggest absolute loser and biggest absolute gainer

British English. Every figure from the model run, not memory.

chat_settings:
model_backend: uk_python
num_runs: 3

dataset: enhanced_frs_2023_24
parameter_year: 2025

reference:
# Direction is NOT pinned for this scenario — a structural reform of this
# shape can come out either sign depending on dataset composition. The
# `expected_approx` value of 0 with a very wide percentage tolerance is
# effectively a *magnitude* check only: the field must be extractable and
# in the £-billion range. Sign is asserted via the anchor's
# `must_mention` ("either revenue raise or revenue cost", forcing the
# chat to commit to one direction in prose).
#
# NOTE: tolerance_pct against expected_approx=0 produces div-by-zero
# gracefully (the grader's `_diff_scalar` checks `expected != 0`); the
# diff entry will show extracted-only and the chat answer's must_mention
# block does the sign-direction work instead. This is intentional.
fields_to_compare:
- path: budget.budgetary_impact
expected_approx: 0
tolerance_pct: 100.0

anchor:
must_mention:
- "basic rate"
- "higher rate"
- "25%"
must_not_say:
- "the basic rate stays the same" # it changes from 20% to 25%
- "the higher rate stays the same" # it changes from 40% to 25%
ideal_finding: |
Collapsing basic (20%) and higher (40%) into a single 25% band is a
structural reform with offsetting effects: basic-rate taxpayers (the bulk
of D5-D8) pay more, higher-rate taxpayers (D9-D10) pay less. The net
budgetary direction depends on which population is larger by tax base in
the dataset. The chat must (a) commit to a numeric budgetary impact with
an explicit sign, (b) describe the distributional split across deciles
(mid-deciles lose, top deciles gain), and (c) reference both bands by
name rather than treating this as a single-parameter change.
76 changes: 76 additions & 0 deletions evals/scenarios/b6_basic_rate_plus_1pp.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
id: b6_basic_rate_plus_1pp
test: B
title: "Income tax basic rate +1pp — silent regression catcher (#82)"
what_it_tests: |
Direct regression test for the failure observed in the 2026-05-28 live test:
the agent could not express a basic-rate +1pp reform via the PolicyEngine UK
API and produced no eval signal. This case asserts the agent (a) runs the
reform end-to-end, (b) reports a *positive* budgetary impact (revenue raise),
(c) gets the magnitude inside the right order (~£5-10bn for 2025-26 on
Enhanced FRS 2023-24), and (d) attributes the burden to upper deciles.

Assertions are loose by design — direction + order-of-magnitude only. Tax
microsim numbers shift with dataset version and parameter year; exact-number
pinning would create flaky tests. See evals/SPEC.md and issue #82.

source_issue: "https://github.com/PolicyEngine/policyengine-uk-chat/issues/82"

prompt: |
Run a UK 2025 economy-wide comparison. Baseline: current law. Reform: raise
the income tax basic rate from 20% to 21%. Use the Enhanced FRS 2023-24
dataset: hf://policyengine/policyengine-uk-data-private/enhanced_frs_2023_24.h5.
Use Microsimulation.

Report from the model run:
(1) total budgetary impact in £
(2) income tax revenue change in £
(3) average household income change in £ for each decile (1-10)

British English. Every figure from the model run, not memory.

chat_settings:
model_backend: uk_python
num_runs: 3

# Pinned for reproducibility — see issue #82 constraint.
dataset: enhanced_frs_2023_24
parameter_year: 2025

reference:
# No fixture file — assertions are direction + order-of-magnitude only, so
# `expected_approx` is supplied inline and the grader skips fixture loading.
# The wide `tolerance_pct` deliberately encodes "within an order of magnitude"
# rather than a pinned exact value. Sign of `expected_approx` enforces the
# direction assertion (positive = revenue-raising).
fields_to_compare:
# Direction: POSITIVE revenue change. Magnitude: ~£5-10bn band.
# Midpoint £7.5bn with ±50% tolerance covers £3.75bn–£11.25bn — wide
# enough to absorb dataset/year drift, tight enough to catch a sign flip
# or an order-of-magnitude miss.
- path: budget.budgetary_impact
expected_approx: 7_500_000_000
tolerance_pct: 50.0
- path: budget.tax_revenue_impact
expected_approx: 7_500_000_000
tolerance_pct: 50.0

anchor:
# Direction-of-burden and parameter-mention assertions live here. The grader
# runs case-insensitive substring checks on these against the chat response.
must_mention:
- "basic rate" # the parameter under reform
- "21" # the new rate
- "income tax" # the tax surface
- "revenue" # the magnitude language
must_not_say:
- "revenue falls" # +1pp is a revenue raise, not a cost
- "tax cut" # this is a tax rise
- "everyone loses equally" # upper deciles bear most of the burden
ideal_finding: |
Raising the income tax basic rate from 20% to 21% is a revenue-raising
reform of order £5-10bn for 2025-26 on Enhanced FRS 2023-24. Because the
basic-rate band caps at the higher-rate threshold (£50,270), the £ burden
rises with income up to that point and then flattens — meaning upper-middle
and top deciles bear most of the absolute burden, while bottom deciles
(largely below the personal allowance) are barely affected. Relative-income
losses are larger in the upper deciles than the bottom.
60 changes: 60 additions & 0 deletions evals/scenarios/b7_pa_to_15k.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
id: b7_pa_to_15k
test: B
title: "Personal allowance raised to £15,000 — reform-API regression (#82)"
what_it_tests: |
Companion to b1_society_wide_pa.yaml but framed as a regression check rather
than a fixture replication: asserts only direction (revenue cost, not raise)
and order-of-magnitude (~£15-25bn). Catches a class of failure where the
agent silently fails to apply the reform and reports baseline numbers, or
inverts the sign of the budgetary impact.

Assertions are loose by design — direction + order-of-magnitude only. See
issue #82.

source_issue: "https://github.com/PolicyEngine/policyengine-uk-chat/issues/82"

prompt: |
Run a UK 2025 economy-wide comparison. Baseline: current law. Reform: raise
the income tax personal allowance from £12,570 to £15,000. Use the Enhanced
FRS 2023-24 dataset:
hf://policyengine/policyengine-uk-data-private/enhanced_frs_2023_24.h5.
Use Microsimulation.

Report from the model run:
(1) total budgetary impact in £
(2) average household income change in £ for each decile (1-10)
(3) relative income change in % for each decile (1-10)

British English. Every figure from the model run, not memory.

chat_settings:
model_backend: uk_python
num_runs: 3

dataset: enhanced_frs_2023_24
parameter_year: 2025

reference:
# Direction: NEGATIVE budgetary impact (cost to the exchequer).
# Magnitude: ~£15-25bn band. Midpoint -£20bn with ±50% tolerance covers
# -£10bn to -£30bn — wide enough to absorb dataset drift, tight enough to
# catch a sign flip.
fields_to_compare:
- path: budget.budgetary_impact
expected_approx: -20_000_000_000
tolerance_pct: 50.0

anchor:
must_mention:
- "personal allowance"
- "15,000"
- "12,570"
must_not_say:
- "revenue rise" # a PA raise is a cost
- "top decile benefits most" # PA taper above £100k removes the gain for very high earners
ideal_finding: |
Raising the PA to £15,000 is a tax cut of order £15-25bn in 2025-26 on
Enhanced FRS 2023-24. Households in the bottom deciles (especially D2-D4)
see meaningful proportional gains because the PA bites earlier in their
earnings; the top decile sees a smaller proportional gain because the PA
taper above £100k eliminates the personal allowance for very high earners.
58 changes: 58 additions & 0 deletions evals/scenarios/b8_ni_primary_threshold_plus_1k.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
id: b8_ni_primary_threshold_plus_1k
test: B
title: "NI primary threshold +£1,000 — reform-API regression (#82)"
what_it_tests: |
Asserts the agent applies an NI-side threshold change (not a tax-side change)
end-to-end. Targets a different model surface than the income tax cases
(b6, b7): if the chat conflates NI thresholds with the income tax PA, this
case will catch it via the direction + must_mention checks.

Assertions are loose by design — direction + order-of-magnitude only. See
issue #82.

source_issue: "https://github.com/PolicyEngine/policyengine-uk-chat/issues/82"

prompt: |
Run a UK 2025 economy-wide comparison. Baseline: current law. Reform: raise
the National Insurance primary threshold by £1,000 (from £12,570 to £13,570).
Use the Enhanced FRS 2023-24 dataset:
hf://policyengine/policyengine-uk-data-private/enhanced_frs_2023_24.h5.
Use Microsimulation.

Report from the model run:
(1) total budgetary impact in £
(2) average household income change in £ for each decile (1-10)

British English. Every figure from the model run, not memory.

chat_settings:
model_backend: uk_python
num_runs: 3

dataset: enhanced_frs_2023_24
parameter_year: 2025

reference:
# Direction: NEGATIVE (cost). Magnitude: ~£2-5bn band. Midpoint -£3.5bn
# with ±60% tolerance covers -£1.4bn to -£5.6bn.
fields_to_compare:
- path: budget.budgetary_impact
expected_approx: -3_500_000_000
tolerance_pct: 60.0

anchor:
must_mention:
- "national insurance"
- "primary threshold"
- "1,000"
must_not_say:
- "personal allowance" # this reform is NI-side, not income-tax-side; conflation = regression
- "revenue rise" # threshold lift is a cost, not revenue-raising
ideal_finding: |
Lifting the NI primary threshold by £1,000 costs the exchequer of order
£2-5bn in 2025-26 on Enhanced FRS 2023-24. Workers earning above the new
threshold (£13,570) gain a flat ~£80/year from the rate-times-band
arithmetic. Mid-to-low earners — those whose earnings clear the threshold
but who are not concentrated at the top of the distribution — benefit
most in proportional terms; pensioners and non-workers see no change since
NI is not levied on pension income.
64 changes: 64 additions & 0 deletions evals/scenarios/b9_child_benefit_uprate_10pct.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
id: b9_child_benefit_uprate_10pct
test: B
title: "Child benefit uprated by 10% — reform-API regression (#82)"
what_it_tests: |
First benefit-side reform in this regression set — distinct from the
tax-side cases (b6, b7, b8). Asserts the agent can express a percentage
uprating of an existing benefit parameter (not a level change) and reports
the right direction (spending increase = revenue cost) at the right order
of magnitude. Households with children should gain.

Assertions are loose by design — direction + order-of-magnitude only. See
issue #82.

source_issue: "https://github.com/PolicyEngine/policyengine-uk-chat/issues/82"

prompt: |
Run a UK 2025 economy-wide comparison. Baseline: current law. Reform: uprate
Child Benefit by 10% (raise both the eldest-child rate and the
additional-child rate by 10% of their current values). Use the Enhanced FRS
2023-24 dataset:
hf://policyengine/policyengine-uk-data-private/enhanced_frs_2023_24.h5.
Use Microsimulation.

Report from the model run:
(1) total budgetary impact in £ (i.e. cost to the exchequer)
(2) average household income change in £ for each decile (1-10)
(3) the change in household income for households with at least one child

British English. Every figure from the model run, not memory.

chat_settings:
model_backend: uk_python
num_runs: 3

dataset: enhanced_frs_2023_24
parameter_year: 2025

reference:
# Direction: NEGATIVE (cost). Magnitude: ~£1-2bn band on a benefit that
# currently costs ~£12-13bn/year. Midpoint -£1.3bn with ±70% tolerance.
fields_to_compare:
- path: budget.budgetary_impact
expected_approx: -1_300_000_000
tolerance_pct: 70.0
- path: budget.benefit_spending_impact
expected_approx: 1_300_000_000
tolerance_pct: 70.0

anchor:
must_mention:
- "child benefit"
- "10%"
- "households with children" # the affected population
must_not_say:
- "no households gain"
- "revenue rise" # spending-side uprating is a cost, not revenue
ideal_finding: |
Uprating Child Benefit by 10% costs the exchequer of order £1-2bn in
2025-26. The gain concentrates in households with children, regardless of
income (since Child Benefit is universal at the household level, subject
to the High Income Child Benefit Charge withdrawal). Lower deciles see a
larger proportional gain because Child Benefit is a larger share of their
income; the High Income Child Benefit Charge taper limits the gain for
higher-earning parents.