Background
Today's live test (2026-05-28) exposed the agent's inability to express a basic-rate +1pp reform via the PolicyEngine UK API. The failure was silent — no test, no eval signal — and only surfaced because a human watched the stream.
PR #52 (feat/eval-harness) is the right home for systematic regression coverage like this. This issue tracks adding a concrete set of reform-API cases once #52 lands.
Cases to add
Each case = a user prompt + expected behavior:
- Basic rate +1pp: "What's the distributional impact of raising the income tax basic rate from 20% to 21%?" — expect: positive revenue change ~£6–8bn, top deciles bearing the burden, optional chart.
- Personal allowance to £15k: "What if the personal allowance were £15,000?" — expect: revenue drop, bottom-quintile gain.
- NI primary threshold change: "What if the NI primary threshold went up by £1,000?" — expect: low-to-mid earners benefit, revenue cost.
- Child benefit uprating: "Uprate child benefit by 10%." — expect: lower-decile gain, modest revenue cost.
- Two-band income tax: "What if we collapsed basic and higher rates to a single 25% band?" — expect: a sensible reform structure, not a guessed one.
For each: assert the direction of revenue change is correct, the magnitude is within an order of magnitude of a known answer, and the agent's response either (a) emits a chart when chartsMode=on, or (b) cites the typed tool used.
Why this matters
Without these eval cases, fixes to system-prompt recipes (#filed-companion), typed tools (#filed-companion), and model selection (#filed-companion) can regress silently. Each PR that touches reform handling should run these cases.
Depends on
PR #52 landing (eval harness scaffolding).
Constraints
- Cases should be cheap to run — single household / minimal microsim where possible.
- Pin a dataset version (e.g. EFRS 2025/26) so case results are reproducible.
Background
Today's live test (2026-05-28) exposed the agent's inability to express a basic-rate +1pp reform via the PolicyEngine UK API. The failure was silent — no test, no eval signal — and only surfaced because a human watched the stream.
PR #52 (
feat/eval-harness) is the right home for systematic regression coverage like this. This issue tracks adding a concrete set of reform-API cases once #52 lands.Cases to add
Each case = a user prompt + expected behavior:
For each: assert the direction of revenue change is correct, the magnitude is within an order of magnitude of a known answer, and the agent's response either (a) emits a chart when chartsMode=on, or (b) cites the typed tool used.
Why this matters
Without these eval cases, fixes to system-prompt recipes (#filed-companion), typed tools (#filed-companion), and model selection (#filed-companion) can regress silently. Each PR that touches reform handling should run these cases.
Depends on
PR #52 landing (eval harness scaffolding).
Constraints