Add reform-API eval cases to catch silent regressions like the 1pp basic-rate failure

## Background

Today's live test (2026-05-28) exposed the agent's inability to express a basic-rate +1pp reform via the PolicyEngine UK API. The failure was silent — no test, no eval signal — and only surfaced because a human watched the stream.

PR #52 (`feat/eval-harness`) is the right home for systematic regression coverage like this. This issue tracks adding a concrete set of reform-API cases once #52 lands.

## Cases to add

Each case = a user prompt + expected behavior:

1. **Basic rate +1pp**: *"What's the distributional impact of raising the income tax basic rate from 20% to 21%?"* — expect: positive revenue change ~£6–8bn, top deciles bearing the burden, optional chart.
2. **Personal allowance to £15k**: *"What if the personal allowance were £15,000?"* — expect: revenue drop, bottom-quintile gain.
3. **NI primary threshold change**: *"What if the NI primary threshold went up by £1,000?"* — expect: low-to-mid earners benefit, revenue cost.
4. **Child benefit uprating**: *"Uprate child benefit by 10%."* — expect: lower-decile gain, modest revenue cost.
5. **Two-band income tax**: *"What if we collapsed basic and higher rates to a single 25% band?"* — expect: a sensible reform structure, not a guessed one.

For each: assert the **direction** of revenue change is correct, the magnitude is within an order of magnitude of a known answer, and the agent's response either (a) emits a chart when chartsMode=on, or (b) cites the typed tool used.

## Why this matters

Without these eval cases, fixes to system-prompt recipes (#filed-companion), typed tools (#filed-companion), and model selection (#filed-companion) can regress silently. Each PR that touches reform handling should run these cases.

## Depends on

PR #52 landing (eval harness scaffolding).

## Constraints

- Cases should be cheap to run — single household / minimal microsim where possible.
- Pin a dataset version (e.g. EFRS 2025/26) so case results are reproducible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add reform-API eval cases to catch silent regressions like the 1pp basic-rate failure #82

Background

Cases to add

Why this matters

Depends on

Constraints

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Add reform-API eval cases to catch silent regressions like the 1pp basic-rate failure #82

Description

Background

Cases to add

Why this matters

Depends on

Constraints

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions