Skip to content

Add reform-API eval cases to catch silent regressions like the 1pp basic-rate failure #82

@vahid-ahmadi

Description

@vahid-ahmadi

Background

Today's live test (2026-05-28) exposed the agent's inability to express a basic-rate +1pp reform via the PolicyEngine UK API. The failure was silent — no test, no eval signal — and only surfaced because a human watched the stream.

PR #52 (feat/eval-harness) is the right home for systematic regression coverage like this. This issue tracks adding a concrete set of reform-API cases once #52 lands.

Cases to add

Each case = a user prompt + expected behavior:

  1. Basic rate +1pp: "What's the distributional impact of raising the income tax basic rate from 20% to 21%?" — expect: positive revenue change ~£6–8bn, top deciles bearing the burden, optional chart.
  2. Personal allowance to £15k: "What if the personal allowance were £15,000?" — expect: revenue drop, bottom-quintile gain.
  3. NI primary threshold change: "What if the NI primary threshold went up by £1,000?" — expect: low-to-mid earners benefit, revenue cost.
  4. Child benefit uprating: "Uprate child benefit by 10%." — expect: lower-decile gain, modest revenue cost.
  5. Two-band income tax: "What if we collapsed basic and higher rates to a single 25% band?" — expect: a sensible reform structure, not a guessed one.

For each: assert the direction of revenue change is correct, the magnitude is within an order of magnitude of a known answer, and the agent's response either (a) emits a chart when chartsMode=on, or (b) cites the typed tool used.

Why this matters

Without these eval cases, fixes to system-prompt recipes (#filed-companion), typed tools (#filed-companion), and model selection (#filed-companion) can regress silently. Each PR that touches reform handling should run these cases.

Depends on

PR #52 landing (eval harness scaffolding).

Constraints

  • Cases should be cheap to run — single household / minimal microsim where possible.
  • Pin a dataset version (e.g. EFRS 2025/26) so case results are reproducible.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions