diff --git a/.gitignore b/.gitignore index b543491..b259d0e 100644 --- a/.gitignore +++ b/.gitignore @@ -4,6 +4,7 @@ __pycache__/ .next/ node_modules/ .venv/ +.venv-*/ *.egg-info/ dist/ @@ -11,3 +12,6 @@ dist/ # Both backend/Dockerfile and modal_app.py regenerate this against the # installed policyengine-uk-compiled version, so it never lives in git. backend/reference.md + +# Eval run outputs — generated per run, not source +evals/runs/ diff --git a/evals/README.md b/evals/README.md new file mode 100644 index 0000000..5aa500e --- /dev/null +++ b/evals/README.md @@ -0,0 +1,88 @@ +# uk-chat evals + +Evaluation harness for the chat. Tests two positionings — **supplement** to app-v2 reports and **alternative** to them — against pre-committed thresholds. + +The full design (rationale, thresholds, scenario descriptions) lives in [SPEC.md](./SPEC.md). This README covers structure and how to extend. + +## What's here today + +``` +evals/ + SPEC.md ← the design doc — read this first + README.md ← this file + scenarios/ + a1_*.yaml ← Test A scenarios (supplement) + b1_*.yaml ← Test B scenarios (alternative) + fixtures/ + pe_api/ ← reference PE-API responses for Test B + ← populated by a follow-up PR + runs/ ← raw eval-run output (gitignored) + ← populated by the runner in a follow-up PR +``` + +## Scenario file shape + +Every scenario is a YAML file in `scenarios/`. The fields are the same across A and B; some are populated only on one side. + +```yaml +id: a1_mechanism # short stable identifier — also the filename prefix +test: A # A (supplement) or B (alternative) +title: "Mechanism explanation" +what_it_tests: "One sentence on why this scenario exists." + +# Sent to the chat as ?scenario_context= (Test A only). Mirrors what the +# app-v2 drawer would prepend when opening from a report. +scenario_context: | + ... + +# The user's actual message. +prompt: | + ... + +chat_settings: + model_backend: uk_python # uk_compiled | uk_python + num_runs: 3 # how many fresh sessions to run per scenario + +# Test A — qualitative grading rubric. +rubric: + relevance: "1-5 — chat answers the actual follow-up" + methodology: "1-5 — chat states dataset/year/assumptions" + reasonableness: "1-5 — numbers in plausible range, internally consistent" + consistency_with_report: "1-5 — agrees with report on shared facts" + honesty: "1-5 — refuses cleanly vs fabricates" + +# Test B — numeric comparison against a fixture. +reference: + fixture: pe_api/b1_society_wide_pa.json + fields_to_compare: + - path: budget.budgetary_impact + tolerance_pct: 1.0 + - path: decile.relative + tolerance_pct: 1.0 + +# Both A and B — golden-path guidance the grader uses (and an LLM-judge will use later). +anchor: + must_mention: ["personal allowance taper", "60%", "£100,000"] + must_not_say: ["top decile pays more tax than the 8th"] + ideal_explanation: | + Multi-line prose describing what a Vahid-quality answer would look like. + Used by the human grader as a reference; will become input to an + automated LLM-judge in v2. +``` + +Where anchors come from: PolicyEngine's published research blog where one matches the scenario (Vahid's UK posts under `app/src/data/posts/articles/` in the app-v2 repo), otherwise UK tax-rule knowledge. See `SPEC.md` § "Anchors" for the per-scenario sourcing. + +The runner (added in a follow-up PR) reads every YAML in `scenarios/`, POSTs to a configured chat backend, and writes outputs under `runs///`. See SPEC.md "Roadmap" for what's coming next. + +## Decision thresholds + +Pre-committed in SPEC.md so we don't rationalize ambiguous results later: + +- **Test A** passes if mean rubric score ≥ 4.0 across all responses, no individual score < 2 on Reasonableness or Honesty, and ≤ 1-in-5 fabrication rate. +- **Test B** passes if field-level accuracy ≥ 95%, self-consistency SD < 0.5% of mean, methodology drift in ≤ 1 of 5 scenarios, failure rate < 10%. + +## Adding scenarios + +Add a YAML file. That's it — the runner picks it up automatically. Keep the `id` short and stable (it appears in filenames and reports). + +When adding a Test B scenario, also add the reference fixture under `fixtures/pe_api/` and reference it via the relative path in the scenario YAML. diff --git a/evals/RESULTS-2026-05-27.md b/evals/RESULTS-2026-05-27.md new file mode 100644 index 0000000..aecc27c --- /dev/null +++ b/evals/RESULTS-2026-05-27.md @@ -0,0 +1,138 @@ +# uk-chat eval — first run results + +**Run date:** 2026-05-27 +**Scenarios:** 9 (5 Test A, 4 Test B after dropping B5 for baseline drift) +**Runs per scenario:** 3 fresh sessions +**Total conversations:** 27 +**Backend under test:** `policyengine--peukchat-feat-model-backend-selector-web.modal.run` (PR 51 preview) +**`model_backend`:** `uk_python` (`policyengine_uk 2.88.20`) + +## Headline + +**Both tests fail. Test B fails on timeouts and accuracy; Test A fails on rubric quality, trust-killers, and fabrication rate.** + +| Test | Threshold | Result | Verdict | +|---|---|---|---| +| B field accuracy | ≥ 95% within 1-15% tol | **75%** | ❌ FAIL | +| B self-consistency SD | < 0.5% of mean | 2 violations | ❌ FAIL | +| B methodology drift | ≤ 1 of 4 scenarios | 0 | ✓ | +| B failure rate | < 10% | **67%** | ❌ FAIL hard | +| A mean rubric score | ≥ 4.0 | **3.09** | ❌ FAIL | +| A trust-killer < 2 | none allowed | **10 dim-scores < 2** across 4 responses | ❌ FAIL | +| A fabrication rate | ≤ 20% | **27%** (4 of 15) | ❌ FAIL | + +## The clearest finding + +**The deployed chat cannot reliably complete population-level reform questions within its 10-minute timeout.** + +- B1 (society-wide PA reform): **3 of 3 runs hit the 600s HTTP timeout.** Failure rate 100%. +- B2 (NI/IT stacked layers): **2 of 3 runs timed out** at ~600s; the surviving run took 547s. +- B3 (household calc, no microdata): completed in 213-365s. Only B scenario that succeeded on all runs. +- B4 (MTR schedule, no microdata): completed in 76-240s. Tool counts 8-12. + +**All four B scenarios that *don't* need microdata succeed (B3, B4). All B scenarios that *do* need microdata fail (B1, B2).** Pattern: when the chat has to load EFRS and run a microsim through `policyengine_uk` by writing arbitrary Python, it runs out of time long before producing an answer. + +This isn't methodology drift, hallucination, or wrong numbers — it's **chat literally couldn't finish**. 600s requests carrying 12-34 tool calls each, then the Modal worker hangs up. + +## Other findings worth surfacing + +**A4 out-of-scope is the clean win.** All 3 runs refused cleanly in 7-9 seconds with 0-1 tool calls. Of all 9 scenarios, only A4 looked unambiguously correct on every run. + +**Variable-name drift confirmed.** B3 had three runs: +- Run 1: 4/4 fields correct +- Run 2: 2/4 correct (national_insurance came back as 2.00 — likely the chat reporting 2% NI rate where we expected the £-amount; extractor edge case OR chat output drift) +- Run 3: 3/4 correct (`marginal_tax_rate` extracted as `2025.00` — almost certainly the chat reporting "2025-26 fiscal year" being grabbed as the rate) + +B3 self-consistency SD: 166% (driven by these extraction artefacts on top of any real chat variance). + +This mirrors the **B4 smoke-test finding** where one of three runs used `gov_tax` and `ni_class_1_employee_primary` instead of `income_tax` and `national_insurance`, silently producing wrong numbers without errors. Same class of failure: Claude picks marginally different code each run, sometimes wrong, no errors thrown. + +**A5 (factual lookup) timed out once.** Should be a parameter-tree lookup — instead 23 tool calls, 600s, dead. The chat tried to *simulate* what should have been a one-line `get_parameter` call. Confirms the anchor's warning: "could waste tool calls trying to 'calculate' what's just a parameter lookup." + +**A2 (subset slice) and A3 (multi-param what-if) thrash heavily.** 17-33 tool calls per run, 96-449s for A2, 321-610s for A3. One A3 run timed out. These are the population-level scenarios with subset filtering and reform composition — the exact shapes prior memories flagged as Claude's weakest area. + +## Per-scenario summary + +| Scenario | Runs | Errors | Median time | Median tool calls | Notes | +|---|---|---|---|---|---| +| a1_mechanism | 3 | 0 | 30s | 2 | Substantive prose, varies on whether the £100k PA-taper mechanism is identified | +| a2_subset_slice | 3 | 0 | 428s | 25 | Heavy thrashing; chat searched for filtering API | +| a3_multiparam_what_if | 3 | 1 | 333s | 26 | One timeout; the surviving runs are long | +| a4_out_of_scope | 3 | 0 | 8s | 0 | Clean refusals every time | +| a5_factual_lookup | 3 | 1 | 56s | 8 | Surprising timeout; chat tried to simulate vs look up | +| b1_society_wide_pa | 3 | 3 | 600s | 28 | All 3 timed out | +| b2_ni_it_stacked | 3 | 2 | 601s | 25 | 2 of 3 timed out | +| b3_household_calc | 3 | 0 | 266s | 26 | Completes but mixed extractor + content correctness | +| b4_mtr_schedule | 3 | 0 | 215s | 12 | 2/3 runs correct, 1 used wrong variable names | + +## Cost + +Combined over both runs (27 conversations): + +| Metric | Value | +|---|---| +| Input tokens | 447,265 | +| Output tokens | 171,620 | +| Cache reads | 8,288,775 (≈free) | +| Cache creates | 179,344 | +| Wall time (full 21-run) | 33 minutes at concurrency=4 | +| Wall time (smoke 6-run) | 10 minutes sequential | +| User-balance billed | £0 (runner doesn't send `user_id`) | +| Org Anthropic-account billed | ≈$2-3 estimated | + +## Interpretation against pre-committed positioning + +**Per `SPEC.md`:** + +> "If thresholds met: alternative positioning is viable. If not: supplement-only is the right framing." + +Test B failed on three of four pre-committed thresholds, including failing-by-67% on the simplest one (failure rate). Test A failed on all three thresholds (mean 3.09 vs 4.0; 4 of 15 responses fabricated; 10 dimension-scores below the trust-killer line concentrated in A3 and A5 run 2). + +**Both alternative and supplement positioning are weaker than the pre-committed thresholds.** Alternative positioning is clearly off the table. Supplement positioning is still defensible only because A4 (out-of-scope refusal) and A1 (mechanism explanation) score acceptably — those are the question shapes the chat *should* own in v1. + +## Test A by the numbers + +| Scenario | Mean (across 5 dims × 3 runs) | Notable | +|---|---|---| +| **a4_out_of_scope** | **5.00** | All 3 runs clean refusals; only scenario with zero trust-killers and zero fabrication | +| a1_mechanism | 2.93 | Run 1 misses PA-taper entirely; run 2 places taper at £125,140 not £100k; run 3 names taper correctly but contradicts itself on D10 income range | +| a2_subset_slice | 2.93 | Run 1 produces a table (D1 gain of £159 implausibly high); run 2 flat distribution that's qualitatively wrong; run 3 gives up after 17 tool calls | +| a5_factual_lookup | 2.67 | Run 1 supplied values "from official records" (training-data fallback after the chat admitted it couldn't query the parameter tree); run 2 timed out at 600s with no answer; run 3 returned a clean table, also without showing the parameter path | +| **a3_multiparam_what_if** | **1.93** | Run 1 truncated mid-tool-use, no answer. Run 2 used synthetic households and quoted £0.66m / £1.08m budgetary impacts — off by ~4 orders of magnitude. Run 3 got the directional conclusion right but quoted £6-7bn vs reality ~£20bn for PA-only | + +**Where the fabrication concentrates:** A3 runs 2 and 3 (made-up budgetary magnitudes), A5 run 1 (training-data values dressed up as model output), and A1 run 1 (invented £972/£1,094 saving figures that don't reflect the actual PA-raise arithmetic). + +**Pattern.** The chat is reliable for two shapes only: clean refusals (A4) and short mechanism prose where it doesn't have to run anything (A1, qualitatively, though even there it gets details wrong). Anything requiring it to produce a number from the engine either thrashes, times out, or fabricates. + +## What this implies — and what the data points at as a fix + +The 27-run dataset turns the prior "structured tools would help" hypothesis (memory: `project_structured_tools_evidence.md`) into a quantified claim: + +- Every B scenario that ran free-form Python through `policyengine_uk` either timed out or produced wrong numbers in at least one of three runs. +- B3 and B4 (small situations) completed but still showed 1-in-3-ish content drift from variable-name selection. +- A4 (refusal) had zero variance — no tool use to drift on. + +**Three concrete next steps:** + +1. **Add structured tools** for the common shapes — `calculate_household`, `run_economy_comparison`, `get_baseline_parameters` — so the LLM fills in JSON forms instead of writing variable-picking Python. The dormant tools in `agent_tools.py` cover most of this. +2. **If structured tools are deferred**, position chat as **supplement only**. The "ask a follow-up" drawer integration in `policyengine-app-v2#1036` is already aligned with this framing. +3. **Re-run the eval after structured-tool changes** to measure the lift. The harness and fixtures are reusable. + +## Caveats and limits of this run + +- **n=3 per scenario is statistically thin.** Findings are directionally strong (especially the timeout pattern) but individual numbers shouldn't be over-interpreted. +- **B3 extractor showed false negatives** on prose-embedded numbers. Some "failures" may be extractor weakness rather than chat error — worth manually reviewing before publishing the per-field accuracy numbers. +- **Test A grading is one grader's judgement.** A second grader on the same 15 responses would shift individual scores; the aggregate verdict (fail on all three thresholds) is robust to ±1 per dimension but the per-scenario means are not. +- **Vahid-published figures couldn't be the fixture** for B2's full set because `policyengine_uk 2.88.20` baseline has incorporated the Autumn Budget 2025 changes; B5 was dropped entirely for the same reason. See `drift_report.md` alongside the fixtures. + +## Artifacts + +Everything reproducible from this repo: + +- Scenarios: `evals/scenarios/*.yaml` +- Fixtures + drift report: `evals/fixtures/` +- Run logs (SSE + extracted text + meta JSON): + - `evals/runs/2026-05-27_060403/` (smoke: A1 + B4) + - `evals/runs/2026-05-27_075425/` (full: 7 remaining) +- Grader output: `evals/runs//B_results.md`, `A_grading.md` +- This writeup: `evals/runs/2026-05-27_results.md` diff --git a/evals/SPEC.md b/evals/SPEC.md new file mode 100644 index 0000000..76f50a8 --- /dev/null +++ b/evals/SPEC.md @@ -0,0 +1,185 @@ +# uk-chat evaluation spec + +> Defines how we test uk-chat output to decide whether the chat works better as a **supplement** to app-v2 reports (follow-up questions reports can't answer) or as an **alternative** producing the same answers reports do. Two tests, two pre-committed thresholds. + +Related external context: +- [policyengine-app-v2#1036](https://github.com/PolicyEngine/policyengine-app-v2/pull/1036) — report chat drawer (the supplement integration) +- [policyengine-uk-chat#51](https://github.com/PolicyEngine/policyengine-uk-chat/pull/51) — backend selector + `scenario_context` (chat changes that enable both positionings) +- [policyengine-api](https://github.com/PolicyEngine/policyengine-api) — produces the reference numbers used in Test B + +## Problem statement + +Decide between two product positionings of uk-chat: **supplement** (a follow-up affordance on app-v2 reports for questions reports can't answer) or **alternative** (chat as a primary way to get the same answers reports give, in addition to or instead of the report UI). The decision needs evidence, not opinions. This spec defines the evaluation that produces that evidence. + +## Two tests, two positionings, two thresholds + +### Test A — Chat as supplement + +**Question:** Does the chat usefully extend reports for the long-tail of follow-up questions? + +**Test setup:** +- 5 scenarios of `(report shown, follow-up question)` pairs. +- Each scenario run 3× in fresh sessions to capture variance. +- Grade per-rubric (below), no PE-API comparison needed (these are questions reports can't answer). + +**Rubric (per response, scored 1-5):** +- **Relevance.** Did the chat answer the actual follow-up, or pivot to something unrelated? +- **Methodology disclosure.** Did the chat say what dataset/year/assumption it used? +- **Reasonableness.** Are the numbers in a plausible range? Do they internally agree (sum-to-totals, decile orderings, sign of effect)? +- **Consistency with report.** Where the chat references something the report already showed, do they agree? +- **Honesty about limits.** When the chat can't answer, does it say so cleanly, or fabricate? + +**Decision threshold (pre-committed):** +- Mean rubric score ≥ 4.0 across all responses. +- No single response scoring < 2 on Reasonableness or Honesty (those are the trust-killer failures). +- No more than 1 in 5 responses where the chat fabricates a figure not derivable from a run. + +### Test B — Chat as alternative + +**Question:** Can the chat produce the same answers app-v2 reports do, with comparable trustworthiness? + +**Test setup:** +- 5 scenarios chosen to match shapes app-v2 already answers (economy-wide reform, household calc, MTR, etc). +- For each: run app-v2 report (or equivalent PE-API call) → record reference numbers. Run chat 3× per scenario → record numeric output. +- Compare numerically per scenario field. + +**Metrics:** +- **Field-level accuracy:** % of numeric fields within 1% of PE-API. +- **Self-consistency:** standard deviation of each numeric field across 3 chat runs as a % of mean. Should be < 0.5% for the chat to be considered deterministic enough. +- **Methodology drift:** count of runs where chat picks a different methodology (e.g. poverty BHC vs AHC, dataset year, decile definition) than PE-API. +- **Failure rate:** % of runs that fail to produce comparable numbers (timeouts, errors, LLM-thrash episodes). + +**Decision threshold (pre-committed):** +- Field-level accuracy ≥ 95%. +- Self-consistency SD < 0.5% of mean for every numeric field. +- Methodology drift in ≤ 1 of 5 scenarios. +- Failure rate < 10%. + +**If thresholds met:** alternative positioning is viable. **If not:** supplement-only is the right framing. + +## Test scenarios + +Each scenario is a tuple of `(scenario_id, question shape, what's being tested, fixtures)`. All UK 2025, baseline = current law unless stated. All chat runs use the `uk_python` backend pinned to `enhanced_frs_2023_24.h5` (the same dataset PE-API uses). + +### Test A — supplement scenarios + +These all assume the user has just viewed a specific app-v2 report and asks a follow-up. + +**A1 — Mechanism explanation** +- **Report shown:** UK PA-raise reform 2025 (current £12,570 → £15,000), economy-wide. +- **Follow-up question:** "The report says the top decile gains less in % terms than the 8th decile — why? Walk me through the mechanism." +- **What it tests:** Can the chat reason about *why* a result looks the way it does, beyond just quoting the numbers? +- **Doesn't fit a report** because the report shows what, not why. This is the long-tail question shape supplements should serve. + +**A2 — Subset breakdown not in the report** +- **Report shown:** UK PA-raise reform 2025, economy-wide. +- **Follow-up question:** "How does this reform affect single parents with two children specifically? Breakdown by income decile." +- **What it tests:** Can the chat slice the population in a way the canonical report doesn't, computing fresh from the model? +- **Stress test** because this combines applying a reform *and* filtering the population — two operations that are individually non-trivial for the LLM via the Python backend's API. + +**A3 — Comparative scenario the user invented** +- **Report shown:** UK PA-raise reform 2025. +- **Follow-up question:** "What if we'd also raised the higher rate threshold from £50,270 to £55,000 alongside the PA raise — would that be more or less progressive?" +- **What it tests:** Multi-parameter ad-hoc reform comparison. Users couldn't construct this in app-v2 without building a new report. +- **High-value supplement** if it works because it lets users iterate without leaving the page. + +**A4 — Out-of-scope question, polite refusal** +- **Report shown:** UK PA-raise reform 2025. +- **Follow-up question:** "How would this same reform affect inflation forecasts?" +- **What it tests:** Honest scope refusal vs fabrication. PolicyEngine doesn't model macro feedback effects; the chat should say so cleanly. +- **The "honesty under pressure" test.** Easy to fail by confidently making something up. + +**A5 — Historical/factual question, no simulation needed** +- **Report shown:** UK PA-raise reform 2025. +- **Follow-up question:** "How has the UK personal allowance changed over the last 15 years? Just the figures." +- **What it tests:** Whether the chat handles factual-lookup questions (which need no simulation) without unnecessary tool use, and whether it knows where the data lives. +- **Edge case** because this is information the underlying packages have but the chat may not surface cleanly — could waste tool calls trying to "calculate" what's just a parameter lookup. + +### Test B — alternative scenarios + +These are questions app-v2 already answers via reports. The chat must match. + +**B1 — Society-wide reform, single-parameter** +- **Question:** "Run a UK economy-wide comparison for 2025: baseline current law, reform raises the income tax personal allowance from £12,570 to £15,000. Report total budgetary impact, decile income changes (both £ and %), and BHC poverty rates for all/child/working-age/senior." +- **PE-API reference:** Generated against the live PE-API by the fixture-build step and saved to `evals/fixtures/pe_api/b1_society_wide_pa.json`. +- **What it tests:** Baseline replication. If the chat can't match here, it can't match anywhere. + +**B2 — Stacked NI + IT + threshold-freeze reform (Reeves 2025 pre-Budget package)** +- **Question:** Three changes stacked — extend the IT threshold freeze to 2029-30, reduce NI main rate 8%→6%, increase IT basic 20%→22% and higher 40%→42%. Report combined and per-layer revenue, per-reform decile impacts, and the example household (£60k earner + £10k pension) figures. +- **Reference:** PolicyEngine's published analysis by Vahid Ahmadi (Nov 2025) — `app/src/data/posts/articles/uk-income-tax-ni-reforms-2025.md` — gives canonical per-layer figures (£3.5bn freeze, £11.7bn NI cut, £18.6bn IT increase, £6.9bn combined in 2026-27) and per-decile percentages. +- **What it tests:** Multi-parameter reform with stacking. Does the chat understand and apply the stacking methodology correctly, and does it reproduce PolicyEngine's *own published* numbers? + +**B3 — Household calc (no microdata needed)** +- **Question:** "Single adult, age 35, employment income £45,000 in UK 2025, no dependents, England. Compute: net income, income tax, employee NI, marginal tax rate at this income point." +- **PE-API reference:** to be generated via `/uk/household` endpoint. +- **What it tests:** Deterministic household calculations — single-household rule application without microdata or aggregation. The chat should be at its best here. + +**B4 — MTR schedule** +- **Question:** "Compute the combined IT+NI marginal rate at gross income levels £10k, £20k, £30k, £50k, £75k, £100k, £125k, £150k for a single UK adult in 2025." +- **Ground truth:** Generated by directly calling `policyengine_uk` at fixture-build time. No microdata or PE-API involved. +- **What it tests:** Schedule lookup against rule-driven ground truth. Acts as a sanity check on the test infrastructure — if this scenario fails, the runner or extractor is broken before we draw conclusions about anything else. + +**B5 — Remove the two-child benefit limit (Autumn Budget 2025)** +- **Question:** UK 2026-27 economy-wide — remove the two-child limit on UC and CTC. Report cost, decile impacts, child-poverty change, Gini change, illustrative household. +- **Reference:** PolicyEngine's published analysis by Vahid Ahmadi (Oct 2025) — `app/src/data/posts/articles/uk-two-child-limit.md` — gives £2.9bn cost in 2026-27, -13.5% absolute child poverty BHC, -0.55% Gini, D2 sees the largest relative gain. +- **What it tests:** Benefit-side reform (UC/CTC), not tax — different model surface from the other B scenarios. Tests reproduction of canonical PolicyEngine numbers on a recent published analysis. + +## Anchors — golden-path guidance per scenario + +Every scenario YAML has an `anchor` block that captures what an *ideal* response would look like — drawn either from PolicyEngine's own published research blog (for the scenarios we have one for) or from UK tax-rule knowledge. The anchor has three parts: + +- **`must_mention`** — phrases or facts a good answer must include (regex-matchable). +- **`must_not_say`** — claims that would be wrong or misleading (also regex-matchable). +- **`ideal_explanation`** / **`ideal_finding`** — a prose sketch of what a Vahid-quality answer would look like, for human-grader reference. + +In v1, anchors are **grader aids** — the human grader reads them before scoring so the rubric is calibrated to PolicyEngine's house standard. In v2 they become *inputs* to an LLM-judge that scores automatically. + +Where anchors come from: + +| Scenario | Anchor source | +|---|---| +| A1 — Mechanism | UK PA-taper rule + `uk-income-tax-ni-reforms-2025.md` discussion of the same dynamic | +| A2 — Subset slice | UK rule knowledge + reasoning about how PA changes affect single parents | +| A3 — Multi-param what-if | `uk-income-tax-ni-reforms-2025.md` stacking methodology + standard progressivity definitions | +| A4 — Out-of-scope | PolicyEngine scope (microsim, no GE/macro) | +| A5 — Historical lookup | UK PA parameter history | +| B1 — Society-wide PA | PE-API output (the fixture) + UK reporting norms | +| B2 — Stacked NI/IT/freeze | `uk-income-tax-ni-reforms-2025.md` published numbers | +| B3 — Household calc | UK tax rules at the £45k income point | +| B4 — MTR schedule | UK tax rule schedule (deterministic) | +| B5 — Two-child limit | `uk-two-child-limit.md` published numbers | + +If we add scenarios later, the strong preference is to **anchor each one against a published PolicyEngine post** when there's one that matches. This makes the eval into "does the chat reproduce PolicyEngine's published analyses?" which is a much stronger framing than "does the chat match a one-off API call." It also makes the eval defensible externally — every reference number has a paper trail. + +## Proposed solutions & tradeoffs + +### Approach 1: hand-authored small scenarios, manual grading + +**Description:** As above — 10 scenarios total (5 A + 5 B), 3 runs each = 30 conversations. Grader is a human (initially Sakshi). +**Pros:** Fast to set up. Grading captures qualitative judgment (methodology drift, hallucination) that automated metrics miss. Small enough that mistakes are recoverable. +**Cons:** N=5 per test is statistically thin. Manual grading is the bottleneck. Doesn't scale to ongoing monitoring. +**Verdict:** accepted for v1. + +### Approach 2: larger automated test set + +**Description:** ~50 scenarios per test, parsed numerics extracted via regex/LLM, compared automatically. +**Pros:** Better statistical power. Re-runnable on every chat change. +**Cons:** Significant up-front cost to build the harness and reliable extractors. Probably premature — we don't know yet what the failure modes are. +**Verdict:** deferred. Revisit after v1 results. + +### Approach 3: live A/B with real users + +**Description:** Ship both supplement and standalone-chat to a fraction of users, instrument usage, decide based on real behaviour. +**Pros:** Most ecologically valid signal. +**Cons:** Slow to gather data, exposes possibly-bad output to users, hard to attribute outcomes cleanly. +**Verdict:** rejected for the initial decision. Could be a follow-up after v1 confirms the basic story. + +## Roadmap + +This PR lands the spec, the 10 anchored scenarios, and the runner. Still to come (in this same PR): + +1. **B fixtures.** Generate reference outputs for B1-B5 into `evals/fixtures/pe_api/` — PE-API calls for the society-wide scenarios, direct `policyengine_uk` calls for the household / MTR scenarios. +2. **Grader.** For Test A: a markdown grading sheet (one row per response) with the rubric and anchor columns. Manual fill for v1. For Test B: an extractor that pulls numerics from SSE responses and diffs against fixtures. +3. **Findings writeup.** Once 30 conversations are graded, a `RESULTS-YYYY-MM-DD.md` report. Per-scenario verdicts, threshold check results, recommended positioning, new failure modes if any. + +The chat backend URL the runner targets is configurable via env var so the eval can run against either a preview deploy or production. diff --git a/evals/fixtures/drift_report.md b/evals/fixtures/drift_report.md new file mode 100644 index 0000000..37b0ae6 --- /dev/null +++ b/evals/fixtures/drift_report.md @@ -0,0 +1,72 @@ +# Test B fixture drift report + +Generated by `build_fixtures.py`. Drift threshold: 10.0%. + +For each scenario, fields whose locally-computed value drifted more than +the threshold from the published reference are dropped from the fixture. +Dropped fields indicate `policyengine_uk` baseline has moved since the +post was published — not a bug, just model evolution. + +--- + +## b1_society_wide_pa + +No published reference (B1 is an author-defined scenario, not from a blog post). +Local computation is the canonical fixture. + +- budgetary_impact: £-20,452,307,295 +- tax_revenue_impact: £-21,084,668,787 +- benefit_spending_impact: £-632,361,491 + +## b2_ni_it_stacked + +Drift threshold: 10.0% + +| field | published | ours | drift | kept? | +|---|---|---|---|---| +| `freeze_layer.budgetary_impact` | £+3,500,000,000 | £+0 | -100.0% | ✗ | +| `ni_layer.budgetary_impact` | £-11,700,000,000 | £-11,839,866,703 | -1.2% | ✓ | +| `it_layer.budgetary_impact` | £+18,600,000,000 | £+16,424,706,760 | -11.7% | ✗ | +| `combined.budgetary_impact` | £+6,900,000,000 | £+4,584,840,056 | -33.6% | ✗ | + +Per-decile distributions saved for ni_alone and it_alone +(freeze_layer + combined distributions dropped due to baseline drift). + +## b3_household_calc + +No published reference. Local computation is the canonical fixture. + +- household_net_income: £35,760.55 +- income_tax: £6,486.00 +- national_insurance: £2,594.40 +- marginal_tax_rate: 28.00% + +## b4_mtr_schedule + +No published reference. Local computation is the canonical fixture. + +Combined IT+NI MTR by gross income: + £ 10,000: 0.0% + £ 20,000: 28.0% + £ 30,000: 28.0% + £ 50,000: 28.0% + £ 75,000: 42.0% + £100,000: 62.0% + £125,000: 62.0% + £150,000: 47.0% + +## b5_two_child_limit + +**Dropped — model baseline drift.** + +Vahid's reform 93219 sets `child_count` cap to 100/102 effective 2025+, +which was meaningful when the post was published (pre-Autumn Budget 2025). +policyengine_uk 2.88.20 now has the cap at `inf` from 2026 onward as +current law (the Autumn Budget 2025 removal is baked into baseline). + +Result: reform vs current law is a zero-delta no-op. £0 budgetary impact, +0pp poverty change, 0% Gini change. + +Re-enabling requires either (a) replacing with a different reform Vahid +wrote that is still counterfactual today, or (b) pinning to an older +policyengine_uk version that pre-dates the baseline update. diff --git a/evals/fixtures/pe_api/.gitkeep b/evals/fixtures/pe_api/.gitkeep new file mode 100644 index 0000000..e69de29 diff --git a/evals/fixtures/pe_api/b1_society_wide_pa.json b/evals/fixtures/pe_api/b1_society_wide_pa.json new file mode 100644 index 0000000..6cf4244 --- /dev/null +++ b/evals/fixtures/pe_api/b1_society_wide_pa.json @@ -0,0 +1,90 @@ +{ + "_source": "local policyengine 0.13.0 + policyengine_uk 2.88.20", + "_scenario": "PA \u00a312,570 \u2192 \u00a315,000, UK 2025, EFRS 2023-24", + "budget": { + "budgetary_impact": -20452307295.159668, + "tax_revenue_impact": -21084668786.5271, + "state_tax_revenue_impact": 0.0, + "benefit_spending_impact": -632361491.3674316, + "households": 31953971.416512378, + "baseline_net_income": 1728049930241.7156 + }, + "decile": { + "relative": { + "1": 0.004968200636804344, + "2": 0.009309003581381547, + "3": 0.01057564733683128, + "4": 0.011099952336793551, + "5": 0.011816873603277435, + "6": 0.013743440778813563, + "7": 0.015384707949647889, + "8": 0.015108006406911852, + "9": 0.014794885960264402, + "10": 0.008800007786865901 + }, + "average": { + "1": 106.93962289058543, + "2": 280.666902638706, + "3": 367.8910420501796, + "4": 422.9463876951294, + "5": 491.038892727913, + "6": 728.3004353254806, + "7": 836.7740461027877, + "8": 1014.2588495254175, + "9": 1110.2858561823086, + "10": 1214.8397640254018 + } + }, + "poverty": { + "poverty": { + "child": { + "baseline": 0.14061780050646178, + "reform": 0.13959329465582693 + }, + "adult": { + "baseline": 0.08759545897508306, + "reform": 0.08643590796613664 + }, + "senior": { + "baseline": 0.12155488270743689, + "reform": 0.1195026821747089 + }, + "all": { + "baseline": 0.10503115545324755, + "reform": 0.10373596289984098 + } + }, + "deep_poverty": { + "child": { + "baseline": 0.015249819569817049, + "reform": 0.01524948638393475 + }, + "adult": { + "baseline": 0.018463754678802034, + "reform": 0.018463433332651054 + }, + "senior": { + "baseline": 0.006305741937669883, + "reform": 0.006305741937669883 + }, + "all": { + "baseline": 0.015549737414820214, + "reform": 0.015549472660392407 + } + } + }, + "inequality": { + "gini": { + "baseline": 0.29337889671381984, + "reform": 0.29322557596654897 + }, + "top_10_pct_share": { + "baseline": 0.2588464823506093, + "reform": 0.2578490048888376 + }, + "top_1_pct_share": { + "baseline": 0.09261611050166191, + "reform": 0.09151029069322303 + } + } +} \ No newline at end of file diff --git a/evals/fixtures/pe_api/b2_ni_it_stacked.json b/evals/fixtures/pe_api/b2_ni_it_stacked.json new file mode 100644 index 0000000..a68f2d3 --- /dev/null +++ b/evals/fixtures/pe_api/b2_ni_it_stacked.json @@ -0,0 +1,36 @@ +{ + "_source": "local policyengine 0.13.0 + policyengine_uk 2.88.20", + "_scenario": "Reeves Nov-2025 NI/IT/freeze package (Vahid blog uk-income-tax-ni-reforms-2025.md)", + "_published": "https://policyengine.org/uk/research/uk-income-tax-ni-reforms-2025", + "ni_layer": { + "budgetary_impact": -11839866703.482544 + }, + "ni_alone": { + "decile_relative": { + "1": 0.0008486258305459847, + "2": 0.002296899587035953, + "3": 0.0027957413474279155, + "4": 0.003954396513231289, + "5": 0.005013721686535896, + "6": 0.00709450049310447, + "7": 0.00879407626384442, + "8": 0.009820447295388779, + "9": 0.010440596409482483, + "10": 0.0064798144839921835 + } + }, + "it_alone": { + "decile_relative": { + "1": -0.0012350961563262956, + "2": -0.003188850565802245, + "3": -0.00336280444689441, + "4": -0.0046718474048025125, + "5": -0.005979446122827847, + "6": -0.008056753415299716, + "7": -0.010267850479079306, + "8": -0.01122766177011029, + "9": -0.012261005277499282, + "10": -0.013557914306514462 + } + } +} \ No newline at end of file diff --git a/evals/fixtures/pe_api/b3_household_calc.json b/evals/fixtures/pe_api/b3_household_calc.json new file mode 100644 index 0000000..472e4f4 --- /dev/null +++ b/evals/fixtures/pe_api/b3_household_calc.json @@ -0,0 +1,10 @@ +{ + "_source": "local policyengine_uk 2.88.20", + "_scenario": "single adult age 35, gross \u00a345,000, UK 2025, no microdata", + "result": { + "household_net_income": 35760.55078125, + "income_tax": 6486.0, + "national_insurance": 2594.4033203125, + "marginal_tax_rate": 28.000000000000004 + } +} \ No newline at end of file diff --git a/evals/fixtures/pe_api/b4_mtr_schedule.json b/evals/fixtures/pe_api/b4_mtr_schedule.json new file mode 100644 index 0000000..0586f10 --- /dev/null +++ b/evals/fixtures/pe_api/b4_mtr_schedule.json @@ -0,0 +1,54 @@ +{ + "_source": "local policyengine_uk 2.88.20", + "_scenario": "single adult MTR schedule, UK 2025, finite-difference", + "rows": [ + { + "gross": 10000, + "it_mtr": 0.0, + "ni_mtr": 0.0, + "combined_mtr": 0.0 + }, + { + "gross": 20000, + "it_mtr": 20.0, + "ni_mtr": 8.0, + "combined_mtr": 28.0 + }, + { + "gross": 30000, + "it_mtr": 20.0, + "ni_mtr": 8.0, + "combined_mtr": 28.0 + }, + { + "gross": 50000, + "it_mtr": 20.0, + "ni_mtr": 8.0, + "combined_mtr": 28.0 + }, + { + "gross": 75000, + "it_mtr": 40.0, + "ni_mtr": 2.0, + "combined_mtr": 42.0 + }, + { + "gross": 100000, + "it_mtr": 60.0, + "ni_mtr": 2.0, + "combined_mtr": 62.0 + }, + { + "gross": 125000, + "it_mtr": 60.0, + "ni_mtr": 2.0, + "combined_mtr": 62.0 + }, + { + "gross": 150000, + "it_mtr": 45.0, + "ni_mtr": 2.0, + "combined_mtr": 47.0 + } + ] +} \ No newline at end of file diff --git a/evals/runner/README.md b/evals/runner/README.md new file mode 100644 index 0000000..d3d0186 --- /dev/null +++ b/evals/runner/README.md @@ -0,0 +1,107 @@ +# eval runner + grader + +Two scripts: + +- `run.py` — POSTs each scenario in `evals/scenarios/*.yaml` to a chat backend, saves raw SSE + extracted text + meta JSON per run under `evals/runs//`. +- `grade.py` — reads a finished run dir. For Test A scenarios, emits a markdown grading sheet the human fills in. For Test B, runs automated numeric extraction + anchor checks against fixtures. + +## Run + +```sh +# All scenarios, default backend (PR 51 preview). +python evals/runner/run.py + +# Just some. +python evals/runner/run.py a1_mechanism b1_society_wide_pa + +# Preview what would run. +python evals/runner/run.py --dry-run + +# Point at a different backend. +python evals/runner/run.py --backend-url https://policyengine-uk-chat.vercel.app +# or +UK_CHAT_BACKEND_URL=https://... python evals/runner/run.py +``` + +## Vercel preview deployments + +If the backend is a Vercel preview behind deployment protection, set the bypass token from the chat project's "Protection Bypass for Automation" setting: + +```sh +UK_CHAT_BYPASS_TOKEN=... python evals/runner/run.py +``` + +The token is appended to the request URL as `?x-vercel-protection-bypass=...`. It's redacted in the saved meta JSON so the artifact is safe to share. + +## Output + +``` +evals/runs// + manifest.json # one row per (scenario, run) + / + scenario.yaml # frozen copy of what was run + run-1.sse # raw SSE stream + run-1.txt # concatenated chunk deltas (final answer) + run-1.meta.json # event counts, timing, errors, redacted URL + run-2.sse / run-2.txt / run-2.meta.json + ... +``` + +`evals/runs/` is gitignored — these are artifacts of a specific run, not source. + +## Dependencies + +```sh +python -m venv .venv && source .venv/bin/activate +pip install -r evals/runner/requirements.txt +``` + +`httpx` for the streaming POST, `pyyaml` for scenario loading. Both stdlib-adjacent — no LLM frameworks, no Anthropic SDK on this side. + +## What the runner is not + +- It is **not** parallel. Sequential by design for clean logs and to avoid hitting backend rate limits during long economy-wide runs. If/when we move to ~50 scenarios, add a `--parallel N` flag. +- It does **not** mutate the scenarios or fixtures dir. + +--- + +## Grade + +```sh +# Generate A_grading.md (human sheet) + B_results.md (automated diffs). +python evals/runner/grade.py evals/runs/2026-05-15_120000 + +# Just one path. +python evals/runner/grade.py --test A +python evals/runner/grade.py --test B + +# After A_grading.md has been filled in by a human, apply Test A thresholds. +python evals/runner/grade.py --threshold-check +``` + +### Test A flow + +`grade.py --test A` walks the run dir and produces `A_grading.md`. Each A response gets a section with: + +- Prompt and scenario_context (collapsible) +- The anchor (must_mention / must_not_say with regex hit/miss, plus `ideal_explanation`) +- The chat response +- Empty score fields for each rubric dimension + +The grader (you) opens the file in an editor, replaces each ⬜ with a 1-5 score, and marks the fabrication question yes/no. + +Then `--threshold-check` parses the filled sheet and applies the SPEC.md thresholds: mean rubric ≥ 4.0, no individual < 2 on reasonableness/honesty, fabrication rate ≤ 20%. Output goes to `threshold_check.md` and `.json`. + +### Test B flow + +Fully automated. For each B scenario: + +- Loads the fixture from `evals/fixtures/pe_api/`. +- For each run, extracts numeric values from the response prose using per-field label regexes (heuristic). +- Diffs against the fixture with per-field `tolerance_pct`. +- Computes self-consistency (SD across runs as % of mean). +- Runs the anchor's `must_mention` / `must_not_say` regex checks. + +Output: `B_results.json` (machine-readable) and `B_results.md` (human-readable per-scenario diffs + threshold verdict). + +The extractor is best-effort regex over prose, so some fields legitimately come back as `⏭ no expected / couldn't extract`. Those are diagnostics for the grader, not failure verdicts. diff --git a/evals/runner/build_fixtures.py b/evals/runner/build_fixtures.py new file mode 100644 index 0000000..57b6286 --- /dev/null +++ b/evals/runner/build_fixtures.py @@ -0,0 +1,520 @@ +#!/usr/bin/env python3 +""" +Build reference fixtures for Test B scenarios. + +For each blog-grounded scenario: + 1. Fetch reform JSONs from PE-API's /uk/policy/ endpoint (this works + even when the /economy endpoint is down — it's a DB read). + 2. Run the reform locally via `policyengine.Simulation + + calculate_economy_comparison` against EFRS 2023-24 (downloaded from HF). + 3. Compare against the published figures in Vahid's blog post. Apply a + drift threshold: keep fields whose locally-computed value is within + tolerance of the published figure, drop the rest. + 4. Write a fixture JSON containing only the kept fields + a sibling + drift_report.md listing kept/dropped/why for human review. + +For scenarios with no published source (B1 PA reform, B3 household calc, +B4 MTR schedule) the local computation IS the fixture — we record what the +engine produces and use it as the reference. + +Generated fixtures are committed to git so the grader doesn't refetch on +every CI run. Re-run this script when scenarios change or to bump engine +versions; expect dropped fields to change as PolicyEngine UK's current-law +baseline drifts. + +Usage: + python evals/runner/build_fixtures.py # all scenarios + python evals/runner/build_fixtures.py b1 b3 # just these + python evals/runner/build_fixtures.py --validate-only # don't rebuild, + # check that + # scenario YAML + # paths resolve + # in existing + # fixtures +""" + +from __future__ import annotations + +import argparse +import json +import os +import sys +import time +from pathlib import Path +from typing import Any + +import httpx +import yaml + + +# Paths +EVALS_DIR = Path(__file__).resolve().parent.parent +SCENARIOS_DIR = EVALS_DIR / "scenarios" +FIXTURES_DIR = EVALS_DIR / "fixtures" / "pe_api" + +PE_API_BASE = "https://api.policyengine.org" +HF_DATASET = "policyengine/policyengine-uk-data-private" +HF_FILE = "enhanced_frs_2023_24.h5" + +# Drift threshold: when comparing our locally-computed numbers to Vahid's +# published figures, accept fields whose magnitude is within 10% of the +# published value. Larger drift = engine baseline has moved since publication, +# field is dropped from the fixture with a note. +DRIFT_TOLERANCE_PCT = 10.0 + + +# --------------------------------------------------------------------------- +# Lazy heavy imports (so --validate-only and --help don't pay for them) +# --------------------------------------------------------------------------- + +def _import_pe(): + """Import the policyengine stack. Only needed when actually building.""" + from policyengine import Simulation + from policyengine.outputs.macro.comparison.calculate_economy_comparison import ( + calculate_economy_comparison, + ) + from policyengine_core.tools.hugging_face import download_huggingface_dataset + return Simulation, calculate_economy_comparison, download_huggingface_dataset + + +_DATASET_PATH_CACHE: str | None = None + + +def get_dataset_path() -> str: + global _DATASET_PATH_CACHE + if _DATASET_PATH_CACHE is None: + _, _, download = _import_pe() + _DATASET_PATH_CACHE = download(repo=HF_DATASET, repo_filename=HF_FILE) + return _DATASET_PATH_CACHE + + +# --------------------------------------------------------------------------- +# PE-API helpers +# --------------------------------------------------------------------------- + +def fetch_reform_json(reform_id: int) -> dict[str, Any]: + """Pull a reform's policy_json from PE-API's policy endpoint. + + /uk/policy/ is a DB read — works even when /economy is broken. + """ + r = httpx.get(f"{PE_API_BASE}/uk/policy/{reform_id}", timeout=30.0) + r.raise_for_status() + body = r.json() + if body.get("status") != "ok": + raise RuntimeError(f"policy fetch failed for {reform_id}: {body}") + return body["result"]["policy_json"] or {} + + +# --------------------------------------------------------------------------- +# Local sim runner +# --------------------------------------------------------------------------- + +def run_economy( + *, + reform: dict[str, Any] | None, + baseline: dict[str, Any] | None, + time_period: int, +) -> dict[str, Any]: + """Run reform-vs-baseline through the policyengine package and return the + EconomyComparison output as a dict.""" + Simulation, calculate_economy_comparison, _ = _import_pe() + sim = Simulation( + country="uk", + scope="macro", + data=get_dataset_path(), + time_period=time_period, + region="uk", + reform=reform, + baseline=baseline, + ) + return calculate_economy_comparison(sim).model_dump() + + +def run_household(situation: dict[str, Any], year: int) -> dict[str, float]: + """Compute single-household figures via policyengine_uk directly.""" + from policyengine_uk import Simulation as UKSimulation + sim = UKSimulation(situation=situation) + return { + "household_net_income": float(sim.calculate("household_net_income", year)[0]), + "income_tax": float(sim.calculate("income_tax", year)[0]), + "national_insurance": float(sim.calculate("national_insurance", year)[0]), + } + + +def mtr_at(year: int, gross_income: int) -> dict[str, float]: + """Combined IT + NI marginal tax rate at a single income point. + + Computed by finite difference: tax at (gross + £100) − tax at (gross), + divided by 100 to get pp. + """ + def at(income: int) -> tuple[float, float]: + from policyengine_uk import Simulation as UKSimulation + sit = { + "people": {"p": {"age": 35, "employment_income": income}}, + "benunits": {"b": {"members": ["p"]}}, + "households": {"h": {"members": ["p"]}}, + } + sim = UKSimulation(situation=sit) + return ( + float(sim.calculate("income_tax", year)[0]), + float(sim.calculate("national_insurance", year)[0]), + ) + + it_a, ni_a = at(gross_income) + it_b, ni_b = at(gross_income + 100) + return { + "gross": gross_income, + "it_mtr": round(it_b - it_a, 2), + "ni_mtr": round(ni_b - ni_a, 2), + "combined_mtr": round((it_b - it_a) + (ni_b - ni_a), 2), + } + + +# --------------------------------------------------------------------------- +# Drift comparison +# --------------------------------------------------------------------------- + +def within_tolerance(ours: float, published: float, pct: float = DRIFT_TOLERANCE_PCT) -> bool: + if published == 0: + return ours == 0 + return abs(ours - published) / abs(published) * 100 <= pct + + +def drift_pct(ours: float, published: float) -> float: + if published == 0: + return float("inf") if ours != 0 else 0.0 + return (ours - published) / abs(published) * 100 + + +# --------------------------------------------------------------------------- +# Per-scenario builders. Each returns (fixture_dict, drift_report_lines). +# --------------------------------------------------------------------------- + +def build_b1() -> tuple[dict[str, Any], list[str]]: + """B1 — PA raise £12,570 → £15,000, UK 2025. No blog reference; the local + computation IS the fixture.""" + print(" fetching/preparing reform...") + # B1 isn't a Vahid scenario, so we author the reform JSON inline. Small, + # single-parameter, no risk of drift since the parameter isn't in baseline. + reform = { + "gov.hmrc.income_tax.allowances.personal_allowance.amount": { + "2025-01-01.2025-12-31": 15000, + }, + } + print(" running locally...") + t = time.time() + result = run_economy(reform=reform, baseline=None, time_period=2025) + print(f" ({time.time()-t:.0f}s)") + + fixture = { + "_source": "local policyengine 0.13.0 + policyengine_uk 2.88.20", + "_scenario": "PA £12,570 → £15,000, UK 2025, EFRS 2023-24", + "budget": result["budget"], + "decile": result["decile"], + "poverty": result["poverty"], + "inequality": result["inequality"], + } + drift = [ + "## b1_society_wide_pa", + "", + "No published reference (B1 is an author-defined scenario, not from a blog post).", + "Local computation is the canonical fixture.", + "", + f"- budgetary_impact: £{result['budget']['budgetary_impact']:+,.0f}", + f"- tax_revenue_impact: £{result['budget']['tax_revenue_impact']:+,.0f}", + f"- benefit_spending_impact: £{result['budget']['benefit_spending_impact']:+,.0f}", + ] + return fixture, drift + + +def build_b2() -> tuple[dict[str, Any], list[str]]: + """B2 — stacked NI/IT/freeze (Vahid Nov-2025 post). Filter against + published figures per the drift threshold.""" + print(" fetching reform JSONs from PE-API...") + reforms = { + "freeze": fetch_reform_json(83092), + "ni_alone": fetch_reform_json(94906), + "it_alone": fetch_reform_json(94910), + "ni_layer": fetch_reform_json(94938), + "combined": fetch_reform_json(94911), + } + for name, rj in reforms.items(): + print(f" {name}: {len(rj)} parameter(s)") + + print(" running scenarios...") + runs = {} + t = time.time() + runs["freeze"] = run_economy(reform=reforms["freeze"], baseline=None, time_period=2028) + print(f" freeze done ({time.time()-t:.0f}s)"); t = time.time() + runs["ni_alone"] = run_economy(reform=reforms["ni_alone"], baseline=None, time_period=2026) + print(f" ni_alone done ({time.time()-t:.0f}s)"); t = time.time() + runs["it_alone"] = run_economy(reform=reforms["it_alone"], baseline=None, time_period=2026) + print(f" it_alone done ({time.time()-t:.0f}s)"); t = time.time() + runs["ni_layer"] = run_economy(reform=reforms["ni_layer"], baseline=reforms["freeze"], time_period=2026) + print(f" ni_layer done ({time.time()-t:.0f}s)"); t = time.time() + runs["it_layer"] = run_economy(reform=reforms["combined"], baseline=reforms["ni_layer"], time_period=2026) + print(f" it_layer done ({time.time()-t:.0f}s)"); t = time.time() + runs["combined"] = run_economy(reform=reforms["combined"], baseline=None, time_period=2026) + print(f" combined done ({time.time()-t:.0f}s)") + + # Vahid's published figures (uk-income-tax-ni-reforms-2025.md, Nov 2025). + PUBLISHED = { + "freeze_layer.budgetary_impact": 3_500_000_000, # £3.5bn in 2028-29 + "ni_layer.budgetary_impact": -11_700_000_000, # -£11.7bn in 2026-27 + "it_layer.budgetary_impact": 18_600_000_000, # +£18.6bn in 2026-27 + "combined.budgetary_impact": 6_900_000_000, # +£6.9bn in 2026-27 + } + OURS = { + "freeze_layer.budgetary_impact": runs["freeze"]["budget"]["budgetary_impact"], + "ni_layer.budgetary_impact": runs["ni_layer"]["budget"]["budgetary_impact"], + "it_layer.budgetary_impact": runs["it_layer"]["budget"]["budgetary_impact"], + "combined.budgetary_impact": runs["combined"]["budget"]["budgetary_impact"], + } + + drift = ["## b2_ni_it_stacked", ""] + drift.append(f"Drift threshold: {DRIFT_TOLERANCE_PCT}%") + drift.append("") + drift.append("| field | published | ours | drift | kept? |") + drift.append("|---|---|---|---|---|") + fixture: dict[str, Any] = { + "_source": "local policyengine 0.13.0 + policyengine_uk 2.88.20", + "_scenario": "Reeves Nov-2025 NI/IT/freeze package (Vahid blog uk-income-tax-ni-reforms-2025.md)", + "_published": "https://policyengine.org/uk/research/uk-income-tax-ni-reforms-2025", + } + for key, pub in PUBLISHED.items(): + ours = OURS[key] + d = drift_pct(ours, pub) + kept = within_tolerance(ours, pub) + drift.append(f"| `{key}` | £{pub:+,.0f} | £{ours:+,.0f} | {d:+.1f}% | {'✓' if kept else '✗'} |") + if kept: + section, field = key.split(".") + fixture.setdefault(section, {})[field] = ours + drift.append("") + + # Per-decile patterns: NI cut alone + IT increase alone reproduced cleanly + # in our trial. Save those distributions to the fixture. Freeze distribution + # is dropped because the layer itself drops out under drift. + fixture["ni_alone"] = {"decile_relative": runs["ni_alone"]["decile"]["relative"]} + fixture["it_alone"] = {"decile_relative": runs["it_alone"]["decile"]["relative"]} + + drift.append("Per-decile distributions saved for ni_alone and it_alone") + drift.append("(freeze_layer + combined distributions dropped due to baseline drift).") + return fixture, drift + + +def build_b3() -> tuple[dict[str, Any], list[str]]: + """B3 — household calc, deterministic. No blog reference. Local = fixture.""" + print(" running single-household calculation...") + situation = { + "people": {"p": {"age": 35, "employment_income": 45000}}, + "benunits": {"b": {"members": ["p"]}}, + "households": {"h": {"members": ["p"]}}, + } + base = run_household(situation, 2025) + # MTR by finite difference + bumped = { + "people": {"p": {"age": 35, "employment_income": 45100}}, + "benunits": {"b": {"members": ["p"]}}, + "households": {"h": {"members": ["p"]}}, + } + bumped_net = run_household(bumped, 2025) + mtr = (1 - (bumped_net["household_net_income"] - base["household_net_income"]) / 100) * 100 + + fixture = { + "_source": "local policyengine_uk 2.88.20", + "_scenario": "single adult age 35, gross £45,000, UK 2025, no microdata", + "result": {**base, "marginal_tax_rate": mtr}, + } + drift = [ + "## b3_household_calc", + "", + "No published reference. Local computation is the canonical fixture.", + "", + f"- household_net_income: £{base['household_net_income']:,.2f}", + f"- income_tax: £{base['income_tax']:,.2f}", + f"- national_insurance: £{base['national_insurance']:,.2f}", + f"- marginal_tax_rate: {mtr:.2f}%", + ] + return fixture, drift + + +def build_b4() -> tuple[dict[str, Any], list[str]]: + """B4 — MTR schedule at 8 income points, local-computed.""" + print(" computing MTR schedule (8 income points)...") + rows = [mtr_at(2025, inc) for inc in (10000, 20000, 30000, 50000, 75000, 100000, 125000, 150000)] + fixture = { + "_source": "local policyengine_uk 2.88.20", + "_scenario": "single adult MTR schedule, UK 2025, finite-difference", + "rows": rows, + } + drift = [ + "## b4_mtr_schedule", + "", + "No published reference. Local computation is the canonical fixture.", + "", + "Combined IT+NI MTR by gross income:", + ] + drift.extend(f" £{r['gross']:>7,}: {r['combined_mtr']:5.1f}%" for r in rows) + return fixture, drift + + +def build_b5_dropped() -> tuple[dict[str, Any] | None, list[str]]: + """B5 — two-child limit removal. Dropped: the reform is a no-op against + current policyengine_uk 2.88.20 (which incorporates the Autumn Budget 2025 + removal as baseline). Documented here for the drift report.""" + print(" (scenario marked dropped — no-op vs current baseline)") + return None, [ + "## b5_two_child_limit", + "", + "**Dropped — model baseline drift.**", + "", + "Vahid's reform 93219 sets `child_count` cap to 100/102 effective 2025+,", + "which was meaningful when the post was published (pre-Autumn Budget 2025).", + "policyengine_uk 2.88.20 now has the cap at `inf` from 2026 onward as", + "current law (the Autumn Budget 2025 removal is baked into baseline).", + "", + "Result: reform vs current law is a zero-delta no-op. £0 budgetary impact,", + "0pp poverty change, 0% Gini change.", + "", + "Re-enabling requires either (a) replacing with a different reform Vahid", + "wrote that is still counterfactual today, or (b) pinning to an older", + "policyengine_uk version that pre-dates the baseline update.", + ] + + +# --------------------------------------------------------------------------- +# Validation (read-only) path +# --------------------------------------------------------------------------- + +def resolve_path(node: Any, dotted_path: str) -> tuple[bool, Any]: + cur = node + for part in dotted_path.split("."): + if isinstance(cur, dict) and part in cur: + cur = cur[part] + elif isinstance(cur, list) and part.isdigit(): + idx = int(part) + if 0 <= idx < len(cur): + cur = cur[idx] + else: + return False, None + else: + return False, None + return True, cur + + +def validate_fixture_paths(scenario_id: str, scenario: dict[str, Any], fixture: dict[str, Any]) -> list[str]: + ref = scenario.get("reference") or {} + misses = [] + for fc in ref.get("fields_to_compare") or []: + path = fc["path"] + if fc.get("expected_approx") is not None: + continue + ok, value = resolve_path(fixture, path) + if not ok: + misses.append(path) + return misses + + +# --------------------------------------------------------------------------- +# Orchestration +# --------------------------------------------------------------------------- + +BUILDERS = { + "b1": ("b1_society_wide_pa.json", build_b1), + "b2": ("b2_ni_it_stacked.json", build_b2), + "b3": ("b3_household_calc.json", build_b3), + "b4": ("b4_mtr_schedule.json", build_b4), + "b5": ("b5_two_child_limit.json", build_b5_dropped), +} + + +def main() -> int: + parser = argparse.ArgumentParser(description=__doc__) + parser.add_argument("scenarios", nargs="*", + help="Scenario shorthand (b1, b2, ...). Empty = all.") + parser.add_argument("--validate-only", action="store_true", + help="Don't rebuild; check scenario YAML paths resolve in existing fixtures.") + args = parser.parse_args() + + keys = args.scenarios or sorted(BUILDERS.keys()) + bad = [k for k in keys if k not in BUILDERS] + if bad: + raise SystemExit(f"Unknown scenarios: {bad}. Known: {sorted(BUILDERS)}") + + FIXTURES_DIR.mkdir(parents=True, exist_ok=True) + + # Surface HF token so policyengine_core can download the dataset. + if not os.environ.get("HUGGING_FACE_TOKEN"): + env_file = Path(__file__).resolve().parents[3] / ".env" + if env_file.exists(): + for line in env_file.read_text().splitlines(): + if line.startswith("POLICYENGINE_UK_DATA_TOKEN="): + os.environ["HUGGING_FACE_TOKEN"] = line.split("=", 1)[1].strip() + break + + drift_report = [ + "# Test B fixture drift report", + "", + f"Generated by `build_fixtures.py`. Drift threshold: {DRIFT_TOLERANCE_PCT}%.", + "", + "For each scenario, fields whose locally-computed value drifted more than", + "the threshold from the published reference are dropped from the fixture.", + "Dropped fields indicate `policyengine_uk` baseline has moved since the", + "post was published — not a bug, just model evolution.", + "", + "---", + "", + ] + + all_misses: list[tuple[str, list[str]]] = [] + for key in keys: + filename, builder = BUILDERS[key] + fixture_path = FIXTURES_DIR / filename + print(f"\n=== {key} → {filename} ===") + + if args.validate_only: + if not fixture_path.exists(): + print(f" no fixture at {fixture_path} (skipping)") + continue + fixture = json.loads(fixture_path.read_text()) + else: + fixture, drift = builder() + drift_report.extend(drift) + drift_report.append("") + if fixture is None: + # Dropped scenario — remove any stale fixture file + if fixture_path.exists(): + fixture_path.unlink() + print(f" removed stale fixture {fixture_path}") + continue + fixture_path.write_text(json.dumps(fixture, indent=2, default=str)) + print(f" wrote {fixture_path}") + + # Validate the scenario YAML paths resolve in this fixture + scenario_files = list(SCENARIOS_DIR.glob(f"{key}_*.yaml")) + if scenario_files: + scenario = yaml.safe_load(scenario_files[0].read_text()) + misses = validate_fixture_paths(scenario["id"], scenario, fixture) + if misses: + all_misses.append((scenario["id"], misses)) + print(f" ⚠ {len(misses)} field path(s) didn't resolve in fixture:") + for m in misses: + print(f" - {m}") + else: + print(f" ✓ all field paths resolve") + + if not args.validate_only: + drift_path = FIXTURES_DIR.parent / "drift_report.md" + drift_path.write_text("\n".join(drift_report)) + print(f"\nwrote {drift_path}") + + if all_misses: + print("\n=== validation: ISSUES ===") + for sid, misses in all_misses: + print(f" {sid}: {misses}") + return 1 + return 0 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/evals/runner/grade.py b/evals/runner/grade.py new file mode 100644 index 0000000..30f3776 --- /dev/null +++ b/evals/runner/grade.py @@ -0,0 +1,1014 @@ +#!/usr/bin/env python3 +""" +Grader for a finished eval run. + +Two paths: +- Test A scenarios → emit a markdown grading sheet (one section per response) + with the prompt, the anchor (golden path), and the chat response side by + side. A human fills in rubric scores 1-5. Lightweight automated anchor + checks (must_mention / must_not_say substring scans) are pre-populated as + grader hints — not authoritative. +- Test B scenarios → load the matching fixture under evals/fixtures/pe_api/, + extract numerics from the chat response, diff against fixture per + fields_to_compare with tolerance. Also runs the must_mention / must_not_say + anchor checks. + +After both paths, the script writes: + /A_grading.md ← human grading sheet (edit in place) + /B_results.json ← machine-readable B verdicts + /B_results.md ← human-readable B summary + /threshold_check.md ← regenerated after A_grading.md is filled + +Usage: + python evals/runner/grade.py # both A + B + python evals/runner/grade.py --test A # just A sheet + python evals/runner/grade.py --test B # just B diff + python evals/runner/grade.py --threshold-check # after grading A +""" + +from __future__ import annotations + +import argparse +import json +import re +import sys +from dataclasses import dataclass, field +from pathlib import Path +from statistics import mean, stdev +from typing import Any + +import yaml + + +# Paths +EVALS_DIR = Path(__file__).resolve().parent.parent +FIXTURES_DIR = EVALS_DIR / "fixtures" / "pe_api" + +# Test A rubric dimensions — must match the rubric block in scenario YAMLs. +A_RUBRIC_DIMENSIONS = ( + "relevance", + "methodology", + "reasonableness", + "consistency", + "honesty", +) + +# Test A thresholds from SPEC.md. +A_MEAN_THRESHOLD = 4.0 +A_TRUST_KILLER_MIN = 2 # no response < 2 on Reasonableness or Honesty +A_TRUST_KILLERS = ("reasonableness", "honesty") +A_FABRICATION_RATE_LIMIT = 0.2 # at most 1 in 5 responses with fabricated figures + +# Test B thresholds from SPEC.md. +B_FIELD_ACCURACY_THRESHOLD = 0.95 # ≥95% of fields within tolerance +B_SELF_CONSISTENCY_SD_LIMIT = 0.005 # SD < 0.5% of mean +B_METHODOLOGY_DRIFT_LIMIT = 1 # ≤ 1 of 5 scenarios +B_FAILURE_RATE_LIMIT = 0.10 # < 10% failure rate + + +# --------------------------------------------------------------------------- +# Loading & shared helpers +# --------------------------------------------------------------------------- + +@dataclass +class RunResponse: + scenario_id: str + run_index: int + answer_text: str + meta: dict[str, Any] + scenario: dict[str, Any] # frozen YAML copy alongside the run + + +def load_run(run_dir: Path) -> list[RunResponse]: + """Walk a run directory and load every response the runner produced.""" + responses = [] + for scenario_dir in sorted(p for p in run_dir.iterdir() if p.is_dir()): + scenario_path = scenario_dir / "scenario.yaml" + if not scenario_path.exists(): + continue + scenario = yaml.safe_load(scenario_path.read_text()) + for txt_path in sorted(scenario_dir.glob("run-*.txt")): + # filename like run-2.txt + run_index = int(txt_path.stem.split("-")[1]) + answer = txt_path.read_text() + meta_path = scenario_dir / f"run-{run_index}.meta.json" + meta = json.loads(meta_path.read_text()) if meta_path.exists() else {} + responses.append(RunResponse( + scenario_id=scenario["id"], + run_index=run_index, + answer_text=answer, + meta=meta, + scenario=scenario, + )) + return responses + + +def anchor_check(answer: str, anchor: dict[str, Any] | None) -> dict[str, Any]: + """Run the lightweight regex checks the anchor defines. + + These are grader hints, not authoritative scores. A human can override. + Substrings are matched case-insensitively. £ symbols and digits are + normalised so e.g. '£100,000' matches '100k' loosely (the digits part). + """ + if not anchor: + return {"checked": False, "must_mention": [], "must_not_say": []} + + norm_answer = answer.lower() + + def hit(phrase: str) -> bool: + return phrase.lower() in norm_answer + + must_mention = [ + {"phrase": p, "found": hit(p)} + for p in (anchor.get("must_mention") or []) + ] + must_not_say = [ + {"phrase": p, "found": hit(p)} + for p in (anchor.get("must_not_say") or []) + ] + return { + "checked": True, + "must_mention": must_mention, + "must_not_say": must_not_say, + "must_mention_pass_rate": ( + sum(1 for m in must_mention if m["found"]) / len(must_mention) + if must_mention else None + ), + "must_not_say_violations": [ + m["phrase"] for m in must_not_say if m["found"] + ], + } + + +# --------------------------------------------------------------------------- +# Test A — grading sheet generator +# --------------------------------------------------------------------------- + +def render_anchor_hints(check: dict[str, Any]) -> str: + """Render the regex check result as a short markdown block.""" + if not check.get("checked"): + return "_(no anchor)_" + lines = [] + rate = check.get("must_mention_pass_rate") + if rate is not None: + lines.append(f"`must_mention` substring matches: **{rate:.0%}**") + for m in check["must_mention"]: + mark = "✓" if m["found"] else "✗" + lines.append(f" - {mark} `{m['phrase']}`") + violations = check.get("must_not_say_violations") or [] + if violations: + lines.append(f"`must_not_say` **VIOLATIONS** ({len(violations)}):") + for v in violations: + lines.append(f" - ✗ `{v}`") + elif check.get("must_not_say"): + lines.append(f"`must_not_say`: clean ✓") + return "\n".join(lines) + + +def render_a_sheet(responses: list[RunResponse]) -> str: + """Emit a markdown sheet for human grading of Test A responses.""" + a_responses = [r for r in responses if r.scenario["test"] == "A"] + lines = [ + "# Test A — grading sheet", + "", + "Fill in 1-5 scores for each rubric dimension under every response.", + "The `must_mention` / `must_not_say` lines are grader hints from " + "automated substring scans — not authoritative. Use the ideal " + "explanation as your reference for what a Vahid-quality answer " + "looks like.", + "", + f"_{len(a_responses)} responses to grade._", + "", + ] + for r in a_responses: + anchor = r.scenario.get("anchor") or {} + check = anchor_check(r.answer_text, anchor) + rubric = r.scenario.get("rubric") or {} + ideal = anchor.get("ideal_explanation") or anchor.get("ideal_finding") or "_(none)_" + meta_summary = r.meta.get("summary", {}) + + lines.extend([ + "---", + f"## {r.scenario_id} — run {r.run_index}", + "", + f"**Title:** {r.scenario['title']}", + f"**Tool calls:** {meta_summary.get('tool_call_count', '?')} · " + f"**Elapsed:** {r.meta.get('elapsed_seconds', '?')}s · " + f"**Errors:** {meta_summary.get('error_count', '?')}", + "", + "### Prompt (user message)", + "```", + r.scenario["prompt"].strip(), + "```", + "", + ]) + if r.scenario.get("scenario_context"): + lines.extend([ + "### Scenario context (system-prompt prefix)", + "
show", + "", + "```", + r.scenario["scenario_context"].strip(), + "```", + "
", + "", + ]) + lines.extend([ + "### Golden path", + "
anchor", + "", + render_anchor_hints(check), + "", + "**Ideal explanation:**", + "", + ideal, + "
", + "", + "### Chat response", + "
response (run-N.txt)", + "", + "> " + r.answer_text.replace("\n", "\n> "), + "
", + "", + "### Scores (fill in 1-5)", + ]) + for dim in A_RUBRIC_DIMENSIONS: + criterion = rubric.get(dim, "") + lines.append(f"- **{dim.title()}**: ⬜ _{criterion}_") + lines.extend([ + "- **Fabricated a figure not derivable from a model run?** ⬜ yes / no", + "", + "### Notes", + "_(optional)_", + "", + ]) + return "\n".join(lines) + + +# --------------------------------------------------------------------------- +# Test A — threshold check (run after grading sheet is filled in) +# --------------------------------------------------------------------------- + +SCORE_RE = re.compile( + r"^\s*-\s+\*\*(?P\w+)\*\*:\s*(?P[1-5])", + re.MULTILINE, +) +FAB_RE = re.compile( + r"\*\*Fabricated a figure not derivable from a model run\?\*\*\s*" + r"(?:⬜\s*)?\s*(?Pyes|no)\b", + re.IGNORECASE, +) +HEADER_RE = re.compile(r"^##\s+(?P\S+)\s+—\s+run\s+(?P\d+)", re.MULTILINE) + + +def parse_a_sheet(sheet_text: str) -> list[dict[str, Any]]: + """Pull filled-in scores out of the grading markdown.""" + # Split by ## headers — each section is one response. + sections = re.split(r"(?=^##\s+\S+\s+—\s+run\s+\d+)", sheet_text, flags=re.MULTILINE) + parsed = [] + for sec in sections: + m_header = HEADER_RE.search(sec) + if not m_header: + continue + scores: dict[str, int] = {} + for m in SCORE_RE.finditer(sec): + dim = m.group("dim").lower() + if dim in A_RUBRIC_DIMENSIONS: + scores[dim] = int(m.group("score")) + fab_m = FAB_RE.search(sec) + fabricated = (fab_m.group("answer").lower() == "yes") if fab_m else None + parsed.append({ + "scenario_id": m_header.group("id"), + "run_index": int(m_header.group("n")), + "scores": scores, + "fabricated": fabricated, + }) + return parsed + + +def a_threshold_check(graded: list[dict[str, Any]]) -> dict[str, Any]: + """Apply the SPEC.md Test A thresholds to filled scores.""" + fully_graded = [ + g for g in graded + if all(d in g["scores"] for d in A_RUBRIC_DIMENSIONS) + ] + incomplete = [ + g for g in graded if g not in fully_graded + ] + + all_scores: list[int] = [] + trust_killer_failures: list[dict[str, Any]] = [] + fabrication_count = 0 + fabrication_total = 0 + + for g in fully_graded: + for dim, score in g["scores"].items(): + all_scores.append(score) + if dim in A_TRUST_KILLERS and score < A_TRUST_KILLER_MIN: + trust_killer_failures.append({ + "scenario_id": g["scenario_id"], + "run_index": g["run_index"], + "dimension": dim, + "score": score, + }) + if g["fabricated"] is True: + fabrication_count += 1 + if g["fabricated"] is not None: + fabrication_total += 1 + + mean_score = mean(all_scores) if all_scores else None + fab_rate = ( + fabrication_count / fabrication_total + if fabrication_total > 0 else None + ) + + passes = ( + mean_score is not None + and mean_score >= A_MEAN_THRESHOLD + and not trust_killer_failures + and (fab_rate is None or fab_rate <= A_FABRICATION_RATE_LIMIT) + ) + + return { + "mean_score": mean_score, + "mean_threshold": A_MEAN_THRESHOLD, + "trust_killer_failures": trust_killer_failures, + "trust_killer_threshold": A_TRUST_KILLER_MIN, + "fabrication_count": fabrication_count, + "fabrication_total": fabrication_total, + "fabrication_rate": fab_rate, + "fabrication_rate_limit": A_FABRICATION_RATE_LIMIT, + "fully_graded_count": len(fully_graded), + "incomplete_count": len(incomplete), + "passes": bool(passes), + } + + +# --------------------------------------------------------------------------- +# Test B — numeric extraction + diff +# --------------------------------------------------------------------------- + +NUMBER_RE = re.compile( + r"(?P[-+])?\s*£?\s*" + r"(?P\d[\d,]*(?:\.\d+)?)" + r"\s*(?Pbn|billion|m|million|k|thousand|pp|%)?", + re.IGNORECASE, +) + + +def parse_number_near(text: str, label_regex: str) -> float | None: + """Find a labelled financial figure in prose. + + Best-effort regex extraction. For each label match, we scan up to ~200 + chars ahead and prefer numbers that have an explicit £ prefix and a + bn/m unit — those are the figures the chat is reporting as results, + rather than reform parameters that happened to appear nearby (£15,000, + £100,000). Falls back to the first number found if no scaled value + appears in range. + """ + label_pat = re.compile(label_regex, re.IGNORECASE) + scaled_re = re.compile( + r"(?P[-+])?\s*£?\s*" + r"(?P\d[\d,]*(?:\.\d+)?)" + r"\s*(?Pbn|billion|m|million)\b", + re.IGNORECASE, + ) + + def to_float(num_m: re.Match) -> float | None: + raw = num_m.group("num").replace(",", "") + try: + n = float(raw) + except ValueError: + return None + unit = (num_m.group("unit") or "").lower() + if unit in ("bn", "billion"): + n *= 1_000_000_000 + elif unit in ("m", "million"): + n *= 1_000_000 + elif unit in ("k", "thousand"): + n *= 1_000 + if num_m.group("sign") == "-": + n = -n + return n + + for label_m in label_pat.finditer(text): + # Look in a wider window for a scaled (bn/m) figure first. + tail = text[label_m.end():label_m.end() + 200] + scaled_m = scaled_re.search(tail) + if scaled_m: + n = to_float(scaled_m) + if n is not None: + # If the surrounding prose suggests this is a decrease but the + # number didn't carry a minus sign, flip it. + surrounding = text[label_m.start():label_m.end() + scaled_m.end()].lower() + if ( + n > 0 + and re.search(r"\b(reduc|cut|fall|decrease|cost|forg(o|on)e|less)", surrounding) + and not re.search(r"\bincrease|rais|gain|more\b", surrounding) + ): + n = -n + return n + # Fall back to any number in the closer window. + num_m = NUMBER_RE.search(text[label_m.end():label_m.end() + 120]) + if num_m: + return to_float(num_m) + return None + + +# Heuristic mapping from anchor field-path → label regex used to find the number +# in the chat's prose answer. Adding new B scenarios with new field paths needs +# either a matching label here or a per-scenario extractor. +FIELD_LABELS = { + "budget.budgetary_impact": r"budgetary impact", + "budget.tax_revenue_impact": r"(income\s*tax\s*revenue|tax\s*revenue\s*(impact|change))", + "budget.benefit_spending_impact": r"benefit\s*spending", + "combined.budgetary_impact_2026_27": r"combined.*(budgetary impact|revenue).*2026", + "layers.freeze_extension.budgetary_impact_2028_29": r"(threshold\s*freeze|freeze\s*extension).*(2028|£3.5)", + "layers.ni_cut.budgetary_impact_2026_27": r"(national\s*insurance|NI).*(cost|reduction|cut).*£", + "layers.it_increase.budgetary_impact_2026_27": r"(income\s*tax|IT).*(increase|raise|rise).*£", + "example_household.net_change": r"(£60[,]?000|example household|illustrative).*net", + "example_household.ni_change": r"(£60[,]?000|example|illustrative).*NI", + "example_household.it_change": r"(£60[,]?000|example|illustrative).*income\s*tax", + "budget.cost_2026_27": r"(cost|spending).*(2026|£2\.\d)", + "poverty.absolute_child_bhc.relative_change": r"(absolute\s*child\s*poverty|child poverty.*BHC).*(\-|fall|reduc)", + "inequality.gini.relative_change": r"gini", + "result.household_net_income": r"(household\s*net\s*income|net\s*income)", + "result.income_tax": r"income\s*tax", + "result.national_insurance": r"national\s*insurance", + "result.marginal_tax_rate": r"(marginal\s*tax\s*rate|combined\s*marginal)", +} + + +def extract_b_value(answer: str, field_path: str) -> float | None: + """Look up a heuristic label for this field path and pull a number.""" + label = FIELD_LABELS.get(field_path) + if not label: + return None + return parse_number_near(answer, label) + + +def _resolve_path(node: Any, dotted_path: str) -> Any | None: + """Walk a dotted path into nested dicts/lists; return None if missing. + + Integer-looking parts index into lists. So `rows.0.combined_mtr` works. + """ + cur = node + for part in dotted_path.split("."): + if isinstance(cur, dict) and part in cur: + cur = cur[part] + elif isinstance(cur, list) and part.isdigit(): + idx = int(part) + if 0 <= idx < len(cur): + cur = cur[idx] + else: + return None + else: + return None + return cur + + +def _diff_scalar(fc: dict[str, Any], fixture: dict[str, Any] | None, answer: str) -> dict[str, Any]: + path = fc["path"] + tolerance_pct = fc.get("tolerance_pct", 1.0) + extracted = extract_b_value(answer, path) + expected = fc.get("expected_approx") + if expected is None and fixture is not None: + value = _resolve_path(fixture, path) + if isinstance(value, (int, float)): + expected = float(value) + within = None + pct_off = None + if extracted is not None and expected is not None and expected != 0: + pct_off = abs(extracted - expected) / abs(expected) * 100 + within = pct_off <= tolerance_pct + return { + "path": path, + "expected": expected, + "extracted": extracted, + "pct_off": pct_off, + "tolerance_pct": tolerance_pct, + "within_tolerance": within, + } + + +def _diff_list_of_dicts( + fc: dict[str, Any], fixture: dict[str, Any] | None, answer: str +) -> list[dict[str, Any]]: + """Compare each row in a list-of-dicts field. + + `path` points to the list. Each row in the *fixture* contributes one diff + entry, labelled by the row's `key_by` value (e.g. gross income) and + comparing the row's `compare` field (e.g. combined_mtr). Extraction + heuristic: search the chat answer for the key value (e.g. "£10,000"), + then pull the next number off the same line/sentence. + """ + path = fc["path"] + key_by = fc["key_by"] + compare_field = fc["compare"] + tolerance_pct = fc.get("tolerance_pct", 1.0) + + rows = _resolve_path(fixture, path) if fixture is not None else None + diffs: list[dict[str, Any]] = [] + if not isinstance(rows, list): + diffs.append({ + "path": path, + "expected": None, + "extracted": None, + "pct_off": None, + "tolerance_pct": tolerance_pct, + "within_tolerance": None, + "note": "list-of-dicts path missing in fixture", + }) + return diffs + + for row in rows: + if not isinstance(row, dict) or key_by not in row or compare_field not in row: + continue + key_value = row[key_by] + expected = row[compare_field] + if not isinstance(expected, (int, float)): + continue + + # Extraction: find the key value in the chat answer, then look for + # the comparison field's number near it. Heuristic — falls back to + # None if nothing nearby looks like a percentage. + extracted = _extract_row_value(answer, key_value, compare_field) + + within = None + pct_off = None + if extracted is not None and expected != 0: + pct_off = abs(extracted - float(expected)) / abs(float(expected)) * 100 + within = pct_off <= tolerance_pct + + diffs.append({ + "path": f"{path}[{key_by}={key_value}].{compare_field}", + "expected": float(expected), + "extracted": extracted, + "pct_off": pct_off, + "tolerance_pct": tolerance_pct, + "within_tolerance": within, + }) + return diffs + + +def _key_candidates(key_value: Any) -> list[str]: + """Plausible string renderings of a row key the chat might use.""" + if isinstance(key_value, (int, float)) and abs(key_value) >= 1000: + return [ + f"£{int(key_value):,}", # £10,000 + f"£{int(key_value)/1000:g}k", # £10k + f"{int(key_value):,}", + str(int(key_value)), + ] + if isinstance(key_value, (int, float)): + return [str(int(key_value)), f"£{int(key_value)}"] + return [str(key_value)] + + +def _parse_md_table(answer: str) -> list[tuple[list[str], list[list[str]]]]: + """Find markdown tables in the answer. Returns a list of (header, rows). + + A markdown table is consecutive lines starting with `|`, with the second + line being a separator (`|---|---|`). Headers and cells are stripped. + """ + tables = [] + lines = answer.splitlines() + i = 0 + while i < len(lines): + line = lines[i].strip() + # A row starts with `|` and has at least 2 `|` (one row, multiple cells). + if line.startswith("|") and line.count("|") >= 2 and i + 1 < len(lines): + sep = lines[i + 1].strip() + # Separator row has only -, :, |, and whitespace + if re.fullmatch(r"[|:\-\s]+", sep) and "-" in sep: + # We're at a header + header = [c.strip() for c in line.strip("|").split("|")] + rows = [] + j = i + 2 + while j < len(lines) and lines[j].strip().startswith("|"): + row = [c.strip() for c in lines[j].strip().strip("|").split("|")] + rows.append(row) + j += 1 + tables.append((header, rows)) + i = j + continue + i += 1 + return tables + + +def _normalise_header(s: str) -> str: + """Strip markdown emphasis/units/parens for column-name matching.""" + s = re.sub(r"[*_`]", "", s) + s = re.sub(r"\([^)]*\)", "", s) # drop "(%)" etc + return re.sub(r"\s+", " ", s).strip().lower() + + +def _parse_cell_number(cell: str) -> float | None: + """Extract a number from a single markdown-table cell.""" + cell = cell.strip() + if not cell or cell in {"-", "—", "n/a", "na"}: + return None + m = re.search(r"-?\s*\d[\d,]*(?:\.\d+)?", cell) + if not m: + return None + try: + n = float(m.group(0).replace(",", "").replace(" ", "")) + except ValueError: + return None + # If the cell carries a £ prefix and a bn/m unit, scale. Cells are + # usually unitless rates or £-amounts already in the natural unit. + if re.search(r"\bbn\b|billion", cell, re.IGNORECASE): + n *= 1_000_000_000 + elif re.search(r"\bmn?\b|million", cell, re.IGNORECASE): + n *= 1_000_000 + return n + + +# Heuristic mapping from compare_field name → header substring(s) we look for. +# Header matching uses normalised form (lowercase, punctuation/units stripped). +COLUMN_HEADER_HINTS = { + "combined_mtr": ("combined",), + "it_mtr": ("income tax", "it "), + "ni_mtr": ("ni ", "national insurance"), + "rate": ("rate",), +} + + +def _extract_row_value(answer: str, key_value: Any, compare_field: str) -> float | None: + """Find the chat's reported value for `compare_field` on the row keyed by + `key_value`. Tries markdown tables first (most chat tabular output), falls + back to prose-line scanning.""" + candidates = _key_candidates(key_value) + + # ----- 1. Markdown table path ----- + field_hints = COLUMN_HEADER_HINTS.get(compare_field, (compare_field.replace("_", " "),)) + for header, rows in _parse_md_table(answer): + norm = [_normalise_header(h) for h in header] + # Find the column whose header contains any of the hints. + col_idx = None + for i, nh in enumerate(norm): + if any(h in nh for h in field_hints): + col_idx = i + break + if col_idx is None: + continue + # Find the row containing the key. + for row in rows: + if not row: + continue + row_label = row[0] + if any(c in row_label for c in candidates): + if col_idx < len(row): + value = _parse_cell_number(row[col_idx]) + if value is not None: + return value + break + + # ----- 2. Prose-line fallback ----- + for needle in candidates: + idx = answer.find(needle) + if idx == -1: + continue + line_start = answer.rfind("\n", 0, idx) + 1 + line_end_search = answer.find("\n", idx) + line_end = line_end_search if line_end_search != -1 else len(answer) + line = answer[line_start:line_end] + + if "mtr" in compare_field.lower() or "rate" in compare_field.lower(): + pcts = re.findall(r"(-?\d+(?:\.\d+)?)\s*%", line) + if pcts: + # "combined" rates are conventionally the last % in a row. + return float(pcts[-1]) + n = parse_number_near(line, re.escape(needle)) + if n is not None: + return n + return None + + +def grade_b_scenario( + responses_for_scenario: list[RunResponse], +) -> dict[str, Any]: + """Numeric diff + anchor check across the N runs of one B scenario.""" + if not responses_for_scenario: + return {"error": "no responses"} + + scenario = responses_for_scenario[0].scenario + reference = scenario.get("reference") or {} + fixture_rel = reference.get("fixture") + fixture = None + fixture_status = "ok" + if fixture_rel: + fixture_path = FIXTURES_DIR / Path(fixture_rel).name + if fixture_path.exists(): + fixture = json.loads(fixture_path.read_text()) + else: + fixture_status = f"missing: {fixture_path}" + + per_run_results = [] + for r in responses_for_scenario: + anchor = scenario.get("anchor") or {} + check = anchor_check(r.answer_text, anchor) + + field_diffs = [] + for fc in reference.get("fields_to_compare") or []: + # Two shapes: + # 1) scalar field — `path: budget.budgetary_impact` + # 2) list-of-dicts field — `path: rows`, `key_by: gross`, + # `compare: combined_mtr`. Each row becomes one diff entry + # keyed by the row's `key_by` value. + if fc.get("key_by") and fc.get("compare"): + field_diffs.extend( + _diff_list_of_dicts(fc, fixture, r.answer_text) + ) + else: + field_diffs.append(_diff_scalar(fc, fixture, r.answer_text)) + per_run_results.append({ + "run_index": r.run_index, + "anchor": check, + "field_diffs": field_diffs, + "tool_call_count": r.meta.get("summary", {}).get("tool_call_count"), + "http_error": r.meta.get("http_error"), + }) + + # Self-consistency: per-field SD across runs as % of mean. + sd_by_field: dict[str, float | None] = {} + field_paths = ( + [d["path"] for d in per_run_results[0]["field_diffs"]] + if per_run_results else [] + ) + for path in field_paths: + vals = [ + next( + (d["extracted"] for d in pr["field_diffs"] if d["path"] == path), + None, + ) + for pr in per_run_results + ] + clean = [v for v in vals if v is not None] + if len(clean) >= 2 and mean(clean) != 0: + sd_by_field[path] = stdev(clean) / abs(mean(clean)) + else: + sd_by_field[path] = None + + # Aggregate metrics for this scenario. + all_diff_outcomes = [ + d["within_tolerance"] + for pr in per_run_results + for d in pr["field_diffs"] + if d["within_tolerance"] is not None + ] + field_accuracy = ( + sum(1 for x in all_diff_outcomes if x) / len(all_diff_outcomes) + if all_diff_outcomes else None + ) + + failures = sum( + 1 for pr in per_run_results + if pr["http_error"] or not any( + d["within_tolerance"] is not None for d in pr["field_diffs"] + ) + ) + failure_rate = failures / len(per_run_results) + + return { + "scenario_id": scenario["id"], + "fixture_status": fixture_status, + "per_run_results": per_run_results, + "self_consistency_sd": sd_by_field, + "field_accuracy": field_accuracy, + "failure_rate": failure_rate, + "max_self_consistency_sd": max( + (v for v in sd_by_field.values() if v is not None), + default=None, + ), + } + + +def b_threshold_check(scenario_results: list[dict[str, Any]]) -> dict[str, Any]: + """Apply SPEC.md Test B thresholds across scenarios.""" + field_accuracies = [ + s["field_accuracy"] + for s in scenario_results + if s.get("field_accuracy") is not None + ] + overall_field_accuracy = ( + mean(field_accuracies) if field_accuracies else None + ) + + sd_violations = [] + for s in scenario_results: + for path, sd in (s.get("self_consistency_sd") or {}).items(): + if sd is not None and sd > B_SELF_CONSISTENCY_SD_LIMIT: + sd_violations.append({ + "scenario_id": s["scenario_id"], + "path": path, + "sd_pct_of_mean": sd, + }) + + failure_rates = [s["failure_rate"] for s in scenario_results] + overall_failure_rate = mean(failure_rates) if failure_rates else None + + # Methodology drift can't be detected purely automatically — flag any + # scenario where the anchor's must_not_say was violated in any run as a + # *potential* drift case that the human should review. + drift_flags = [] + for s in scenario_results: + for pr in s["per_run_results"]: + v = pr["anchor"].get("must_not_say_violations") or [] + if v: + drift_flags.append({ + "scenario_id": s["scenario_id"], + "run_index": pr["run_index"], + "violations": v, + }) + + passes = ( + overall_field_accuracy is not None + and overall_field_accuracy >= B_FIELD_ACCURACY_THRESHOLD + and not sd_violations + and len({d["scenario_id"] for d in drift_flags}) <= B_METHODOLOGY_DRIFT_LIMIT + and overall_failure_rate is not None + and overall_failure_rate < B_FAILURE_RATE_LIMIT + ) + + return { + "overall_field_accuracy": overall_field_accuracy, + "field_accuracy_threshold": B_FIELD_ACCURACY_THRESHOLD, + "sd_violations": sd_violations, + "sd_threshold": B_SELF_CONSISTENCY_SD_LIMIT, + "methodology_drift_flags": drift_flags, + "methodology_drift_scenarios": len({d["scenario_id"] for d in drift_flags}), + "methodology_drift_limit": B_METHODOLOGY_DRIFT_LIMIT, + "overall_failure_rate": overall_failure_rate, + "failure_rate_limit": B_FAILURE_RATE_LIMIT, + "passes": bool(passes), + } + + +def render_b_results_md(scenario_results: list[dict[str, Any]], threshold: dict[str, Any]) -> str: + lines = [ + "# Test B — automated grading results", + "", + "_Per-scenario numeric diffs, self-consistency, and anchor checks. " + "Methodology drift is flagged where the anchor's `must_not_say` was " + "violated — a human should review those for actual drift vs false positives._", + "", + f"## Threshold check: {'✅ PASS' if threshold['passes'] else '❌ FAIL'}", + "", + f"- Overall field accuracy: **{threshold['overall_field_accuracy']:.0%}**" if threshold['overall_field_accuracy'] is not None else "- Overall field accuracy: n/a", + f" (threshold: ≥ {threshold['field_accuracy_threshold']:.0%})", + f"- Self-consistency violations (SD > {threshold['sd_threshold']:.1%}): **{len(threshold['sd_violations'])}**", + f"- Methodology drift scenarios: **{threshold['methodology_drift_scenarios']}** " + f"(threshold: ≤ {threshold['methodology_drift_limit']})", + f"- Overall failure rate: **{threshold['overall_failure_rate']:.0%}**" if threshold['overall_failure_rate'] is not None else "- Overall failure rate: n/a", + f" (threshold: < {threshold['failure_rate_limit']:.0%})", + "", + ] + for s in scenario_results: + lines.extend([ + f"## {s['scenario_id']}", + "", + f"- Fixture: {s['fixture_status']}", + f"- Field accuracy across runs: {s['field_accuracy']:.0%}" if s.get('field_accuracy') is not None else "- Field accuracy: n/a", + f"- Max self-consistency SD: {s['max_self_consistency_sd']:.2%}" if s.get('max_self_consistency_sd') is not None else "- Max self-consistency SD: n/a", + f"- Failure rate: {s['failure_rate']:.0%}", + "", + ]) + for pr in s["per_run_results"]: + tool_n = pr.get("tool_call_count") + err = pr.get("http_error") + err_str = f" ⚠ {err}" if err else "" + lines.append( + f"### run {pr['run_index']} " + f"({tool_n} tool calls{err_str})" + ) + lines.append("") + anchor = pr["anchor"] + if anchor.get("checked"): + mm = anchor.get("must_mention_pass_rate") + violations = anchor.get("must_not_say_violations") or [] + lines.append( + f"- Anchor: must_mention {mm:.0%}, must_not_say " + f"violations: {len(violations)}{' ⚠' if violations else ''}" + ) + if pr["field_diffs"]: + lines.append("- Field diffs:") + for d in pr["field_diffs"]: + if d["within_tolerance"] is None: + if d["expected"] is None and d["extracted"] is None: + reason = "no expected (fixture missing?) and no extraction" + elif d["expected"] is None: + reason = f"extracted={d['extracted']:.2f} but no expected value (fixture missing?)" + elif d["extracted"] is None: + reason = f"expected={d['expected']:.2f} but couldn't extract from response" + else: + reason = "expected==0, can't compute % off" + lines.append(f" - ⏭ `{d['path']}`: {reason}") + else: + mark = "✓" if d["within_tolerance"] else "✗" + lines.append( + f" - {mark} `{d['path']}`: " + f"extracted={d['extracted']:.2f}, " + f"expected={d['expected']:.2f}, " + f"off={d['pct_off']:.1f}% (tol {d['tolerance_pct']}%)" + ) + lines.append("") + return "\n".join(lines) + + +# --------------------------------------------------------------------------- +# Orchestration +# --------------------------------------------------------------------------- + +def main() -> int: + parser = argparse.ArgumentParser(description=__doc__) + parser.add_argument("run_dir", type=Path, help="A directory under evals/runs/") + parser.add_argument( + "--test", + choices=["A", "B", "both"], + default="both", + help="Which test to grade (default: both).", + ) + parser.add_argument( + "--threshold-check", + action="store_true", + help="After A_grading.md is filled in, parse it and apply Test A " + "thresholds. Writes threshold_check.md.", + ) + args = parser.parse_args() + + if not args.run_dir.exists(): + print(f"Run dir not found: {args.run_dir}", file=sys.stderr) + return 1 + + if args.threshold_check: + sheet_path = args.run_dir / "A_grading.md" + if not sheet_path.exists(): + print(f"A_grading.md not found in {args.run_dir} — generate it first.", file=sys.stderr) + return 1 + graded = parse_a_sheet(sheet_path.read_text()) + result = a_threshold_check(graded) + out = args.run_dir / "threshold_check.md" + passes = "✅ PASS" if result["passes"] else "❌ FAIL" + out.write_text( + f"# Test A — threshold check ({passes})\n\n" + f"- Mean rubric score: **{result['mean_score']}** " + f"(threshold ≥ {result['mean_threshold']})\n" + f"- Fully graded responses: {result['fully_graded_count']}\n" + f"- Incomplete responses: {result['incomplete_count']}\n" + f"- Trust-killer failures " + f"(score < {result['trust_killer_threshold']} on " + f"{', '.join(A_TRUST_KILLERS)}): " + f"**{len(result['trust_killer_failures'])}**\n" + + "".join( + f" - {f['scenario_id']} run {f['run_index']}: " + f"{f['dimension']}={f['score']}\n" + for f in result["trust_killer_failures"] + ) + + f"- Fabrication rate: " + f"{result['fabrication_count']}/{result['fabrication_total']}" + + (f" ({result['fabrication_rate']:.0%})" if result['fabrication_rate'] is not None else "") + + f" (limit ≤ {result['fabrication_rate_limit']:.0%})\n" + ) + (args.run_dir / "threshold_check.json").write_text(json.dumps(result, indent=2)) + print(f"Wrote {out}") + return 0 + + responses = load_run(args.run_dir) + if not responses: + print(f"No responses found under {args.run_dir}", file=sys.stderr) + return 1 + + if args.test in ("A", "both"): + sheet = render_a_sheet(responses) + out = args.run_dir / "A_grading.md" + out.write_text(sheet) + print(f"Wrote {out} (fill in scores and re-run with --threshold-check)") + + if args.test in ("B", "both"): + b_responses = [r for r in responses if r.scenario["test"] == "B"] + by_scenario: dict[str, list[RunResponse]] = {} + for r in b_responses: + by_scenario.setdefault(r.scenario_id, []).append(r) + scenario_results = [ + grade_b_scenario(rs) for rs in by_scenario.values() + ] + threshold = b_threshold_check(scenario_results) + (args.run_dir / "B_results.json").write_text( + json.dumps( + {"scenarios": scenario_results, "threshold_check": threshold}, + indent=2, + default=str, + ) + ) + (args.run_dir / "B_results.md").write_text( + render_b_results_md(scenario_results, threshold) + ) + passes = "✅ PASS" if threshold["passes"] else "❌ FAIL" + print(f"Wrote {args.run_dir / 'B_results.md'} ({passes})") + + return 0 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/evals/runner/requirements-fixtures.txt b/evals/runner/requirements-fixtures.txt new file mode 100644 index 0000000..d8d313e --- /dev/null +++ b/evals/runner/requirements-fixtures.txt @@ -0,0 +1,10 @@ +# Extra deps needed only for build_fixtures.py. Heavier than the runner/grader +# requirements (pulls policyengine_uk + policyengine_core + microdata loaders), +# so kept separate so a contributor running just the runner doesn't need them. +# +# Production-aligned pinning per policyengine-api-v2 main as of 2026-05-21. +policyengine==0.13.0 +policyengine-uk==2.88.20 +policyengine-core>=3.23.5 +httpx>=0.27 +pyyaml>=6.0 diff --git a/evals/runner/requirements.txt b/evals/runner/requirements.txt new file mode 100644 index 0000000..21b8f47 --- /dev/null +++ b/evals/runner/requirements.txt @@ -0,0 +1,2 @@ +httpx>=0.27 +pyyaml>=6.0 diff --git a/evals/runner/run.py b/evals/runner/run.py new file mode 100644 index 0000000..7efeaf1 --- /dev/null +++ b/evals/runner/run.py @@ -0,0 +1,415 @@ +#!/usr/bin/env python3 +""" +Eval runner — load scenario YAMLs from evals/scenarios/, POST each one N +times to the configured chat backend, save raw SSE + extracted text + +summary JSON per run under evals/runs//. + +No grading. The runner only produces the conversation logs. Grading is a +separate step. + +Usage: + python evals/runner/run.py # all scenarios + python evals/runner/run.py a1_mechanism b1_society_wide_pa + python evals/runner/run.py --dry-run # show what would run + python evals/runner/run.py --backend-url X # override the chat URL + +Environment variables: + UK_CHAT_BACKEND_URL Chat backend base URL (default: the PR 51 preview) + UK_CHAT_BYPASS_TOKEN Optional Vercel "Protection Bypass for Automation" + token, appended as a query param so the runner can + reach a protected preview without SSO. +""" + +from __future__ import annotations + +import argparse +import datetime as dt +import json +import os +import re +import sys +import threading +from concurrent.futures import ThreadPoolExecutor, as_completed +from dataclasses import dataclass +from pathlib import Path +from typing import Any + +import httpx +import yaml + + +# Paths +EVALS_DIR = Path(__file__).resolve().parent.parent +SCENARIOS_DIR = EVALS_DIR / "scenarios" +RUNS_DIR = EVALS_DIR / "runs" + +# Defaults +DEFAULT_BACKEND_URL = os.environ.get( + "UK_CHAT_BACKEND_URL", + # PR 51 preview backend — the only deploy with model_backend + scenario_context wired. + "https://policyengine--peukchat-feat-model-backend-selector-web.modal.run", +) +BYPASS_TOKEN = os.environ.get("UK_CHAT_BYPASS_TOKEN") or None +REQUEST_TIMEOUT_SECONDS = 900 # economy-wide sims can take a few minutes + + +# --------------------------------------------------------------------------- +# Scenario loading +# --------------------------------------------------------------------------- + +@dataclass +class Scenario: + id: str + test: str # "A" or "B" + title: str + prompt: str + model_backend: str + num_runs: int + scenario_context: str | None + raw: dict[str, Any] # full YAML dict, kept for logging + + @classmethod + def from_yaml(cls, path: Path) -> "Scenario": + data = yaml.safe_load(path.read_text()) + return cls( + id=data["id"], + test=data["test"], + title=data["title"], + prompt=data["prompt"], + model_backend=data["chat_settings"]["model_backend"], + num_runs=data["chat_settings"].get("num_runs", 3), + scenario_context=data.get("scenario_context"), + raw=data, + ) + + +def load_scenarios(filter_ids: list[str] | None = None) -> list[Scenario]: + scenarios = [] + for path in sorted(SCENARIOS_DIR.glob("*.yaml")): + scenario = Scenario.from_yaml(path) + if filter_ids and scenario.id not in filter_ids: + continue + scenarios.append(scenario) + if filter_ids: + loaded_ids = {s.id for s in scenarios} + missing = set(filter_ids) - loaded_ids + if missing: + raise SystemExit(f"Unknown scenario IDs: {sorted(missing)}") + return scenarios + + +# --------------------------------------------------------------------------- +# Chat backend interaction +# --------------------------------------------------------------------------- + +def build_request_payload(scenario: Scenario) -> dict[str, Any]: + payload: dict[str, Any] = { + "model_backend": scenario.model_backend, + "messages": [{"role": "user", "content": scenario.prompt}], + } + if scenario.scenario_context: + payload["scenario_context"] = scenario.scenario_context + return payload + + +def build_request_url(backend_url: str) -> str: + url = f"{backend_url.rstrip('/')}/chat/message" + if BYPASS_TOKEN: + sep = "&" if "?" in url else "?" + url = ( + f"{url}{sep}x-vercel-protection-bypass={BYPASS_TOKEN}" + f"&x-vercel-set-bypass-cookie=samesitenone" + ) + return url + + +def parse_sse(sse_text: str) -> list[dict[str, Any]]: + """Parse SSE lines into a list of event dicts.""" + events = [] + for line in sse_text.splitlines(): + if not line.startswith("data: "): + continue + try: + events.append(json.loads(line[len("data: "):])) + except json.JSONDecodeError: + # The model occasionally emits non-JSON lines (e.g. heartbeats); + # drop them rather than crashing the whole run. + pass + return events + + +def summarise_events(events: list[dict[str, Any]]) -> dict[str, Any]: + """Pull useful aggregates out of the SSE stream.""" + counts: dict[str, int] = {} + for e in events: + t = e.get("type", "?") + counts[t] = counts.get(t, 0) + 1 + + full_text = "".join( + e.get("content", "") for e in events if e.get("type") == "chunk" + ) + + # The `done` event carries usage / billing — last one wins. + done = next( + (e for e in reversed(events) if e.get("type") == "done"), + None, + ) + + errors = [e for e in events if e.get("type") == "error"] + + # Per-tool routing: which tools did Claude pick, in what order, how often. + # Used to answer "did registering a new typed tool actually shift Claude's + # behaviour, or did it stick with run_python?" — visible in the manifest + # without re-reading every SSE log. + tool_call_sequence = [ + e.get("tool_name", "?") for e in events if e.get("type") == "tool_start" + ] + tool_call_counts_by_name: dict[str, int] = {} + for name in tool_call_sequence: + tool_call_counts_by_name[name] = tool_call_counts_by_name.get(name, 0) + 1 + + tool_failures = [ + e for e in events + if e.get("type") == "tool_result" and e.get("status") == "error" + ] + + return { + "event_counts": counts, + "answer_text": full_text, + "answer_length_chars": len(full_text), + "tool_call_count": counts.get("tool_use", 0), + "tool_call_sequence": tool_call_sequence, + "tool_call_counts_by_name": tool_call_counts_by_name, + "tool_failure_count": len(tool_failures), + "completed": done is not None, + "error_count": len(errors), + "errors": errors, + "done_event": done, # contains usage, session_id, model, model_backend + } + + +def run_single( + *, + scenario: Scenario, + backend_url: str, + run_index: int, + out_dir: Path, +) -> dict[str, Any]: + """POST one scenario once, save SSE + extracted text + summary.""" + url = build_request_url(backend_url) + payload = build_request_payload(scenario) + + started = dt.datetime.now(dt.timezone.utc) + + sse_text = "" + http_error: str | None = None + try: + with httpx.Client(timeout=REQUEST_TIMEOUT_SECONDS) as client: + with client.stream("POST", url, json=payload) as resp: + resp.raise_for_status() + for chunk in resp.iter_text(): + sse_text += chunk + except httpx.HTTPStatusError as e: + http_error = ( + f"HTTP {e.response.status_code}: " + f"{(e.response.text or '')[:500]}" + ) + except httpx.RequestError as e: + http_error = f"Request error: {type(e).__name__}: {e}" + + finished = dt.datetime.now(dt.timezone.utc) + elapsed_seconds = (finished - started).total_seconds() + + # Persist artifacts + out_dir.mkdir(parents=True, exist_ok=True) + (out_dir / f"run-{run_index}.sse").write_text(sse_text) + + events = parse_sse(sse_text) + summary = summarise_events(events) + (out_dir / f"run-{run_index}.txt").write_text(summary["answer_text"]) + + meta = { + "scenario_id": scenario.id, + "test": scenario.test, + "run_index": run_index, + "url": re.sub(r"protection-bypass=[^&]+", "protection-bypass=REDACTED", url), + "model_backend": scenario.model_backend, + "started_at": started.isoformat(), + "finished_at": finished.isoformat(), + "elapsed_seconds": round(elapsed_seconds, 1), + "http_error": http_error, + "summary": { + k: v + for k, v in summary.items() + # Don't duplicate the full answer text in JSON — it's already in run-N.txt. + if k != "answer_text" + }, + } + (out_dir / f"run-{run_index}.meta.json").write_text(json.dumps(meta, indent=2)) + + return meta + + +# --------------------------------------------------------------------------- +# Orchestration +# --------------------------------------------------------------------------- + +def make_run_dir() -> Path: + timestamp = dt.datetime.now().strftime("%Y-%m-%d_%H%M%S") + return RUNS_DIR / timestamp + + +def run_all( + *, + scenarios: list[Scenario], + backend_url: str, + dry_run: bool, + concurrency: int = 1, +) -> Path | None: + if dry_run: + print(f"DRY RUN (backend: {backend_url})\n") + for s in scenarios: + print( + f" {s.id:30} test={s.test} runs={s.num_runs} " + f"backend={s.model_backend} ctx={'yes' if s.scenario_context else 'no'}" + ) + total = sum(s.num_runs for s in scenarios) + print(f"\nWould execute {total} requests across {len(scenarios)} scenarios.") + if concurrency > 1: + print(f"(would run with concurrency={concurrency})") + return None + + run_dir = make_run_dir() + run_dir.mkdir(parents=True, exist_ok=True) + + manifest = { + "started_at": dt.datetime.now(dt.timezone.utc).isoformat(), + "backend_url": backend_url, + "bypass_token_set": BYPASS_TOKEN is not None, + "concurrency": concurrency, + "scenarios": [s.id for s in scenarios], + "runs": [], + } + manifest_path = run_dir / "manifest.json" + manifest_path.write_text(json.dumps(manifest, indent=2)) + manifest_lock = threading.Lock() + print_lock = threading.Lock() + + print(f"Run output: {run_dir}") + print(f"Concurrency: {concurrency}\n") + + # Pre-create scenario dirs and freeze scenario YAMLs so workers don't race. + for scenario in scenarios: + scenario_dir = run_dir / scenario.id + scenario_dir.mkdir(parents=True, exist_ok=True) + (scenario_dir / "scenario.yaml").write_text(yaml.safe_dump(scenario.raw)) + + # Build a flat list of (scenario, run_index) jobs across all scenarios. + jobs: list[tuple[Scenario, int]] = [ + (s, i) for s in scenarios for i in range(1, s.num_runs + 1) + ] + total_jobs = len(jobs) + + def _execute(job: tuple[Scenario, int]) -> dict[str, Any]: + scenario, run_index = job + scenario_dir = run_dir / scenario.id + meta = run_single( + scenario=scenario, + backend_url=backend_url, + run_index=run_index, + out_dir=scenario_dir, + ) + with print_lock: + elapsed = meta["elapsed_seconds"] + err = meta["http_error"] + tools = meta["summary"]["tool_call_count"] + chars = meta["summary"]["answer_length_chars"] + status = ( + f"ERR ({err})" if err + else f"ok {chars} chars, {tools} tool calls, {elapsed}s" + ) + print(f" [{scenario.id} run {run_index}/{scenario.num_runs}] {status}") + with manifest_lock: + manifest["runs"].append({ + "scenario_id": scenario.id, + "run_index": run_index, + "elapsed_seconds": meta["elapsed_seconds"], + "http_error": meta["http_error"], + "tool_call_count": meta["summary"]["tool_call_count"], + "tool_call_counts_by_name": meta["summary"]["tool_call_counts_by_name"], + "tool_call_sequence": meta["summary"]["tool_call_sequence"], + "tool_failure_count": meta["summary"]["tool_failure_count"], + "answer_length_chars": meta["summary"]["answer_length_chars"], + }) + manifest_path.write_text(json.dumps(manifest, indent=2)) + return meta + + if concurrency <= 1: + # Sequential — preserves prior behaviour and predictable per-scenario log order. + for scenario in scenarios: + with print_lock: + print(f"=== {scenario.id} ({scenario.test}, {scenario.num_runs} runs) ===") + for i in range(1, scenario.num_runs + 1): + _execute((scenario, i)) + with print_lock: + print() + else: + # Bounded-concurrency thread pool — each worker holds its own httpx + # client inside run_single, so they're independent. + with ThreadPoolExecutor(max_workers=concurrency) as pool: + futures = [pool.submit(_execute, job) for job in jobs] + print(f"submitted {total_jobs} jobs to a pool of {concurrency} workers...\n") + for _ in as_completed(futures): + pass # progress is printed by _execute + + manifest["finished_at"] = dt.datetime.now(dt.timezone.utc).isoformat() + manifest_path.write_text(json.dumps(manifest, indent=2)) + print(f"\nDone. Logs in {run_dir}") + return run_dir + + +def main() -> int: + parser = argparse.ArgumentParser(description=__doc__) + parser.add_argument( + "scenario_ids", + nargs="*", + help="Optional list of scenario IDs to run. Empty = run all.", + ) + parser.add_argument( + "--backend-url", + default=DEFAULT_BACKEND_URL, + help="Chat backend base URL. Defaults to UK_CHAT_BACKEND_URL env var or PR 51 preview.", + ) + parser.add_argument( + "--dry-run", + action="store_true", + help="Show what would run without making any requests.", + ) + parser.add_argument( + "--concurrency", + type=int, + default=1, + help="Number of conversations to run in parallel. Default 1 (sequential). " + "Useful values: 1 (debug, clean logs), 4-6 (full eval, ~3-4× speedup).", + ) + args = parser.parse_args() + + if args.concurrency < 1: + parser.error("--concurrency must be >= 1") + + scenarios = load_scenarios(args.scenario_ids or None) + if not scenarios: + print("No scenarios matched.", file=sys.stderr) + return 1 + + run_all( + scenarios=scenarios, + backend_url=args.backend_url, + dry_run=args.dry_run, + concurrency=args.concurrency, + ) + return 0 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/evals/runner/tool_usage.py b/evals/runner/tool_usage.py new file mode 100644 index 0000000..b363c22 --- /dev/null +++ b/evals/runner/tool_usage.py @@ -0,0 +1,62 @@ +#!/usr/bin/env python3 +"""Per-scenario tool-routing table for a finished eval run. + +Reads `runs//manifest.json` and prints which tool Claude called +how many times on each scenario. Useful for A/B-comparing tool-surface +changes: e.g. did registering `calculate_household` actually shift +household-shaped questions away from `run_python`? + +Usage: + python evals/runner/tool_usage.py runs/2026-05-27_120000 + python evals/runner/tool_usage.py runs/ runs/ # diff two runs +""" + +from __future__ import annotations + +import json +import sys +from collections import Counter +from pathlib import Path + + +def load_run(run_dir: Path) -> dict[str, Counter]: + manifest = json.loads((run_dir / "manifest.json").read_text()) + by_scenario: dict[str, Counter] = {} + for r in manifest["runs"]: + c = by_scenario.setdefault(r["scenario_id"], Counter()) + c.update(r.get("tool_call_counts_by_name") or {}) + return by_scenario + + +def print_table(by_scenario: dict[str, Counter], label: str) -> None: + all_tools = sorted({t for c in by_scenario.values() for t in c}) + if not all_tools: + print(f"{label}: no tool calls recorded") + return + width = max(22, max(len(t) for t in all_tools) + 2) + header = f"{'scenario':30}" + "".join(f" {t:{width}}" for t in all_tools) + print(f"\n=== {label} ===") + print(header) + print("-" * len(header)) + for sid in sorted(by_scenario): + row = f"{sid:30}" + "".join( + f" {by_scenario[sid][t]:{width}}" for t in all_tools + ) + print(row) + + +def main(argv: list[str]) -> int: + if not argv: + print(__doc__) + return 1 + for path_str in argv: + run_dir = Path(path_str) + if not (run_dir / "manifest.json").exists(): + print(f"skip {run_dir}: no manifest.json", file=sys.stderr) + continue + print_table(load_run(run_dir), label=str(run_dir)) + return 0 + + +if __name__ == "__main__": + sys.exit(main(sys.argv[1:])) diff --git a/evals/scenarios/a1_mechanism.yaml b/evals/scenarios/a1_mechanism.yaml new file mode 100644 index 0000000..4fd46e2 --- /dev/null +++ b/evals/scenarios/a1_mechanism.yaml @@ -0,0 +1,59 @@ +id: a1_mechanism +test: A +title: "Mechanism explanation" +what_it_tests: "Can the chat reason about *why* a report finding looks the way it does, beyond quoting numbers? Reports show what; this is the explanatory niche supplements should serve." + +scenario_context: | + The user is viewing a PolicyEngine UK report. + Country: UK + Year: 2025 + Reform: Raise the income tax personal allowance from the current £12,570 to £15,000. + Baseline: Current law. + Dataset: Enhanced FRS 2023-24. + + The report showed (for reference): + - Budgetary impact: -£20.43bn + - Relative income change by decile (%): 1→0.51, 2→0.90, 3→1.09, 4→1.12, 5→1.29, 6→1.48, 7→1.60, 8→1.56, 9→1.54, 10→0.91 + - Child poverty BHC: 14.58% → 14.28% (-0.30 pp) + + Use these figures as already-trusted context. If the user asks for figures already shown, quote them; otherwise run a fresh simulation. + +prompt: | + The report says the top decile gains less in % terms (0.91%) than the 8th decile (1.56%) and 9th decile (1.54%). Why is that? Walk me through the mechanism in 2-3 paragraphs. + +chat_settings: + model_backend: uk_python + num_runs: 3 + +rubric: + relevance: "1-5 — chat answers the actual 'why' question, not pivots to 'what' restatement" + methodology: "1-5 — chat names the mechanism (PA taper at £100k? higher rate? earnings vs income share?) precisely" + reasonableness: "1-5 — the explanation matches UK tax rules; sign and direction are correct" + consistency_with_report: "1-5 — any numbers cited match the figures in scenario_context" + honesty: "1-5 — if the chat doesn't know, it says so cleanly; doesn't fabricate magnitudes" + +anchor: + # The mechanism question has a known correct answer in UK tax rules — drawn + # both from rule knowledge and from how Vahid's "uk-income-tax-ni-reforms-2025" + # post discusses the PA-taper interaction. + must_mention: + - "personal allowance taper" # or equivalent: "PA withdrawal", "tapering" + - "£100,000" # or "100k" — the threshold where the taper starts + - "60%" # the effective MTR band created by the taper + must_not_say: + - "the top decile pays more tax than the 8th" # opposite of the question + - "the reform takes income from the top decile" # PA raise gives them less benefit, doesn't take income + ideal_explanation: | + Two mechanisms in play. First, between £100,000 and roughly £125,140, the + personal allowance is withdrawn at the rate of £1 for every £2 of income + above £100k. This creates an effective 60% marginal tax band: 40% higher-rate + income tax PLUS the 20% effective rate from losing tax-free allowance. So + high earners who span that range gain *less* from a PA raise than households + whose taxable income sits below £100k. + + Second, the absolute £-gain from the PA raise is the same for any taxpayer + in the higher band (PA goes up by £2,430 × marginal rate). But because the + top decile has much larger total household income, that fixed £-gain is a + smaller *percentage* of their income than it is for someone in the 8th or + 9th decile. So both the PA taper and the percentage-of-income arithmetic + push the top decile's relative gain below the 8th and 9th. diff --git a/evals/scenarios/a2_subset_slice.yaml b/evals/scenarios/a2_subset_slice.yaml new file mode 100644 index 0000000..c117eca --- /dev/null +++ b/evals/scenarios/a2_subset_slice.yaml @@ -0,0 +1,52 @@ +id: a2_subset_slice +test: A +title: "Subset breakdown not in the report" +what_it_tests: "Can the chat slice the population (single parents with two children) in a way the canonical report doesn't, computing fresh from the model? Stress test — combines applying a reform with filtering the population, two non-trivial operations against the Python backend's API." + +scenario_context: | + The user is viewing a PolicyEngine UK report. + Country: UK + Year: 2025 + Reform: Raise the income tax personal allowance from the current £12,570 to £15,000. + Baseline: Current law. + Dataset: Enhanced FRS 2023-24. + + The report showed aggregate (whole-UK) impacts but did not break down by household type. + +prompt: | + How does this PA reform affect single parents with two children specifically? Give me decile-by-decile average gains in £ for just that household type, using the same dataset and year as the report. + +chat_settings: + model_backend: uk_python + num_runs: 3 + +rubric: + relevance: "1-5 — chat actually filters to single-parent-2-child households, not just regurgitates the whole-UK numbers" + methodology: "1-5 — chat states how it identified single parents (count_adults=1, count_children=2?), names the dataset, year, income equivalisation choice" + reasonableness: "1-5 — the per-decile gains are positive, monotone-ish in income, and plausible vs the whole-UK figures (probably similar magnitudes since the reform is a tax change, not benefit)" + consistency_with_report: "1-5 — references to the whole-UK figures match scenario_context exactly" + honesty: "1-5 — if the chat runs out of tool budget or can't filter cleanly, it says so rather than fabricating; explicitly flags low-N decile cells" + +anchor: + must_mention: + - "single parent" # or "lone parent" + - "two children" # or "2 children" + - "decile" # since the question asks for a decile breakdown + - "Enhanced FRS" # or "EFRS 2023-24" — should name the dataset + must_not_say: + - "this matches the whole-UK figures" # subset should differ from aggregate + - "single parents see no change" # they're earners too; PA raise affects them + ideal_explanation: | + Method should be: load EFRS 2023-24 via Microsimulation, apply the PA reform, + then filter the benunit-level frame to single-parent-with-two-children + families (count_adults == 1, count_children == 2 within the benunit, or + equivalent variable). Compute per-decile average household income change in + £ for that subset. + + Expected pattern: single-parent families are over-represented in lower + deciles. Per-decile *absolute* £-gains from the PA raise are smaller for + these families than for whole-UK averages because their earnings are + typically lower, so fewer of them have taxable income above the new £15k + threshold. The lowest deciles may see zero or near-zero gain (no income tax + to begin with). Middle deciles see modest gains. Should report low-N + caveats if any decile cell has <30 sampled households. diff --git a/evals/scenarios/a3_multiparam_what_if.yaml b/evals/scenarios/a3_multiparam_what_if.yaml new file mode 100644 index 0000000..11c85c7 --- /dev/null +++ b/evals/scenarios/a3_multiparam_what_if.yaml @@ -0,0 +1,56 @@ +id: a3_multiparam_what_if +test: A +title: "Multi-parameter what-if the user invented" +what_it_tests: "Can the chat answer ad-hoc reform variations users couldn't construct in app-v2 without building a new report? High-value supplement if it works, because it lets users iterate without leaving the page." + +scenario_context: | + The user is viewing a PolicyEngine UK report. + Country: UK + Year: 2025 + Reform: Raise the income tax personal allowance from the current £12,570 to £15,000. + Baseline: Current law. + Dataset: Enhanced FRS 2023-24. + + The report showed the impact of the PA raise alone. The user is now asking about a hypothetical extension. + +prompt: | + What if we also raised the higher-rate threshold from £50,270 to £55,000 alongside the PA raise? Would the combined reform be more or less progressive than the PA raise alone? Compare them on (a) budgetary impact and (b) the gain in the bottom three deciles vs the top three deciles. + +chat_settings: + model_backend: uk_python + num_runs: 3 + +rubric: + relevance: "1-5 — chat actually runs the combined reform and compares to baseline OR to the PA-only reform; doesn't refuse" + methodology: "1-5 — chat is explicit about what 'more progressive' means (D1-D3 vs D8-D10? % vs £?) and what comparison is being made" + reasonableness: "1-5 — combined budgetary cost is larger than PA-only cost; HRT raise concentrates gains in upper-middle deciles, so combined is *less* progressive than PA-only by most measures; chat reaches a defensible conclusion" + consistency_with_report: "1-5 — PA-only numbers cited match the scenario_context figures" + honesty: "1-5 — chat caveats that 'progressive' has multiple definitions, OR explicitly picks one and sticks with it; doesn't claim a stronger conclusion than the numbers support" + +anchor: + # Multi-param reforms with stacked impacts are exactly Vahid's + # uk-income-tax-ni-reforms-2025 territory; chat should reason similarly. + must_mention: + - "higher-rate threshold" # the new parameter being changed + - "£50,270" # the current value, often quoted + - "decile" # since "progressive" requires decile-by-decile thinking + must_not_say: + - "raising the higher-rate threshold helps the bottom decile" # opposite + - "the two reforms cancel out" # they don't — stacked cost + ideal_explanation: | + Methodology: run two simulations — (a) PA raise alone (already in + scenario_context), (b) PA raise + HRT raise. Compare distributional shapes. + + Expected pattern: a higher-rate threshold raise pushes more income out of + the 40% band into the 20% band, so the gain accrues *only* to households + with taxable income above £50,270 — concentrated in deciles 7-9. The bottom + deciles see no additional gain on top of the PA raise (they don't pay + higher-rate tax anyway). Combined reform is therefore *less* progressive + than PA-raise alone by most reasonable definitions: + - Bottom-vs-top relative gain ratio: smaller for combined than PA-only. + - Gini reduction: smaller for combined than PA-only. + - Budgetary cost: larger for combined. + + Caveat to call out: "progressive" can mean (a) D1-vs-D10 relative gain, + (b) Gini change, or (c) absolute £-equality. The chat should name which + definition it's using. diff --git a/evals/scenarios/a4_out_of_scope.yaml b/evals/scenarios/a4_out_of_scope.yaml new file mode 100644 index 0000000..229ac2c --- /dev/null +++ b/evals/scenarios/a4_out_of_scope.yaml @@ -0,0 +1,48 @@ +id: a4_out_of_scope +test: A +title: "Out-of-scope question — does the chat refuse honestly?" +what_it_tests: "Honesty under pressure. PolicyEngine doesn't model macroeconomic feedback (inflation, GDP response, behavioural elasticities beyond the supply-side baked in). The chat should refuse cleanly or scope down, not fabricate a confident answer." + +scenario_context: | + The user is viewing a PolicyEngine UK report. + Country: UK + Year: 2025 + Reform: Raise the income tax personal allowance from the current £12,570 to £15,000. + Baseline: Current law. + Dataset: Enhanced FRS 2023-24. + +prompt: | + How would this reform affect UK inflation over the next 12 months? Give me a forecast. + +chat_settings: + model_backend: uk_python + num_runs: 3 + +rubric: + relevance: "1-5 — chat addresses the question rather than pivoting to a different one" + methodology: "1-5 — chat correctly identifies that the question requires macroeconomic modelling, which PolicyEngine doesn't do" + reasonableness: "1-5 — chat reaches the right verdict (out of scope) without engineering a fake answer; if it does offer adjacent simulation outputs (e.g. 'here's the household-income channel'), those are correct" + consistency_with_report: "1-5 — if any report figures are referenced, they match scenario_context" + honesty: "1-5 — chat REFUSES cleanly. This is the trust-killer dimension for this scenario. A confident inflation forecast scores 1; an explicit 'I can't model this' scores 5." + +anchor: + must_mention: + - "PolicyEngine doesn't model" # or "out of scope" or "not part of the model" + - "macroeconomic" # or "macro" — name the kind of modelling needed + must_not_say: + - "I forecast inflation" # any number with a confident inflation prediction + - "inflation will rise by" # ditto + - "inflation will fall by" # ditto + - "the CPI impact is" # claims a quantified macro outcome + ideal_explanation: | + The chat should explain that PolicyEngine is a microsimulation model — it + computes per-household tax/benefit changes under fixed prices and behaviours. + It does not model: + - General-equilibrium macroeconomic effects (aggregate demand, inflation, + GDP growth, monetary-policy response). + - Most behavioural elasticities (labour supply has some basic modelling, + but consumer-price or investment responses are not in scope). + A clean refusal cites both that the question is out of scope AND why + (microsim vs macro). Bonus: chat may offer adjacent things it *can* compute — + e.g. household-income channel, fiscal cost — without pretending those answer + the inflation question. diff --git a/evals/scenarios/a5_factual_lookup.yaml b/evals/scenarios/a5_factual_lookup.yaml new file mode 100644 index 0000000..4e53091 --- /dev/null +++ b/evals/scenarios/a5_factual_lookup.yaml @@ -0,0 +1,53 @@ +id: a5_factual_lookup +test: A +title: "Historical parameter lookup — no simulation needed" +what_it_tests: "Does the chat handle factual-lookup questions efficiently, without unnecessary tool use? Tests whether the chat knows when a question is a parameter retrieval vs when it needs to run code." + +scenario_context: | + The user is viewing a PolicyEngine UK report. + Country: UK + Year: 2025 + Reform: Raise the income tax personal allowance from the current £12,570 to £15,000. + Baseline: Current law. + Dataset: Enhanced FRS 2023-24. + +prompt: | + How has the UK personal allowance changed over the last 15 years? Just the figures by year, no analysis. + +chat_settings: + model_backend: uk_python + num_runs: 3 + +rubric: + relevance: "1-5 — chat returns a year-by-year series, not commentary or simulation output" + methodology: "1-5 — chat names the parameter path it queried (e.g. gov.hmrc.income_tax.allowances.personal_allowance.amount) so the answer is reproducible" + reasonableness: "1-5 — the values are correct historical PA amounts. 2010 ≈ £6,475; 2014 ≈ £10,000; 2016 ≈ £11,000; 2019 ≈ £12,500; 2021-2027 frozen at £12,570. Off-by-tax-year shifts are forgivable; wildly wrong values are not" + consistency_with_report: "1-5 — if the current value (£12,570) is in the series, it matches the report's baseline" + honesty: "1-5 — if the underlying model doesn't go back 15 years, chat says so and gives the range it has, rather than fabricating early-2010s values" + +anchor: + must_mention: + - "gov.hmrc.income_tax.allowances.personal_allowance.amount" # the parameter path + - "frozen" # or "freeze" — explains the 2021-onwards plateau + must_not_say: + - "I'll run a simulation" # this is a lookup, not a sim + - "the personal allowance is £0" # never zero in any of the relevant years + ideal_explanation: | + The chat should look up the historical values from the model's parameter + tree rather than running any simulation. Expected values (UK tax years): + 2010-11: £6,475 + 2011-12: £7,475 + 2012-13: £8,105 + 2013-14: £9,440 + 2014-15: £10,000 + 2015-16: £10,600 + 2016-17: £11,000 + 2017-18: £11,500 + 2018-19: £11,850 + 2019-20: £12,500 + 2020-21: £12,500 + 2021-22: £12,570 + 2022-onwards: £12,570 (frozen — extended freeze through 2027-28 in current law) + Should mention the post-2021 freeze explicitly. If the model's parameter + history doesn't go back to 2010, chat should give the range it does have + and say so. diff --git a/evals/scenarios/b1_society_wide_pa.yaml b/evals/scenarios/b1_society_wide_pa.yaml new file mode 100644 index 0000000..9f570cf --- /dev/null +++ b/evals/scenarios/b1_society_wide_pa.yaml @@ -0,0 +1,62 @@ +id: b1_society_wide_pa +test: B +title: "Society-wide PA reform — baseline replication" +what_it_tests: "Can the chat match PE-API on the single most-tested reform shape? If chat can't match here, it can't match anywhere." + +prompt: | + Run an economy-wide comparison for the UK in 2025. Baseline: current law. Reform: raise the income tax personal allowance from £12,570 to £15,000. Use the Enhanced FRS 2023-24 dataset: hf://policyengine/policyengine-uk-data-private/enhanced_frs_2023_24.h5. Use Microsimulation, not Simulation. + + Report precisely from the model run: + (1) total budgetary impact in £ + (2) income tax revenue change in £ + (3) benefit spending change in £ + (4) average household income change in £ for each decile (1-10) + (5) relative income change in % for each decile (1-10) + (6) BHC poverty rate baseline vs reform, for: all, child, working-age, senior + + British English. Every figure from the model run, not memory. + +chat_settings: + model_backend: uk_python + num_runs: 3 + +reference: + fixture: pe_api/b1_society_wide_pa.json + fields_to_compare: + - path: budget.budgetary_impact + tolerance_pct: 1.0 + - path: budget.tax_revenue_impact + tolerance_pct: 1.0 + - path: budget.benefit_spending_impact + tolerance_pct: 1.0 + - path: decile.average + tolerance_pct: 1.0 + - path: decile.relative + tolerance_pct: 1.0 + - path: poverty.poverty.all.baseline + tolerance_pct: 1.0 + - path: poverty.poverty.all.reform + tolerance_pct: 1.0 + - path: poverty.poverty.child + tolerance_pct: 1.0 + - path: poverty.poverty.adult + tolerance_pct: 1.0 + - path: poverty.poverty.senior + tolerance_pct: 1.0 + +anchor: + # No PolicyEngine blog post analyses this exact reform; anchor is derived from + # PE-API's own output (the fixture) plus general PolicyEngine reporting norms. + must_mention: + - "Enhanced FRS" # the dataset + - "2025" # the year + - "BHC" # which poverty measure (since the question asked for BHC) + must_not_say: + - "the reform pays for itself" # PA raise is a cost, not revenue-raising + - "everyone gains" # poorest decile gains very little since they pay little income tax + ideal_finding: | + The reform is a tax cut of broadly £20bn order of magnitude in 2025-26, with + relative gains rising through the middle deciles and falling off in the top + decile because the PA taper above £100k removes the benefit for very high + earners. BHC poverty falls modestly because the gains for low-income + taxpaying households push some across the relative-poverty line. diff --git a/evals/scenarios/b2_ni_it_stacked.yaml b/evals/scenarios/b2_ni_it_stacked.yaml new file mode 100644 index 0000000..0a99bf2 --- /dev/null +++ b/evals/scenarios/b2_ni_it_stacked.yaml @@ -0,0 +1,75 @@ +id: b2_ni_it_stacked +test: B +title: "Stacked NI + income tax reform (subset of Reeves Nov-2025 pre-Budget package)" +what_it_tests: "Multi-parameter reform — the LLM must apply two changes in the right order and produce per-layer numbers comparable to PolicyEngine's published analysis. Reference is a peer-reviewed PolicyEngine analysis (Vahid Ahmadi, Nov 2025), filtered to the layers that still reproduce against current policyengine_uk baseline (see evals/fixtures/drift_report.md)." + +source_post: "uk-income-tax-ni-reforms-2025.md" +source_post_url: "https://policyengine.org/uk/research/uk-income-tax-ni-reforms-2025" + +# Note on scope: the original post analysed three layers (freeze + NI cut + IT +# increase). The freeze layer and combined number have been dropped from this +# scenario because they no longer reproduce against current policyengine_uk — +# the post-Autumn-Budget-2025 baseline has incorporated the freeze. See +# evals/fixtures/drift_report.md for the kept/dropped audit trail. + +prompt: | + Run a UK 2026-27 economy-wide comparison. Baseline: current law. + + Apply two reforms, EACH against current law (not stacked): + + Reform A — reduce the National Insurance main rate from 8% to 6% (on earnings + between £12,570 and £50,270). + + Reform B — increase the basic income tax rate from 20% to 22% and the higher + rate from 40% to 42%. + + Use the Enhanced FRS 2023-24 dataset: hf://policyengine/policyengine-uk-data-private/enhanced_frs_2023_24.h5. + + Report from the model run, separately for each reform: + (a) Budgetary impact in £ for 2026-27. + (b) Relative income change in % by decile (D1 through D10). + + British English. Every figure from the model run. + +chat_settings: + model_backend: uk_python + num_runs: 3 + +reference: + fixture: pe_api/b2_ni_it_stacked.json + fields_to_compare: + # NI cut alone vs current law, 2026-27. Vahid published -£11.7bn; our + # local run reproduces within 1.2%. Tight tolerance, kept. + - path: ni_layer.budgetary_impact + tolerance_pct: 5.0 + + # Per-decile distributional shapes for the two reforms applied individually. + # The fixture stores the full decile.relative dict; the grader compares + # element-wise. + - path: ni_alone.decile_relative + tolerance_pct: 20.0 + - path: it_alone.decile_relative + tolerance_pct: 20.0 + + # Not compared: + # - it_layer.budgetary_impact: dropped per 10% drift threshold (11.7% off + # Vahid). See evals/fixtures/drift_report.md. + # - freeze_layer.budgetary_impact: dropped — current baseline has + # incorporated the freeze, reform is a no-op. + # - combined.budgetary_impact: dropped — downstream of freeze drift. + +anchor: + must_mention: + - "Upper Earnings Limit" # why NI cut doesn't help top earners much + - "higher rate" # IT band where the increase concentrates + - "decile" # required by the question shape + must_not_say: + - "all households gain" # untrue under combined effect + - "the reforms are independent" # they're a package even when scored separately + ideal_finding: | + The NI cut concentrates gains in deciles where workers' earnings sit in the + £12,570-£50,270 band — peaking around D8-D9 with ~1.1% relative gain. + The IT increase reduces income across all deciles, again peaking at D8-D9 + with ~1.5-1.6% relative loss because that's where the higher-rate band + bites hardest. The two layers roughly offset for typical workers; pensioners + and landlords pay only the IT rise. diff --git a/evals/scenarios/b3_household_calc.yaml b/evals/scenarios/b3_household_calc.yaml new file mode 100644 index 0000000..a7f56e9 --- /dev/null +++ b/evals/scenarios/b3_household_calc.yaml @@ -0,0 +1,47 @@ +id: b3_household_calc +test: B +title: "Household calculation — deterministic, no microdata" +what_it_tests: "Single-household calculations are the chat's strongest theoretical case — no microdata, no aggregation, just applying the rules to one situation dict. Establishes a baseline for what the chat can do when nothing else is in its way." + +prompt: | + Compute UK 2025 figures for a single adult, age 35, with employment income £45,000, no other income, no dependents, lives in England, has no other assets or benefits. Use the policyengine_uk Python Simulation class with a `situation` dict (no microdata, no dataset). + + Report from the model run: + (1) household_net_income in £ + (2) income_tax in £ + (3) national_insurance (employee) in £ + (4) marginal_tax_rate at this income point (the combined IT+NI marginal rate on an extra £1) + + British English. Every figure from the model run. + +chat_settings: + model_backend: uk_python + num_runs: 3 + +reference: + fixture: pe_api/b3_household_calc.json + fields_to_compare: + - path: result.household_net_income + tolerance_pct: 0.5 + - path: result.income_tax + tolerance_pct: 0.5 + - path: result.national_insurance + tolerance_pct: 0.5 + - path: result.marginal_tax_rate + tolerance_pct: 0.5 + +anchor: + # Anchor derived from UK tax rule knowledge — no blog post needed since this is + # a deterministic single-household calculation in the basic-rate band. + must_mention: + - "personal allowance" # £12,570 is the threshold to flag + - "basic rate" # £45,000 is in the basic-rate band + - "8%" # employee NI rate at this income + must_not_say: + - "higher rate" # £45k is below the higher-rate threshold of £50,270 + - "no income tax" # £45k is well above the PA + ideal_finding: | + At £45,000 gross, the worker pays 20% income tax on £32,430 (£45,000 − + £12,570 personal allowance) = £6,486, and 8% NI on the same band = £2,594 + (approximately). Combined marginal tax rate on an extra £1 of earnings is + 28% (20% income tax + 8% NI). Net income ≈ £35,920. diff --git a/evals/scenarios/b4_mtr_schedule.yaml b/evals/scenarios/b4_mtr_schedule.yaml new file mode 100644 index 0000000..231a194 --- /dev/null +++ b/evals/scenarios/b4_mtr_schedule.yaml @@ -0,0 +1,42 @@ +id: b4_mtr_schedule +test: B +title: "MTR schedule — sanity check against known-good ground truth" +what_it_tests: "Schedule lookup against rule-driven ground truth — direct policyengine_uk calls computed at fixture build time. Acts as a sanity check on the test infrastructure itself: if this fails, the runner or extractor is broken before we draw conclusions about anything else." + +prompt: | + For a single adult in the UK in 2025/26 with no children and no other income, compute the marginal income tax rate (combined: income tax + employee National Insurance) on an extra £1 of employment income, at each of these gross employment income levels: £10,000, £20,000, £30,000, £50,000, £75,000, £100,000, £125,000, £150,000. + + Report a clear table: gross income, income tax marginal rate, NI marginal rate, combined marginal rate. + + Use the policyengine_uk Python Simulation class with a `situation` dict. No microdata, no dataset. Every figure from the model run. + +chat_settings: + model_backend: uk_python + num_runs: 3 + +reference: + fixture: pe_api/b4_mtr_schedule.json + fields_to_compare: + # Compare the combined MTR at each row. `rows` is a list-of-dicts; the + # grader iterates over it, using `key_by` to identify each row (gross + # income) and `compare` to pick which field to diff. + - path: rows + key_by: gross + compare: combined_mtr + tolerance_pct: 1.0 + +anchor: + # Pure rule-driven schedule — anchor is the expected schedule values themselves. + must_mention: + - "personal allowance" # below £12,570 = 0% MTR + - "60%" # the PA-taper band £100k-£125k + - "Upper Earnings Limit" # £50,270, where NI drops from 8% to 2% + must_not_say: + - "MTR is constant" # the schedule has steps; chat should describe them + ideal_finding: | + Combined IT+NI MTR schedule for a single adult, no other income, 2025-26: + £10k → 0% (below PA and NI primary threshold). + £20k–£50k → 28% (20% basic IT + 8% main NI). + £75k → 42% (40% higher-rate IT + 2% NI above UEL). + £100k–£125k → 62% (40% higher IT + 20% effective from PA taper + 2% NI). + £150k → 47% (45% additional-rate IT + 2% NI, PA fully tapered out).