From 765afb57494fbdd3f4ed08a4def0d17586656e75 Mon Sep 17 00:00:00 2001
From: SakshiKekre <sakshi.kekre@gmail.com>
Date: Fri, 15 May 2026 11:22:51 -0700
Subject: [PATCH 01/11] Add eval harness scaffold: spec, scenarios, fixtures
 dir
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Moves the eval design doc into the repo as evals/SPEC.md and lays out
the directory structure the harness will use. Ten hand-authored
scenarios are included as YAML — five Test A (chat as supplement) and
five Test B (chat as alternative). Each scenario covers a distinct
question shape and stress-tests a specific failure mode.

No runner yet — that's the next PR. This PR is just the data and
schema. See evals/README.md for layout and evals/SPEC.md for design,
thresholds, and roadmap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .gitignore                                    |   3 +
 evals/README.md                               |  77 +++++++++
 evals/SPEC.md                                 | 159 ++++++++++++++++++
 evals/fixtures/pe_api/.gitkeep                |   0
 evals/scenarios/a1_mechanism.yaml             |  33 ++++
 evals/scenarios/a2_subset_slice.yaml          |  28 +++
 evals/scenarios/a3_multiparam_what_if.yaml    |  28 +++
 evals/scenarios/a4_out_of_scope.yaml          |  26 +++
 evals/scenarios/a5_factual_lookup.yaml        |  26 +++
 evals/scenarios/b1_society_wide_pa.yaml       |  45 +++++
 .../b2_society_wide_pa_plus_hrt.yaml          |  46 +++++
 evals/scenarios/b3_household_calc.yaml        |  31 ++++
 evals/scenarios/b4_mtr_schedule.yaml          |  22 +++
 evals/scenarios/b5_regional_reform.yaml       |  33 ++++
 14 files changed, 557 insertions(+)
 create mode 100644 evals/README.md
 create mode 100644 evals/SPEC.md
 create mode 100644 evals/fixtures/pe_api/.gitkeep
 create mode 100644 evals/scenarios/a1_mechanism.yaml
 create mode 100644 evals/scenarios/a2_subset_slice.yaml
 create mode 100644 evals/scenarios/a3_multiparam_what_if.yaml
 create mode 100644 evals/scenarios/a4_out_of_scope.yaml
 create mode 100644 evals/scenarios/a5_factual_lookup.yaml
 create mode 100644 evals/scenarios/b1_society_wide_pa.yaml
 create mode 100644 evals/scenarios/b2_society_wide_pa_plus_hrt.yaml
 create mode 100644 evals/scenarios/b3_household_calc.yaml
 create mode 100644 evals/scenarios/b4_mtr_schedule.yaml
 create mode 100644 evals/scenarios/b5_regional_reform.yaml

diff --git a/.gitignore b/.gitignore
index b543491..f0cddd5 100644
--- a/.gitignore
+++ b/.gitignore
@@ -11,3 +11,6 @@ dist/
 # Both backend/Dockerfile and modal_app.py regenerate this against the
 # installed policyengine-uk-compiled version, so it never lives in git.
 backend/reference.md
+
+# Eval run outputs — generated per run, not source
+evals/runs/
diff --git a/evals/README.md b/evals/README.md
new file mode 100644
index 0000000..9acd5fd
--- /dev/null
+++ b/evals/README.md
@@ -0,0 +1,77 @@
+# uk-chat evals
+
+Evaluation harness for the chat. Tests two positionings — **supplement** to app-v2 reports and **alternative** to them — against pre-committed thresholds.
+
+The full design (rationale, thresholds, scenario descriptions) lives in [SPEC.md](./SPEC.md). This README covers structure and how to extend.
+
+## What's here today
+
+```
+evals/
+  SPEC.md                ← the design doc — read this first
+  README.md              ← this file
+  scenarios/
+    a1_*.yaml            ← Test A scenarios (supplement)
+    b1_*.yaml            ← Test B scenarios (alternative)
+  fixtures/
+    pe_api/              ← reference PE-API responses for Test B
+                          ← populated by a follow-up PR
+  runs/                  ← raw eval-run output (gitignored)
+                          ← populated by the runner in a follow-up PR
+```
+
+## Scenario file shape
+
+Every scenario is a YAML file in `scenarios/`. The fields are the same across A and B; some are populated only on one side.
+
+```yaml
+id: a1_mechanism            # short stable identifier — also the filename prefix
+test: A                     # A (supplement) or B (alternative)
+title: "Mechanism explanation"
+what_it_tests: "One sentence on why this scenario exists."
+
+# Sent to the chat as ?scenario_context= (Test A only). Mirrors what the
+# app-v2 drawer would prepend when opening from a report.
+scenario_context: |
+  ...
+
+# The user's actual message.
+prompt: |
+  ...
+
+chat_settings:
+  model_backend: uk_python  # uk_compiled | uk_python
+  num_runs: 3               # how many fresh sessions to run per scenario
+
+# Test A — qualitative grading rubric.
+rubric:
+  relevance: "1-5 — chat answers the actual follow-up"
+  methodology: "1-5 — chat states dataset/year/assumptions"
+  reasonableness: "1-5 — numbers in plausible range, internally consistent"
+  consistency_with_report: "1-5 — agrees with report on shared facts"
+  honesty: "1-5 — refuses cleanly vs fabricates"
+
+# Test B — numeric comparison against a fixture.
+reference:
+  fixture: pe_api/b1_society_wide_pa.json
+  fields_to_compare:
+    - path: budget.budgetary_impact
+      tolerance_pct: 1.0
+    - path: decile.relative
+      tolerance_pct: 1.0
+```
+
+The runner (added in a follow-up PR) reads every YAML in `scenarios/`, POSTs to a configured chat backend, and writes outputs under `runs/<timestamp>/<scenario-id>/`. See SPEC.md "Roadmap" for what's coming next.
+
+## Decision thresholds
+
+Pre-committed in SPEC.md so we don't rationalize ambiguous results later:
+
+- **Test A** passes if mean rubric score ≥ 4.0 across all responses, no individual score < 2 on Reasonableness or Honesty, and ≤ 1-in-5 fabrication rate.
+- **Test B** passes if field-level accuracy ≥ 95%, self-consistency SD < 0.5% of mean, methodology drift in ≤ 1 of 5 scenarios, failure rate < 10%.
+
+## Adding scenarios
+
+Add a YAML file. That's it — the runner picks it up automatically. Keep the `id` short and stable (it appears in filenames and reports).
+
+When adding a Test B scenario, also add the reference fixture under `fixtures/pe_api/` and reference it via the relative path in the scenario YAML.
diff --git a/evals/SPEC.md b/evals/SPEC.md
new file mode 100644
index 0000000..9268113
--- /dev/null
+++ b/evals/SPEC.md
@@ -0,0 +1,159 @@
+# uk-chat evaluation spec
+
+> Defines how we test uk-chat output to decide whether the chat works better as a **supplement** to app-v2 reports (follow-up questions reports can't answer) or as an **alternative** producing the same answers reports do. Two tests, two pre-committed thresholds.
+
+Related external context:
+- [policyengine-app-v2#1036](https://github.com/PolicyEngine/policyengine-app-v2/pull/1036) — report chat drawer (the supplement integration)
+- [policyengine-uk-chat#51](https://github.com/PolicyEngine/policyengine-uk-chat/pull/51) — backend selector + `scenario_context` (chat changes that enable both positionings)
+- [policyengine-api](https://github.com/PolicyEngine/policyengine-api) — produces the reference numbers used in Test B
+
+## Problem statement
+
+Decide between two product positionings of uk-chat: **supplement** (a follow-up affordance on app-v2 reports for questions reports can't answer) or **alternative** (chat as a primary way to get the same answers reports give, in addition to or instead of the report UI). The decision needs evidence, not opinions. This spec defines the evaluation that produces that evidence.
+
+## Two tests, two positionings, two thresholds
+
+### Test A — Chat as supplement
+
+**Question:** Does the chat usefully extend reports for the long-tail of follow-up questions?
+
+**Test setup:**
+- 5 scenarios of `(report shown, follow-up question)` pairs.
+- Each scenario run 3× in fresh sessions to capture variance.
+- Grade per-rubric (below), no PE-API comparison needed (these are questions reports can't answer).
+
+**Rubric (per response, scored 1-5):**
+- **Relevance.** Did the chat answer the actual follow-up, or pivot to something unrelated?
+- **Methodology disclosure.** Did the chat say what dataset/year/assumption it used?
+- **Reasonableness.** Are the numbers in a plausible range? Do they internally agree (sum-to-totals, decile orderings, sign of effect)?
+- **Consistency with report.** Where the chat references something the report already showed, do they agree?
+- **Honesty about limits.** When the chat can't answer, does it say so cleanly, or fabricate?
+
+**Decision threshold (pre-committed):**
+- Mean rubric score ≥ 4.0 across all responses.
+- No single response scoring < 2 on Reasonableness or Honesty (those are the trust-killer failures).
+- No more than 1 in 5 responses where the chat fabricates a figure not derivable from a run.
+
+### Test B — Chat as alternative
+
+**Question:** Can the chat produce the same answers app-v2 reports do, with comparable trustworthiness?
+
+**Test setup:**
+- 5 scenarios chosen to match shapes app-v2 already answers (economy-wide reform, household calc, MTR, etc).
+- For each: run app-v2 report (or equivalent PE-API call) → record reference numbers. Run chat 3× per scenario → record numeric output.
+- Compare numerically per scenario field.
+
+**Metrics:**
+- **Field-level accuracy:** % of numeric fields within 1% of PE-API.
+- **Self-consistency:** standard deviation of each numeric field across 3 chat runs as a % of mean. Should be < 0.5% for the chat to be considered deterministic enough.
+- **Methodology drift:** count of runs where chat picks a different methodology (e.g. poverty BHC vs AHC, dataset year, decile definition) than PE-API.
+- **Failure rate:** % of runs that fail to produce comparable numbers (timeouts, errors, LLM-thrash episodes).
+
+**Decision threshold (pre-committed):**
+- Field-level accuracy ≥ 95%.
+- Self-consistency SD < 0.5% of mean for every numeric field.
+- Methodology drift in ≤ 1 of 5 scenarios.
+- Failure rate < 10%.
+
+**If thresholds met:** alternative positioning is viable. **If not:** supplement-only is the right framing.
+
+## Test scenarios
+
+Each scenario is a tuple of `(scenario_id, question shape, what's being tested, fixtures)`. All UK 2025, baseline = current law unless stated. All chat runs use the `uk_python` backend pinned to `enhanced_frs_2023_24.h5` (the same dataset PE-API uses).
+
+### Test A — supplement scenarios
+
+These all assume the user has just viewed a specific app-v2 report and asks a follow-up.
+
+**A1 — Mechanism explanation**
+- **Report shown:** UK PA-raise reform 2025 (current £12,570 → £15,000), economy-wide.
+- **Follow-up question:** "The report says the top decile gains less in % terms than the 8th decile — why? Walk me through the mechanism."
+- **What it tests:** Can the chat reason about *why* a result looks the way it does, beyond just quoting the numbers?
+- **Doesn't fit a report** because the report shows what, not why. This is the long-tail question shape supplements should serve.
+
+**A2 — Subset breakdown not in the report**
+- **Report shown:** UK PA-raise reform 2025, economy-wide.
+- **Follow-up question:** "How does this reform affect single parents with two children specifically? Breakdown by income decile."
+- **What it tests:** Can the chat slice the population in a way the canonical report doesn't, computing fresh from the model?
+- **Stress test** because this combines applying a reform *and* filtering the population — two operations that are individually non-trivial for the LLM via the Python backend's API.
+
+**A3 — Comparative scenario the user invented**
+- **Report shown:** UK PA-raise reform 2025.
+- **Follow-up question:** "What if we'd also raised the higher rate threshold from £50,270 to £55,000 alongside the PA raise — would that be more or less progressive?"
+- **What it tests:** Multi-parameter ad-hoc reform comparison. Users couldn't construct this in app-v2 without building a new report.
+- **High-value supplement** if it works because it lets users iterate without leaving the page.
+
+**A4 — Out-of-scope question, polite refusal**
+- **Report shown:** UK PA-raise reform 2025.
+- **Follow-up question:** "How would this same reform affect inflation forecasts?"
+- **What it tests:** Honest scope refusal vs fabrication. PolicyEngine doesn't model macro feedback effects; the chat should say so cleanly.
+- **The "honesty under pressure" test.** Easy to fail by confidently making something up.
+
+**A5 — Historical/factual question, no simulation needed**
+- **Report shown:** UK PA-raise reform 2025.
+- **Follow-up question:** "How has the UK personal allowance changed over the last 15 years? Just the figures."
+- **What it tests:** Whether the chat handles factual-lookup questions (which need no simulation) without unnecessary tool use, and whether it knows where the data lives.
+- **Edge case** because this is information the underlying packages have but the chat may not surface cleanly — could waste tool calls trying to "calculate" what's just a parameter lookup.
+
+### Test B — alternative scenarios
+
+These are questions app-v2 already answers via reports. The chat must match.
+
+**B1 — Society-wide reform, single-parameter**
+- **Question:** "Run a UK economy-wide comparison for 2025: baseline current law, reform raises the income tax personal allowance from £12,570 to £15,000. Report total budgetary impact, decile income changes (both £ and %), and BHC poverty rates for all/child/working-age/senior."
+- **PE-API reference:** Generated against the live PE-API by the fixture-build step and saved to `evals/fixtures/pe_api/b1_society_wide_pa.json`.
+- **What it tests:** Baseline replication. If the chat can't match here, it can't match anywhere.
+
+**B2 — Society-wide reform, multi-parameter**
+- **Question:** "UK economy-wide 2025: reform raises personal allowance to £15,000 *and* raises higher-rate threshold to £55,000. Report budget impact, decile impacts, poverty."
+- **PE-API reference:** to be generated.
+- **What it tests:** Composition. Multi-parameter reforms are harder for the LLM to apply correctly — single-parameter tests don't catch the composition bug.
+
+**B3 — Household calc (no microdata needed)**
+- **Question:** "Single adult, age 35, employment income £45,000 in UK 2025, no dependents, England. Compute: net income, income tax, employee NI, marginal tax rate at this income point."
+- **PE-API reference:** to be generated via `/uk/household` endpoint.
+- **What it tests:** Deterministic household calculations — single-household rule application without microdata or aggregation. The chat should be at its best here.
+
+**B4 — MTR schedule**
+- **Question:** "Compute the combined IT+NI marginal rate at gross income levels £10k, £20k, £30k, £50k, £75k, £100k, £125k, £150k for a single UK adult in 2025."
+- **Ground truth:** Generated by directly calling `policyengine_uk` at fixture-build time. No microdata or PE-API involved.
+- **What it tests:** Schedule lookup against rule-driven ground truth. Acts as a sanity check on the test infrastructure — if this scenario fails, the runner or extractor is broken before we draw conclusions about anything else.
+
+**B5 — Regional reform**
+- **Question:** "Run a Scotland-only comparison for 2025: reform raises the Scottish starter-rate threshold by £1,000. Report Scotland-specific budgetary impact and decile changes."
+- **PE-API reference:** to be generated via `/uk/economy/...?region=scotland`.
+- **What it tests:** Regional handling, which the chat hasn't been stress-tested on. Vercel/Modal preview deployments showed `region` flowing through, but we haven't validated the chat picks the right slice.
+
+## Proposed solutions & tradeoffs
+
+### Approach 1: hand-authored small scenarios, manual grading
+
+**Description:** As above — 10 scenarios total (5 A + 5 B), 3 runs each = 30 conversations. Grader is a human (initially Sakshi).
+**Pros:** Fast to set up. Grading captures qualitative judgment (methodology drift, hallucination) that automated metrics miss. Small enough that mistakes are recoverable.
+**Cons:** N=5 per test is statistically thin. Manual grading is the bottleneck. Doesn't scale to ongoing monitoring.
+**Verdict:** accepted for v1.
+
+### Approach 2: larger automated test set
+
+**Description:** ~50 scenarios per test, parsed numerics extracted via regex/LLM, compared automatically.
+**Pros:** Better statistical power. Re-runnable on every chat change.
+**Cons:** Significant up-front cost to build the harness and reliable extractors. Probably premature — we don't know yet what the failure modes are.
+**Verdict:** deferred. Revisit after v1 results.
+
+### Approach 3: live A/B with real users
+
+**Description:** Ship both supplement and standalone-chat to a fraction of users, instrument usage, decide based on real behaviour.
+**Pros:** Most ecologically valid signal.
+**Cons:** Slow to gather data, exposes possibly-bad output to users, hard to attribute outcomes cleanly.
+**Verdict:** rejected for the initial decision. Could be a follow-up after v1 confirms the basic story.
+
+## Roadmap
+
+This PR lands the **spec, directory scaffold, and the 10 scenario files** only. Subsequent PRs:
+
+1. **Runner.** A small Python script under `evals/runner/` that loads scenarios, POSTs each prompt N times to a configured chat backend URL, saves raw SSE logs under `evals/runs/<timestamp>/`.
+2. **B fixtures.** Generate reference outputs for B1-B5 into `evals/fixtures/pe_api/` — PE-API calls for the society-wide / regional scenarios, direct `policyengine_uk` calls for the household / MTR scenarios.
+3. **Grader.** For Test A: a markdown grading sheet (one row per response) with the rubric columns. Manual fill for v1. For Test B: an extractor that pulls numerics from SSE responses and diffs against fixtures.
+4. **Findings writeup.** Once 30 conversations are graded, a `RESULTS-YYYY-MM-DD.md` per-run report. Per-scenario verdicts, threshold check results, recommended positioning, new failure modes if any.
+
+The chat backend URL the runner targets is configurable via env var so the eval can run against either a preview deploy or production.
diff --git a/evals/fixtures/pe_api/.gitkeep b/evals/fixtures/pe_api/.gitkeep
new file mode 100644
index 0000000..e69de29
diff --git a/evals/scenarios/a1_mechanism.yaml b/evals/scenarios/a1_mechanism.yaml
new file mode 100644
index 0000000..9e155f5
--- /dev/null
+++ b/evals/scenarios/a1_mechanism.yaml
@@ -0,0 +1,33 @@
+id: a1_mechanism
+test: A
+title: "Mechanism explanation"
+what_it_tests: "Can the chat reason about *why* a report finding looks the way it does, beyond quoting numbers? Reports show what; this is the explanatory niche supplements should serve."
+
+scenario_context: |
+  The user is viewing a PolicyEngine UK report.
+  Country: UK
+  Year: 2025
+  Reform: Raise the income tax personal allowance from the current £12,570 to £15,000.
+  Baseline: Current law.
+  Dataset: Enhanced FRS 2023-24.
+
+  The report showed (for reference):
+  - Budgetary impact: -£20.43bn
+  - Relative income change by decile (%): 1→0.51, 2→0.90, 3→1.09, 4→1.12, 5→1.29, 6→1.48, 7→1.60, 8→1.56, 9→1.54, 10→0.91
+  - Child poverty BHC: 14.58% → 14.28% (-0.30 pp)
+
+  Use these figures as already-trusted context. If the user asks for figures already shown, quote them; otherwise run a fresh simulation.
+
+prompt: |
+  The report says the top decile gains less in % terms (0.91%) than the 8th decile (1.56%) and 9th decile (1.54%). Why is that? Walk me through the mechanism in 2-3 paragraphs.
+
+chat_settings:
+  model_backend: uk_python
+  num_runs: 3
+
+rubric:
+  relevance: "1-5 — chat answers the actual 'why' question, not pivots to 'what' restatement"
+  methodology: "1-5 — chat names the mechanism (PA taper at £100k? higher rate? earnings vs income share?) precisely"
+  reasonableness: "1-5 — the explanation matches UK tax rules; sign and direction are correct"
+  consistency_with_report: "1-5 — any numbers cited match the figures in scenario_context"
+  honesty: "1-5 — if the chat doesn't know, it says so cleanly; doesn't fabricate magnitudes"
diff --git a/evals/scenarios/a2_subset_slice.yaml b/evals/scenarios/a2_subset_slice.yaml
new file mode 100644
index 0000000..e96ea7a
--- /dev/null
+++ b/evals/scenarios/a2_subset_slice.yaml
@@ -0,0 +1,28 @@
+id: a2_subset_slice
+test: A
+title: "Subset breakdown not in the report"
+what_it_tests: "Can the chat slice the population (single parents with two children) in a way the canonical report doesn't, computing fresh from the model? Stress test — combines applying a reform with filtering the population, two non-trivial operations against the Python backend's API."
+
+scenario_context: |
+  The user is viewing a PolicyEngine UK report.
+  Country: UK
+  Year: 2025
+  Reform: Raise the income tax personal allowance from the current £12,570 to £15,000.
+  Baseline: Current law.
+  Dataset: Enhanced FRS 2023-24.
+
+  The report showed aggregate (whole-UK) impacts but did not break down by household type.
+
+prompt: |
+  How does this PA reform affect single parents with two children specifically? Give me decile-by-decile average gains in £ for just that household type, using the same dataset and year as the report.
+
+chat_settings:
+  model_backend: uk_python
+  num_runs: 3
+
+rubric:
+  relevance: "1-5 — chat actually filters to single-parent-2-child households, not just regurgitates the whole-UK numbers"
+  methodology: "1-5 — chat states how it identified single parents (count_adults=1, count_children=2?), names the dataset, year, income equivalisation choice"
+  reasonableness: "1-5 — the per-decile gains are positive, monotone-ish in income, and plausible vs the whole-UK figures (probably similar magnitudes since the reform is a tax change, not benefit)"
+  consistency_with_report: "1-5 — references to the whole-UK figures match scenario_context exactly"
+  honesty: "1-5 — if the chat runs out of tool budget or can't filter cleanly, it says so rather than fabricating; explicitly flags low-N decile cells"
diff --git a/evals/scenarios/a3_multiparam_what_if.yaml b/evals/scenarios/a3_multiparam_what_if.yaml
new file mode 100644
index 0000000..fc377c6
--- /dev/null
+++ b/evals/scenarios/a3_multiparam_what_if.yaml
@@ -0,0 +1,28 @@
+id: a3_multiparam_what_if
+test: A
+title: "Multi-parameter what-if the user invented"
+what_it_tests: "Can the chat answer ad-hoc reform variations users couldn't construct in app-v2 without building a new report? High-value supplement if it works, because it lets users iterate without leaving the page."
+
+scenario_context: |
+  The user is viewing a PolicyEngine UK report.
+  Country: UK
+  Year: 2025
+  Reform: Raise the income tax personal allowance from the current £12,570 to £15,000.
+  Baseline: Current law.
+  Dataset: Enhanced FRS 2023-24.
+
+  The report showed the impact of the PA raise alone. The user is now asking about a hypothetical extension.
+
+prompt: |
+  What if we also raised the higher-rate threshold from £50,270 to £55,000 alongside the PA raise? Would the combined reform be more or less progressive than the PA raise alone? Compare them on (a) budgetary impact and (b) the gain in the bottom three deciles vs the top three deciles.
+
+chat_settings:
+  model_backend: uk_python
+  num_runs: 3
+
+rubric:
+  relevance: "1-5 — chat actually runs the combined reform and compares to baseline OR to the PA-only reform; doesn't refuse"
+  methodology: "1-5 — chat is explicit about what 'more progressive' means (D1-D3 vs D8-D10? % vs £?) and what comparison is being made"
+  reasonableness: "1-5 — combined budgetary cost is larger than PA-only cost; HRT raise concentrates gains in upper-middle deciles, so combined is *less* progressive than PA-only by most measures; chat reaches a defensible conclusion"
+  consistency_with_report: "1-5 — PA-only numbers cited match the scenario_context figures"
+  honesty: "1-5 — chat caveats that 'progressive' has multiple definitions, OR explicitly picks one and sticks with it; doesn't claim a stronger conclusion than the numbers support"
diff --git a/evals/scenarios/a4_out_of_scope.yaml b/evals/scenarios/a4_out_of_scope.yaml
new file mode 100644
index 0000000..4a2c6fa
--- /dev/null
+++ b/evals/scenarios/a4_out_of_scope.yaml
@@ -0,0 +1,26 @@
+id: a4_out_of_scope
+test: A
+title: "Out-of-scope question — does the chat refuse honestly?"
+what_it_tests: "Honesty under pressure. PolicyEngine doesn't model macroeconomic feedback (inflation, GDP response, behavioural elasticities beyond the supply-side baked in). The chat should refuse cleanly or scope down, not fabricate a confident answer."
+
+scenario_context: |
+  The user is viewing a PolicyEngine UK report.
+  Country: UK
+  Year: 2025
+  Reform: Raise the income tax personal allowance from the current £12,570 to £15,000.
+  Baseline: Current law.
+  Dataset: Enhanced FRS 2023-24.
+
+prompt: |
+  How would this reform affect UK inflation over the next 12 months? Give me a forecast.
+
+chat_settings:
+  model_backend: uk_python
+  num_runs: 3
+
+rubric:
+  relevance: "1-5 — chat addresses the question rather than pivoting to a different one"
+  methodology: "1-5 — chat correctly identifies that the question requires macroeconomic modelling, which PolicyEngine doesn't do"
+  reasonableness: "1-5 — chat reaches the right verdict (out of scope) without engineering a fake answer; if it does offer adjacent simulation outputs (e.g. 'here's the household-income channel'), those are correct"
+  consistency_with_report: "1-5 — if any report figures are referenced, they match scenario_context"
+  honesty: "1-5 — chat REFUSES cleanly. This is the trust-killer dimension for this scenario. A confident inflation forecast scores 1; an explicit 'I can't model this' scores 5."
diff --git a/evals/scenarios/a5_factual_lookup.yaml b/evals/scenarios/a5_factual_lookup.yaml
new file mode 100644
index 0000000..b5ffb0c
--- /dev/null
+++ b/evals/scenarios/a5_factual_lookup.yaml
@@ -0,0 +1,26 @@
+id: a5_factual_lookup
+test: A
+title: "Historical parameter lookup — no simulation needed"
+what_it_tests: "Does the chat handle factual-lookup questions efficiently, without unnecessary tool use? Tests whether the chat knows when a question is a parameter retrieval vs when it needs to run code."
+
+scenario_context: |
+  The user is viewing a PolicyEngine UK report.
+  Country: UK
+  Year: 2025
+  Reform: Raise the income tax personal allowance from the current £12,570 to £15,000.
+  Baseline: Current law.
+  Dataset: Enhanced FRS 2023-24.
+
+prompt: |
+  How has the UK personal allowance changed over the last 15 years? Just the figures by year, no analysis.
+
+chat_settings:
+  model_backend: uk_python
+  num_runs: 3
+
+rubric:
+  relevance: "1-5 — chat returns a year-by-year series, not commentary or simulation output"
+  methodology: "1-5 — chat names the parameter path it queried (e.g. gov.hmrc.income_tax.allowances.personal_allowance.amount) so the answer is reproducible"
+  reasonableness: "1-5 — the values are correct historical PA amounts. 2010 ≈ £6,475; 2014 ≈ £10,000; 2016 ≈ £11,000; 2019 ≈ £12,500; 2021-2027 frozen at £12,570. Off-by-tax-year shifts are forgivable; wildly wrong values are not"
+  consistency_with_report: "1-5 — if the current value (£12,570) is in the series, it matches the report's baseline"
+  honesty: "1-5 — if the underlying model doesn't go back 15 years, chat says so and gives the range it has, rather than fabricating early-2010s values"
diff --git a/evals/scenarios/b1_society_wide_pa.yaml b/evals/scenarios/b1_society_wide_pa.yaml
new file mode 100644
index 0000000..5148b4c
--- /dev/null
+++ b/evals/scenarios/b1_society_wide_pa.yaml
@@ -0,0 +1,45 @@
+id: b1_society_wide_pa
+test: B
+title: "Society-wide PA reform — baseline replication"
+what_it_tests: "Can the chat match PE-API on the single most-tested reform shape? If chat can't match here, it can't match anywhere."
+
+prompt: |
+  Run an economy-wide comparison for the UK in 2025. Baseline: current law. Reform: raise the income tax personal allowance from £12,570 to £15,000. Use the Enhanced FRS 2023-24 dataset: hf://policyengine/policyengine-uk-data-private/enhanced_frs_2023_24.h5. Use Microsimulation, not Simulation.
+
+  Report precisely from the model run:
+  (1) total budgetary impact in £
+  (2) income tax revenue change in £
+  (3) benefit spending change in £
+  (4) average household income change in £ for each decile (1-10)
+  (5) relative income change in % for each decile (1-10)
+  (6) BHC poverty rate baseline vs reform, for: all, child, working-age, senior
+
+  British English. Every figure from the model run, not memory.
+
+chat_settings:
+  model_backend: uk_python
+  num_runs: 3
+
+reference:
+  fixture: pe_api/b1_society_wide_pa.json
+  fields_to_compare:
+    - path: budget.budgetary_impact
+      tolerance_pct: 1.0
+    - path: budget.tax_revenue_impact
+      tolerance_pct: 1.0
+    - path: budget.benefit_spending_impact
+      tolerance_pct: 1.0
+    - path: decile.average
+      tolerance_pct: 1.0
+    - path: decile.relative
+      tolerance_pct: 1.0
+    - path: poverty.poverty.all.baseline
+      tolerance_pct: 1.0
+    - path: poverty.poverty.all.reform
+      tolerance_pct: 1.0
+    - path: poverty.poverty.child
+      tolerance_pct: 1.0
+    - path: poverty.poverty.adult
+      tolerance_pct: 1.0
+    - path: poverty.poverty.senior
+      tolerance_pct: 1.0
diff --git a/evals/scenarios/b2_society_wide_pa_plus_hrt.yaml b/evals/scenarios/b2_society_wide_pa_plus_hrt.yaml
new file mode 100644
index 0000000..6aff475
--- /dev/null
+++ b/evals/scenarios/b2_society_wide_pa_plus_hrt.yaml
@@ -0,0 +1,46 @@
+id: b2_society_wide_pa_plus_hrt
+test: B
+title: "Society-wide multi-parameter reform"
+what_it_tests: "Composition. Single-parameter reform tests can pass while multi-parameter ones fail — the LLM applies the first change correctly but botches the combination. This catches that bug class."
+
+prompt: |
+  Run an economy-wide comparison for the UK in 2025. Baseline: current law. Reform applies TWO changes simultaneously:
+    (a) raise the income tax personal allowance from £12,570 to £15,000
+    (b) raise the higher-rate threshold from £50,270 to £55,000
+
+  Use the Enhanced FRS 2023-24 dataset: hf://policyengine/policyengine-uk-data-private/enhanced_frs_2023_24.h5. Use Microsimulation, not Simulation.
+
+  Report precisely from the model run:
+  (1) total budgetary impact in £
+  (2) income tax revenue change in £
+  (3) average household income change in £ for each decile (1-10)
+  (4) relative income change in % for each decile (1-10)
+  (5) BHC poverty rate baseline vs reform, for: all, child, working-age, senior
+
+  British English. Every figure from the model run, not memory.
+
+chat_settings:
+  model_backend: uk_python
+  num_runs: 3
+
+reference:
+  fixture: pe_api/b2_society_wide_pa_plus_hrt.json
+  fields_to_compare:
+    - path: budget.budgetary_impact
+      tolerance_pct: 1.0
+    - path: budget.tax_revenue_impact
+      tolerance_pct: 1.0
+    - path: decile.average
+      tolerance_pct: 1.0
+    - path: decile.relative
+      tolerance_pct: 1.0
+    - path: poverty.poverty.all.baseline
+      tolerance_pct: 1.0
+    - path: poverty.poverty.all.reform
+      tolerance_pct: 1.0
+    - path: poverty.poverty.child
+      tolerance_pct: 1.0
+    - path: poverty.poverty.adult
+      tolerance_pct: 1.0
+    - path: poverty.poverty.senior
+      tolerance_pct: 1.0
diff --git a/evals/scenarios/b3_household_calc.yaml b/evals/scenarios/b3_household_calc.yaml
new file mode 100644
index 0000000..e341854
--- /dev/null
+++ b/evals/scenarios/b3_household_calc.yaml
@@ -0,0 +1,31 @@
+id: b3_household_calc
+test: B
+title: "Household calculation — deterministic, no microdata"
+what_it_tests: "Single-household calculations are the chat's strongest theoretical case — no microdata, no aggregation, just applying the rules to one situation dict. Establishes a baseline for what the chat can do when nothing else is in its way."
+
+prompt: |
+  Compute UK 2025 figures for a single adult, age 35, with employment income £45,000, no other income, no dependents, lives in England, has no other assets or benefits. Use the policyengine_uk Python Simulation class with a `situation` dict (no microdata, no dataset).
+
+  Report from the model run:
+  (1) household_net_income in £
+  (2) income_tax in £
+  (3) national_insurance (employee) in £
+  (4) marginal_tax_rate at this income point (the combined IT+NI marginal rate on an extra £1)
+
+  British English. Every figure from the model run.
+
+chat_settings:
+  model_backend: uk_python
+  num_runs: 3
+
+reference:
+  fixture: pe_api/b3_household_calc.json
+  fields_to_compare:
+    - path: result.household_net_income
+      tolerance_pct: 0.5
+    - path: result.income_tax
+      tolerance_pct: 0.5
+    - path: result.national_insurance
+      tolerance_pct: 0.5
+    - path: result.marginal_tax_rate
+      tolerance_pct: 0.5
diff --git a/evals/scenarios/b4_mtr_schedule.yaml b/evals/scenarios/b4_mtr_schedule.yaml
new file mode 100644
index 0000000..efd7318
--- /dev/null
+++ b/evals/scenarios/b4_mtr_schedule.yaml
@@ -0,0 +1,22 @@
+id: b4_mtr_schedule
+test: B
+title: "MTR schedule — sanity check against known-good ground truth"
+what_it_tests: "Schedule lookup against rule-driven ground truth — direct policyengine_uk calls computed at fixture build time. Acts as a sanity check on the test infrastructure itself: if this fails, the runner or extractor is broken before we draw conclusions about anything else."
+
+prompt: |
+  For a single adult in the UK in 2025/26 with no children and no other income, compute the marginal income tax rate (combined: income tax + employee National Insurance) on an extra £1 of employment income, at each of these gross employment income levels: £10,000, £20,000, £30,000, £50,000, £75,000, £100,000, £125,000, £150,000.
+
+  Report a clear table: gross income, income tax marginal rate, NI marginal rate, combined marginal rate.
+
+  Use the policyengine_uk Python Simulation class with a `situation` dict. No microdata, no dataset. Every figure from the model run.
+
+chat_settings:
+  model_backend: uk_python
+  num_runs: 3
+
+reference:
+  fixture: pe_api/b4_mtr_schedule.json
+  fields_to_compare:
+    - path: rows
+      tolerance_pct: 1.0
+      # Each row: { gross, it_mtr, ni_mtr, combined_mtr }. Compare element-wise.
diff --git a/evals/scenarios/b5_regional_reform.yaml b/evals/scenarios/b5_regional_reform.yaml
new file mode 100644
index 0000000..5bb5e18
--- /dev/null
+++ b/evals/scenarios/b5_regional_reform.yaml
@@ -0,0 +1,33 @@
+id: b5_regional_reform
+test: B
+title: "Regional (Scotland-only) reform"
+what_it_tests: "Region handling, which the chat hasn't been stress-tested on. Scotland has its own income tax bands (separate parameter tree from rest-of-UK). The chat needs to find and apply the right parameter, not the UK-wide one."
+
+prompt: |
+  Run a Scotland-only economy-wide comparison for 2025. Baseline: current law. Reform: raise the Scottish starter-rate threshold (currently £14,876) by £1,000 to £15,876.
+
+  Use the Enhanced FRS 2023-24 dataset: hf://policyengine/policyengine-uk-data-private/enhanced_frs_2023_24.h5. Use Microsimulation. Filter to Scotland-resident households only.
+
+  Report from the model run:
+  (1) Scotland-specific budgetary impact in £ (only the Scottish exchequer effect, not rest-of-UK)
+  (2) average household income change in £ by decile (1-10), Scotland-only deciles
+  (3) BHC poverty rate baseline vs reform, all-Scotland
+
+  British English. Every figure from the model run.
+
+chat_settings:
+  model_backend: uk_python
+  num_runs: 3
+
+reference:
+  fixture: pe_api/b5_regional_reform.json
+  fields_to_compare:
+    - path: budget.budgetary_impact
+      tolerance_pct: 2.0
+      # Slightly wider tolerance — region-filtering choices (residence vs
+      # employment vs taxable-income basis) can legitimately introduce 1-2%
+      # variance even with the same engine.
+    - path: decile.average
+      tolerance_pct: 2.0
+    - path: poverty.poverty.all
+      tolerance_pct: 1.0

From c654db9cd60645407e759c420f07f0205c8d021b Mon Sep 17 00:00:00 2001
From: SakshiKekre <sakshi.kekre@gmail.com>
Date: Tue, 19 May 2026 12:56:12 -0700
Subject: [PATCH 02/11] Ground scenarios in PolicyEngine's published research
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Two changes:

1. Replace the two made-up B scenarios (B2 PA+HRT, B5 Scotland) with
   ones drawn from Vahid Ahmadi's published UK analyses:
   - B2 → stacked NI/IT/threshold-freeze reform (Nov 2025 pre-Budget)
     with reference figures from uk-income-tax-ni-reforms-2025.md
   - B5 → remove the two-child benefit limit (Autumn Budget 2025)
     with reference figures from uk-two-child-limit.md

   This shifts Test B from "does chat match a one-off API call I made"
   to "does chat reproduce PolicyEngine's published analyses" — a much
   stronger framing.

2. Add `anchor` blocks to every scenario. Anchors carry:
   - must_mention: phrases a good answer must include
   - must_not_say: claims that would be wrong
   - ideal_explanation / ideal_finding: prose sketch the grader uses

   In v1, anchors are human-grader aids. In v2 they become inputs to
   an automated LLM-judge.

Per-scenario anchor sourcing documented in SPEC.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 evals/README.md                               | 11 +++
 evals/SPEC.md                                 | 43 +++++++++--
 evals/scenarios/a1_mechanism.yaml             | 26 +++++++
 evals/scenarios/a2_subset_slice.yaml          | 24 ++++++
 evals/scenarios/a3_multiparam_what_if.yaml    | 28 +++++++
 evals/scenarios/a4_out_of_scope.yaml          | 22 ++++++
 evals/scenarios/a5_factual_lookup.yaml        | 27 +++++++
 evals/scenarios/b1_society_wide_pa.yaml       | 17 ++++
 evals/scenarios/b2_ni_it_stacked.yaml         | 77 +++++++++++++++++++
 .../b2_society_wide_pa_plus_hrt.yaml          | 46 -----------
 evals/scenarios/b3_household_calc.yaml        | 16 ++++
 evals/scenarios/b4_mtr_schedule.yaml          | 16 ++++
 evals/scenarios/b5_regional_reform.yaml       | 33 --------
 evals/scenarios/b5_two_child_limit.yaml       | 62 +++++++++++++++
 14 files changed, 361 insertions(+), 87 deletions(-)
 create mode 100644 evals/scenarios/b2_ni_it_stacked.yaml
 delete mode 100644 evals/scenarios/b2_society_wide_pa_plus_hrt.yaml
 delete mode 100644 evals/scenarios/b5_regional_reform.yaml
 create mode 100644 evals/scenarios/b5_two_child_limit.yaml

diff --git a/evals/README.md b/evals/README.md
index 9acd5fd..5aa500e 100644
--- a/evals/README.md
+++ b/evals/README.md
@@ -59,8 +59,19 @@ reference:
       tolerance_pct: 1.0
     - path: decile.relative
       tolerance_pct: 1.0
+
+# Both A and B — golden-path guidance the grader uses (and an LLM-judge will use later).
+anchor:
+  must_mention: ["personal allowance taper", "60%", "£100,000"]
+  must_not_say: ["top decile pays more tax than the 8th"]
+  ideal_explanation: |
+    Multi-line prose describing what a Vahid-quality answer would look like.
+    Used by the human grader as a reference; will become input to an
+    automated LLM-judge in v2.
 ```
 
+Where anchors come from: PolicyEngine's published research blog where one matches the scenario (Vahid's UK posts under `app/src/data/posts/articles/` in the app-v2 repo), otherwise UK tax-rule knowledge. See `SPEC.md` § "Anchors" for the per-scenario sourcing.
+
 The runner (added in a follow-up PR) reads every YAML in `scenarios/`, POSTs to a configured chat backend, and writes outputs under `runs/<timestamp>/<scenario-id>/`. See SPEC.md "Roadmap" for what's coming next.
 
 ## Decision thresholds
diff --git a/evals/SPEC.md b/evals/SPEC.md
index 9268113..36048e8 100644
--- a/evals/SPEC.md
+++ b/evals/SPEC.md
@@ -104,10 +104,10 @@ These are questions app-v2 already answers via reports. The chat must match.
 - **PE-API reference:** Generated against the live PE-API by the fixture-build step and saved to `evals/fixtures/pe_api/b1_society_wide_pa.json`.
 - **What it tests:** Baseline replication. If the chat can't match here, it can't match anywhere.
 
-**B2 — Society-wide reform, multi-parameter**
-- **Question:** "UK economy-wide 2025: reform raises personal allowance to £15,000 *and* raises higher-rate threshold to £55,000. Report budget impact, decile impacts, poverty."
-- **PE-API reference:** to be generated.
-- **What it tests:** Composition. Multi-parameter reforms are harder for the LLM to apply correctly — single-parameter tests don't catch the composition bug.
+**B2 — Stacked NI + IT + threshold-freeze reform (Reeves 2025 pre-Budget package)**
+- **Question:** Three changes stacked — extend the IT threshold freeze to 2029-30, reduce NI main rate 8%→6%, increase IT basic 20%→22% and higher 40%→42%. Report combined and per-layer revenue, per-reform decile impacts, and the example household (£60k earner + £10k pension) figures.
+- **Reference:** PolicyEngine's published analysis by Vahid Ahmadi (Nov 2025) — `app/src/data/posts/articles/uk-income-tax-ni-reforms-2025.md` — gives canonical per-layer figures (£3.5bn freeze, £11.7bn NI cut, £18.6bn IT increase, £6.9bn combined in 2026-27) and per-decile percentages.
+- **What it tests:** Multi-parameter reform with stacking. Does the chat understand and apply the stacking methodology correctly, and does it reproduce PolicyEngine's *own published* numbers?
 
 **B3 — Household calc (no microdata needed)**
 - **Question:** "Single adult, age 35, employment income £45,000 in UK 2025, no dependents, England. Compute: net income, income tax, employee NI, marginal tax rate at this income point."
@@ -119,10 +119,37 @@ These are questions app-v2 already answers via reports. The chat must match.
 - **Ground truth:** Generated by directly calling `policyengine_uk` at fixture-build time. No microdata or PE-API involved.
 - **What it tests:** Schedule lookup against rule-driven ground truth. Acts as a sanity check on the test infrastructure — if this scenario fails, the runner or extractor is broken before we draw conclusions about anything else.
 
-**B5 — Regional reform**
-- **Question:** "Run a Scotland-only comparison for 2025: reform raises the Scottish starter-rate threshold by £1,000. Report Scotland-specific budgetary impact and decile changes."
-- **PE-API reference:** to be generated via `/uk/economy/...?region=scotland`.
-- **What it tests:** Regional handling, which the chat hasn't been stress-tested on. Vercel/Modal preview deployments showed `region` flowing through, but we haven't validated the chat picks the right slice.
+**B5 — Remove the two-child benefit limit (Autumn Budget 2025)**
+- **Question:** UK 2026-27 economy-wide — remove the two-child limit on UC and CTC. Report cost, decile impacts, child-poverty change, Gini change, illustrative household.
+- **Reference:** PolicyEngine's published analysis by Vahid Ahmadi (Oct 2025) — `app/src/data/posts/articles/uk-two-child-limit.md` — gives £2.9bn cost in 2026-27, -13.5% absolute child poverty BHC, -0.55% Gini, D2 sees the largest relative gain.
+- **What it tests:** Benefit-side reform (UC/CTC), not tax — different model surface from the other B scenarios. Tests reproduction of canonical PolicyEngine numbers on a recent published analysis.
+
+## Anchors — golden-path guidance per scenario
+
+Every scenario YAML has an `anchor` block that captures what an *ideal* response would look like — drawn either from PolicyEngine's own published research blog (for the scenarios we have one for) or from UK tax-rule knowledge. The anchor has three parts:
+
+- **`must_mention`** — phrases or facts a good answer must include (regex-matchable).
+- **`must_not_say`** — claims that would be wrong or misleading (also regex-matchable).
+- **`ideal_explanation`** / **`ideal_finding`** — a prose sketch of what a Vahid-quality answer would look like, for human-grader reference.
+
+In v1, anchors are **grader aids** — the human grader reads them before scoring so the rubric is calibrated to PolicyEngine's house standard. In v2 they become *inputs* to an LLM-judge that scores automatically.
+
+Where anchors come from:
+
+| Scenario | Anchor source |
+|---|---|
+| A1 — Mechanism | UK PA-taper rule + `uk-income-tax-ni-reforms-2025.md` discussion of the same dynamic |
+| A2 — Subset slice | UK rule knowledge + reasoning about how PA changes affect single parents |
+| A3 — Multi-param what-if | `uk-income-tax-ni-reforms-2025.md` stacking methodology + standard progressivity definitions |
+| A4 — Out-of-scope | PolicyEngine scope (microsim, no GE/macro) |
+| A5 — Historical lookup | UK PA parameter history |
+| B1 — Society-wide PA | PE-API output (the fixture) + UK reporting norms |
+| B2 — Stacked NI/IT/freeze | `uk-income-tax-ni-reforms-2025.md` published numbers |
+| B3 — Household calc | UK tax rules at the £45k income point |
+| B4 — MTR schedule | UK tax rule schedule (deterministic) |
+| B5 — Two-child limit | `uk-two-child-limit.md` published numbers |
+
+If we add scenarios later, the strong preference is to **anchor each one against a published PolicyEngine post** when there's one that matches. This makes the eval into "does the chat reproduce PolicyEngine's published analyses?" which is a much stronger framing than "does the chat match a one-off API call." It also makes the eval defensible externally — every reference number has a paper trail.
 
 ## Proposed solutions & tradeoffs
 
diff --git a/evals/scenarios/a1_mechanism.yaml b/evals/scenarios/a1_mechanism.yaml
index 9e155f5..4fd46e2 100644
--- a/evals/scenarios/a1_mechanism.yaml
+++ b/evals/scenarios/a1_mechanism.yaml
@@ -31,3 +31,29 @@ rubric:
   reasonableness: "1-5 — the explanation matches UK tax rules; sign and direction are correct"
   consistency_with_report: "1-5 — any numbers cited match the figures in scenario_context"
   honesty: "1-5 — if the chat doesn't know, it says so cleanly; doesn't fabricate magnitudes"
+
+anchor:
+  # The mechanism question has a known correct answer in UK tax rules — drawn
+  # both from rule knowledge and from how Vahid's "uk-income-tax-ni-reforms-2025"
+  # post discusses the PA-taper interaction.
+  must_mention:
+    - "personal allowance taper"     # or equivalent: "PA withdrawal", "tapering"
+    - "£100,000"                     # or "100k" — the threshold where the taper starts
+    - "60%"                          # the effective MTR band created by the taper
+  must_not_say:
+    - "the top decile pays more tax than the 8th"  # opposite of the question
+    - "the reform takes income from the top decile"  # PA raise gives them less benefit, doesn't take income
+  ideal_explanation: |
+    Two mechanisms in play. First, between £100,000 and roughly £125,140, the
+    personal allowance is withdrawn at the rate of £1 for every £2 of income
+    above £100k. This creates an effective 60% marginal tax band: 40% higher-rate
+    income tax PLUS the 20% effective rate from losing tax-free allowance. So
+    high earners who span that range gain *less* from a PA raise than households
+    whose taxable income sits below £100k.
+
+    Second, the absolute £-gain from the PA raise is the same for any taxpayer
+    in the higher band (PA goes up by £2,430 × marginal rate). But because the
+    top decile has much larger total household income, that fixed £-gain is a
+    smaller *percentage* of their income than it is for someone in the 8th or
+    9th decile. So both the PA taper and the percentage-of-income arithmetic
+    push the top decile's relative gain below the 8th and 9th.
diff --git a/evals/scenarios/a2_subset_slice.yaml b/evals/scenarios/a2_subset_slice.yaml
index e96ea7a..c117eca 100644
--- a/evals/scenarios/a2_subset_slice.yaml
+++ b/evals/scenarios/a2_subset_slice.yaml
@@ -26,3 +26,27 @@ rubric:
   reasonableness: "1-5 — the per-decile gains are positive, monotone-ish in income, and plausible vs the whole-UK figures (probably similar magnitudes since the reform is a tax change, not benefit)"
   consistency_with_report: "1-5 — references to the whole-UK figures match scenario_context exactly"
   honesty: "1-5 — if the chat runs out of tool budget or can't filter cleanly, it says so rather than fabricating; explicitly flags low-N decile cells"
+
+anchor:
+  must_mention:
+    - "single parent"             # or "lone parent"
+    - "two children"              # or "2 children"
+    - "decile"                    # since the question asks for a decile breakdown
+    - "Enhanced FRS"              # or "EFRS 2023-24" — should name the dataset
+  must_not_say:
+    - "this matches the whole-UK figures"      # subset should differ from aggregate
+    - "single parents see no change"           # they're earners too; PA raise affects them
+  ideal_explanation: |
+    Method should be: load EFRS 2023-24 via Microsimulation, apply the PA reform,
+    then filter the benunit-level frame to single-parent-with-two-children
+    families (count_adults == 1, count_children == 2 within the benunit, or
+    equivalent variable). Compute per-decile average household income change in
+    £ for that subset.
+
+    Expected pattern: single-parent families are over-represented in lower
+    deciles. Per-decile *absolute* £-gains from the PA raise are smaller for
+    these families than for whole-UK averages because their earnings are
+    typically lower, so fewer of them have taxable income above the new £15k
+    threshold. The lowest deciles may see zero or near-zero gain (no income tax
+    to begin with). Middle deciles see modest gains. Should report low-N
+    caveats if any decile cell has <30 sampled households.
diff --git a/evals/scenarios/a3_multiparam_what_if.yaml b/evals/scenarios/a3_multiparam_what_if.yaml
index fc377c6..11c85c7 100644
--- a/evals/scenarios/a3_multiparam_what_if.yaml
+++ b/evals/scenarios/a3_multiparam_what_if.yaml
@@ -26,3 +26,31 @@ rubric:
   reasonableness: "1-5 — combined budgetary cost is larger than PA-only cost; HRT raise concentrates gains in upper-middle deciles, so combined is *less* progressive than PA-only by most measures; chat reaches a defensible conclusion"
   consistency_with_report: "1-5 — PA-only numbers cited match the scenario_context figures"
   honesty: "1-5 — chat caveats that 'progressive' has multiple definitions, OR explicitly picks one and sticks with it; doesn't claim a stronger conclusion than the numbers support"
+
+anchor:
+  # Multi-param reforms with stacked impacts are exactly Vahid's
+  # uk-income-tax-ni-reforms-2025 territory; chat should reason similarly.
+  must_mention:
+    - "higher-rate threshold"     # the new parameter being changed
+    - "£50,270"                   # the current value, often quoted
+    - "decile"                    # since "progressive" requires decile-by-decile thinking
+  must_not_say:
+    - "raising the higher-rate threshold helps the bottom decile"  # opposite
+    - "the two reforms cancel out"                                 # they don't — stacked cost
+  ideal_explanation: |
+    Methodology: run two simulations — (a) PA raise alone (already in
+    scenario_context), (b) PA raise + HRT raise. Compare distributional shapes.
+
+    Expected pattern: a higher-rate threshold raise pushes more income out of
+    the 40% band into the 20% band, so the gain accrues *only* to households
+    with taxable income above £50,270 — concentrated in deciles 7-9. The bottom
+    deciles see no additional gain on top of the PA raise (they don't pay
+    higher-rate tax anyway). Combined reform is therefore *less* progressive
+    than PA-raise alone by most reasonable definitions:
+    - Bottom-vs-top relative gain ratio: smaller for combined than PA-only.
+    - Gini reduction: smaller for combined than PA-only.
+    - Budgetary cost: larger for combined.
+
+    Caveat to call out: "progressive" can mean (a) D1-vs-D10 relative gain,
+    (b) Gini change, or (c) absolute £-equality. The chat should name which
+    definition it's using.
diff --git a/evals/scenarios/a4_out_of_scope.yaml b/evals/scenarios/a4_out_of_scope.yaml
index 4a2c6fa..229ac2c 100644
--- a/evals/scenarios/a4_out_of_scope.yaml
+++ b/evals/scenarios/a4_out_of_scope.yaml
@@ -24,3 +24,25 @@ rubric:
   reasonableness: "1-5 — chat reaches the right verdict (out of scope) without engineering a fake answer; if it does offer adjacent simulation outputs (e.g. 'here's the household-income channel'), those are correct"
   consistency_with_report: "1-5 — if any report figures are referenced, they match scenario_context"
   honesty: "1-5 — chat REFUSES cleanly. This is the trust-killer dimension for this scenario. A confident inflation forecast scores 1; an explicit 'I can't model this' scores 5."
+
+anchor:
+  must_mention:
+    - "PolicyEngine doesn't model"   # or "out of scope" or "not part of the model"
+    - "macroeconomic"                # or "macro" — name the kind of modelling needed
+  must_not_say:
+    - "I forecast inflation"         # any number with a confident inflation prediction
+    - "inflation will rise by"       # ditto
+    - "inflation will fall by"       # ditto
+    - "the CPI impact is"            # claims a quantified macro outcome
+  ideal_explanation: |
+    The chat should explain that PolicyEngine is a microsimulation model — it
+    computes per-household tax/benefit changes under fixed prices and behaviours.
+    It does not model:
+      - General-equilibrium macroeconomic effects (aggregate demand, inflation,
+        GDP growth, monetary-policy response).
+      - Most behavioural elasticities (labour supply has some basic modelling,
+        but consumer-price or investment responses are not in scope).
+    A clean refusal cites both that the question is out of scope AND why
+    (microsim vs macro). Bonus: chat may offer adjacent things it *can* compute —
+    e.g. household-income channel, fiscal cost — without pretending those answer
+    the inflation question.
diff --git a/evals/scenarios/a5_factual_lookup.yaml b/evals/scenarios/a5_factual_lookup.yaml
index b5ffb0c..4e53091 100644
--- a/evals/scenarios/a5_factual_lookup.yaml
+++ b/evals/scenarios/a5_factual_lookup.yaml
@@ -24,3 +24,30 @@ rubric:
   reasonableness: "1-5 — the values are correct historical PA amounts. 2010 ≈ £6,475; 2014 ≈ £10,000; 2016 ≈ £11,000; 2019 ≈ £12,500; 2021-2027 frozen at £12,570. Off-by-tax-year shifts are forgivable; wildly wrong values are not"
   consistency_with_report: "1-5 — if the current value (£12,570) is in the series, it matches the report's baseline"
   honesty: "1-5 — if the underlying model doesn't go back 15 years, chat says so and gives the range it has, rather than fabricating early-2010s values"
+
+anchor:
+  must_mention:
+    - "gov.hmrc.income_tax.allowances.personal_allowance.amount"  # the parameter path
+    - "frozen"                                                    # or "freeze" — explains the 2021-onwards plateau
+  must_not_say:
+    - "I'll run a simulation"          # this is a lookup, not a sim
+    - "the personal allowance is £0"   # never zero in any of the relevant years
+  ideal_explanation: |
+    The chat should look up the historical values from the model's parameter
+    tree rather than running any simulation. Expected values (UK tax years):
+      2010-11: £6,475
+      2011-12: £7,475
+      2012-13: £8,105
+      2013-14: £9,440
+      2014-15: £10,000
+      2015-16: £10,600
+      2016-17: £11,000
+      2017-18: £11,500
+      2018-19: £11,850
+      2019-20: £12,500
+      2020-21: £12,500
+      2021-22: £12,570
+      2022-onwards: £12,570 (frozen — extended freeze through 2027-28 in current law)
+    Should mention the post-2021 freeze explicitly. If the model's parameter
+    history doesn't go back to 2010, chat should give the range it does have
+    and say so.
diff --git a/evals/scenarios/b1_society_wide_pa.yaml b/evals/scenarios/b1_society_wide_pa.yaml
index 5148b4c..9f570cf 100644
--- a/evals/scenarios/b1_society_wide_pa.yaml
+++ b/evals/scenarios/b1_society_wide_pa.yaml
@@ -43,3 +43,20 @@ reference:
       tolerance_pct: 1.0
     - path: poverty.poverty.senior
       tolerance_pct: 1.0
+
+anchor:
+  # No PolicyEngine blog post analyses this exact reform; anchor is derived from
+  # PE-API's own output (the fixture) plus general PolicyEngine reporting norms.
+  must_mention:
+    - "Enhanced FRS"              # the dataset
+    - "2025"                      # the year
+    - "BHC"                       # which poverty measure (since the question asked for BHC)
+  must_not_say:
+    - "the reform pays for itself"   # PA raise is a cost, not revenue-raising
+    - "everyone gains"               # poorest decile gains very little since they pay little income tax
+  ideal_finding: |
+    The reform is a tax cut of broadly £20bn order of magnitude in 2025-26, with
+    relative gains rising through the middle deciles and falling off in the top
+    decile because the PA taper above £100k removes the benefit for very high
+    earners. BHC poverty falls modestly because the gains for low-income
+    taxpaying households push some across the relative-poverty line.
diff --git a/evals/scenarios/b2_ni_it_stacked.yaml b/evals/scenarios/b2_ni_it_stacked.yaml
new file mode 100644
index 0000000..55261a9
--- /dev/null
+++ b/evals/scenarios/b2_ni_it_stacked.yaml
@@ -0,0 +1,77 @@
+id: b2_ni_it_stacked
+test: B
+title: "Stacked NI + income tax + threshold freeze (Reeves 2025 pre-Budget reform package)"
+what_it_tests: "Multi-parameter reform with reform stacking — the LLM must apply three changes in the right order and produce the same per-layer numbers PolicyEngine published. Reference is a peer-reviewed PolicyEngine analysis (Vahid Ahmadi, Nov 2025), not a made-up fixture."
+
+source_post: "uk-income-tax-ni-reforms-2025.md"
+source_post_url: "https://policyengine.org/uk/research/uk-income-tax-ni-reforms-2025"
+
+prompt: |
+  Run a UK 2026-27 economy-wide comparison. Baseline: current law. Reform applies THREE changes, stacked in this order (so each layer's reported impact is on top of the previous):
+    1. Extend the income tax threshold freeze to 2029-30 (current law freezes through 2027-28).
+    2. Reduce the National Insurance main rate from 8% to 6% (on earnings between £12,570 and £50,270).
+    3. Increase basic income tax rate from 20% to 22% and higher rate from 40% to 42%.
+
+  Use the Enhanced FRS 2023-24 dataset: hf://policyengine/policyengine-uk-data-private/enhanced_frs_2023_24.h5.
+
+  Report from the model run:
+  (a) Combined budgetary impact in £ for 2026-27 (all three reforms applied).
+  (b) The stacked per-layer revenue contribution in £ for 2026-27: freeze layer, NI-cut layer (on top of freeze), IT-increase layer (on top of freeze + NI).
+  (c) Relative income change in % by decile, for each of the three reforms applied individually (i.e. each one against current-law baseline).
+  (d) For an illustrative single earner with gross income £60,000 and £10,000 workplace pension contribution: the change in NI, change in income tax, and change in net household income under the combined NI-cut + IT-increase reforms (rates only — ignore the freeze for this household example, since the freeze doesn't bite until 2028-29).
+
+  British English. Every figure from the model run.
+
+chat_settings:
+  model_backend: uk_python
+  num_runs: 3
+
+reference:
+  fixture: pe_api/b2_ni_it_stacked.json
+  fields_to_compare:
+    # All from Vahid Ahmadi, "Analysing alternative income tax and National Insurance reform options for the UK", Nov 2025.
+    - path: combined.budgetary_impact_2026_27
+      expected_approx: 6_900_000_000        # £6.9bn combined in 2026-27
+      tolerance_pct: 5.0
+    - path: layers.freeze_extension.budgetary_impact_2028_29
+      expected_approx: 3_500_000_000        # £3.5bn freeze layer, 2028-29
+      tolerance_pct: 5.0
+    - path: layers.ni_cut.budgetary_impact_2026_27
+      expected_approx: -11_700_000_000      # NI cut costs £11.7bn on top of freeze
+      tolerance_pct: 5.0
+    - path: layers.it_increase.budgetary_impact_2026_27
+      expected_approx: 18_600_000_000       # IT increase raises £18.6bn on top
+      tolerance_pct: 5.0
+    - path: example_household.net_change
+      expected_approx: 5.4                  # £60k earner + £10k pension: +£5.40 net
+      tolerance_pct: 20.0                   # wider — one household, small absolute number, easy to drift
+    - path: example_household.ni_change
+      expected_approx: -754.0               # -£754 NI cut for the example household
+      tolerance_pct: 5.0
+    - path: example_household.it_change
+      expected_approx: 748.6                # +£748.60 IT rise for the example household
+      tolerance_pct: 5.0
+    - path: decile.relative.ni_cut
+      # Per Vahid's Figure 3 (2026-27): D1→0.1, D2→0.2, D3→0.3, D4→0.5, D5→0.7,
+      # D6→0.8, D7→1.0, D8→1.1, D9→1.1, D10→0.6 (percentage gains)
+      tolerance_pct: 15.0
+    - path: decile.relative.it_increase
+      # Per Vahid's Figure 4 (2026-27): D1→-0.2, D2→-0.4, ..., D8→-1.5, D9→-1.6, D10→-1.5
+      tolerance_pct: 15.0
+
+anchor:
+  must_mention:
+    - "stacking"                  # methodology — reforms applied on top of each other
+    - "fiscal drag"               # the mechanism by which the freeze raises revenue
+    - "Upper Earnings Limit"      # £50,270 — why NI cut doesn't help top earners much
+    - "pension contributions"     # why the example household's net change is small
+  must_not_say:
+    - "all households gain"       # untrue under combined reform
+    - "all households lose"       # untrue too
+    - "the reforms are independent"  # they're stacked, with interactions
+  ideal_finding: |
+    The NI cut and IT increase partially offset for typical workers — at £60k with
+    a £10k pension contribution, the worker is only marginally better off (+£5).
+    Pensioners and landlords pay only the IT rise (no NI), so they straightforwardly
+    lose. The freeze raises revenue through fiscal drag — nominal incomes rise but
+    thresholds stay fixed, pulling more income into higher tax bands.
diff --git a/evals/scenarios/b2_society_wide_pa_plus_hrt.yaml b/evals/scenarios/b2_society_wide_pa_plus_hrt.yaml
deleted file mode 100644
index 6aff475..0000000
--- a/evals/scenarios/b2_society_wide_pa_plus_hrt.yaml
+++ /dev/null
@@ -1,46 +0,0 @@
-id: b2_society_wide_pa_plus_hrt
-test: B
-title: "Society-wide multi-parameter reform"
-what_it_tests: "Composition. Single-parameter reform tests can pass while multi-parameter ones fail — the LLM applies the first change correctly but botches the combination. This catches that bug class."
-
-prompt: |
-  Run an economy-wide comparison for the UK in 2025. Baseline: current law. Reform applies TWO changes simultaneously:
-    (a) raise the income tax personal allowance from £12,570 to £15,000
-    (b) raise the higher-rate threshold from £50,270 to £55,000
-
-  Use the Enhanced FRS 2023-24 dataset: hf://policyengine/policyengine-uk-data-private/enhanced_frs_2023_24.h5. Use Microsimulation, not Simulation.
-
-  Report precisely from the model run:
-  (1) total budgetary impact in £
-  (2) income tax revenue change in £
-  (3) average household income change in £ for each decile (1-10)
-  (4) relative income change in % for each decile (1-10)
-  (5) BHC poverty rate baseline vs reform, for: all, child, working-age, senior
-
-  British English. Every figure from the model run, not memory.
-
-chat_settings:
-  model_backend: uk_python
-  num_runs: 3
-
-reference:
-  fixture: pe_api/b2_society_wide_pa_plus_hrt.json
-  fields_to_compare:
-    - path: budget.budgetary_impact
-      tolerance_pct: 1.0
-    - path: budget.tax_revenue_impact
-      tolerance_pct: 1.0
-    - path: decile.average
-      tolerance_pct: 1.0
-    - path: decile.relative
-      tolerance_pct: 1.0
-    - path: poverty.poverty.all.baseline
-      tolerance_pct: 1.0
-    - path: poverty.poverty.all.reform
-      tolerance_pct: 1.0
-    - path: poverty.poverty.child
-      tolerance_pct: 1.0
-    - path: poverty.poverty.adult
-      tolerance_pct: 1.0
-    - path: poverty.poverty.senior
-      tolerance_pct: 1.0
diff --git a/evals/scenarios/b3_household_calc.yaml b/evals/scenarios/b3_household_calc.yaml
index e341854..a7f56e9 100644
--- a/evals/scenarios/b3_household_calc.yaml
+++ b/evals/scenarios/b3_household_calc.yaml
@@ -29,3 +29,19 @@ reference:
       tolerance_pct: 0.5
     - path: result.marginal_tax_rate
       tolerance_pct: 0.5
+
+anchor:
+  # Anchor derived from UK tax rule knowledge — no blog post needed since this is
+  # a deterministic single-household calculation in the basic-rate band.
+  must_mention:
+    - "personal allowance"        # £12,570 is the threshold to flag
+    - "basic rate"                # £45,000 is in the basic-rate band
+    - "8%"                        # employee NI rate at this income
+  must_not_say:
+    - "higher rate"               # £45k is below the higher-rate threshold of £50,270
+    - "no income tax"             # £45k is well above the PA
+  ideal_finding: |
+    At £45,000 gross, the worker pays 20% income tax on £32,430 (£45,000 −
+    £12,570 personal allowance) = £6,486, and 8% NI on the same band = £2,594
+    (approximately). Combined marginal tax rate on an extra £1 of earnings is
+    28% (20% income tax + 8% NI). Net income ≈ £35,920.
diff --git a/evals/scenarios/b4_mtr_schedule.yaml b/evals/scenarios/b4_mtr_schedule.yaml
index efd7318..502e4d4 100644
--- a/evals/scenarios/b4_mtr_schedule.yaml
+++ b/evals/scenarios/b4_mtr_schedule.yaml
@@ -20,3 +20,19 @@ reference:
     - path: rows
       tolerance_pct: 1.0
       # Each row: { gross, it_mtr, ni_mtr, combined_mtr }. Compare element-wise.
+
+anchor:
+  # Pure rule-driven schedule — anchor is the expected schedule values themselves.
+  must_mention:
+    - "personal allowance"         # below £12,570 = 0% MTR
+    - "60%"                        # the PA-taper band £100k-£125k
+    - "Upper Earnings Limit"       # £50,270, where NI drops from 8% to 2%
+  must_not_say:
+    - "MTR is constant"            # the schedule has steps; chat should describe them
+  ideal_finding: |
+    Combined IT+NI MTR schedule for a single adult, no other income, 2025-26:
+    £10k → 0% (below PA and NI primary threshold).
+    £20k–£50k → 28% (20% basic IT + 8% main NI).
+    £75k → 42% (40% higher-rate IT + 2% NI above UEL).
+    £100k–£125k → 62% (40% higher IT + 20% effective from PA taper + 2% NI).
+    £150k → 47% (45% additional-rate IT + 2% NI, PA fully tapered out).
diff --git a/evals/scenarios/b5_regional_reform.yaml b/evals/scenarios/b5_regional_reform.yaml
deleted file mode 100644
index 5bb5e18..0000000
--- a/evals/scenarios/b5_regional_reform.yaml
+++ /dev/null
@@ -1,33 +0,0 @@
-id: b5_regional_reform
-test: B
-title: "Regional (Scotland-only) reform"
-what_it_tests: "Region handling, which the chat hasn't been stress-tested on. Scotland has its own income tax bands (separate parameter tree from rest-of-UK). The chat needs to find and apply the right parameter, not the UK-wide one."
-
-prompt: |
-  Run a Scotland-only economy-wide comparison for 2025. Baseline: current law. Reform: raise the Scottish starter-rate threshold (currently £14,876) by £1,000 to £15,876.
-
-  Use the Enhanced FRS 2023-24 dataset: hf://policyengine/policyengine-uk-data-private/enhanced_frs_2023_24.h5. Use Microsimulation. Filter to Scotland-resident households only.
-
-  Report from the model run:
-  (1) Scotland-specific budgetary impact in £ (only the Scottish exchequer effect, not rest-of-UK)
-  (2) average household income change in £ by decile (1-10), Scotland-only deciles
-  (3) BHC poverty rate baseline vs reform, all-Scotland
-
-  British English. Every figure from the model run.
-
-chat_settings:
-  model_backend: uk_python
-  num_runs: 3
-
-reference:
-  fixture: pe_api/b5_regional_reform.json
-  fields_to_compare:
-    - path: budget.budgetary_impact
-      tolerance_pct: 2.0
-      # Slightly wider tolerance — region-filtering choices (residence vs
-      # employment vs taxable-income basis) can legitimately introduce 1-2%
-      # variance even with the same engine.
-    - path: decile.average
-      tolerance_pct: 2.0
-    - path: poverty.poverty.all
-      tolerance_pct: 1.0
diff --git a/evals/scenarios/b5_two_child_limit.yaml b/evals/scenarios/b5_two_child_limit.yaml
new file mode 100644
index 0000000..56f2367
--- /dev/null
+++ b/evals/scenarios/b5_two_child_limit.yaml
@@ -0,0 +1,62 @@
+id: b5_two_child_limit
+test: B
+title: "Remove the two-child benefit limit (Autumn Budget 2025)"
+what_it_tests: "Benefit-side reform (UC and CTC), not tax. Tests a different model surface from the previous B scenarios and a different kind of distributional shape — large gains concentrated in the lower deciles. Reference is PolicyEngine's published analysis (Vahid Ahmadi, Oct 2025) — directly comparing chat output to canonical PolicyEngine analysis."
+
+source_post: "uk-two-child-limit.md"
+source_post_url: "https://policyengine.org/uk/research/uk-two-child-limit"
+
+prompt: |
+  Run a UK 2026-27 economy-wide comparison. Baseline: current law (the two-child limit on Universal Credit and Child Tax Credit is in place). Reform: remove the two-child limit, so families receive child elements for every eligible child regardless of how many.
+
+  Use the Enhanced FRS 2023-24 dataset: hf://policyengine/policyengine-uk-data-private/enhanced_frs_2023_24.h5.
+
+  Report from the model run:
+  (a) Government spending impact (benefit cost) in £ for 2026-27.
+  (b) Relative income change in % by decile (1-10) for 2026-27.
+  (c) Change in absolute child poverty BHC in % terms (e.g. -X.X%, signed).
+  (d) Change in the Gini coefficient (signed).
+  (e) For an illustrative household — two parents, three children aged 3, 5, 7, both earning at the median — the change in household net income in 2026-27.
+
+  British English. Every figure from the model run.
+
+chat_settings:
+  model_backend: uk_python
+  num_runs: 3
+
+reference:
+  fixture: pe_api/b5_two_child_limit.json
+  fields_to_compare:
+    # All from Vahid Ahmadi, "How removing the two-child benefit limit would affect the UK", Oct 2025.
+    # PolicyEngine reform ID 93219 against baseline 1, 2026-27.
+    - path: budget.cost_2026_27
+      expected_approx: 2_900_000_000        # £2.9bn in 2026-27 per PolicyEngine
+      tolerance_pct: 5.0
+    - path: decile.relative.d2
+      expected_approx: 2.59                 # D2 sees the largest gain — 2.59% by 2029-30 (use 2026-27 fixture value once generated)
+      tolerance_pct: 25.0                   # decile-2 figure is for 2029-30; 2026-27 will be smaller
+    - path: poverty.absolute_child_bhc.relative_change
+      expected_approx: -13.5                # -13.5% absolute child poverty BHC
+      tolerance_pct: 10.0
+    - path: inequality.gini.relative_change
+      expected_approx: -0.55                # Gini falls 0.55% in 2026-27
+      tolerance_pct: 15.0
+
+anchor:
+  must_mention:
+    - "Universal Credit"          # the primary benefit affected
+    - "Child Tax Credit"          # legacy benefit also affected
+    - "transitional protection"   # why caseload grows over time
+    - "lower deciles"             # where the gains concentrate
+  must_not_say:
+    - "all families gain"         # only ~4.4% of population gains anything; the rest are unaffected
+    - "the reform raises revenue" # it's a cost, not a revenue raiser
+    - "the largest gains are in higher deciles"  # opposite is true
+  ideal_finding: |
+    Removing the two-child limit costs roughly £2.9bn in 2026-27, rising as more
+    post-2017 children age into the affected population. About 4.4% of the
+    population gains, heavily concentrated in the bottom three deciles (D2 sees
+    the largest relative gain). Absolute child poverty BHC falls by ~13.5% in
+    2026-27. The Gini coefficient falls by ~0.55% — modest overall inequality
+    reduction reflecting how concentrated the gains are. The reform interacts
+    with UC tapers, so as employment income rises the marginal gain falls.

From 5fac830969a72675299b1d72ca1bce94b14dd9b4 Mon Sep 17 00:00:00 2001
From: SakshiKekre <sakshi.kekre@gmail.com>
Date: Wed, 20 May 2026 05:53:06 -0700
Subject: [PATCH 03/11] Add eval runner

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 evals/SPEC.md                 |   9 +-
 evals/runner/README.md        |  62 ++++++
 evals/runner/requirements.txt |   2 +
 evals/runner/run.py           | 349 ++++++++++++++++++++++++++++++++++
 4 files changed, 417 insertions(+), 5 deletions(-)
 create mode 100644 evals/runner/README.md
 create mode 100644 evals/runner/requirements.txt
 create mode 100644 evals/runner/run.py

diff --git a/evals/SPEC.md b/evals/SPEC.md
index 36048e8..76f50a8 100644
--- a/evals/SPEC.md
+++ b/evals/SPEC.md
@@ -176,11 +176,10 @@ If we add scenarios later, the strong preference is to **anchor each one against
 
 ## Roadmap
 
-This PR lands the **spec, directory scaffold, and the 10 scenario files** only. Subsequent PRs:
+This PR lands the spec, the 10 anchored scenarios, and the runner. Still to come (in this same PR):
 
-1. **Runner.** A small Python script under `evals/runner/` that loads scenarios, POSTs each prompt N times to a configured chat backend URL, saves raw SSE logs under `evals/runs/<timestamp>/`.
-2. **B fixtures.** Generate reference outputs for B1-B5 into `evals/fixtures/pe_api/` — PE-API calls for the society-wide / regional scenarios, direct `policyengine_uk` calls for the household / MTR scenarios.
-3. **Grader.** For Test A: a markdown grading sheet (one row per response) with the rubric columns. Manual fill for v1. For Test B: an extractor that pulls numerics from SSE responses and diffs against fixtures.
-4. **Findings writeup.** Once 30 conversations are graded, a `RESULTS-YYYY-MM-DD.md` per-run report. Per-scenario verdicts, threshold check results, recommended positioning, new failure modes if any.
+1. **B fixtures.** Generate reference outputs for B1-B5 into `evals/fixtures/pe_api/` — PE-API calls for the society-wide scenarios, direct `policyengine_uk` calls for the household / MTR scenarios.
+2. **Grader.** For Test A: a markdown grading sheet (one row per response) with the rubric and anchor columns. Manual fill for v1. For Test B: an extractor that pulls numerics from SSE responses and diffs against fixtures.
+3. **Findings writeup.** Once 30 conversations are graded, a `RESULTS-YYYY-MM-DD.md` report. Per-scenario verdicts, threshold check results, recommended positioning, new failure modes if any.
 
 The chat backend URL the runner targets is configurable via env var so the eval can run against either a preview deploy or production.
diff --git a/evals/runner/README.md b/evals/runner/README.md
new file mode 100644
index 0000000..a4c75fc
--- /dev/null
+++ b/evals/runner/README.md
@@ -0,0 +1,62 @@
+# eval runner
+
+POSTs each scenario in `evals/scenarios/*.yaml` to a chat backend, saves raw SSE + extracted text + a meta JSON per run. No grading — that's a separate step.
+
+## Run
+
+```sh
+# All scenarios, default backend (PR 51 preview).
+python evals/runner/run.py
+
+# Just some.
+python evals/runner/run.py a1_mechanism b1_society_wide_pa
+
+# Preview what would run.
+python evals/runner/run.py --dry-run
+
+# Point at a different backend.
+python evals/runner/run.py --backend-url https://policyengine-uk-chat.vercel.app
+# or
+UK_CHAT_BACKEND_URL=https://... python evals/runner/run.py
+```
+
+## Vercel preview deployments
+
+If the backend is a Vercel preview behind deployment protection, set the bypass token from the chat project's "Protection Bypass for Automation" setting:
+
+```sh
+UK_CHAT_BYPASS_TOKEN=... python evals/runner/run.py
+```
+
+The token is appended to the request URL as `?x-vercel-protection-bypass=...`. It's redacted in the saved meta JSON so the artifact is safe to share.
+
+## Output
+
+```
+evals/runs/<timestamp>/
+  manifest.json                          # one row per (scenario, run)
+  <scenario_id>/
+    scenario.yaml                        # frozen copy of what was run
+    run-1.sse                            # raw SSE stream
+    run-1.txt                            # concatenated chunk deltas (final answer)
+    run-1.meta.json                      # event counts, timing, errors, redacted URL
+    run-2.sse / run-2.txt / run-2.meta.json
+    ...
+```
+
+`evals/runs/` is gitignored — these are artifacts of a specific run, not source.
+
+## Dependencies
+
+```sh
+python -m venv .venv && source .venv/bin/activate
+pip install -r evals/runner/requirements.txt
+```
+
+`httpx` for the streaming POST, `pyyaml` for scenario loading. Both stdlib-adjacent — no LLM frameworks, no Anthropic SDK on this side.
+
+## What the runner is not
+
+- It is **not** a grader. Test A grading is a human step; Test B has a separate extractor PR.
+- It is **not** parallel. Sequential by design for clean logs and to avoid hitting backend rate limits during long economy-wide runs. If/when we move to ~50 scenarios, add a `--parallel N` flag.
+- It does **not** mutate the scenarios or fixtures dir.
diff --git a/evals/runner/requirements.txt b/evals/runner/requirements.txt
new file mode 100644
index 0000000..21b8f47
--- /dev/null
+++ b/evals/runner/requirements.txt
@@ -0,0 +1,2 @@
+httpx>=0.27
+pyyaml>=6.0
diff --git a/evals/runner/run.py b/evals/runner/run.py
new file mode 100644
index 0000000..e71ad2e
--- /dev/null
+++ b/evals/runner/run.py
@@ -0,0 +1,349 @@
+#!/usr/bin/env python3
+"""
+Eval runner — load scenario YAMLs from evals/scenarios/, POST each one N
+times to the configured chat backend, save raw SSE + extracted text +
+summary JSON per run under evals/runs/<timestamp>/.
+
+No grading. The runner only produces the conversation logs. Grading is a
+separate step.
+
+Usage:
+    python evals/runner/run.py                  # all scenarios
+    python evals/runner/run.py a1_mechanism b1_society_wide_pa
+    python evals/runner/run.py --dry-run        # show what would run
+    python evals/runner/run.py --backend-url X  # override the chat URL
+
+Environment variables:
+    UK_CHAT_BACKEND_URL       Chat backend base URL (default: the PR 51 preview)
+    UK_CHAT_BYPASS_TOKEN      Optional Vercel "Protection Bypass for Automation"
+                              token, appended as a query param so the runner can
+                              reach a protected preview without SSO.
+"""
+
+from __future__ import annotations
+
+import argparse
+import datetime as dt
+import json
+import os
+import re
+import sys
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Any
+
+import httpx
+import yaml
+
+
+# Paths
+EVALS_DIR = Path(__file__).resolve().parent.parent
+SCENARIOS_DIR = EVALS_DIR / "scenarios"
+RUNS_DIR = EVALS_DIR / "runs"
+
+# Defaults
+DEFAULT_BACKEND_URL = os.environ.get(
+    "UK_CHAT_BACKEND_URL",
+    # PR 51 preview backend — the only deploy with model_backend + scenario_context wired.
+    "https://policyengine--peukchat-feat-model-backend-selector-web.modal.run",
+)
+BYPASS_TOKEN = os.environ.get("UK_CHAT_BYPASS_TOKEN") or None
+REQUEST_TIMEOUT_SECONDS = 900  # economy-wide sims can take a few minutes
+
+
+# ---------------------------------------------------------------------------
+# Scenario loading
+# ---------------------------------------------------------------------------
+
+@dataclass
+class Scenario:
+    id: str
+    test: str  # "A" or "B"
+    title: str
+    prompt: str
+    model_backend: str
+    num_runs: int
+    scenario_context: str | None
+    raw: dict[str, Any]  # full YAML dict, kept for logging
+
+    @classmethod
+    def from_yaml(cls, path: Path) -> "Scenario":
+        data = yaml.safe_load(path.read_text())
+        return cls(
+            id=data["id"],
+            test=data["test"],
+            title=data["title"],
+            prompt=data["prompt"],
+            model_backend=data["chat_settings"]["model_backend"],
+            num_runs=data["chat_settings"].get("num_runs", 3),
+            scenario_context=data.get("scenario_context"),
+            raw=data,
+        )
+
+
+def load_scenarios(filter_ids: list[str] | None = None) -> list[Scenario]:
+    scenarios = []
+    for path in sorted(SCENARIOS_DIR.glob("*.yaml")):
+        scenario = Scenario.from_yaml(path)
+        if filter_ids and scenario.id not in filter_ids:
+            continue
+        scenarios.append(scenario)
+    if filter_ids:
+        loaded_ids = {s.id for s in scenarios}
+        missing = set(filter_ids) - loaded_ids
+        if missing:
+            raise SystemExit(f"Unknown scenario IDs: {sorted(missing)}")
+    return scenarios
+
+
+# ---------------------------------------------------------------------------
+# Chat backend interaction
+# ---------------------------------------------------------------------------
+
+def build_request_payload(scenario: Scenario) -> dict[str, Any]:
+    payload: dict[str, Any] = {
+        "model_backend": scenario.model_backend,
+        "messages": [{"role": "user", "content": scenario.prompt}],
+    }
+    if scenario.scenario_context:
+        payload["scenario_context"] = scenario.scenario_context
+    return payload
+
+
+def build_request_url(backend_url: str) -> str:
+    url = f"{backend_url.rstrip('/')}/chat/message"
+    if BYPASS_TOKEN:
+        sep = "&" if "?" in url else "?"
+        url = (
+            f"{url}{sep}x-vercel-protection-bypass={BYPASS_TOKEN}"
+            f"&x-vercel-set-bypass-cookie=samesitenone"
+        )
+    return url
+
+
+def parse_sse(sse_text: str) -> list[dict[str, Any]]:
+    """Parse SSE lines into a list of event dicts."""
+    events = []
+    for line in sse_text.splitlines():
+        if not line.startswith("data: "):
+            continue
+        try:
+            events.append(json.loads(line[len("data: "):]))
+        except json.JSONDecodeError:
+            # The model occasionally emits non-JSON lines (e.g. heartbeats);
+            # drop them rather than crashing the whole run.
+            pass
+    return events
+
+
+def summarise_events(events: list[dict[str, Any]]) -> dict[str, Any]:
+    """Pull useful aggregates out of the SSE stream."""
+    counts: dict[str, int] = {}
+    for e in events:
+        t = e.get("type", "?")
+        counts[t] = counts.get(t, 0) + 1
+
+    full_text = "".join(
+        e.get("content", "") for e in events if e.get("type") == "chunk"
+    )
+
+    # The `done` event carries usage / billing — last one wins.
+    done = next(
+        (e for e in reversed(events) if e.get("type") == "done"),
+        None,
+    )
+
+    errors = [e for e in events if e.get("type") == "error"]
+
+    return {
+        "event_counts": counts,
+        "answer_text": full_text,
+        "answer_length_chars": len(full_text),
+        "tool_call_count": counts.get("tool_use", 0),
+        "completed": done is not None,
+        "error_count": len(errors),
+        "errors": errors,
+        "done_event": done,  # contains usage, session_id, model, model_backend
+    }
+
+
+def run_single(
+    *,
+    scenario: Scenario,
+    backend_url: str,
+    run_index: int,
+    out_dir: Path,
+) -> dict[str, Any]:
+    """POST one scenario once, save SSE + extracted text + summary."""
+    url = build_request_url(backend_url)
+    payload = build_request_payload(scenario)
+
+    started = dt.datetime.now(dt.timezone.utc)
+
+    sse_text = ""
+    http_error: str | None = None
+    try:
+        with httpx.Client(timeout=REQUEST_TIMEOUT_SECONDS) as client:
+            with client.stream("POST", url, json=payload) as resp:
+                resp.raise_for_status()
+                for chunk in resp.iter_text():
+                    sse_text += chunk
+    except httpx.HTTPStatusError as e:
+        http_error = (
+            f"HTTP {e.response.status_code}: "
+            f"{(e.response.text or '')[:500]}"
+        )
+    except httpx.RequestError as e:
+        http_error = f"Request error: {type(e).__name__}: {e}"
+
+    finished = dt.datetime.now(dt.timezone.utc)
+    elapsed_seconds = (finished - started).total_seconds()
+
+    # Persist artifacts
+    out_dir.mkdir(parents=True, exist_ok=True)
+    (out_dir / f"run-{run_index}.sse").write_text(sse_text)
+
+    events = parse_sse(sse_text)
+    summary = summarise_events(events)
+    (out_dir / f"run-{run_index}.txt").write_text(summary["answer_text"])
+
+    meta = {
+        "scenario_id": scenario.id,
+        "test": scenario.test,
+        "run_index": run_index,
+        "url": re.sub(r"protection-bypass=[^&]+", "protection-bypass=REDACTED", url),
+        "model_backend": scenario.model_backend,
+        "started_at": started.isoformat(),
+        "finished_at": finished.isoformat(),
+        "elapsed_seconds": round(elapsed_seconds, 1),
+        "http_error": http_error,
+        "summary": {
+            k: v
+            for k, v in summary.items()
+            # Don't duplicate the full answer text in JSON — it's already in run-N.txt.
+            if k != "answer_text"
+        },
+    }
+    (out_dir / f"run-{run_index}.meta.json").write_text(json.dumps(meta, indent=2))
+
+    return meta
+
+
+# ---------------------------------------------------------------------------
+# Orchestration
+# ---------------------------------------------------------------------------
+
+def make_run_dir() -> Path:
+    timestamp = dt.datetime.now().strftime("%Y-%m-%d_%H%M%S")
+    return RUNS_DIR / timestamp
+
+
+def run_all(
+    *,
+    scenarios: list[Scenario],
+    backend_url: str,
+    dry_run: bool,
+) -> Path | None:
+    if dry_run:
+        print(f"DRY RUN (backend: {backend_url})\n")
+        for s in scenarios:
+            print(
+                f"  {s.id:30}  test={s.test}  runs={s.num_runs}  "
+                f"backend={s.model_backend}  ctx={'yes' if s.scenario_context else 'no'}"
+            )
+        total = sum(s.num_runs for s in scenarios)
+        print(f"\nWould execute {total} requests across {len(scenarios)} scenarios.")
+        return None
+
+    run_dir = make_run_dir()
+    run_dir.mkdir(parents=True, exist_ok=True)
+
+    manifest = {
+        "started_at": dt.datetime.now(dt.timezone.utc).isoformat(),
+        "backend_url": backend_url,
+        "bypass_token_set": BYPASS_TOKEN is not None,
+        "scenarios": [s.id for s in scenarios],
+        "runs": [],
+    }
+    manifest_path = run_dir / "manifest.json"
+    manifest_path.write_text(json.dumps(manifest, indent=2))
+
+    print(f"Run output: {run_dir}\n")
+
+    for scenario in scenarios:
+        scenario_dir = run_dir / scenario.id
+        # Save a copy of the scenario YAML alongside the runs so we can
+        # tell, months later, exactly what prompt/anchor was in effect.
+        scenario_dir.mkdir(parents=True, exist_ok=True)
+        (scenario_dir / "scenario.yaml").write_text(yaml.safe_dump(scenario.raw))
+
+        print(f"=== {scenario.id} ({scenario.test}, {scenario.num_runs} runs) ===")
+        for i in range(1, scenario.num_runs + 1):
+            print(f"  run {i}/{scenario.num_runs}...", end=" ", flush=True)
+            meta = run_single(
+                scenario=scenario,
+                backend_url=backend_url,
+                run_index=i,
+                out_dir=scenario_dir,
+            )
+            elapsed = meta["elapsed_seconds"]
+            err = meta["http_error"]
+            tools = meta["summary"]["tool_call_count"]
+            chars = meta["summary"]["answer_length_chars"]
+            status = (
+                f"ERR ({err})" if err
+                else f"ok  {chars} chars, {tools} tool calls, {elapsed}s"
+            )
+            print(status)
+
+            manifest["runs"].append({
+                "scenario_id": scenario.id,
+                "run_index": i,
+                "elapsed_seconds": elapsed,
+                "http_error": err,
+                "tool_call_count": tools,
+                "answer_length_chars": chars,
+            })
+            manifest_path.write_text(json.dumps(manifest, indent=2))
+        print()
+
+    manifest["finished_at"] = dt.datetime.now(dt.timezone.utc).isoformat()
+    manifest_path.write_text(json.dumps(manifest, indent=2))
+    print(f"Done. Logs in {run_dir}")
+    return run_dir
+
+
+def main() -> int:
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument(
+        "scenario_ids",
+        nargs="*",
+        help="Optional list of scenario IDs to run. Empty = run all.",
+    )
+    parser.add_argument(
+        "--backend-url",
+        default=DEFAULT_BACKEND_URL,
+        help="Chat backend base URL. Defaults to UK_CHAT_BACKEND_URL env var or PR 51 preview.",
+    )
+    parser.add_argument(
+        "--dry-run",
+        action="store_true",
+        help="Show what would run without making any requests.",
+    )
+    args = parser.parse_args()
+
+    scenarios = load_scenarios(args.scenario_ids or None)
+    if not scenarios:
+        print("No scenarios matched.", file=sys.stderr)
+        return 1
+
+    run_all(
+        scenarios=scenarios,
+        backend_url=args.backend_url,
+        dry_run=args.dry_run,
+    )
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())

From 86dec728c8718d13c390fe476755f92250dca8f3 Mon Sep 17 00:00:00 2001
From: SakshiKekre <sakshi.kekre@gmail.com>
Date: Wed, 20 May 2026 07:29:13 -0700
Subject: [PATCH 04/11] Add eval grader

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 evals/runner/README.md |  51 ++-
 evals/runner/grade.py  | 793 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 841 insertions(+), 3 deletions(-)
 create mode 100644 evals/runner/grade.py

diff --git a/evals/runner/README.md b/evals/runner/README.md
index a4c75fc..d3d0186 100644
--- a/evals/runner/README.md
+++ b/evals/runner/README.md
@@ -1,6 +1,9 @@
-# eval runner
+# eval runner + grader
 
-POSTs each scenario in `evals/scenarios/*.yaml` to a chat backend, saves raw SSE + extracted text + a meta JSON per run. No grading — that's a separate step.
+Two scripts:
+
+- `run.py` — POSTs each scenario in `evals/scenarios/*.yaml` to a chat backend, saves raw SSE + extracted text + meta JSON per run under `evals/runs/<timestamp>/`.
+- `grade.py` — reads a finished run dir. For Test A scenarios, emits a markdown grading sheet the human fills in. For Test B, runs automated numeric extraction + anchor checks against fixtures.
 
 ## Run
 
@@ -57,6 +60,48 @@ pip install -r evals/runner/requirements.txt
 
 ## What the runner is not
 
-- It is **not** a grader. Test A grading is a human step; Test B has a separate extractor PR.
 - It is **not** parallel. Sequential by design for clean logs and to avoid hitting backend rate limits during long economy-wide runs. If/when we move to ~50 scenarios, add a `--parallel N` flag.
 - It does **not** mutate the scenarios or fixtures dir.
+
+---
+
+## Grade
+
+```sh
+# Generate A_grading.md (human sheet) + B_results.md (automated diffs).
+python evals/runner/grade.py evals/runs/2026-05-15_120000
+
+# Just one path.
+python evals/runner/grade.py <run_dir> --test A
+python evals/runner/grade.py <run_dir> --test B
+
+# After A_grading.md has been filled in by a human, apply Test A thresholds.
+python evals/runner/grade.py <run_dir> --threshold-check
+```
+
+### Test A flow
+
+`grade.py --test A` walks the run dir and produces `A_grading.md`. Each A response gets a section with:
+
+- Prompt and scenario_context (collapsible)
+- The anchor (must_mention / must_not_say with regex hit/miss, plus `ideal_explanation`)
+- The chat response
+- Empty score fields for each rubric dimension
+
+The grader (you) opens the file in an editor, replaces each ⬜ with a 1-5 score, and marks the fabrication question yes/no.
+
+Then `--threshold-check` parses the filled sheet and applies the SPEC.md thresholds: mean rubric ≥ 4.0, no individual < 2 on reasonableness/honesty, fabrication rate ≤ 20%. Output goes to `threshold_check.md` and `.json`.
+
+### Test B flow
+
+Fully automated. For each B scenario:
+
+- Loads the fixture from `evals/fixtures/pe_api/`.
+- For each run, extracts numeric values from the response prose using per-field label regexes (heuristic).
+- Diffs against the fixture with per-field `tolerance_pct`.
+- Computes self-consistency (SD across runs as % of mean).
+- Runs the anchor's `must_mention` / `must_not_say` regex checks.
+
+Output: `B_results.json` (machine-readable) and `B_results.md` (human-readable per-scenario diffs + threshold verdict).
+
+The extractor is best-effort regex over prose, so some fields legitimately come back as `⏭ no expected / couldn't extract`. Those are diagnostics for the grader, not failure verdicts.
diff --git a/evals/runner/grade.py b/evals/runner/grade.py
new file mode 100644
index 0000000..4042356
--- /dev/null
+++ b/evals/runner/grade.py
@@ -0,0 +1,793 @@
+#!/usr/bin/env python3
+"""
+Grader for a finished eval run.
+
+Two paths:
+- Test A scenarios → emit a markdown grading sheet (one section per response)
+  with the prompt, the anchor (golden path), and the chat response side by
+  side. A human fills in rubric scores 1-5. Lightweight automated anchor
+  checks (must_mention / must_not_say substring scans) are pre-populated as
+  grader hints — not authoritative.
+- Test B scenarios → load the matching fixture under evals/fixtures/pe_api/,
+  extract numerics from the chat response, diff against fixture per
+  fields_to_compare with tolerance. Also runs the must_mention / must_not_say
+  anchor checks.
+
+After both paths, the script writes:
+  <run_dir>/A_grading.md           ← human grading sheet (edit in place)
+  <run_dir>/B_results.json         ← machine-readable B verdicts
+  <run_dir>/B_results.md           ← human-readable B summary
+  <run_dir>/threshold_check.md     ← regenerated after A_grading.md is filled
+
+Usage:
+    python evals/runner/grade.py <run_dir>                   # both A + B
+    python evals/runner/grade.py <run_dir> --test A          # just A sheet
+    python evals/runner/grade.py <run_dir> --test B          # just B diff
+    python evals/runner/grade.py <run_dir> --threshold-check # after grading A
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import re
+import sys
+from dataclasses import dataclass, field
+from pathlib import Path
+from statistics import mean, stdev
+from typing import Any
+
+import yaml
+
+
+# Paths
+EVALS_DIR = Path(__file__).resolve().parent.parent
+FIXTURES_DIR = EVALS_DIR / "fixtures" / "pe_api"
+
+# Test A rubric dimensions — must match the rubric block in scenario YAMLs.
+A_RUBRIC_DIMENSIONS = (
+    "relevance",
+    "methodology",
+    "reasonableness",
+    "consistency",
+    "honesty",
+)
+
+# Test A thresholds from SPEC.md.
+A_MEAN_THRESHOLD = 4.0
+A_TRUST_KILLER_MIN = 2   # no response < 2 on Reasonableness or Honesty
+A_TRUST_KILLERS = ("reasonableness", "honesty")
+A_FABRICATION_RATE_LIMIT = 0.2  # at most 1 in 5 responses with fabricated figures
+
+# Test B thresholds from SPEC.md.
+B_FIELD_ACCURACY_THRESHOLD = 0.95   # ≥95% of fields within tolerance
+B_SELF_CONSISTENCY_SD_LIMIT = 0.005  # SD < 0.5% of mean
+B_METHODOLOGY_DRIFT_LIMIT = 1        # ≤ 1 of 5 scenarios
+B_FAILURE_RATE_LIMIT = 0.10          # < 10% failure rate
+
+
+# ---------------------------------------------------------------------------
+# Loading & shared helpers
+# ---------------------------------------------------------------------------
+
+@dataclass
+class RunResponse:
+    scenario_id: str
+    run_index: int
+    answer_text: str
+    meta: dict[str, Any]
+    scenario: dict[str, Any]  # frozen YAML copy alongside the run
+
+
+def load_run(run_dir: Path) -> list[RunResponse]:
+    """Walk a run directory and load every response the runner produced."""
+    responses = []
+    for scenario_dir in sorted(p for p in run_dir.iterdir() if p.is_dir()):
+        scenario_path = scenario_dir / "scenario.yaml"
+        if not scenario_path.exists():
+            continue
+        scenario = yaml.safe_load(scenario_path.read_text())
+        for txt_path in sorted(scenario_dir.glob("run-*.txt")):
+            # filename like run-2.txt
+            run_index = int(txt_path.stem.split("-")[1])
+            answer = txt_path.read_text()
+            meta_path = scenario_dir / f"run-{run_index}.meta.json"
+            meta = json.loads(meta_path.read_text()) if meta_path.exists() else {}
+            responses.append(RunResponse(
+                scenario_id=scenario["id"],
+                run_index=run_index,
+                answer_text=answer,
+                meta=meta,
+                scenario=scenario,
+            ))
+    return responses
+
+
+def anchor_check(answer: str, anchor: dict[str, Any] | None) -> dict[str, Any]:
+    """Run the lightweight regex checks the anchor defines.
+
+    These are grader hints, not authoritative scores. A human can override.
+    Substrings are matched case-insensitively. £ symbols and digits are
+    normalised so e.g. '£100,000' matches '100k' loosely (the digits part).
+    """
+    if not anchor:
+        return {"checked": False, "must_mention": [], "must_not_say": []}
+
+    norm_answer = answer.lower()
+
+    def hit(phrase: str) -> bool:
+        return phrase.lower() in norm_answer
+
+    must_mention = [
+        {"phrase": p, "found": hit(p)}
+        for p in (anchor.get("must_mention") or [])
+    ]
+    must_not_say = [
+        {"phrase": p, "found": hit(p)}
+        for p in (anchor.get("must_not_say") or [])
+    ]
+    return {
+        "checked": True,
+        "must_mention": must_mention,
+        "must_not_say": must_not_say,
+        "must_mention_pass_rate": (
+            sum(1 for m in must_mention if m["found"]) / len(must_mention)
+            if must_mention else None
+        ),
+        "must_not_say_violations": [
+            m["phrase"] for m in must_not_say if m["found"]
+        ],
+    }
+
+
+# ---------------------------------------------------------------------------
+# Test A — grading sheet generator
+# ---------------------------------------------------------------------------
+
+def render_anchor_hints(check: dict[str, Any]) -> str:
+    """Render the regex check result as a short markdown block."""
+    if not check.get("checked"):
+        return "_(no anchor)_"
+    lines = []
+    rate = check.get("must_mention_pass_rate")
+    if rate is not None:
+        lines.append(f"`must_mention` substring matches: **{rate:.0%}**")
+        for m in check["must_mention"]:
+            mark = "✓" if m["found"] else "✗"
+            lines.append(f"  - {mark} `{m['phrase']}`")
+    violations = check.get("must_not_say_violations") or []
+    if violations:
+        lines.append(f"`must_not_say` **VIOLATIONS** ({len(violations)}):")
+        for v in violations:
+            lines.append(f"  - ✗ `{v}`")
+    elif check.get("must_not_say"):
+        lines.append(f"`must_not_say`: clean ✓")
+    return "\n".join(lines)
+
+
+def render_a_sheet(responses: list[RunResponse]) -> str:
+    """Emit a markdown sheet for human grading of Test A responses."""
+    a_responses = [r for r in responses if r.scenario["test"] == "A"]
+    lines = [
+        "# Test A — grading sheet",
+        "",
+        "Fill in 1-5 scores for each rubric dimension under every response.",
+        "The `must_mention` / `must_not_say` lines are grader hints from "
+        "automated substring scans — not authoritative. Use the ideal "
+        "explanation as your reference for what a Vahid-quality answer "
+        "looks like.",
+        "",
+        f"_{len(a_responses)} responses to grade._",
+        "",
+    ]
+    for r in a_responses:
+        anchor = r.scenario.get("anchor") or {}
+        check = anchor_check(r.answer_text, anchor)
+        rubric = r.scenario.get("rubric") or {}
+        ideal = anchor.get("ideal_explanation") or anchor.get("ideal_finding") or "_(none)_"
+        meta_summary = r.meta.get("summary", {})
+
+        lines.extend([
+            "---",
+            f"## {r.scenario_id} — run {r.run_index}",
+            "",
+            f"**Title:** {r.scenario['title']}",
+            f"**Tool calls:** {meta_summary.get('tool_call_count', '?')} · "
+            f"**Elapsed:** {r.meta.get('elapsed_seconds', '?')}s · "
+            f"**Errors:** {meta_summary.get('error_count', '?')}",
+            "",
+            "### Prompt (user message)",
+            "```",
+            r.scenario["prompt"].strip(),
+            "```",
+            "",
+        ])
+        if r.scenario.get("scenario_context"):
+            lines.extend([
+                "### Scenario context (system-prompt prefix)",
+                "<details><summary>show</summary>",
+                "",
+                "```",
+                r.scenario["scenario_context"].strip(),
+                "```",
+                "</details>",
+                "",
+            ])
+        lines.extend([
+            "### Golden path",
+            "<details open><summary>anchor</summary>",
+            "",
+            render_anchor_hints(check),
+            "",
+            "**Ideal explanation:**",
+            "",
+            ideal,
+            "</details>",
+            "",
+            "### Chat response",
+            "<details open><summary>response (run-N.txt)</summary>",
+            "",
+            "> " + r.answer_text.replace("\n", "\n> "),
+            "</details>",
+            "",
+            "### Scores (fill in 1-5)",
+        ])
+        for dim in A_RUBRIC_DIMENSIONS:
+            criterion = rubric.get(dim, "")
+            lines.append(f"- **{dim.title()}**: ⬜  _{criterion}_")
+        lines.extend([
+            "- **Fabricated a figure not derivable from a model run?** ⬜ yes / no",
+            "",
+            "### Notes",
+            "_(optional)_",
+            "",
+        ])
+    return "\n".join(lines)
+
+
+# ---------------------------------------------------------------------------
+# Test A — threshold check (run after grading sheet is filled in)
+# ---------------------------------------------------------------------------
+
+SCORE_RE = re.compile(
+    r"^\s*-\s+\*\*(?P<dim>\w+)\*\*:\s*(?P<score>[1-5])",
+    re.MULTILINE,
+)
+FAB_RE = re.compile(
+    r"\*\*Fabricated a figure not derivable from a model run\?\*\*\s*"
+    r"(?:⬜\s*)?\s*(?P<answer>yes|no)\b",
+    re.IGNORECASE,
+)
+HEADER_RE = re.compile(r"^##\s+(?P<id>\S+)\s+—\s+run\s+(?P<n>\d+)", re.MULTILINE)
+
+
+def parse_a_sheet(sheet_text: str) -> list[dict[str, Any]]:
+    """Pull filled-in scores out of the grading markdown."""
+    # Split by ## headers — each section is one response.
+    sections = re.split(r"(?=^##\s+\S+\s+—\s+run\s+\d+)", sheet_text, flags=re.MULTILINE)
+    parsed = []
+    for sec in sections:
+        m_header = HEADER_RE.search(sec)
+        if not m_header:
+            continue
+        scores: dict[str, int] = {}
+        for m in SCORE_RE.finditer(sec):
+            dim = m.group("dim").lower()
+            if dim in A_RUBRIC_DIMENSIONS:
+                scores[dim] = int(m.group("score"))
+        fab_m = FAB_RE.search(sec)
+        fabricated = (fab_m.group("answer").lower() == "yes") if fab_m else None
+        parsed.append({
+            "scenario_id": m_header.group("id"),
+            "run_index": int(m_header.group("n")),
+            "scores": scores,
+            "fabricated": fabricated,
+        })
+    return parsed
+
+
+def a_threshold_check(graded: list[dict[str, Any]]) -> dict[str, Any]:
+    """Apply the SPEC.md Test A thresholds to filled scores."""
+    fully_graded = [
+        g for g in graded
+        if all(d in g["scores"] for d in A_RUBRIC_DIMENSIONS)
+    ]
+    incomplete = [
+        g for g in graded if g not in fully_graded
+    ]
+
+    all_scores: list[int] = []
+    trust_killer_failures: list[dict[str, Any]] = []
+    fabrication_count = 0
+    fabrication_total = 0
+
+    for g in fully_graded:
+        for dim, score in g["scores"].items():
+            all_scores.append(score)
+            if dim in A_TRUST_KILLERS and score < A_TRUST_KILLER_MIN:
+                trust_killer_failures.append({
+                    "scenario_id": g["scenario_id"],
+                    "run_index": g["run_index"],
+                    "dimension": dim,
+                    "score": score,
+                })
+        if g["fabricated"] is True:
+            fabrication_count += 1
+        if g["fabricated"] is not None:
+            fabrication_total += 1
+
+    mean_score = mean(all_scores) if all_scores else None
+    fab_rate = (
+        fabrication_count / fabrication_total
+        if fabrication_total > 0 else None
+    )
+
+    passes = (
+        mean_score is not None
+        and mean_score >= A_MEAN_THRESHOLD
+        and not trust_killer_failures
+        and (fab_rate is None or fab_rate <= A_FABRICATION_RATE_LIMIT)
+    )
+
+    return {
+        "mean_score": mean_score,
+        "mean_threshold": A_MEAN_THRESHOLD,
+        "trust_killer_failures": trust_killer_failures,
+        "trust_killer_threshold": A_TRUST_KILLER_MIN,
+        "fabrication_count": fabrication_count,
+        "fabrication_total": fabrication_total,
+        "fabrication_rate": fab_rate,
+        "fabrication_rate_limit": A_FABRICATION_RATE_LIMIT,
+        "fully_graded_count": len(fully_graded),
+        "incomplete_count": len(incomplete),
+        "passes": bool(passes),
+    }
+
+
+# ---------------------------------------------------------------------------
+# Test B — numeric extraction + diff
+# ---------------------------------------------------------------------------
+
+NUMBER_RE = re.compile(
+    r"(?P<sign>[-+])?\s*£?\s*"
+    r"(?P<num>\d[\d,]*(?:\.\d+)?)"
+    r"\s*(?P<unit>bn|billion|m|million|k|thousand|pp|%)?",
+    re.IGNORECASE,
+)
+
+
+def parse_number_near(text: str, label_regex: str) -> float | None:
+    """Find a labelled financial figure in prose.
+
+    Best-effort regex extraction. For each label match, we scan up to ~200
+    chars ahead and prefer numbers that have an explicit £ prefix and a
+    bn/m unit — those are the figures the chat is reporting as results,
+    rather than reform parameters that happened to appear nearby (£15,000,
+    £100,000). Falls back to the first number found if no scaled value
+    appears in range.
+    """
+    label_pat = re.compile(label_regex, re.IGNORECASE)
+    scaled_re = re.compile(
+        r"(?P<sign>[-+])?\s*£?\s*"
+        r"(?P<num>\d[\d,]*(?:\.\d+)?)"
+        r"\s*(?P<unit>bn|billion|m|million)\b",
+        re.IGNORECASE,
+    )
+
+    def to_float(num_m: re.Match) -> float | None:
+        raw = num_m.group("num").replace(",", "")
+        try:
+            n = float(raw)
+        except ValueError:
+            return None
+        unit = (num_m.group("unit") or "").lower()
+        if unit in ("bn", "billion"):
+            n *= 1_000_000_000
+        elif unit in ("m", "million"):
+            n *= 1_000_000
+        elif unit in ("k", "thousand"):
+            n *= 1_000
+        if num_m.group("sign") == "-":
+            n = -n
+        return n
+
+    for label_m in label_pat.finditer(text):
+        # Look in a wider window for a scaled (bn/m) figure first.
+        tail = text[label_m.end():label_m.end() + 200]
+        scaled_m = scaled_re.search(tail)
+        if scaled_m:
+            n = to_float(scaled_m)
+            if n is not None:
+                # If the surrounding prose suggests this is a decrease but the
+                # number didn't carry a minus sign, flip it.
+                surrounding = text[label_m.start():label_m.end() + scaled_m.end()].lower()
+                if (
+                    n > 0
+                    and re.search(r"\b(reduc|cut|fall|decrease|cost|forg(o|on)e|less)", surrounding)
+                    and not re.search(r"\bincrease|rais|gain|more\b", surrounding)
+                ):
+                    n = -n
+                return n
+        # Fall back to any number in the closer window.
+        num_m = NUMBER_RE.search(text[label_m.end():label_m.end() + 120])
+        if num_m:
+            return to_float(num_m)
+    return None
+
+
+# Heuristic mapping from anchor field-path → label regex used to find the number
+# in the chat's prose answer. Adding new B scenarios with new field paths needs
+# either a matching label here or a per-scenario extractor.
+FIELD_LABELS = {
+    "budget.budgetary_impact": r"budgetary impact",
+    "budget.tax_revenue_impact": r"(income\s*tax\s*revenue|tax\s*revenue\s*(impact|change))",
+    "budget.benefit_spending_impact": r"benefit\s*spending",
+    "combined.budgetary_impact_2026_27": r"combined.*(budgetary impact|revenue).*2026",
+    "layers.freeze_extension.budgetary_impact_2028_29": r"(threshold\s*freeze|freeze\s*extension).*(2028|£3.5)",
+    "layers.ni_cut.budgetary_impact_2026_27": r"(national\s*insurance|NI).*(cost|reduction|cut).*£",
+    "layers.it_increase.budgetary_impact_2026_27": r"(income\s*tax|IT).*(increase|raise|rise).*£",
+    "example_household.net_change": r"(£60[,]?000|example household|illustrative).*net",
+    "example_household.ni_change": r"(£60[,]?000|example|illustrative).*NI",
+    "example_household.it_change": r"(£60[,]?000|example|illustrative).*income\s*tax",
+    "budget.cost_2026_27": r"(cost|spending).*(2026|£2\.\d)",
+    "poverty.absolute_child_bhc.relative_change": r"(absolute\s*child\s*poverty|child poverty.*BHC).*(\-|fall|reduc)",
+    "inequality.gini.relative_change": r"gini",
+    "result.household_net_income": r"(household\s*net\s*income|net\s*income)",
+    "result.income_tax": r"income\s*tax",
+    "result.national_insurance": r"national\s*insurance",
+    "result.marginal_tax_rate": r"(marginal\s*tax\s*rate|combined\s*marginal)",
+}
+
+
+def extract_b_value(answer: str, field_path: str) -> float | None:
+    """Look up a heuristic label for this field path and pull a number."""
+    label = FIELD_LABELS.get(field_path)
+    if not label:
+        return None
+    return parse_number_near(answer, label)
+
+
+def grade_b_scenario(
+    responses_for_scenario: list[RunResponse],
+) -> dict[str, Any]:
+    """Numeric diff + anchor check across the N runs of one B scenario."""
+    if not responses_for_scenario:
+        return {"error": "no responses"}
+
+    scenario = responses_for_scenario[0].scenario
+    reference = scenario.get("reference") or {}
+    fixture_rel = reference.get("fixture")
+    fixture = None
+    fixture_status = "ok"
+    if fixture_rel:
+        fixture_path = FIXTURES_DIR / Path(fixture_rel).name
+        if fixture_path.exists():
+            fixture = json.loads(fixture_path.read_text())
+        else:
+            fixture_status = f"missing: {fixture_path}"
+
+    per_run_results = []
+    for r in responses_for_scenario:
+        anchor = scenario.get("anchor") or {}
+        check = anchor_check(r.answer_text, anchor)
+
+        field_diffs = []
+        for fc in reference.get("fields_to_compare") or []:
+            path = fc["path"]
+            tolerance_pct = fc.get("tolerance_pct", 1.0)
+            extracted = extract_b_value(r.answer_text, path)
+            expected = fc.get("expected_approx")
+            if expected is None and fixture is not None:
+                # Pluck the path out of the fixture JSON.
+                node = fixture
+                for part in path.split("."):
+                    if isinstance(node, dict) and part in node:
+                        node = node[part]
+                    else:
+                        node = None
+                        break
+                if isinstance(node, (int, float)):
+                    expected = float(node)
+            within = None
+            pct_off = None
+            if extracted is not None and expected is not None and expected != 0:
+                pct_off = abs(extracted - expected) / abs(expected) * 100
+                within = pct_off <= tolerance_pct
+            field_diffs.append({
+                "path": path,
+                "expected": expected,
+                "extracted": extracted,
+                "pct_off": pct_off,
+                "tolerance_pct": tolerance_pct,
+                "within_tolerance": within,
+            })
+        per_run_results.append({
+            "run_index": r.run_index,
+            "anchor": check,
+            "field_diffs": field_diffs,
+            "tool_call_count": r.meta.get("summary", {}).get("tool_call_count"),
+            "http_error": r.meta.get("http_error"),
+        })
+
+    # Self-consistency: per-field SD across runs as % of mean.
+    sd_by_field: dict[str, float | None] = {}
+    field_paths = (
+        [d["path"] for d in per_run_results[0]["field_diffs"]]
+        if per_run_results else []
+    )
+    for path in field_paths:
+        vals = [
+            next(
+                (d["extracted"] for d in pr["field_diffs"] if d["path"] == path),
+                None,
+            )
+            for pr in per_run_results
+        ]
+        clean = [v for v in vals if v is not None]
+        if len(clean) >= 2 and mean(clean) != 0:
+            sd_by_field[path] = stdev(clean) / abs(mean(clean))
+        else:
+            sd_by_field[path] = None
+
+    # Aggregate metrics for this scenario.
+    all_diff_outcomes = [
+        d["within_tolerance"]
+        for pr in per_run_results
+        for d in pr["field_diffs"]
+        if d["within_tolerance"] is not None
+    ]
+    field_accuracy = (
+        sum(1 for x in all_diff_outcomes if x) / len(all_diff_outcomes)
+        if all_diff_outcomes else None
+    )
+
+    failures = sum(
+        1 for pr in per_run_results
+        if pr["http_error"] or not any(
+            d["within_tolerance"] is not None for d in pr["field_diffs"]
+        )
+    )
+    failure_rate = failures / len(per_run_results)
+
+    return {
+        "scenario_id": scenario["id"],
+        "fixture_status": fixture_status,
+        "per_run_results": per_run_results,
+        "self_consistency_sd": sd_by_field,
+        "field_accuracy": field_accuracy,
+        "failure_rate": failure_rate,
+        "max_self_consistency_sd": max(
+            (v for v in sd_by_field.values() if v is not None),
+            default=None,
+        ),
+    }
+
+
+def b_threshold_check(scenario_results: list[dict[str, Any]]) -> dict[str, Any]:
+    """Apply SPEC.md Test B thresholds across scenarios."""
+    field_accuracies = [
+        s["field_accuracy"]
+        for s in scenario_results
+        if s.get("field_accuracy") is not None
+    ]
+    overall_field_accuracy = (
+        mean(field_accuracies) if field_accuracies else None
+    )
+
+    sd_violations = []
+    for s in scenario_results:
+        for path, sd in (s.get("self_consistency_sd") or {}).items():
+            if sd is not None and sd > B_SELF_CONSISTENCY_SD_LIMIT:
+                sd_violations.append({
+                    "scenario_id": s["scenario_id"],
+                    "path": path,
+                    "sd_pct_of_mean": sd,
+                })
+
+    failure_rates = [s["failure_rate"] for s in scenario_results]
+    overall_failure_rate = mean(failure_rates) if failure_rates else None
+
+    # Methodology drift can't be detected purely automatically — flag any
+    # scenario where the anchor's must_not_say was violated in any run as a
+    # *potential* drift case that the human should review.
+    drift_flags = []
+    for s in scenario_results:
+        for pr in s["per_run_results"]:
+            v = pr["anchor"].get("must_not_say_violations") or []
+            if v:
+                drift_flags.append({
+                    "scenario_id": s["scenario_id"],
+                    "run_index": pr["run_index"],
+                    "violations": v,
+                })
+
+    passes = (
+        overall_field_accuracy is not None
+        and overall_field_accuracy >= B_FIELD_ACCURACY_THRESHOLD
+        and not sd_violations
+        and len({d["scenario_id"] for d in drift_flags}) <= B_METHODOLOGY_DRIFT_LIMIT
+        and overall_failure_rate is not None
+        and overall_failure_rate < B_FAILURE_RATE_LIMIT
+    )
+
+    return {
+        "overall_field_accuracy": overall_field_accuracy,
+        "field_accuracy_threshold": B_FIELD_ACCURACY_THRESHOLD,
+        "sd_violations": sd_violations,
+        "sd_threshold": B_SELF_CONSISTENCY_SD_LIMIT,
+        "methodology_drift_flags": drift_flags,
+        "methodology_drift_scenarios": len({d["scenario_id"] for d in drift_flags}),
+        "methodology_drift_limit": B_METHODOLOGY_DRIFT_LIMIT,
+        "overall_failure_rate": overall_failure_rate,
+        "failure_rate_limit": B_FAILURE_RATE_LIMIT,
+        "passes": bool(passes),
+    }
+
+
+def render_b_results_md(scenario_results: list[dict[str, Any]], threshold: dict[str, Any]) -> str:
+    lines = [
+        "# Test B — automated grading results",
+        "",
+        "_Per-scenario numeric diffs, self-consistency, and anchor checks. "
+        "Methodology drift is flagged where the anchor's `must_not_say` was "
+        "violated — a human should review those for actual drift vs false positives._",
+        "",
+        f"## Threshold check: {'✅ PASS' if threshold['passes'] else '❌ FAIL'}",
+        "",
+        f"- Overall field accuracy: **{threshold['overall_field_accuracy']:.0%}**" if threshold['overall_field_accuracy'] is not None else "- Overall field accuracy: n/a",
+        f"  (threshold: ≥ {threshold['field_accuracy_threshold']:.0%})",
+        f"- Self-consistency violations (SD > {threshold['sd_threshold']:.1%}): **{len(threshold['sd_violations'])}**",
+        f"- Methodology drift scenarios: **{threshold['methodology_drift_scenarios']}** "
+        f"(threshold: ≤ {threshold['methodology_drift_limit']})",
+        f"- Overall failure rate: **{threshold['overall_failure_rate']:.0%}**" if threshold['overall_failure_rate'] is not None else "- Overall failure rate: n/a",
+        f"  (threshold: < {threshold['failure_rate_limit']:.0%})",
+        "",
+    ]
+    for s in scenario_results:
+        lines.extend([
+            f"## {s['scenario_id']}",
+            "",
+            f"- Fixture: {s['fixture_status']}",
+            f"- Field accuracy across runs: {s['field_accuracy']:.0%}" if s.get('field_accuracy') is not None else "- Field accuracy: n/a",
+            f"- Max self-consistency SD: {s['max_self_consistency_sd']:.2%}" if s.get('max_self_consistency_sd') is not None else "- Max self-consistency SD: n/a",
+            f"- Failure rate: {s['failure_rate']:.0%}",
+            "",
+        ])
+        for pr in s["per_run_results"]:
+            tool_n = pr.get("tool_call_count")
+            err = pr.get("http_error")
+            err_str = f" ⚠ {err}" if err else ""
+            lines.append(
+                f"### run {pr['run_index']} "
+                f"({tool_n} tool calls{err_str})"
+            )
+            lines.append("")
+            anchor = pr["anchor"]
+            if anchor.get("checked"):
+                mm = anchor.get("must_mention_pass_rate")
+                violations = anchor.get("must_not_say_violations") or []
+                lines.append(
+                    f"- Anchor: must_mention {mm:.0%}, must_not_say "
+                    f"violations: {len(violations)}{' ⚠' if violations else ''}"
+                )
+            if pr["field_diffs"]:
+                lines.append("- Field diffs:")
+                for d in pr["field_diffs"]:
+                    if d["within_tolerance"] is None:
+                        if d["expected"] is None and d["extracted"] is None:
+                            reason = "no expected (fixture missing?) and no extraction"
+                        elif d["expected"] is None:
+                            reason = f"extracted={d['extracted']:.2f} but no expected value (fixture missing?)"
+                        elif d["extracted"] is None:
+                            reason = f"expected={d['expected']:.2f} but couldn't extract from response"
+                        else:
+                            reason = "expected==0, can't compute % off"
+                        lines.append(f"  - ⏭ `{d['path']}`: {reason}")
+                    else:
+                        mark = "✓" if d["within_tolerance"] else "✗"
+                        lines.append(
+                            f"  - {mark} `{d['path']}`: "
+                            f"extracted={d['extracted']:.2f}, "
+                            f"expected={d['expected']:.2f}, "
+                            f"off={d['pct_off']:.1f}% (tol {d['tolerance_pct']}%)"
+                        )
+            lines.append("")
+    return "\n".join(lines)
+
+
+# ---------------------------------------------------------------------------
+# Orchestration
+# ---------------------------------------------------------------------------
+
+def main() -> int:
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument("run_dir", type=Path, help="A directory under evals/runs/")
+    parser.add_argument(
+        "--test",
+        choices=["A", "B", "both"],
+        default="both",
+        help="Which test to grade (default: both).",
+    )
+    parser.add_argument(
+        "--threshold-check",
+        action="store_true",
+        help="After A_grading.md is filled in, parse it and apply Test A "
+             "thresholds. Writes threshold_check.md.",
+    )
+    args = parser.parse_args()
+
+    if not args.run_dir.exists():
+        print(f"Run dir not found: {args.run_dir}", file=sys.stderr)
+        return 1
+
+    if args.threshold_check:
+        sheet_path = args.run_dir / "A_grading.md"
+        if not sheet_path.exists():
+            print(f"A_grading.md not found in {args.run_dir} — generate it first.", file=sys.stderr)
+            return 1
+        graded = parse_a_sheet(sheet_path.read_text())
+        result = a_threshold_check(graded)
+        out = args.run_dir / "threshold_check.md"
+        passes = "✅ PASS" if result["passes"] else "❌ FAIL"
+        out.write_text(
+            f"# Test A — threshold check ({passes})\n\n"
+            f"- Mean rubric score: **{result['mean_score']}** "
+            f"(threshold ≥ {result['mean_threshold']})\n"
+            f"- Fully graded responses: {result['fully_graded_count']}\n"
+            f"- Incomplete responses: {result['incomplete_count']}\n"
+            f"- Trust-killer failures "
+            f"(score < {result['trust_killer_threshold']} on "
+            f"{', '.join(A_TRUST_KILLERS)}): "
+            f"**{len(result['trust_killer_failures'])}**\n"
+            + "".join(
+                f"  - {f['scenario_id']} run {f['run_index']}: "
+                f"{f['dimension']}={f['score']}\n"
+                for f in result["trust_killer_failures"]
+            )
+            + f"- Fabrication rate: "
+            f"{result['fabrication_count']}/{result['fabrication_total']}"
+            + (f" ({result['fabrication_rate']:.0%})" if result['fabrication_rate'] is not None else "")
+            + f" (limit ≤ {result['fabrication_rate_limit']:.0%})\n"
+        )
+        (args.run_dir / "threshold_check.json").write_text(json.dumps(result, indent=2))
+        print(f"Wrote {out}")
+        return 0
+
+    responses = load_run(args.run_dir)
+    if not responses:
+        print(f"No responses found under {args.run_dir}", file=sys.stderr)
+        return 1
+
+    if args.test in ("A", "both"):
+        sheet = render_a_sheet(responses)
+        out = args.run_dir / "A_grading.md"
+        out.write_text(sheet)
+        print(f"Wrote {out} (fill in scores and re-run with --threshold-check)")
+
+    if args.test in ("B", "both"):
+        b_responses = [r for r in responses if r.scenario["test"] == "B"]
+        by_scenario: dict[str, list[RunResponse]] = {}
+        for r in b_responses:
+            by_scenario.setdefault(r.scenario_id, []).append(r)
+        scenario_results = [
+            grade_b_scenario(rs) for rs in by_scenario.values()
+        ]
+        threshold = b_threshold_check(scenario_results)
+        (args.run_dir / "B_results.json").write_text(
+            json.dumps(
+                {"scenarios": scenario_results, "threshold_check": threshold},
+                indent=2,
+                default=str,
+            )
+        )
+        (args.run_dir / "B_results.md").write_text(
+            render_b_results_md(scenario_results, threshold)
+        )
+        passes = "✅ PASS" if threshold["passes"] else "❌ FAIL"
+        print(f"Wrote {args.run_dir / 'B_results.md'} ({passes})")
+
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())

From 06a4e4666395c6ae53c7fc315bfa7f38fc4d1870 Mon Sep 17 00:00:00 2001
From: SakshiKekre <sakshi.kekre@gmail.com>
Date: Thu, 21 May 2026 15:38:21 -0700
Subject: [PATCH 05/11] Add Test B fixture builder + extend grader to
 list-of-dicts paths

- build_fixtures.py: fetches PE-API responses for B1/B2/B3/B5 and computes
  B4 locally via policyengine_uk (PE-API has no MTR endpoint). Output JSONs
  are committed so the grader doesn't refetch on every run.
- Generated fixtures for B3 (household calc) and B4 (MTR schedule).
- grade.py: split scalar vs list-of-dicts field comparison. List shape uses
  `key_by` (row identifier) + `compare` (field to diff). Adds a per-row
  extractor that locates the key in chat prose and pulls the nearby
  percentage.
- b4_mtr_schedule.yaml: switch fields_to_compare to the new shape so the
  grader diffs combined_mtr per gross-income row.
---
 evals/fixtures/pe_api/b3_household_calc.json |   8 +
 evals/fixtures/pe_api/b4_mtr_schedule.json   |  52 +++
 evals/runner/build_fixtures.py               | 435 +++++++++++++++++++
 evals/runner/grade.py                        | 188 ++++++--
 evals/scenarios/b4_mtr_schedule.yaml         |   6 +-
 5 files changed, 660 insertions(+), 29 deletions(-)
 create mode 100644 evals/fixtures/pe_api/b3_household_calc.json
 create mode 100644 evals/fixtures/pe_api/b4_mtr_schedule.json
 create mode 100644 evals/runner/build_fixtures.py

diff --git a/evals/fixtures/pe_api/b3_household_calc.json b/evals/fixtures/pe_api/b3_household_calc.json
new file mode 100644
index 0000000..0d2433b
--- /dev/null
+++ b/evals/fixtures/pe_api/b3_household_calc.json
@@ -0,0 +1,8 @@
+{
+  "result": {
+    "household_net_income": 35760.55078125,
+    "income_tax": 6486.0,
+    "national_insurance": 2594.4033203125,
+    "marginal_tax_rate": 28.000000000000004
+  }
+}
\ No newline at end of file
diff --git a/evals/fixtures/pe_api/b4_mtr_schedule.json b/evals/fixtures/pe_api/b4_mtr_schedule.json
new file mode 100644
index 0000000..9f9145f
--- /dev/null
+++ b/evals/fixtures/pe_api/b4_mtr_schedule.json
@@ -0,0 +1,52 @@
+{
+  "rows": [
+    {
+      "gross": 10000,
+      "it_mtr": 0.0,
+      "ni_mtr": 0.0,
+      "combined_mtr": 0.0
+    },
+    {
+      "gross": 20000,
+      "it_mtr": 20.0,
+      "ni_mtr": 8.0,
+      "combined_mtr": 28.0
+    },
+    {
+      "gross": 30000,
+      "it_mtr": 20.0,
+      "ni_mtr": 8.0,
+      "combined_mtr": 28.0
+    },
+    {
+      "gross": 50000,
+      "it_mtr": 20.0,
+      "ni_mtr": 8.0,
+      "combined_mtr": 28.0
+    },
+    {
+      "gross": 75000,
+      "it_mtr": 40.0,
+      "ni_mtr": 2.0,
+      "combined_mtr": 42.0
+    },
+    {
+      "gross": 100000,
+      "it_mtr": 60.0,
+      "ni_mtr": 2.0,
+      "combined_mtr": 62.0
+    },
+    {
+      "gross": 125000,
+      "it_mtr": 60.0,
+      "ni_mtr": 2.0,
+      "combined_mtr": 62.0
+    },
+    {
+      "gross": 150000,
+      "it_mtr": 45.0,
+      "ni_mtr": 2.0,
+      "combined_mtr": 47.0
+    }
+  ]
+}
\ No newline at end of file
diff --git a/evals/runner/build_fixtures.py b/evals/runner/build_fixtures.py
new file mode 100644
index 0000000..220cd12
--- /dev/null
+++ b/evals/runner/build_fixtures.py
@@ -0,0 +1,435 @@
+#!/usr/bin/env python3
+"""
+Build reference fixtures for Test B scenarios.
+
+Most fixtures come from live PE-API calls (api.policyengine.org). B4 (MTR
+schedule) is computed locally via the policyengine_uk package since the API
+has no MTR endpoint. B2 (stacked NI/IT/freeze) is assembled from multiple
+PE-API calls because the YAML's fixture shape has per-layer keys that no
+single API response produces.
+
+Run on demand — generated fixtures are committed to git so the grader
+doesn't have to refetch on every CI run.
+
+Usage:
+    python evals/runner/build_fixtures.py                    # all scenarios
+    python evals/runner/build_fixtures.py b1 b3              # just these
+    python evals/runner/build_fixtures.py --validate-only    # don't refetch,
+                                                              # just check that
+                                                              # each path in
+                                                              # scenario YAMLs
+                                                              # resolves in the
+                                                              # existing fixture
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import sys
+import time
+from pathlib import Path
+from typing import Any
+
+import httpx
+import yaml
+
+
+# Paths
+EVALS_DIR = Path(__file__).resolve().parent.parent
+SCENARIOS_DIR = EVALS_DIR / "scenarios"
+FIXTURES_DIR = EVALS_DIR / "fixtures" / "pe_api"
+
+PE_API_BASE = "https://api.policyengine.org"
+POLL_INTERVAL_SECONDS = 15
+POLL_TIMEOUT_SECONDS = 600
+
+
+# ---------------------------------------------------------------------------
+# PE-API helpers
+# ---------------------------------------------------------------------------
+
+def create_policy(client: httpx.Client, country: str, data: dict[str, Any]) -> int:
+    """POST a reform policy spec; return the new policy_id."""
+    r = client.post(f"{PE_API_BASE}/{country}/policy", json={"data": data})
+    r.raise_for_status()
+    body = r.json()
+    if body.get("status") != "ok":
+        raise RuntimeError(f"policy create failed: {body}")
+    return body["result"]["policy_id"]
+
+
+def poll_economy(
+    client: httpx.Client,
+    country: str,
+    reform_id: int,
+    baseline_id: int,
+    region: str,
+    time_period: str,
+    dataset: str | None = None,
+) -> dict[str, Any]:
+    """Fire an economy-wide comparison and poll until status=ok."""
+    params = {"region": region, "time_period": time_period}
+    if dataset:
+        params["dataset"] = dataset
+    url = f"{PE_API_BASE}/{country}/economy/{reform_id}/over/{baseline_id}"
+
+    print(f"  polling {url} (region={region}, time_period={time_period})...")
+    started = time.time()
+    while True:
+        r = client.get(url, params=params)
+        r.raise_for_status()
+        body = r.json()
+        status = body.get("status")
+        if status == "ok":
+            print(f"    done ({int(time.time() - started)}s)")
+            return body["result"]
+        if status == "error":
+            raise RuntimeError(f"economy comparison errored: {body}")
+        if time.time() - started > POLL_TIMEOUT_SECONDS:
+            raise TimeoutError(
+                f"economy comparison did not finish in {POLL_TIMEOUT_SECONDS}s"
+            )
+        time.sleep(POLL_INTERVAL_SECONDS)
+
+
+def fetch_household(
+    client: httpx.Client,
+    country: str,
+    household_id: int,
+    policy_id: int,
+) -> dict[str, Any]:
+    r = client.get(f"{PE_API_BASE}/{country}/household/{household_id}/policy/{policy_id}")
+    r.raise_for_status()
+    body = r.json()
+    if body.get("status") != "ok":
+        raise RuntimeError(f"household fetch failed: {body}")
+    return body["result"]
+
+
+# ---------------------------------------------------------------------------
+# Per-scenario fixture builders
+# ---------------------------------------------------------------------------
+
+def build_b1(client: httpx.Client) -> dict[str, Any]:
+    """B1 — single-parameter PA raise from £12,570 to £15,000, UK 2025."""
+    reform_data = {
+        "gov.hmrc.income_tax.allowances.personal_allowance.amount": {
+            "2025-01-01.2025-12-31": 15000,
+        }
+    }
+    reform_id = create_policy(client, "uk", reform_data)
+    print(f"  created reform policy {reform_id}")
+    return poll_economy(
+        client,
+        country="uk",
+        reform_id=reform_id,
+        baseline_id=1,
+        region="uk",
+        time_period="2025",
+    )
+
+
+def build_b2(client: httpx.Client) -> dict[str, Any]:
+    """B2 — stacked NI/IT/freeze.
+
+    The YAML's reference shape has per-layer keys (`layers.freeze_extension.*`,
+    `layers.ni_cut.*`, `layers.it_increase.*`). We fetch the four reform IDs
+    Vahid cites in uk-income-tax-ni-reforms-2025.md and assemble.
+
+    Vahid's reform IDs (from the post):
+      83092 — freeze extension to 2029-30 (vs current law baseline 1)
+      94938 — NI cut applied on top of freeze (baseline 83092)
+      94911 — IT increase on top of freeze + NI cut (baseline 94938)
+      94906 — NI cut alone vs current law (for the standalone decile chart)
+      94910 — IT increase alone vs current law (for the standalone decile chart)
+    """
+    print("  fetching freeze extension impact (reform 83092, 2028-29)...")
+    freeze_2028 = poll_economy(
+        client, "uk", reform_id=83092, baseline_id=1,
+        region="uk", time_period="2028",
+    )
+
+    print("  fetching NI cut layer (reform 94938 over 83092, 2026-27)...")
+    ni_layer = poll_economy(
+        client, "uk", reform_id=94938, baseline_id=83092,
+        region="uk", time_period="2026",
+    )
+
+    print("  fetching IT increase layer (reform 94911 over 94938, 2026-27)...")
+    it_layer = poll_economy(
+        client, "uk", reform_id=94911, baseline_id=94938,
+        region="uk", time_period="2026",
+    )
+
+    print("  fetching combined impact (reform 94911 over 1, 2026-27)...")
+    combined = poll_economy(
+        client, "uk", reform_id=94911, baseline_id=1,
+        region="uk", time_period="2026",
+    )
+
+    print("  fetching NI cut alone (reform 94906 over 1, 2026-27)...")
+    ni_alone = poll_economy(
+        client, "uk", reform_id=94906, baseline_id=1,
+        region="uk", time_period="2026",
+    )
+
+    print("  fetching IT increase alone (reform 94910 over 1, 2026-27)...")
+    it_alone = poll_economy(
+        client, "uk", reform_id=94910, baseline_id=1,
+        region="uk", time_period="2026",
+    )
+
+    return {
+        "combined": {
+            "budgetary_impact_2026_27": combined.get("budget", {}).get("budgetary_impact"),
+            "_raw": combined,
+        },
+        "layers": {
+            "freeze_extension": {
+                "budgetary_impact_2028_29": freeze_2028.get("budget", {}).get("budgetary_impact"),
+                "_raw": freeze_2028,
+            },
+            "ni_cut": {
+                "budgetary_impact_2026_27": ni_layer.get("budget", {}).get("budgetary_impact"),
+                "_raw": ni_layer,
+            },
+            "it_increase": {
+                "budgetary_impact_2026_27": it_layer.get("budget", {}).get("budgetary_impact"),
+                "_raw": it_layer,
+            },
+        },
+        "decile": {
+            "relative": {
+                "ni_cut": ni_alone.get("decile", {}).get("relative"),
+                "it_increase": it_alone.get("decile", {}).get("relative"),
+            },
+        },
+        # example_household figures come from Vahid's hand-computed table in the
+        # post (£60k earner + £10k pension). They aren't an API endpoint — they
+        # are the canonical illustrative-household example from the post.
+        "example_household": {
+            "net_change": 5.4,
+            "ni_change": -754.0,
+            "it_change": 748.6,
+        },
+    }
+
+
+def build_b3() -> dict[str, Any]:
+    """B3 — household calc, computed locally via policyengine_uk.
+
+    The PE-API household endpoint requires a stored household_id, which means
+    we'd have to POST a household spec first. Easier to call the package
+    directly — same engine, no policy or household round-trip.
+    """
+    print("  computing household via policyengine_uk locally...")
+    from policyengine_uk import Simulation
+
+    situation = {
+        "people": {"p": {"age": 35, "employment_income": 45000}},
+        "benunits": {"b": {"members": ["p"]}},
+        "households": {"h": {"members": ["p"]}},
+    }
+    sim = Simulation(situation=situation)
+    net = float(sim.calculate("household_net_income", 2025)[0])
+    income_tax = float(sim.calculate("income_tax", 2025)[0])
+    ni = float(sim.calculate("national_insurance", 2025)[0])
+
+    # MTR by finite difference (+£100 of employment income)
+    bumped = {
+        "people": {"p": {"age": 35, "employment_income": 45100}},
+        "benunits": {"b": {"members": ["p"]}},
+        "households": {"h": {"members": ["p"]}},
+    }
+    sim_b = Simulation(situation=bumped)
+    net_b = float(sim_b.calculate("household_net_income", 2025)[0])
+    mtr = (1 - (net_b - net) / 100) * 100
+
+    return {
+        "result": {
+            "household_net_income": net,
+            "income_tax": income_tax,
+            "national_insurance": ni,
+            "marginal_tax_rate": mtr,
+        }
+    }
+
+
+def build_b4() -> dict[str, Any]:
+    """B4 — MTR schedule at 8 income points, computed locally."""
+    print("  computing MTR schedule via policyengine_uk locally...")
+    from policyengine_uk import Simulation
+
+    incomes = [10000, 20000, 30000, 50000, 75000, 100000, 125000, 150000]
+
+    def sit(income: int) -> dict[str, Any]:
+        return {
+            "people": {"p": {"age": 35, "employment_income": income}},
+            "benunits": {"b": {"members": ["p"]}},
+            "households": {"h": {"members": ["p"]}},
+        }
+
+    def at(income: int) -> dict[str, float]:
+        s = Simulation(situation=sit(income))
+        return {
+            "it": float(s.calculate("income_tax", 2025)[0]),
+            "ni": float(s.calculate("national_insurance", 2025)[0]),
+        }
+
+    rows = []
+    for income in incomes:
+        a = at(income)
+        b = at(income + 100)
+        it_mtr = (b["it"] - a["it"])  # change in £100 = pp directly
+        ni_mtr = (b["ni"] - a["ni"])
+        rows.append({
+            "gross": income,
+            "it_mtr": round(it_mtr, 2),
+            "ni_mtr": round(ni_mtr, 2),
+            "combined_mtr": round(it_mtr + ni_mtr, 2),
+        })
+
+    return {"rows": rows}
+
+
+def build_b5(client: httpx.Client) -> dict[str, Any]:
+    """B5 — remove the two-child limit (Vahid's reform 93219, 2026-27)."""
+    print("  fetching reform 93219 over 1, region=uk, 2026...")
+    result = poll_economy(
+        client, "uk", reform_id=93219, baseline_id=1,
+        region="uk", time_period="2026",
+    )
+    return result
+
+
+# ---------------------------------------------------------------------------
+# Validation: every fields_to_compare.path resolves in the fixture
+# ---------------------------------------------------------------------------
+
+def resolve_path(node: Any, dotted_path: str) -> tuple[bool, Any]:
+    """Return (resolved, value). Integer-looking parts index into lists."""
+    cur = node
+    for part in dotted_path.split("."):
+        if isinstance(cur, dict) and part in cur:
+            cur = cur[part]
+        elif isinstance(cur, list) and part.isdigit():
+            idx = int(part)
+            if 0 <= idx < len(cur):
+                cur = cur[idx]
+            else:
+                return False, None
+        else:
+            return False, None
+    return True, cur
+
+
+def validate_fixture_paths(scenario_id: str, scenario: dict[str, Any], fixture: dict[str, Any]) -> list[str]:
+    """For every fields_to_compare.path, check it resolves in the fixture.
+
+    Returns a list of paths that didn't resolve.
+    """
+    ref = scenario.get("reference") or {}
+    misses = []
+    for fc in ref.get("fields_to_compare") or []:
+        path = fc["path"]
+        if fc.get("expected_approx") is not None:
+            # Has an inline expected value, no fixture lookup required.
+            continue
+        ok, value = resolve_path(fixture, path)
+        if not ok:
+            misses.append(path)
+        elif not isinstance(value, (int, float)):
+            misses.append(f"{path} (resolved but value is {type(value).__name__}, expected number)")
+    return misses
+
+
+# ---------------------------------------------------------------------------
+# Orchestration
+# ---------------------------------------------------------------------------
+
+BUILDERS = {
+    "b1": ("b1_society_wide_pa.json", lambda c: build_b1(c)),
+    "b2": ("b2_ni_it_stacked.json", lambda c: build_b2(c)),
+    "b3": ("b3_household_calc.json", lambda c: build_b3()),
+    "b4": ("b4_mtr_schedule.json", lambda c: build_b4()),
+    "b5": ("b5_two_child_limit.json", lambda c: build_b5(c)),
+}
+
+
+def load_scenario(scenario_id_prefix: str) -> dict[str, Any]:
+    """Load the scenario YAML matching b1, b2, ... shorthand."""
+    matches = list(SCENARIOS_DIR.glob(f"{scenario_id_prefix}_*.yaml"))
+    if not matches:
+        raise SystemExit(f"No scenario YAML matching '{scenario_id_prefix}_*'")
+    if len(matches) > 1:
+        raise SystemExit(f"Multiple matches: {[m.name for m in matches]}")
+    return yaml.safe_load(matches[0].read_text())
+
+
+def main() -> int:
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument(
+        "scenarios", nargs="*",
+        help="Scenario shorthand (b1, b2, ...). Empty = all B scenarios.",
+    )
+    parser.add_argument(
+        "--validate-only", action="store_true",
+        help="Skip rebuilding fixtures; just validate that scenario field paths "
+             "resolve in the existing fixture JSONs.",
+    )
+    args = parser.parse_args()
+
+    keys = args.scenarios or sorted(BUILDERS.keys())
+    bad = [k for k in keys if k not in BUILDERS]
+    if bad:
+        raise SystemExit(f"Unknown scenarios: {bad}. Known: {sorted(BUILDERS)}")
+
+    FIXTURES_DIR.mkdir(parents=True, exist_ok=True)
+
+    overall_misses: list[tuple[str, list[str]]] = []
+
+    with httpx.Client(timeout=60.0) as client:
+        for key in keys:
+            filename, builder = BUILDERS[key]
+            fixture_path = FIXTURES_DIR / filename
+            scenario = load_scenario(key)
+            print(f"\n=== {key} → {filename} ===")
+
+            if args.validate_only:
+                if not fixture_path.exists():
+                    print(f"  no fixture at {fixture_path} (skipping)")
+                    continue
+                fixture = json.loads(fixture_path.read_text())
+            else:
+                if key in ("b3", "b4"):
+                    fixture = builder(None)  # local computation, no httpx client needed
+                else:
+                    fixture = builder(client)
+                fixture_path.write_text(json.dumps(fixture, indent=2, default=str))
+                print(f"  wrote {fixture_path}")
+
+            misses = validate_fixture_paths(scenario["id"], scenario, fixture)
+            if misses:
+                overall_misses.append((scenario["id"], misses))
+                print(f"  ⚠ {len(misses)} field path(s) didn't resolve in fixture:")
+                for path in misses:
+                    print(f"    - {path}")
+            else:
+                print(f"  ✓ all field paths resolve")
+
+    if overall_misses:
+        print("\n=== validation: ISSUES ===")
+        for sid, misses in overall_misses:
+            print(f"  {sid}: {misses}")
+        print("\nFix either the fixture (rename keys), the scenario YAML "
+              "(rename paths), or both. The grader silently skips unresolved "
+              "paths, so these would otherwise hide as 'unextracted'.")
+        return 1
+
+    print("\n✓ all done")
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())
diff --git a/evals/runner/grade.py b/evals/runner/grade.py
index 4042356..a7ae83a 100644
--- a/evals/runner/grade.py
+++ b/evals/runner/grade.py
@@ -447,6 +447,155 @@ def extract_b_value(answer: str, field_path: str) -> float | None:
     return parse_number_near(answer, label)
 
 
+def _resolve_path(node: Any, dotted_path: str) -> Any | None:
+    """Walk a dotted path into nested dicts/lists; return None if missing.
+
+    Integer-looking parts index into lists. So `rows.0.combined_mtr` works.
+    """
+    cur = node
+    for part in dotted_path.split("."):
+        if isinstance(cur, dict) and part in cur:
+            cur = cur[part]
+        elif isinstance(cur, list) and part.isdigit():
+            idx = int(part)
+            if 0 <= idx < len(cur):
+                cur = cur[idx]
+            else:
+                return None
+        else:
+            return None
+    return cur
+
+
+def _diff_scalar(fc: dict[str, Any], fixture: dict[str, Any] | None, answer: str) -> dict[str, Any]:
+    path = fc["path"]
+    tolerance_pct = fc.get("tolerance_pct", 1.0)
+    extracted = extract_b_value(answer, path)
+    expected = fc.get("expected_approx")
+    if expected is None and fixture is not None:
+        value = _resolve_path(fixture, path)
+        if isinstance(value, (int, float)):
+            expected = float(value)
+    within = None
+    pct_off = None
+    if extracted is not None and expected is not None and expected != 0:
+        pct_off = abs(extracted - expected) / abs(expected) * 100
+        within = pct_off <= tolerance_pct
+    return {
+        "path": path,
+        "expected": expected,
+        "extracted": extracted,
+        "pct_off": pct_off,
+        "tolerance_pct": tolerance_pct,
+        "within_tolerance": within,
+    }
+
+
+def _diff_list_of_dicts(
+    fc: dict[str, Any], fixture: dict[str, Any] | None, answer: str
+) -> list[dict[str, Any]]:
+    """Compare each row in a list-of-dicts field.
+
+    `path` points to the list. Each row in the *fixture* contributes one diff
+    entry, labelled by the row's `key_by` value (e.g. gross income) and
+    comparing the row's `compare` field (e.g. combined_mtr). Extraction
+    heuristic: search the chat answer for the key value (e.g. "£10,000"),
+    then pull the next number off the same line/sentence.
+    """
+    path = fc["path"]
+    key_by = fc["key_by"]
+    compare_field = fc["compare"]
+    tolerance_pct = fc.get("tolerance_pct", 1.0)
+
+    rows = _resolve_path(fixture, path) if fixture is not None else None
+    diffs: list[dict[str, Any]] = []
+    if not isinstance(rows, list):
+        diffs.append({
+            "path": path,
+            "expected": None,
+            "extracted": None,
+            "pct_off": None,
+            "tolerance_pct": tolerance_pct,
+            "within_tolerance": None,
+            "note": "list-of-dicts path missing in fixture",
+        })
+        return diffs
+
+    for row in rows:
+        if not isinstance(row, dict) or key_by not in row or compare_field not in row:
+            continue
+        key_value = row[key_by]
+        expected = row[compare_field]
+        if not isinstance(expected, (int, float)):
+            continue
+
+        # Extraction: find the key value in the chat answer, then look for
+        # the comparison field's number near it. Heuristic — falls back to
+        # None if nothing nearby looks like a percentage.
+        extracted = _extract_row_value(answer, key_value, compare_field)
+
+        within = None
+        pct_off = None
+        if extracted is not None and expected != 0:
+            pct_off = abs(extracted - float(expected)) / abs(float(expected)) * 100
+            within = pct_off <= tolerance_pct
+
+        diffs.append({
+            "path": f"{path}[{key_by}={key_value}].{compare_field}",
+            "expected": float(expected),
+            "extracted": extracted,
+            "pct_off": pct_off,
+            "tolerance_pct": tolerance_pct,
+            "within_tolerance": within,
+        })
+    return diffs
+
+
+def _extract_row_value(answer: str, key_value: Any, compare_field: str) -> float | None:
+    """Find the chat's reported value for a row keyed by `key_value`.
+
+    Looks for the key (formatted as £-amount if numeric and large) in the
+    chat answer, then pulls the first number from the surrounding line. The
+    `compare_field` name is mostly used to bias which number to pick when
+    multiple are present — for MTR fields we prefer the last % on the line.
+    """
+    # Format the key as it's likely to appear in chat prose.
+    if isinstance(key_value, (int, float)):
+        candidates = [
+            f"£{int(key_value):,}",       # £10,000
+            f"£{int(key_value)/1000:g}k", # £10k
+            f"{int(key_value):,}",
+            str(int(key_value)),
+        ]
+    else:
+        candidates = [str(key_value)]
+
+    for needle in candidates:
+        idx = answer.find(needle)
+        if idx == -1:
+            continue
+        # Look at the surrounding line: from previous newline to next newline.
+        line_start = answer.rfind("\n", 0, idx) + 1
+        line_end_search = answer.find("\n", idx)
+        line_end = line_end_search if line_end_search != -1 else len(answer)
+        line = answer[line_start:line_end]
+
+        # For MTR-style fields, prefer "combined" % if multiple appear, else
+        # the last % on the line.
+        if "mtr" in compare_field.lower() or "rate" in compare_field.lower():
+            # Pull all percentages from the line.
+            pcts = re.findall(r"(\d+(?:\.\d+)?)\s*%", line)
+            if pcts:
+                if "combined" in compare_field.lower():
+                    return float(pcts[-1])  # combined usually last in the row
+                return float(pcts[-1])
+        # Otherwise, take the first scaled or numeric value after the needle.
+        n = parse_number_near(line, re.escape(needle))
+        if n is not None:
+            return n
+    return None
+
+
 def grade_b_scenario(
     responses_for_scenario: list[RunResponse],
 ) -> dict[str, Any]:
@@ -473,34 +622,17 @@ def grade_b_scenario(
 
         field_diffs = []
         for fc in reference.get("fields_to_compare") or []:
-            path = fc["path"]
-            tolerance_pct = fc.get("tolerance_pct", 1.0)
-            extracted = extract_b_value(r.answer_text, path)
-            expected = fc.get("expected_approx")
-            if expected is None and fixture is not None:
-                # Pluck the path out of the fixture JSON.
-                node = fixture
-                for part in path.split("."):
-                    if isinstance(node, dict) and part in node:
-                        node = node[part]
-                    else:
-                        node = None
-                        break
-                if isinstance(node, (int, float)):
-                    expected = float(node)
-            within = None
-            pct_off = None
-            if extracted is not None and expected is not None and expected != 0:
-                pct_off = abs(extracted - expected) / abs(expected) * 100
-                within = pct_off <= tolerance_pct
-            field_diffs.append({
-                "path": path,
-                "expected": expected,
-                "extracted": extracted,
-                "pct_off": pct_off,
-                "tolerance_pct": tolerance_pct,
-                "within_tolerance": within,
-            })
+            # Two shapes:
+            #   1) scalar field — `path: budget.budgetary_impact`
+            #   2) list-of-dicts field — `path: rows`, `key_by: gross`,
+            #      `compare: combined_mtr`. Each row becomes one diff entry
+            #      keyed by the row's `key_by` value.
+            if fc.get("key_by") and fc.get("compare"):
+                field_diffs.extend(
+                    _diff_list_of_dicts(fc, fixture, r.answer_text)
+                )
+            else:
+                field_diffs.append(_diff_scalar(fc, fixture, r.answer_text))
         per_run_results.append({
             "run_index": r.run_index,
             "anchor": check,
diff --git a/evals/scenarios/b4_mtr_schedule.yaml b/evals/scenarios/b4_mtr_schedule.yaml
index 502e4d4..231a194 100644
--- a/evals/scenarios/b4_mtr_schedule.yaml
+++ b/evals/scenarios/b4_mtr_schedule.yaml
@@ -17,9 +17,13 @@ chat_settings:
 reference:
   fixture: pe_api/b4_mtr_schedule.json
   fields_to_compare:
+    # Compare the combined MTR at each row. `rows` is a list-of-dicts; the
+    # grader iterates over it, using `key_by` to identify each row (gross
+    # income) and `compare` to pick which field to diff.
     - path: rows
+      key_by: gross
+      compare: combined_mtr
       tolerance_pct: 1.0
-      # Each row: { gross, it_mtr, ni_mtr, combined_mtr }. Compare element-wise.
 
 anchor:
   # Pure rule-driven schedule — anchor is the expected schedule values themselves.

From dd768db733b4a6217047399b1879dd30d1264e2a Mon Sep 17 00:00:00 2001
From: SakshiKekre <sakshi.kekre@gmail.com>
Date: Wed, 27 May 2026 06:00:39 -0700
Subject: [PATCH 06/11] Generate Test B fixtures via local policyengine, filter
 by 10% drift

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .gitignore                                    |   1 +
 evals/fixtures/drift_report.md                |  72 ++
 evals/fixtures/pe_api/b1_society_wide_pa.json |  90 +++
 evals/fixtures/pe_api/b2_ni_it_stacked.json   |  36 +
 evals/fixtures/pe_api/b3_household_calc.json  |   2 +
 evals/fixtures/pe_api/b4_mtr_schedule.json    |   2 +
 evals/runner/build_fixtures.py                | 677 ++++++++++--------
 evals/runner/requirements-fixtures.txt        |  10 +
 evals/scenarios/b2_ni_it_stacked.yaml         | 100 ++-
 evals/scenarios/b5_two_child_limit.yaml       |  62 --
 10 files changed, 643 insertions(+), 409 deletions(-)
 create mode 100644 evals/fixtures/drift_report.md
 create mode 100644 evals/fixtures/pe_api/b1_society_wide_pa.json
 create mode 100644 evals/fixtures/pe_api/b2_ni_it_stacked.json
 create mode 100644 evals/runner/requirements-fixtures.txt
 delete mode 100644 evals/scenarios/b5_two_child_limit.yaml

diff --git a/.gitignore b/.gitignore
index f0cddd5..b259d0e 100644
--- a/.gitignore
+++ b/.gitignore
@@ -4,6 +4,7 @@ __pycache__/
 .next/
 node_modules/
 .venv/
+.venv-*/
 *.egg-info/
 dist/
 
diff --git a/evals/fixtures/drift_report.md b/evals/fixtures/drift_report.md
new file mode 100644
index 0000000..37b0ae6
--- /dev/null
+++ b/evals/fixtures/drift_report.md
@@ -0,0 +1,72 @@
+# Test B fixture drift report
+
+Generated by `build_fixtures.py`. Drift threshold: 10.0%.
+
+For each scenario, fields whose locally-computed value drifted more than
+the threshold from the published reference are dropped from the fixture.
+Dropped fields indicate `policyengine_uk` baseline has moved since the
+post was published — not a bug, just model evolution.
+
+---
+
+## b1_society_wide_pa
+
+No published reference (B1 is an author-defined scenario, not from a blog post).
+Local computation is the canonical fixture.
+
+- budgetary_impact: £-20,452,307,295
+- tax_revenue_impact: £-21,084,668,787
+- benefit_spending_impact: £-632,361,491
+
+## b2_ni_it_stacked
+
+Drift threshold: 10.0%
+
+| field | published | ours | drift | kept? |
+|---|---|---|---|---|
+| `freeze_layer.budgetary_impact` | £+3,500,000,000 | £+0 | -100.0% | ✗ |
+| `ni_layer.budgetary_impact` | £-11,700,000,000 | £-11,839,866,703 | -1.2% | ✓ |
+| `it_layer.budgetary_impact` | £+18,600,000,000 | £+16,424,706,760 | -11.7% | ✗ |
+| `combined.budgetary_impact` | £+6,900,000,000 | £+4,584,840,056 | -33.6% | ✗ |
+
+Per-decile distributions saved for ni_alone and it_alone
+(freeze_layer + combined distributions dropped due to baseline drift).
+
+## b3_household_calc
+
+No published reference. Local computation is the canonical fixture.
+
+- household_net_income: £35,760.55
+- income_tax:           £6,486.00
+- national_insurance:   £2,594.40
+- marginal_tax_rate:    28.00%
+
+## b4_mtr_schedule
+
+No published reference. Local computation is the canonical fixture.
+
+Combined IT+NI MTR by gross income:
+  £ 10,000:   0.0%
+  £ 20,000:  28.0%
+  £ 30,000:  28.0%
+  £ 50,000:  28.0%
+  £ 75,000:  42.0%
+  £100,000:  62.0%
+  £125,000:  62.0%
+  £150,000:  47.0%
+
+## b5_two_child_limit
+
+**Dropped — model baseline drift.**
+
+Vahid's reform 93219 sets `child_count` cap to 100/102 effective 2025+,
+which was meaningful when the post was published (pre-Autumn Budget 2025).
+policyengine_uk 2.88.20 now has the cap at `inf` from 2026 onward as
+current law (the Autumn Budget 2025 removal is baked into baseline).
+
+Result: reform vs current law is a zero-delta no-op. £0 budgetary impact,
+0pp poverty change, 0% Gini change.
+
+Re-enabling requires either (a) replacing with a different reform Vahid
+wrote that is still counterfactual today, or (b) pinning to an older
+policyengine_uk version that pre-dates the baseline update.
diff --git a/evals/fixtures/pe_api/b1_society_wide_pa.json b/evals/fixtures/pe_api/b1_society_wide_pa.json
new file mode 100644
index 0000000..6cf4244
--- /dev/null
+++ b/evals/fixtures/pe_api/b1_society_wide_pa.json
@@ -0,0 +1,90 @@
+{
+  "_source": "local policyengine 0.13.0 + policyengine_uk 2.88.20",
+  "_scenario": "PA \u00a312,570 \u2192 \u00a315,000, UK 2025, EFRS 2023-24",
+  "budget": {
+    "budgetary_impact": -20452307295.159668,
+    "tax_revenue_impact": -21084668786.5271,
+    "state_tax_revenue_impact": 0.0,
+    "benefit_spending_impact": -632361491.3674316,
+    "households": 31953971.416512378,
+    "baseline_net_income": 1728049930241.7156
+  },
+  "decile": {
+    "relative": {
+      "1": 0.004968200636804344,
+      "2": 0.009309003581381547,
+      "3": 0.01057564733683128,
+      "4": 0.011099952336793551,
+      "5": 0.011816873603277435,
+      "6": 0.013743440778813563,
+      "7": 0.015384707949647889,
+      "8": 0.015108006406911852,
+      "9": 0.014794885960264402,
+      "10": 0.008800007786865901
+    },
+    "average": {
+      "1": 106.93962289058543,
+      "2": 280.666902638706,
+      "3": 367.8910420501796,
+      "4": 422.9463876951294,
+      "5": 491.038892727913,
+      "6": 728.3004353254806,
+      "7": 836.7740461027877,
+      "8": 1014.2588495254175,
+      "9": 1110.2858561823086,
+      "10": 1214.8397640254018
+    }
+  },
+  "poverty": {
+    "poverty": {
+      "child": {
+        "baseline": 0.14061780050646178,
+        "reform": 0.13959329465582693
+      },
+      "adult": {
+        "baseline": 0.08759545897508306,
+        "reform": 0.08643590796613664
+      },
+      "senior": {
+        "baseline": 0.12155488270743689,
+        "reform": 0.1195026821747089
+      },
+      "all": {
+        "baseline": 0.10503115545324755,
+        "reform": 0.10373596289984098
+      }
+    },
+    "deep_poverty": {
+      "child": {
+        "baseline": 0.015249819569817049,
+        "reform": 0.01524948638393475
+      },
+      "adult": {
+        "baseline": 0.018463754678802034,
+        "reform": 0.018463433332651054
+      },
+      "senior": {
+        "baseline": 0.006305741937669883,
+        "reform": 0.006305741937669883
+      },
+      "all": {
+        "baseline": 0.015549737414820214,
+        "reform": 0.015549472660392407
+      }
+    }
+  },
+  "inequality": {
+    "gini": {
+      "baseline": 0.29337889671381984,
+      "reform": 0.29322557596654897
+    },
+    "top_10_pct_share": {
+      "baseline": 0.2588464823506093,
+      "reform": 0.2578490048888376
+    },
+    "top_1_pct_share": {
+      "baseline": 0.09261611050166191,
+      "reform": 0.09151029069322303
+    }
+  }
+}
\ No newline at end of file
diff --git a/evals/fixtures/pe_api/b2_ni_it_stacked.json b/evals/fixtures/pe_api/b2_ni_it_stacked.json
new file mode 100644
index 0000000..a68f2d3
--- /dev/null
+++ b/evals/fixtures/pe_api/b2_ni_it_stacked.json
@@ -0,0 +1,36 @@
+{
+  "_source": "local policyengine 0.13.0 + policyengine_uk 2.88.20",
+  "_scenario": "Reeves Nov-2025 NI/IT/freeze package (Vahid blog uk-income-tax-ni-reforms-2025.md)",
+  "_published": "https://policyengine.org/uk/research/uk-income-tax-ni-reforms-2025",
+  "ni_layer": {
+    "budgetary_impact": -11839866703.482544
+  },
+  "ni_alone": {
+    "decile_relative": {
+      "1": 0.0008486258305459847,
+      "2": 0.002296899587035953,
+      "3": 0.0027957413474279155,
+      "4": 0.003954396513231289,
+      "5": 0.005013721686535896,
+      "6": 0.00709450049310447,
+      "7": 0.00879407626384442,
+      "8": 0.009820447295388779,
+      "9": 0.010440596409482483,
+      "10": 0.0064798144839921835
+    }
+  },
+  "it_alone": {
+    "decile_relative": {
+      "1": -0.0012350961563262956,
+      "2": -0.003188850565802245,
+      "3": -0.00336280444689441,
+      "4": -0.0046718474048025125,
+      "5": -0.005979446122827847,
+      "6": -0.008056753415299716,
+      "7": -0.010267850479079306,
+      "8": -0.01122766177011029,
+      "9": -0.012261005277499282,
+      "10": -0.013557914306514462
+    }
+  }
+}
\ No newline at end of file
diff --git a/evals/fixtures/pe_api/b3_household_calc.json b/evals/fixtures/pe_api/b3_household_calc.json
index 0d2433b..472e4f4 100644
--- a/evals/fixtures/pe_api/b3_household_calc.json
+++ b/evals/fixtures/pe_api/b3_household_calc.json
@@ -1,4 +1,6 @@
 {
+  "_source": "local policyengine_uk 2.88.20",
+  "_scenario": "single adult age 35, gross \u00a345,000, UK 2025, no microdata",
   "result": {
     "household_net_income": 35760.55078125,
     "income_tax": 6486.0,
diff --git a/evals/fixtures/pe_api/b4_mtr_schedule.json b/evals/fixtures/pe_api/b4_mtr_schedule.json
index 9f9145f..0586f10 100644
--- a/evals/fixtures/pe_api/b4_mtr_schedule.json
+++ b/evals/fixtures/pe_api/b4_mtr_schedule.json
@@ -1,4 +1,6 @@
 {
+  "_source": "local policyengine_uk 2.88.20",
+  "_scenario": "single adult MTR schedule, UK 2025, finite-difference",
   "rows": [
     {
       "gross": 10000,
diff --git a/evals/runner/build_fixtures.py b/evals/runner/build_fixtures.py
index 220cd12..57b6286 100644
--- a/evals/runner/build_fixtures.py
+++ b/evals/runner/build_fixtures.py
@@ -2,30 +2,42 @@
 """
 Build reference fixtures for Test B scenarios.
 
-Most fixtures come from live PE-API calls (api.policyengine.org). B4 (MTR
-schedule) is computed locally via the policyengine_uk package since the API
-has no MTR endpoint. B2 (stacked NI/IT/freeze) is assembled from multiple
-PE-API calls because the YAML's fixture shape has per-layer keys that no
-single API response produces.
-
-Run on demand — generated fixtures are committed to git so the grader
-doesn't have to refetch on every CI run.
+For each blog-grounded scenario:
+  1. Fetch reform JSONs from PE-API's /uk/policy/<id> endpoint (this works
+     even when the /economy endpoint is down — it's a DB read).
+  2. Run the reform locally via `policyengine.Simulation +
+     calculate_economy_comparison` against EFRS 2023-24 (downloaded from HF).
+  3. Compare against the published figures in Vahid's blog post. Apply a
+     drift threshold: keep fields whose locally-computed value is within
+     tolerance of the published figure, drop the rest.
+  4. Write a fixture JSON containing only the kept fields + a sibling
+     drift_report.md listing kept/dropped/why for human review.
+
+For scenarios with no published source (B1 PA reform, B3 household calc,
+B4 MTR schedule) the local computation IS the fixture — we record what the
+engine produces and use it as the reference.
+
+Generated fixtures are committed to git so the grader doesn't refetch on
+every CI run. Re-run this script when scenarios change or to bump engine
+versions; expect dropped fields to change as PolicyEngine UK's current-law
+baseline drifts.
 
 Usage:
     python evals/runner/build_fixtures.py                    # all scenarios
     python evals/runner/build_fixtures.py b1 b3              # just these
-    python evals/runner/build_fixtures.py --validate-only    # don't refetch,
-                                                              # just check that
-                                                              # each path in
-                                                              # scenario YAMLs
-                                                              # resolves in the
-                                                              # existing fixture
+    python evals/runner/build_fixtures.py --validate-only    # don't rebuild,
+                                                              # check that
+                                                              # scenario YAML
+                                                              # paths resolve
+                                                              # in existing
+                                                              # fixtures
 """
 
 from __future__ import annotations
 
 import argparse
 import json
+import os
 import sys
 import time
 from pathlib import Path
@@ -41,274 +53,340 @@
 FIXTURES_DIR = EVALS_DIR / "fixtures" / "pe_api"
 
 PE_API_BASE = "https://api.policyengine.org"
-POLL_INTERVAL_SECONDS = 15
-POLL_TIMEOUT_SECONDS = 600
+HF_DATASET = "policyengine/policyengine-uk-data-private"
+HF_FILE = "enhanced_frs_2023_24.h5"
+
+# Drift threshold: when comparing our locally-computed numbers to Vahid's
+# published figures, accept fields whose magnitude is within 10% of the
+# published value. Larger drift = engine baseline has moved since publication,
+# field is dropped from the fixture with a note.
+DRIFT_TOLERANCE_PCT = 10.0
+
+
+# ---------------------------------------------------------------------------
+# Lazy heavy imports (so --validate-only and --help don't pay for them)
+# ---------------------------------------------------------------------------
+
+def _import_pe():
+    """Import the policyengine stack. Only needed when actually building."""
+    from policyengine import Simulation
+    from policyengine.outputs.macro.comparison.calculate_economy_comparison import (
+        calculate_economy_comparison,
+    )
+    from policyengine_core.tools.hugging_face import download_huggingface_dataset
+    return Simulation, calculate_economy_comparison, download_huggingface_dataset
+
+
+_DATASET_PATH_CACHE: str | None = None
+
+
+def get_dataset_path() -> str:
+    global _DATASET_PATH_CACHE
+    if _DATASET_PATH_CACHE is None:
+        _, _, download = _import_pe()
+        _DATASET_PATH_CACHE = download(repo=HF_DATASET, repo_filename=HF_FILE)
+    return _DATASET_PATH_CACHE
 
 
 # ---------------------------------------------------------------------------
 # PE-API helpers
 # ---------------------------------------------------------------------------
 
-def create_policy(client: httpx.Client, country: str, data: dict[str, Any]) -> int:
-    """POST a reform policy spec; return the new policy_id."""
-    r = client.post(f"{PE_API_BASE}/{country}/policy", json={"data": data})
-    r.raise_for_status()
-    body = r.json()
-    if body.get("status") != "ok":
-        raise RuntimeError(f"policy create failed: {body}")
-    return body["result"]["policy_id"]
-
-
-def poll_economy(
-    client: httpx.Client,
-    country: str,
-    reform_id: int,
-    baseline_id: int,
-    region: str,
-    time_period: str,
-    dataset: str | None = None,
-) -> dict[str, Any]:
-    """Fire an economy-wide comparison and poll until status=ok."""
-    params = {"region": region, "time_period": time_period}
-    if dataset:
-        params["dataset"] = dataset
-    url = f"{PE_API_BASE}/{country}/economy/{reform_id}/over/{baseline_id}"
-
-    print(f"  polling {url} (region={region}, time_period={time_period})...")
-    started = time.time()
-    while True:
-        r = client.get(url, params=params)
-        r.raise_for_status()
-        body = r.json()
-        status = body.get("status")
-        if status == "ok":
-            print(f"    done ({int(time.time() - started)}s)")
-            return body["result"]
-        if status == "error":
-            raise RuntimeError(f"economy comparison errored: {body}")
-        if time.time() - started > POLL_TIMEOUT_SECONDS:
-            raise TimeoutError(
-                f"economy comparison did not finish in {POLL_TIMEOUT_SECONDS}s"
-            )
-        time.sleep(POLL_INTERVAL_SECONDS)
-
-
-def fetch_household(
-    client: httpx.Client,
-    country: str,
-    household_id: int,
-    policy_id: int,
-) -> dict[str, Any]:
-    r = client.get(f"{PE_API_BASE}/{country}/household/{household_id}/policy/{policy_id}")
+def fetch_reform_json(reform_id: int) -> dict[str, Any]:
+    """Pull a reform's policy_json from PE-API's policy endpoint.
+
+    /uk/policy/<id> is a DB read — works even when /economy is broken.
+    """
+    r = httpx.get(f"{PE_API_BASE}/uk/policy/{reform_id}", timeout=30.0)
     r.raise_for_status()
     body = r.json()
     if body.get("status") != "ok":
-        raise RuntimeError(f"household fetch failed: {body}")
-    return body["result"]
+        raise RuntimeError(f"policy fetch failed for {reform_id}: {body}")
+    return body["result"]["policy_json"] or {}
 
 
 # ---------------------------------------------------------------------------
-# Per-scenario fixture builders
+# Local sim runner
 # ---------------------------------------------------------------------------
 
-def build_b1(client: httpx.Client) -> dict[str, Any]:
-    """B1 — single-parameter PA raise from £12,570 to £15,000, UK 2025."""
-    reform_data = {
-        "gov.hmrc.income_tax.allowances.personal_allowance.amount": {
-            "2025-01-01.2025-12-31": 15000,
-        }
-    }
-    reform_id = create_policy(client, "uk", reform_data)
-    print(f"  created reform policy {reform_id}")
-    return poll_economy(
-        client,
+def run_economy(
+    *,
+    reform: dict[str, Any] | None,
+    baseline: dict[str, Any] | None,
+    time_period: int,
+) -> dict[str, Any]:
+    """Run reform-vs-baseline through the policyengine package and return the
+    EconomyComparison output as a dict."""
+    Simulation, calculate_economy_comparison, _ = _import_pe()
+    sim = Simulation(
         country="uk",
-        reform_id=reform_id,
-        baseline_id=1,
+        scope="macro",
+        data=get_dataset_path(),
+        time_period=time_period,
         region="uk",
-        time_period="2025",
+        reform=reform,
+        baseline=baseline,
     )
+    return calculate_economy_comparison(sim).model_dump()
+
 
+def run_household(situation: dict[str, Any], year: int) -> dict[str, float]:
+    """Compute single-household figures via policyengine_uk directly."""
+    from policyengine_uk import Simulation as UKSimulation
+    sim = UKSimulation(situation=situation)
+    return {
+        "household_net_income": float(sim.calculate("household_net_income", year)[0]),
+        "income_tax": float(sim.calculate("income_tax", year)[0]),
+        "national_insurance": float(sim.calculate("national_insurance", year)[0]),
+    }
 
-def build_b2(client: httpx.Client) -> dict[str, Any]:
-    """B2 — stacked NI/IT/freeze.
 
-    The YAML's reference shape has per-layer keys (`layers.freeze_extension.*`,
-    `layers.ni_cut.*`, `layers.it_increase.*`). We fetch the four reform IDs
-    Vahid cites in uk-income-tax-ni-reforms-2025.md and assemble.
+def mtr_at(year: int, gross_income: int) -> dict[str, float]:
+    """Combined IT + NI marginal tax rate at a single income point.
 
-    Vahid's reform IDs (from the post):
-      83092 — freeze extension to 2029-30 (vs current law baseline 1)
-      94938 — NI cut applied on top of freeze (baseline 83092)
-      94911 — IT increase on top of freeze + NI cut (baseline 94938)
-      94906 — NI cut alone vs current law (for the standalone decile chart)
-      94910 — IT increase alone vs current law (for the standalone decile chart)
+    Computed by finite difference: tax at (gross + £100) − tax at (gross),
+    divided by 100 to get pp.
     """
-    print("  fetching freeze extension impact (reform 83092, 2028-29)...")
-    freeze_2028 = poll_economy(
-        client, "uk", reform_id=83092, baseline_id=1,
-        region="uk", time_period="2028",
-    )
+    def at(income: int) -> tuple[float, float]:
+        from policyengine_uk import Simulation as UKSimulation
+        sit = {
+            "people": {"p": {"age": 35, "employment_income": income}},
+            "benunits": {"b": {"members": ["p"]}},
+            "households": {"h": {"members": ["p"]}},
+        }
+        sim = UKSimulation(situation=sit)
+        return (
+            float(sim.calculate("income_tax", year)[0]),
+            float(sim.calculate("national_insurance", year)[0]),
+        )
+
+    it_a, ni_a = at(gross_income)
+    it_b, ni_b = at(gross_income + 100)
+    return {
+        "gross": gross_income,
+        "it_mtr": round(it_b - it_a, 2),
+        "ni_mtr": round(ni_b - ni_a, 2),
+        "combined_mtr": round((it_b - it_a) + (ni_b - ni_a), 2),
+    }
 
-    print("  fetching NI cut layer (reform 94938 over 83092, 2026-27)...")
-    ni_layer = poll_economy(
-        client, "uk", reform_id=94938, baseline_id=83092,
-        region="uk", time_period="2026",
-    )
 
-    print("  fetching IT increase layer (reform 94911 over 94938, 2026-27)...")
-    it_layer = poll_economy(
-        client, "uk", reform_id=94911, baseline_id=94938,
-        region="uk", time_period="2026",
-    )
+# ---------------------------------------------------------------------------
+# Drift comparison
+# ---------------------------------------------------------------------------
 
-    print("  fetching combined impact (reform 94911 over 1, 2026-27)...")
-    combined = poll_economy(
-        client, "uk", reform_id=94911, baseline_id=1,
-        region="uk", time_period="2026",
-    )
+def within_tolerance(ours: float, published: float, pct: float = DRIFT_TOLERANCE_PCT) -> bool:
+    if published == 0:
+        return ours == 0
+    return abs(ours - published) / abs(published) * 100 <= pct
 
-    print("  fetching NI cut alone (reform 94906 over 1, 2026-27)...")
-    ni_alone = poll_economy(
-        client, "uk", reform_id=94906, baseline_id=1,
-        region="uk", time_period="2026",
-    )
 
-    print("  fetching IT increase alone (reform 94910 over 1, 2026-27)...")
-    it_alone = poll_economy(
-        client, "uk", reform_id=94910, baseline_id=1,
-        region="uk", time_period="2026",
-    )
-
-    return {
-        "combined": {
-            "budgetary_impact_2026_27": combined.get("budget", {}).get("budgetary_impact"),
-            "_raw": combined,
-        },
-        "layers": {
-            "freeze_extension": {
-                "budgetary_impact_2028_29": freeze_2028.get("budget", {}).get("budgetary_impact"),
-                "_raw": freeze_2028,
-            },
-            "ni_cut": {
-                "budgetary_impact_2026_27": ni_layer.get("budget", {}).get("budgetary_impact"),
-                "_raw": ni_layer,
-            },
-            "it_increase": {
-                "budgetary_impact_2026_27": it_layer.get("budget", {}).get("budgetary_impact"),
-                "_raw": it_layer,
-            },
-        },
-        "decile": {
-            "relative": {
-                "ni_cut": ni_alone.get("decile", {}).get("relative"),
-                "it_increase": it_alone.get("decile", {}).get("relative"),
-            },
-        },
-        # example_household figures come from Vahid's hand-computed table in the
-        # post (£60k earner + £10k pension). They aren't an API endpoint — they
-        # are the canonical illustrative-household example from the post.
-        "example_household": {
-            "net_change": 5.4,
-            "ni_change": -754.0,
-            "it_change": 748.6,
-        },
-    }
+def drift_pct(ours: float, published: float) -> float:
+    if published == 0:
+        return float("inf") if ours != 0 else 0.0
+    return (ours - published) / abs(published) * 100
 
 
-def build_b3() -> dict[str, Any]:
-    """B3 — household calc, computed locally via policyengine_uk.
+# ---------------------------------------------------------------------------
+# Per-scenario builders. Each returns (fixture_dict, drift_report_lines).
+# ---------------------------------------------------------------------------
 
-    The PE-API household endpoint requires a stored household_id, which means
-    we'd have to POST a household spec first. Easier to call the package
-    directly — same engine, no policy or household round-trip.
-    """
-    print("  computing household via policyengine_uk locally...")
-    from policyengine_uk import Simulation
+def build_b1() -> tuple[dict[str, Any], list[str]]:
+    """B1 — PA raise £12,570 → £15,000, UK 2025. No blog reference; the local
+    computation IS the fixture."""
+    print("  fetching/preparing reform...")
+    # B1 isn't a Vahid scenario, so we author the reform JSON inline. Small,
+    # single-parameter, no risk of drift since the parameter isn't in baseline.
+    reform = {
+        "gov.hmrc.income_tax.allowances.personal_allowance.amount": {
+            "2025-01-01.2025-12-31": 15000,
+        },
+    }
+    print("  running locally...")
+    t = time.time()
+    result = run_economy(reform=reform, baseline=None, time_period=2025)
+    print(f"    ({time.time()-t:.0f}s)")
+
+    fixture = {
+        "_source": "local policyengine 0.13.0 + policyengine_uk 2.88.20",
+        "_scenario": "PA £12,570 → £15,000, UK 2025, EFRS 2023-24",
+        "budget": result["budget"],
+        "decile": result["decile"],
+        "poverty": result["poverty"],
+        "inequality": result["inequality"],
+    }
+    drift = [
+        "## b1_society_wide_pa",
+        "",
+        "No published reference (B1 is an author-defined scenario, not from a blog post).",
+        "Local computation is the canonical fixture.",
+        "",
+        f"- budgetary_impact: £{result['budget']['budgetary_impact']:+,.0f}",
+        f"- tax_revenue_impact: £{result['budget']['tax_revenue_impact']:+,.0f}",
+        f"- benefit_spending_impact: £{result['budget']['benefit_spending_impact']:+,.0f}",
+    ]
+    return fixture, drift
+
+
+def build_b2() -> tuple[dict[str, Any], list[str]]:
+    """B2 — stacked NI/IT/freeze (Vahid Nov-2025 post). Filter against
+    published figures per the drift threshold."""
+    print("  fetching reform JSONs from PE-API...")
+    reforms = {
+        "freeze":   fetch_reform_json(83092),
+        "ni_alone": fetch_reform_json(94906),
+        "it_alone": fetch_reform_json(94910),
+        "ni_layer": fetch_reform_json(94938),
+        "combined": fetch_reform_json(94911),
+    }
+    for name, rj in reforms.items():
+        print(f"    {name}: {len(rj)} parameter(s)")
+
+    print("  running scenarios...")
+    runs = {}
+    t = time.time()
+    runs["freeze"]   = run_economy(reform=reforms["freeze"],   baseline=None,              time_period=2028)
+    print(f"    freeze done ({time.time()-t:.0f}s)"); t = time.time()
+    runs["ni_alone"] = run_economy(reform=reforms["ni_alone"], baseline=None,              time_period=2026)
+    print(f"    ni_alone done ({time.time()-t:.0f}s)"); t = time.time()
+    runs["it_alone"] = run_economy(reform=reforms["it_alone"], baseline=None,              time_period=2026)
+    print(f"    it_alone done ({time.time()-t:.0f}s)"); t = time.time()
+    runs["ni_layer"] = run_economy(reform=reforms["ni_layer"], baseline=reforms["freeze"], time_period=2026)
+    print(f"    ni_layer done ({time.time()-t:.0f}s)"); t = time.time()
+    runs["it_layer"] = run_economy(reform=reforms["combined"], baseline=reforms["ni_layer"], time_period=2026)
+    print(f"    it_layer done ({time.time()-t:.0f}s)"); t = time.time()
+    runs["combined"] = run_economy(reform=reforms["combined"], baseline=None,              time_period=2026)
+    print(f"    combined done ({time.time()-t:.0f}s)")
+
+    # Vahid's published figures (uk-income-tax-ni-reforms-2025.md, Nov 2025).
+    PUBLISHED = {
+        "freeze_layer.budgetary_impact":      3_500_000_000,   # £3.5bn in 2028-29
+        "ni_layer.budgetary_impact":        -11_700_000_000,   # -£11.7bn in 2026-27
+        "it_layer.budgetary_impact":         18_600_000_000,   # +£18.6bn in 2026-27
+        "combined.budgetary_impact":          6_900_000_000,   # +£6.9bn in 2026-27
+    }
+    OURS = {
+        "freeze_layer.budgetary_impact":   runs["freeze"]["budget"]["budgetary_impact"],
+        "ni_layer.budgetary_impact":       runs["ni_layer"]["budget"]["budgetary_impact"],
+        "it_layer.budgetary_impact":       runs["it_layer"]["budget"]["budgetary_impact"],
+        "combined.budgetary_impact":       runs["combined"]["budget"]["budgetary_impact"],
+    }
 
+    drift = ["## b2_ni_it_stacked", ""]
+    drift.append(f"Drift threshold: {DRIFT_TOLERANCE_PCT}%")
+    drift.append("")
+    drift.append("| field | published | ours | drift | kept? |")
+    drift.append("|---|---|---|---|---|")
+    fixture: dict[str, Any] = {
+        "_source": "local policyengine 0.13.0 + policyengine_uk 2.88.20",
+        "_scenario": "Reeves Nov-2025 NI/IT/freeze package (Vahid blog uk-income-tax-ni-reforms-2025.md)",
+        "_published": "https://policyengine.org/uk/research/uk-income-tax-ni-reforms-2025",
+    }
+    for key, pub in PUBLISHED.items():
+        ours = OURS[key]
+        d = drift_pct(ours, pub)
+        kept = within_tolerance(ours, pub)
+        drift.append(f"| `{key}` | £{pub:+,.0f} | £{ours:+,.0f} | {d:+.1f}% | {'✓' if kept else '✗'} |")
+        if kept:
+            section, field = key.split(".")
+            fixture.setdefault(section, {})[field] = ours
+    drift.append("")
+
+    # Per-decile patterns: NI cut alone + IT increase alone reproduced cleanly
+    # in our trial. Save those distributions to the fixture. Freeze distribution
+    # is dropped because the layer itself drops out under drift.
+    fixture["ni_alone"] = {"decile_relative": runs["ni_alone"]["decile"]["relative"]}
+    fixture["it_alone"] = {"decile_relative": runs["it_alone"]["decile"]["relative"]}
+
+    drift.append("Per-decile distributions saved for ni_alone and it_alone")
+    drift.append("(freeze_layer + combined distributions dropped due to baseline drift).")
+    return fixture, drift
+
+
+def build_b3() -> tuple[dict[str, Any], list[str]]:
+    """B3 — household calc, deterministic. No blog reference. Local = fixture."""
+    print("  running single-household calculation...")
     situation = {
         "people": {"p": {"age": 35, "employment_income": 45000}},
         "benunits": {"b": {"members": ["p"]}},
         "households": {"h": {"members": ["p"]}},
     }
-    sim = Simulation(situation=situation)
-    net = float(sim.calculate("household_net_income", 2025)[0])
-    income_tax = float(sim.calculate("income_tax", 2025)[0])
-    ni = float(sim.calculate("national_insurance", 2025)[0])
-
-    # MTR by finite difference (+£100 of employment income)
+    base = run_household(situation, 2025)
+    # MTR by finite difference
     bumped = {
         "people": {"p": {"age": 35, "employment_income": 45100}},
         "benunits": {"b": {"members": ["p"]}},
         "households": {"h": {"members": ["p"]}},
     }
-    sim_b = Simulation(situation=bumped)
-    net_b = float(sim_b.calculate("household_net_income", 2025)[0])
-    mtr = (1 - (net_b - net) / 100) * 100
+    bumped_net = run_household(bumped, 2025)
+    mtr = (1 - (bumped_net["household_net_income"] - base["household_net_income"]) / 100) * 100
 
-    return {
-        "result": {
-            "household_net_income": net,
-            "income_tax": income_tax,
-            "national_insurance": ni,
-            "marginal_tax_rate": mtr,
-        }
+    fixture = {
+        "_source": "local policyengine_uk 2.88.20",
+        "_scenario": "single adult age 35, gross £45,000, UK 2025, no microdata",
+        "result": {**base, "marginal_tax_rate": mtr},
     }
-
-
-def build_b4() -> dict[str, Any]:
-    """B4 — MTR schedule at 8 income points, computed locally."""
-    print("  computing MTR schedule via policyengine_uk locally...")
-    from policyengine_uk import Simulation
-
-    incomes = [10000, 20000, 30000, 50000, 75000, 100000, 125000, 150000]
-
-    def sit(income: int) -> dict[str, Any]:
-        return {
-            "people": {"p": {"age": 35, "employment_income": income}},
-            "benunits": {"b": {"members": ["p"]}},
-            "households": {"h": {"members": ["p"]}},
-        }
-
-    def at(income: int) -> dict[str, float]:
-        s = Simulation(situation=sit(income))
-        return {
-            "it": float(s.calculate("income_tax", 2025)[0]),
-            "ni": float(s.calculate("national_insurance", 2025)[0]),
-        }
-
-    rows = []
-    for income in incomes:
-        a = at(income)
-        b = at(income + 100)
-        it_mtr = (b["it"] - a["it"])  # change in £100 = pp directly
-        ni_mtr = (b["ni"] - a["ni"])
-        rows.append({
-            "gross": income,
-            "it_mtr": round(it_mtr, 2),
-            "ni_mtr": round(ni_mtr, 2),
-            "combined_mtr": round(it_mtr + ni_mtr, 2),
-        })
-
-    return {"rows": rows}
-
-
-def build_b5(client: httpx.Client) -> dict[str, Any]:
-    """B5 — remove the two-child limit (Vahid's reform 93219, 2026-27)."""
-    print("  fetching reform 93219 over 1, region=uk, 2026...")
-    result = poll_economy(
-        client, "uk", reform_id=93219, baseline_id=1,
-        region="uk", time_period="2026",
-    )
-    return result
+    drift = [
+        "## b3_household_calc",
+        "",
+        "No published reference. Local computation is the canonical fixture.",
+        "",
+        f"- household_net_income: £{base['household_net_income']:,.2f}",
+        f"- income_tax:           £{base['income_tax']:,.2f}",
+        f"- national_insurance:   £{base['national_insurance']:,.2f}",
+        f"- marginal_tax_rate:    {mtr:.2f}%",
+    ]
+    return fixture, drift
+
+
+def build_b4() -> tuple[dict[str, Any], list[str]]:
+    """B4 — MTR schedule at 8 income points, local-computed."""
+    print("  computing MTR schedule (8 income points)...")
+    rows = [mtr_at(2025, inc) for inc in (10000, 20000, 30000, 50000, 75000, 100000, 125000, 150000)]
+    fixture = {
+        "_source": "local policyengine_uk 2.88.20",
+        "_scenario": "single adult MTR schedule, UK 2025, finite-difference",
+        "rows": rows,
+    }
+    drift = [
+        "## b4_mtr_schedule",
+        "",
+        "No published reference. Local computation is the canonical fixture.",
+        "",
+        "Combined IT+NI MTR by gross income:",
+    ]
+    drift.extend(f"  £{r['gross']:>7,}: {r['combined_mtr']:5.1f}%" for r in rows)
+    return fixture, drift
+
+
+def build_b5_dropped() -> tuple[dict[str, Any] | None, list[str]]:
+    """B5 — two-child limit removal. Dropped: the reform is a no-op against
+    current policyengine_uk 2.88.20 (which incorporates the Autumn Budget 2025
+    removal as baseline). Documented here for the drift report."""
+    print("  (scenario marked dropped — no-op vs current baseline)")
+    return None, [
+        "## b5_two_child_limit",
+        "",
+        "**Dropped — model baseline drift.**",
+        "",
+        "Vahid's reform 93219 sets `child_count` cap to 100/102 effective 2025+,",
+        "which was meaningful when the post was published (pre-Autumn Budget 2025).",
+        "policyengine_uk 2.88.20 now has the cap at `inf` from 2026 onward as",
+        "current law (the Autumn Budget 2025 removal is baked into baseline).",
+        "",
+        "Result: reform vs current law is a zero-delta no-op. £0 budgetary impact,",
+        "0pp poverty change, 0% Gini change.",
+        "",
+        "Re-enabling requires either (a) replacing with a different reform Vahid",
+        "wrote that is still counterfactual today, or (b) pinning to an older",
+        "policyengine_uk version that pre-dates the baseline update.",
+    ]
 
 
 # ---------------------------------------------------------------------------
-# Validation: every fields_to_compare.path resolves in the fixture
+# Validation (read-only) path
 # ---------------------------------------------------------------------------
 
 def resolve_path(node: Any, dotted_path: str) -> tuple[bool, Any]:
-    """Return (resolved, value). Integer-looking parts index into lists."""
     cur = node
     for part in dotted_path.split("."):
         if isinstance(cur, dict) and part in cur:
@@ -325,22 +403,15 @@ def resolve_path(node: Any, dotted_path: str) -> tuple[bool, Any]:
 
 
 def validate_fixture_paths(scenario_id: str, scenario: dict[str, Any], fixture: dict[str, Any]) -> list[str]:
-    """For every fields_to_compare.path, check it resolves in the fixture.
-
-    Returns a list of paths that didn't resolve.
-    """
     ref = scenario.get("reference") or {}
     misses = []
     for fc in ref.get("fields_to_compare") or []:
         path = fc["path"]
         if fc.get("expected_approx") is not None:
-            # Has an inline expected value, no fixture lookup required.
             continue
         ok, value = resolve_path(fixture, path)
         if not ok:
             misses.append(path)
-        elif not isinstance(value, (int, float)):
-            misses.append(f"{path} (resolved but value is {type(value).__name__}, expected number)")
     return misses
 
 
@@ -349,35 +420,20 @@ def validate_fixture_paths(scenario_id: str, scenario: dict[str, Any], fixture:
 # ---------------------------------------------------------------------------
 
 BUILDERS = {
-    "b1": ("b1_society_wide_pa.json", lambda c: build_b1(c)),
-    "b2": ("b2_ni_it_stacked.json", lambda c: build_b2(c)),
-    "b3": ("b3_household_calc.json", lambda c: build_b3()),
-    "b4": ("b4_mtr_schedule.json", lambda c: build_b4()),
-    "b5": ("b5_two_child_limit.json", lambda c: build_b5(c)),
+    "b1": ("b1_society_wide_pa.json",     build_b1),
+    "b2": ("b2_ni_it_stacked.json",       build_b2),
+    "b3": ("b3_household_calc.json",      build_b3),
+    "b4": ("b4_mtr_schedule.json",        build_b4),
+    "b5": ("b5_two_child_limit.json",     build_b5_dropped),
 }
 
 
-def load_scenario(scenario_id_prefix: str) -> dict[str, Any]:
-    """Load the scenario YAML matching b1, b2, ... shorthand."""
-    matches = list(SCENARIOS_DIR.glob(f"{scenario_id_prefix}_*.yaml"))
-    if not matches:
-        raise SystemExit(f"No scenario YAML matching '{scenario_id_prefix}_*'")
-    if len(matches) > 1:
-        raise SystemExit(f"Multiple matches: {[m.name for m in matches]}")
-    return yaml.safe_load(matches[0].read_text())
-
-
 def main() -> int:
     parser = argparse.ArgumentParser(description=__doc__)
-    parser.add_argument(
-        "scenarios", nargs="*",
-        help="Scenario shorthand (b1, b2, ...). Empty = all B scenarios.",
-    )
-    parser.add_argument(
-        "--validate-only", action="store_true",
-        help="Skip rebuilding fixtures; just validate that scenario field paths "
-             "resolve in the existing fixture JSONs.",
-    )
+    parser.add_argument("scenarios", nargs="*",
+                        help="Scenario shorthand (b1, b2, ...). Empty = all.")
+    parser.add_argument("--validate-only", action="store_true",
+                        help="Don't rebuild; check scenario YAML paths resolve in existing fixtures.")
     args = parser.parse_args()
 
     keys = args.scenarios or sorted(BUILDERS.keys())
@@ -387,47 +443,76 @@ def main() -> int:
 
     FIXTURES_DIR.mkdir(parents=True, exist_ok=True)
 
-    overall_misses: list[tuple[str, list[str]]] = []
-
-    with httpx.Client(timeout=60.0) as client:
-        for key in keys:
-            filename, builder = BUILDERS[key]
-            fixture_path = FIXTURES_DIR / filename
-            scenario = load_scenario(key)
-            print(f"\n=== {key} → {filename} ===")
-
-            if args.validate_only:
-                if not fixture_path.exists():
-                    print(f"  no fixture at {fixture_path} (skipping)")
-                    continue
-                fixture = json.loads(fixture_path.read_text())
-            else:
-                if key in ("b3", "b4"):
-                    fixture = builder(None)  # local computation, no httpx client needed
-                else:
-                    fixture = builder(client)
-                fixture_path.write_text(json.dumps(fixture, indent=2, default=str))
-                print(f"  wrote {fixture_path}")
-
+    # Surface HF token so policyengine_core can download the dataset.
+    if not os.environ.get("HUGGING_FACE_TOKEN"):
+        env_file = Path(__file__).resolve().parents[3] / ".env"
+        if env_file.exists():
+            for line in env_file.read_text().splitlines():
+                if line.startswith("POLICYENGINE_UK_DATA_TOKEN="):
+                    os.environ["HUGGING_FACE_TOKEN"] = line.split("=", 1)[1].strip()
+                    break
+
+    drift_report = [
+        "# Test B fixture drift report",
+        "",
+        f"Generated by `build_fixtures.py`. Drift threshold: {DRIFT_TOLERANCE_PCT}%.",
+        "",
+        "For each scenario, fields whose locally-computed value drifted more than",
+        "the threshold from the published reference are dropped from the fixture.",
+        "Dropped fields indicate `policyengine_uk` baseline has moved since the",
+        "post was published — not a bug, just model evolution.",
+        "",
+        "---",
+        "",
+    ]
+
+    all_misses: list[tuple[str, list[str]]] = []
+    for key in keys:
+        filename, builder = BUILDERS[key]
+        fixture_path = FIXTURES_DIR / filename
+        print(f"\n=== {key} → {filename} ===")
+
+        if args.validate_only:
+            if not fixture_path.exists():
+                print(f"  no fixture at {fixture_path} (skipping)")
+                continue
+            fixture = json.loads(fixture_path.read_text())
+        else:
+            fixture, drift = builder()
+            drift_report.extend(drift)
+            drift_report.append("")
+            if fixture is None:
+                # Dropped scenario — remove any stale fixture file
+                if fixture_path.exists():
+                    fixture_path.unlink()
+                    print(f"  removed stale fixture {fixture_path}")
+                continue
+            fixture_path.write_text(json.dumps(fixture, indent=2, default=str))
+            print(f"  wrote {fixture_path}")
+
+        # Validate the scenario YAML paths resolve in this fixture
+        scenario_files = list(SCENARIOS_DIR.glob(f"{key}_*.yaml"))
+        if scenario_files:
+            scenario = yaml.safe_load(scenario_files[0].read_text())
             misses = validate_fixture_paths(scenario["id"], scenario, fixture)
             if misses:
-                overall_misses.append((scenario["id"], misses))
+                all_misses.append((scenario["id"], misses))
                 print(f"  ⚠ {len(misses)} field path(s) didn't resolve in fixture:")
-                for path in misses:
-                    print(f"    - {path}")
+                for m in misses:
+                    print(f"    - {m}")
             else:
                 print(f"  ✓ all field paths resolve")
 
-    if overall_misses:
+    if not args.validate_only:
+        drift_path = FIXTURES_DIR.parent / "drift_report.md"
+        drift_path.write_text("\n".join(drift_report))
+        print(f"\nwrote {drift_path}")
+
+    if all_misses:
         print("\n=== validation: ISSUES ===")
-        for sid, misses in overall_misses:
+        for sid, misses in all_misses:
             print(f"  {sid}: {misses}")
-        print("\nFix either the fixture (rename keys), the scenario YAML "
-              "(rename paths), or both. The grader silently skips unresolved "
-              "paths, so these would otherwise hide as 'unextracted'.")
         return 1
-
-    print("\n✓ all done")
     return 0
 
 
diff --git a/evals/runner/requirements-fixtures.txt b/evals/runner/requirements-fixtures.txt
new file mode 100644
index 0000000..d8d313e
--- /dev/null
+++ b/evals/runner/requirements-fixtures.txt
@@ -0,0 +1,10 @@
+# Extra deps needed only for build_fixtures.py. Heavier than the runner/grader
+# requirements (pulls policyengine_uk + policyengine_core + microdata loaders),
+# so kept separate so a contributor running just the runner doesn't need them.
+#
+# Production-aligned pinning per policyengine-api-v2 main as of 2026-05-21.
+policyengine==0.13.0
+policyengine-uk==2.88.20
+policyengine-core>=3.23.5
+httpx>=0.27
+pyyaml>=6.0
diff --git a/evals/scenarios/b2_ni_it_stacked.yaml b/evals/scenarios/b2_ni_it_stacked.yaml
index 55261a9..0a99bf2 100644
--- a/evals/scenarios/b2_ni_it_stacked.yaml
+++ b/evals/scenarios/b2_ni_it_stacked.yaml
@@ -1,24 +1,33 @@
 id: b2_ni_it_stacked
 test: B
-title: "Stacked NI + income tax + threshold freeze (Reeves 2025 pre-Budget reform package)"
-what_it_tests: "Multi-parameter reform with reform stacking — the LLM must apply three changes in the right order and produce the same per-layer numbers PolicyEngine published. Reference is a peer-reviewed PolicyEngine analysis (Vahid Ahmadi, Nov 2025), not a made-up fixture."
+title: "Stacked NI + income tax reform (subset of Reeves Nov-2025 pre-Budget package)"
+what_it_tests: "Multi-parameter reform — the LLM must apply two changes in the right order and produce per-layer numbers comparable to PolicyEngine's published analysis. Reference is a peer-reviewed PolicyEngine analysis (Vahid Ahmadi, Nov 2025), filtered to the layers that still reproduce against current policyengine_uk baseline (see evals/fixtures/drift_report.md)."
 
 source_post: "uk-income-tax-ni-reforms-2025.md"
 source_post_url: "https://policyengine.org/uk/research/uk-income-tax-ni-reforms-2025"
 
+# Note on scope: the original post analysed three layers (freeze + NI cut + IT
+# increase). The freeze layer and combined number have been dropped from this
+# scenario because they no longer reproduce against current policyengine_uk —
+# the post-Autumn-Budget-2025 baseline has incorporated the freeze. See
+# evals/fixtures/drift_report.md for the kept/dropped audit trail.
+
 prompt: |
-  Run a UK 2026-27 economy-wide comparison. Baseline: current law. Reform applies THREE changes, stacked in this order (so each layer's reported impact is on top of the previous):
-    1. Extend the income tax threshold freeze to 2029-30 (current law freezes through 2027-28).
-    2. Reduce the National Insurance main rate from 8% to 6% (on earnings between £12,570 and £50,270).
-    3. Increase basic income tax rate from 20% to 22% and higher rate from 40% to 42%.
+  Run a UK 2026-27 economy-wide comparison. Baseline: current law.
+
+  Apply two reforms, EACH against current law (not stacked):
+
+  Reform A — reduce the National Insurance main rate from 8% to 6% (on earnings
+  between £12,570 and £50,270).
+
+  Reform B — increase the basic income tax rate from 20% to 22% and the higher
+  rate from 40% to 42%.
 
   Use the Enhanced FRS 2023-24 dataset: hf://policyengine/policyengine-uk-data-private/enhanced_frs_2023_24.h5.
 
-  Report from the model run:
-  (a) Combined budgetary impact in £ for 2026-27 (all three reforms applied).
-  (b) The stacked per-layer revenue contribution in £ for 2026-27: freeze layer, NI-cut layer (on top of freeze), IT-increase layer (on top of freeze + NI).
-  (c) Relative income change in % by decile, for each of the three reforms applied individually (i.e. each one against current-law baseline).
-  (d) For an illustrative single earner with gross income £60,000 and £10,000 workplace pension contribution: the change in NI, change in income tax, and change in net household income under the combined NI-cut + IT-increase reforms (rates only — ignore the freeze for this household example, since the freeze doesn't bite until 2028-29).
+  Report from the model run, separately for each reform:
+  (a) Budgetary impact in £ for 2026-27.
+  (b) Relative income change in % by decile (D1 through D10).
 
   British English. Every figure from the model run.
 
@@ -29,49 +38,38 @@ chat_settings:
 reference:
   fixture: pe_api/b2_ni_it_stacked.json
   fields_to_compare:
-    # All from Vahid Ahmadi, "Analysing alternative income tax and National Insurance reform options for the UK", Nov 2025.
-    - path: combined.budgetary_impact_2026_27
-      expected_approx: 6_900_000_000        # £6.9bn combined in 2026-27
-      tolerance_pct: 5.0
-    - path: layers.freeze_extension.budgetary_impact_2028_29
-      expected_approx: 3_500_000_000        # £3.5bn freeze layer, 2028-29
-      tolerance_pct: 5.0
-    - path: layers.ni_cut.budgetary_impact_2026_27
-      expected_approx: -11_700_000_000      # NI cut costs £11.7bn on top of freeze
+    # NI cut alone vs current law, 2026-27. Vahid published -£11.7bn; our
+    # local run reproduces within 1.2%. Tight tolerance, kept.
+    - path: ni_layer.budgetary_impact
       tolerance_pct: 5.0
-    - path: layers.it_increase.budgetary_impact_2026_27
-      expected_approx: 18_600_000_000       # IT increase raises £18.6bn on top
-      tolerance_pct: 5.0
-    - path: example_household.net_change
-      expected_approx: 5.4                  # £60k earner + £10k pension: +£5.40 net
-      tolerance_pct: 20.0                   # wider — one household, small absolute number, easy to drift
-    - path: example_household.ni_change
-      expected_approx: -754.0               # -£754 NI cut for the example household
-      tolerance_pct: 5.0
-    - path: example_household.it_change
-      expected_approx: 748.6                # +£748.60 IT rise for the example household
-      tolerance_pct: 5.0
-    - path: decile.relative.ni_cut
-      # Per Vahid's Figure 3 (2026-27): D1→0.1, D2→0.2, D3→0.3, D4→0.5, D5→0.7,
-      # D6→0.8, D7→1.0, D8→1.1, D9→1.1, D10→0.6 (percentage gains)
-      tolerance_pct: 15.0
-    - path: decile.relative.it_increase
-      # Per Vahid's Figure 4 (2026-27): D1→-0.2, D2→-0.4, ..., D8→-1.5, D9→-1.6, D10→-1.5
-      tolerance_pct: 15.0
+
+    # Per-decile distributional shapes for the two reforms applied individually.
+    # The fixture stores the full decile.relative dict; the grader compares
+    # element-wise.
+    - path: ni_alone.decile_relative
+      tolerance_pct: 20.0
+    - path: it_alone.decile_relative
+      tolerance_pct: 20.0
+
+    # Not compared:
+    #  - it_layer.budgetary_impact: dropped per 10% drift threshold (11.7% off
+    #    Vahid). See evals/fixtures/drift_report.md.
+    #  - freeze_layer.budgetary_impact: dropped — current baseline has
+    #    incorporated the freeze, reform is a no-op.
+    #  - combined.budgetary_impact: dropped — downstream of freeze drift.
 
 anchor:
   must_mention:
-    - "stacking"                  # methodology — reforms applied on top of each other
-    - "fiscal drag"               # the mechanism by which the freeze raises revenue
-    - "Upper Earnings Limit"      # £50,270 — why NI cut doesn't help top earners much
-    - "pension contributions"     # why the example household's net change is small
+    - "Upper Earnings Limit"      # why NI cut doesn't help top earners much
+    - "higher rate"               # IT band where the increase concentrates
+    - "decile"                    # required by the question shape
   must_not_say:
-    - "all households gain"       # untrue under combined reform
-    - "all households lose"       # untrue too
-    - "the reforms are independent"  # they're stacked, with interactions
+    - "all households gain"       # untrue under combined effect
+    - "the reforms are independent"  # they're a package even when scored separately
   ideal_finding: |
-    The NI cut and IT increase partially offset for typical workers — at £60k with
-    a £10k pension contribution, the worker is only marginally better off (+£5).
-    Pensioners and landlords pay only the IT rise (no NI), so they straightforwardly
-    lose. The freeze raises revenue through fiscal drag — nominal incomes rise but
-    thresholds stay fixed, pulling more income into higher tax bands.
+    The NI cut concentrates gains in deciles where workers' earnings sit in the
+    £12,570-£50,270 band — peaking around D8-D9 with ~1.1% relative gain.
+    The IT increase reduces income across all deciles, again peaking at D8-D9
+    with ~1.5-1.6% relative loss because that's where the higher-rate band
+    bites hardest. The two layers roughly offset for typical workers; pensioners
+    and landlords pay only the IT rise.
diff --git a/evals/scenarios/b5_two_child_limit.yaml b/evals/scenarios/b5_two_child_limit.yaml
deleted file mode 100644
index 56f2367..0000000
--- a/evals/scenarios/b5_two_child_limit.yaml
+++ /dev/null
@@ -1,62 +0,0 @@
-id: b5_two_child_limit
-test: B
-title: "Remove the two-child benefit limit (Autumn Budget 2025)"
-what_it_tests: "Benefit-side reform (UC and CTC), not tax. Tests a different model surface from the previous B scenarios and a different kind of distributional shape — large gains concentrated in the lower deciles. Reference is PolicyEngine's published analysis (Vahid Ahmadi, Oct 2025) — directly comparing chat output to canonical PolicyEngine analysis."
-
-source_post: "uk-two-child-limit.md"
-source_post_url: "https://policyengine.org/uk/research/uk-two-child-limit"
-
-prompt: |
-  Run a UK 2026-27 economy-wide comparison. Baseline: current law (the two-child limit on Universal Credit and Child Tax Credit is in place). Reform: remove the two-child limit, so families receive child elements for every eligible child regardless of how many.
-
-  Use the Enhanced FRS 2023-24 dataset: hf://policyengine/policyengine-uk-data-private/enhanced_frs_2023_24.h5.
-
-  Report from the model run:
-  (a) Government spending impact (benefit cost) in £ for 2026-27.
-  (b) Relative income change in % by decile (1-10) for 2026-27.
-  (c) Change in absolute child poverty BHC in % terms (e.g. -X.X%, signed).
-  (d) Change in the Gini coefficient (signed).
-  (e) For an illustrative household — two parents, three children aged 3, 5, 7, both earning at the median — the change in household net income in 2026-27.
-
-  British English. Every figure from the model run.
-
-chat_settings:
-  model_backend: uk_python
-  num_runs: 3
-
-reference:
-  fixture: pe_api/b5_two_child_limit.json
-  fields_to_compare:
-    # All from Vahid Ahmadi, "How removing the two-child benefit limit would affect the UK", Oct 2025.
-    # PolicyEngine reform ID 93219 against baseline 1, 2026-27.
-    - path: budget.cost_2026_27
-      expected_approx: 2_900_000_000        # £2.9bn in 2026-27 per PolicyEngine
-      tolerance_pct: 5.0
-    - path: decile.relative.d2
-      expected_approx: 2.59                 # D2 sees the largest gain — 2.59% by 2029-30 (use 2026-27 fixture value once generated)
-      tolerance_pct: 25.0                   # decile-2 figure is for 2029-30; 2026-27 will be smaller
-    - path: poverty.absolute_child_bhc.relative_change
-      expected_approx: -13.5                # -13.5% absolute child poverty BHC
-      tolerance_pct: 10.0
-    - path: inequality.gini.relative_change
-      expected_approx: -0.55                # Gini falls 0.55% in 2026-27
-      tolerance_pct: 15.0
-
-anchor:
-  must_mention:
-    - "Universal Credit"          # the primary benefit affected
-    - "Child Tax Credit"          # legacy benefit also affected
-    - "transitional protection"   # why caseload grows over time
-    - "lower deciles"             # where the gains concentrate
-  must_not_say:
-    - "all families gain"         # only ~4.4% of population gains anything; the rest are unaffected
-    - "the reform raises revenue" # it's a cost, not a revenue raiser
-    - "the largest gains are in higher deciles"  # opposite is true
-  ideal_finding: |
-    Removing the two-child limit costs roughly £2.9bn in 2026-27, rising as more
-    post-2017 children age into the affected population. About 4.4% of the
-    population gains, heavily concentrated in the bottom three deciles (D2 sees
-    the largest relative gain). Absolute child poverty BHC falls by ~13.5% in
-    2026-27. The Gini coefficient falls by ~0.55% — modest overall inequality
-    reduction reflecting how concentrated the gains are. The reform interacts
-    with UC tapers, so as employment income rises the marginal gain falls.

From 3ea1f806168c1d5be0670490ebae8cf1f3c57213 Mon Sep 17 00:00:00 2001
From: SakshiKekre <sakshi.kekre@gmail.com>
Date: Wed, 27 May 2026 06:26:12 -0700
Subject: [PATCH 07/11] Grader: parse markdown tables for list-of-dicts
 extraction

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 evals/runner/grade.py | 131 +++++++++++++++++++++++++++++++++++-------
 1 file changed, 110 insertions(+), 21 deletions(-)

diff --git a/evals/runner/grade.py b/evals/runner/grade.py
index a7ae83a..30f3776 100644
--- a/evals/runner/grade.py
+++ b/evals/runner/grade.py
@@ -551,45 +551,134 @@ def _diff_list_of_dicts(
     return diffs
 
 
-def _extract_row_value(answer: str, key_value: Any, compare_field: str) -> float | None:
-    """Find the chat's reported value for a row keyed by `key_value`.
-
-    Looks for the key (formatted as £-amount if numeric and large) in the
-    chat answer, then pulls the first number from the surrounding line. The
-    `compare_field` name is mostly used to bias which number to pick when
-    multiple are present — for MTR fields we prefer the last % on the line.
-    """
-    # Format the key as it's likely to appear in chat prose.
-    if isinstance(key_value, (int, float)):
-        candidates = [
+def _key_candidates(key_value: Any) -> list[str]:
+    """Plausible string renderings of a row key the chat might use."""
+    if isinstance(key_value, (int, float)) and abs(key_value) >= 1000:
+        return [
             f"£{int(key_value):,}",       # £10,000
             f"£{int(key_value)/1000:g}k", # £10k
             f"{int(key_value):,}",
             str(int(key_value)),
         ]
-    else:
-        candidates = [str(key_value)]
+    if isinstance(key_value, (int, float)):
+        return [str(int(key_value)), f"£{int(key_value)}"]
+    return [str(key_value)]
+
+
+def _parse_md_table(answer: str) -> list[tuple[list[str], list[list[str]]]]:
+    """Find markdown tables in the answer. Returns a list of (header, rows).
+
+    A markdown table is consecutive lines starting with `|`, with the second
+    line being a separator (`|---|---|`). Headers and cells are stripped.
+    """
+    tables = []
+    lines = answer.splitlines()
+    i = 0
+    while i < len(lines):
+        line = lines[i].strip()
+        # A row starts with `|` and has at least 2 `|` (one row, multiple cells).
+        if line.startswith("|") and line.count("|") >= 2 and i + 1 < len(lines):
+            sep = lines[i + 1].strip()
+            # Separator row has only -, :, |, and whitespace
+            if re.fullmatch(r"[|:\-\s]+", sep) and "-" in sep:
+                # We're at a header
+                header = [c.strip() for c in line.strip("|").split("|")]
+                rows = []
+                j = i + 2
+                while j < len(lines) and lines[j].strip().startswith("|"):
+                    row = [c.strip() for c in lines[j].strip().strip("|").split("|")]
+                    rows.append(row)
+                    j += 1
+                tables.append((header, rows))
+                i = j
+                continue
+        i += 1
+    return tables
+
+
+def _normalise_header(s: str) -> str:
+    """Strip markdown emphasis/units/parens for column-name matching."""
+    s = re.sub(r"[*_`]", "", s)
+    s = re.sub(r"\([^)]*\)", "", s)  # drop "(%)" etc
+    return re.sub(r"\s+", " ", s).strip().lower()
+
+
+def _parse_cell_number(cell: str) -> float | None:
+    """Extract a number from a single markdown-table cell."""
+    cell = cell.strip()
+    if not cell or cell in {"-", "—", "n/a", "na"}:
+        return None
+    m = re.search(r"-?\s*\d[\d,]*(?:\.\d+)?", cell)
+    if not m:
+        return None
+    try:
+        n = float(m.group(0).replace(",", "").replace(" ", ""))
+    except ValueError:
+        return None
+    # If the cell carries a £ prefix and a bn/m unit, scale. Cells are
+    # usually unitless rates or £-amounts already in the natural unit.
+    if re.search(r"\bbn\b|billion", cell, re.IGNORECASE):
+        n *= 1_000_000_000
+    elif re.search(r"\bmn?\b|million", cell, re.IGNORECASE):
+        n *= 1_000_000
+    return n
+
+
+# Heuristic mapping from compare_field name → header substring(s) we look for.
+# Header matching uses normalised form (lowercase, punctuation/units stripped).
+COLUMN_HEADER_HINTS = {
+    "combined_mtr": ("combined",),
+    "it_mtr":       ("income tax", "it "),
+    "ni_mtr":       ("ni ", "national insurance"),
+    "rate":         ("rate",),
+}
+
 
+def _extract_row_value(answer: str, key_value: Any, compare_field: str) -> float | None:
+    """Find the chat's reported value for `compare_field` on the row keyed by
+    `key_value`. Tries markdown tables first (most chat tabular output), falls
+    back to prose-line scanning."""
+    candidates = _key_candidates(key_value)
+
+    # ----- 1. Markdown table path -----
+    field_hints = COLUMN_HEADER_HINTS.get(compare_field, (compare_field.replace("_", " "),))
+    for header, rows in _parse_md_table(answer):
+        norm = [_normalise_header(h) for h in header]
+        # Find the column whose header contains any of the hints.
+        col_idx = None
+        for i, nh in enumerate(norm):
+            if any(h in nh for h in field_hints):
+                col_idx = i
+                break
+        if col_idx is None:
+            continue
+        # Find the row containing the key.
+        for row in rows:
+            if not row:
+                continue
+            row_label = row[0]
+            if any(c in row_label for c in candidates):
+                if col_idx < len(row):
+                    value = _parse_cell_number(row[col_idx])
+                    if value is not None:
+                        return value
+                break
+
+    # ----- 2. Prose-line fallback -----
     for needle in candidates:
         idx = answer.find(needle)
         if idx == -1:
             continue
-        # Look at the surrounding line: from previous newline to next newline.
         line_start = answer.rfind("\n", 0, idx) + 1
         line_end_search = answer.find("\n", idx)
         line_end = line_end_search if line_end_search != -1 else len(answer)
         line = answer[line_start:line_end]
 
-        # For MTR-style fields, prefer "combined" % if multiple appear, else
-        # the last % on the line.
         if "mtr" in compare_field.lower() or "rate" in compare_field.lower():
-            # Pull all percentages from the line.
-            pcts = re.findall(r"(\d+(?:\.\d+)?)\s*%", line)
+            pcts = re.findall(r"(-?\d+(?:\.\d+)?)\s*%", line)
             if pcts:
-                if "combined" in compare_field.lower():
-                    return float(pcts[-1])  # combined usually last in the row
+                # "combined" rates are conventionally the last % in a row.
                 return float(pcts[-1])
-        # Otherwise, take the first scaled or numeric value after the needle.
         n = parse_number_near(line, re.escape(needle))
         if n is not None:
             return n

From 2161c99e3fe87596803c6f43980ef07764822406 Mon Sep 17 00:00:00 2001
From: SakshiKekre <sakshi.kekre@gmail.com>
Date: Wed, 27 May 2026 08:39:57 -0700
Subject: [PATCH 08/11] Runner: add --concurrency for parallel scenario runs

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 evals/runner/run.py | 86 ++++++++++++++++++++++++++++++++++-----------
 1 file changed, 65 insertions(+), 21 deletions(-)

diff --git a/evals/runner/run.py b/evals/runner/run.py
index e71ad2e..c605c8a 100644
--- a/evals/runner/run.py
+++ b/evals/runner/run.py
@@ -28,6 +28,8 @@
 import os
 import re
 import sys
+import threading
+from concurrent.futures import ThreadPoolExecutor, as_completed
 from dataclasses import dataclass
 from pathlib import Path
 from typing import Any
@@ -243,6 +245,7 @@ def run_all(
     scenarios: list[Scenario],
     backend_url: str,
     dry_run: bool,
+    concurrency: int = 1,
 ) -> Path | None:
     if dry_run:
         print(f"DRY RUN (backend: {backend_url})\n")
@@ -253,6 +256,8 @@ def run_all(
             )
         total = sum(s.num_runs for s in scenarios)
         print(f"\nWould execute {total} requests across {len(scenarios)} scenarios.")
+        if concurrency > 1:
+            print(f"(would run with concurrency={concurrency})")
         return None
 
     run_dir = make_run_dir()
@@ -262,30 +267,40 @@ def run_all(
         "started_at": dt.datetime.now(dt.timezone.utc).isoformat(),
         "backend_url": backend_url,
         "bypass_token_set": BYPASS_TOKEN is not None,
+        "concurrency": concurrency,
         "scenarios": [s.id for s in scenarios],
         "runs": [],
     }
     manifest_path = run_dir / "manifest.json"
     manifest_path.write_text(json.dumps(manifest, indent=2))
+    manifest_lock = threading.Lock()
+    print_lock = threading.Lock()
 
-    print(f"Run output: {run_dir}\n")
+    print(f"Run output: {run_dir}")
+    print(f"Concurrency: {concurrency}\n")
 
+    # Pre-create scenario dirs and freeze scenario YAMLs so workers don't race.
     for scenario in scenarios:
         scenario_dir = run_dir / scenario.id
-        # Save a copy of the scenario YAML alongside the runs so we can
-        # tell, months later, exactly what prompt/anchor was in effect.
         scenario_dir.mkdir(parents=True, exist_ok=True)
         (scenario_dir / "scenario.yaml").write_text(yaml.safe_dump(scenario.raw))
 
-        print(f"=== {scenario.id} ({scenario.test}, {scenario.num_runs} runs) ===")
-        for i in range(1, scenario.num_runs + 1):
-            print(f"  run {i}/{scenario.num_runs}...", end=" ", flush=True)
-            meta = run_single(
-                scenario=scenario,
-                backend_url=backend_url,
-                run_index=i,
-                out_dir=scenario_dir,
-            )
+    # Build a flat list of (scenario, run_index) jobs across all scenarios.
+    jobs: list[tuple[Scenario, int]] = [
+        (s, i) for s in scenarios for i in range(1, s.num_runs + 1)
+    ]
+    total_jobs = len(jobs)
+
+    def _execute(job: tuple[Scenario, int]) -> dict[str, Any]:
+        scenario, run_index = job
+        scenario_dir = run_dir / scenario.id
+        meta = run_single(
+            scenario=scenario,
+            backend_url=backend_url,
+            run_index=run_index,
+            out_dir=scenario_dir,
+        )
+        with print_lock:
             elapsed = meta["elapsed_seconds"]
             err = meta["http_error"]
             tools = meta["summary"]["tool_call_count"]
@@ -294,22 +309,40 @@ def run_all(
                 f"ERR ({err})" if err
                 else f"ok  {chars} chars, {tools} tool calls, {elapsed}s"
             )
-            print(status)
-
+            print(f"  [{scenario.id} run {run_index}/{scenario.num_runs}] {status}")
+        with manifest_lock:
             manifest["runs"].append({
                 "scenario_id": scenario.id,
-                "run_index": i,
-                "elapsed_seconds": elapsed,
-                "http_error": err,
-                "tool_call_count": tools,
-                "answer_length_chars": chars,
+                "run_index": run_index,
+                "elapsed_seconds": meta["elapsed_seconds"],
+                "http_error": meta["http_error"],
+                "tool_call_count": meta["summary"]["tool_call_count"],
+                "answer_length_chars": meta["summary"]["answer_length_chars"],
             })
             manifest_path.write_text(json.dumps(manifest, indent=2))
-        print()
+        return meta
+
+    if concurrency <= 1:
+        # Sequential — preserves prior behaviour and predictable per-scenario log order.
+        for scenario in scenarios:
+            with print_lock:
+                print(f"=== {scenario.id} ({scenario.test}, {scenario.num_runs} runs) ===")
+            for i in range(1, scenario.num_runs + 1):
+                _execute((scenario, i))
+            with print_lock:
+                print()
+    else:
+        # Bounded-concurrency thread pool — each worker holds its own httpx
+        # client inside run_single, so they're independent.
+        with ThreadPoolExecutor(max_workers=concurrency) as pool:
+            futures = [pool.submit(_execute, job) for job in jobs]
+            print(f"submitted {total_jobs} jobs to a pool of {concurrency} workers...\n")
+            for _ in as_completed(futures):
+                pass  # progress is printed by _execute
 
     manifest["finished_at"] = dt.datetime.now(dt.timezone.utc).isoformat()
     manifest_path.write_text(json.dumps(manifest, indent=2))
-    print(f"Done. Logs in {run_dir}")
+    print(f"\nDone. Logs in {run_dir}")
     return run_dir
 
 
@@ -330,8 +363,18 @@ def main() -> int:
         action="store_true",
         help="Show what would run without making any requests.",
     )
+    parser.add_argument(
+        "--concurrency",
+        type=int,
+        default=1,
+        help="Number of conversations to run in parallel. Default 1 (sequential). "
+             "Useful values: 1 (debug, clean logs), 4-6 (full eval, ~3-4× speedup).",
+    )
     args = parser.parse_args()
 
+    if args.concurrency < 1:
+        parser.error("--concurrency must be >= 1")
+
     scenarios = load_scenarios(args.scenario_ids or None)
     if not scenarios:
         print("No scenarios matched.", file=sys.stderr)
@@ -341,6 +384,7 @@ def main() -> int:
         scenarios=scenarios,
         backend_url=args.backend_url,
         dry_run=args.dry_run,
+        concurrency=args.concurrency,
     )
     return 0
 

From e63bfa6126115bded12a533318ca00f827f92b7b Mon Sep 17 00:00:00 2001
From: SakshiKekre <sakshi.kekre@gmail.com>
Date: Wed, 27 May 2026 08:42:50 -0700
Subject: [PATCH 09/11] Runner: capture per-tool routing in manifest +
 tool_usage aggregator

- summarise_events() now extracts tool_call_sequence,
  tool_call_counts_by_name, and tool_failure_count from the SSE stream.
- run_all() surfaces these in each manifest row so you don't have to
  re-read per-run meta.json to see what Claude called.
- New tool_usage.py prints a per-scenario tool-routing table from a
  finished run's manifest. Accepts one or more run dirs for A/B
  comparison.

The point: when we register a new typed tool (calculate_household etc.),
we need to see whether Claude actually picked it vs falling back to
run_python. Reading 60 SSE logs by hand doesn't scale.
---
 evals/runner/run.py        | 22 ++++++++++++++
 evals/runner/tool_usage.py | 62 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 84 insertions(+)
 create mode 100644 evals/runner/tool_usage.py

diff --git a/evals/runner/run.py b/evals/runner/run.py
index c605c8a..7efeaf1 100644
--- a/evals/runner/run.py
+++ b/evals/runner/run.py
@@ -157,11 +157,30 @@ def summarise_events(events: list[dict[str, Any]]) -> dict[str, Any]:
 
     errors = [e for e in events if e.get("type") == "error"]
 
+    # Per-tool routing: which tools did Claude pick, in what order, how often.
+    # Used to answer "did registering a new typed tool actually shift Claude's
+    # behaviour, or did it stick with run_python?" — visible in the manifest
+    # without re-reading every SSE log.
+    tool_call_sequence = [
+        e.get("tool_name", "?") for e in events if e.get("type") == "tool_start"
+    ]
+    tool_call_counts_by_name: dict[str, int] = {}
+    for name in tool_call_sequence:
+        tool_call_counts_by_name[name] = tool_call_counts_by_name.get(name, 0) + 1
+
+    tool_failures = [
+        e for e in events
+        if e.get("type") == "tool_result" and e.get("status") == "error"
+    ]
+
     return {
         "event_counts": counts,
         "answer_text": full_text,
         "answer_length_chars": len(full_text),
         "tool_call_count": counts.get("tool_use", 0),
+        "tool_call_sequence": tool_call_sequence,
+        "tool_call_counts_by_name": tool_call_counts_by_name,
+        "tool_failure_count": len(tool_failures),
         "completed": done is not None,
         "error_count": len(errors),
         "errors": errors,
@@ -317,6 +336,9 @@ def _execute(job: tuple[Scenario, int]) -> dict[str, Any]:
                 "elapsed_seconds": meta["elapsed_seconds"],
                 "http_error": meta["http_error"],
                 "tool_call_count": meta["summary"]["tool_call_count"],
+                "tool_call_counts_by_name": meta["summary"]["tool_call_counts_by_name"],
+                "tool_call_sequence": meta["summary"]["tool_call_sequence"],
+                "tool_failure_count": meta["summary"]["tool_failure_count"],
                 "answer_length_chars": meta["summary"]["answer_length_chars"],
             })
             manifest_path.write_text(json.dumps(manifest, indent=2))
diff --git a/evals/runner/tool_usage.py b/evals/runner/tool_usage.py
new file mode 100644
index 0000000..b363c22
--- /dev/null
+++ b/evals/runner/tool_usage.py
@@ -0,0 +1,62 @@
+#!/usr/bin/env python3
+"""Per-scenario tool-routing table for a finished eval run.
+
+Reads `runs/<timestamp>/manifest.json` and prints which tool Claude called
+how many times on each scenario. Useful for A/B-comparing tool-surface
+changes: e.g. did registering `calculate_household` actually shift
+household-shaped questions away from `run_python`?
+
+Usage:
+    python evals/runner/tool_usage.py runs/2026-05-27_120000
+    python evals/runner/tool_usage.py runs/<ts1> runs/<ts2>  # diff two runs
+"""
+
+from __future__ import annotations
+
+import json
+import sys
+from collections import Counter
+from pathlib import Path
+
+
+def load_run(run_dir: Path) -> dict[str, Counter]:
+    manifest = json.loads((run_dir / "manifest.json").read_text())
+    by_scenario: dict[str, Counter] = {}
+    for r in manifest["runs"]:
+        c = by_scenario.setdefault(r["scenario_id"], Counter())
+        c.update(r.get("tool_call_counts_by_name") or {})
+    return by_scenario
+
+
+def print_table(by_scenario: dict[str, Counter], label: str) -> None:
+    all_tools = sorted({t for c in by_scenario.values() for t in c})
+    if not all_tools:
+        print(f"{label}: no tool calls recorded")
+        return
+    width = max(22, max(len(t) for t in all_tools) + 2)
+    header = f"{'scenario':30}" + "".join(f"  {t:{width}}" for t in all_tools)
+    print(f"\n=== {label} ===")
+    print(header)
+    print("-" * len(header))
+    for sid in sorted(by_scenario):
+        row = f"{sid:30}" + "".join(
+            f"  {by_scenario[sid][t]:{width}}" for t in all_tools
+        )
+        print(row)
+
+
+def main(argv: list[str]) -> int:
+    if not argv:
+        print(__doc__)
+        return 1
+    for path_str in argv:
+        run_dir = Path(path_str)
+        if not (run_dir / "manifest.json").exists():
+            print(f"skip {run_dir}: no manifest.json", file=sys.stderr)
+            continue
+        print_table(load_run(run_dir), label=str(run_dir))
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main(sys.argv[1:]))

From 5c715fbc7e54a4e9d46a2a7f8ac0e00538fc81e8 Mon Sep 17 00:00:00 2001
From: SakshiKekre <sakshi.kekre@gmail.com>
Date: Wed, 27 May 2026 08:49:20 -0700
Subject: [PATCH 10/11] Add 2026-05-27 eval results writeup

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 evals/RESULTS-2026-05-27.md | 124 ++++++++++++++++++++++++++++++++++++
 1 file changed, 124 insertions(+)
 create mode 100644 evals/RESULTS-2026-05-27.md

diff --git a/evals/RESULTS-2026-05-27.md b/evals/RESULTS-2026-05-27.md
new file mode 100644
index 0000000..51851ea
--- /dev/null
+++ b/evals/RESULTS-2026-05-27.md
@@ -0,0 +1,124 @@
+# uk-chat eval — first run results
+
+**Run date:** 2026-05-27
+**Scenarios:** 9 (5 Test A, 4 Test B after dropping B5 for baseline drift)
+**Runs per scenario:** 3 fresh sessions
+**Total conversations:** 27
+**Backend under test:** `policyengine--peukchat-feat-model-backend-selector-web.modal.run` (PR 51 preview)
+**`model_backend`:** `uk_python` (`policyengine_uk 2.88.20`)
+
+## Headline
+
+**Test B fails decisively. Test A pending human grading but smoke output is qualitatively mixed.**
+
+| Test | Threshold | Result | Verdict |
+|---|---|---|---|
+| B field accuracy | ≥ 95% within 1-15% tol | **75%** | ❌ FAIL |
+| B self-consistency SD | < 0.5% of mean | 2 violations | ❌ FAIL |
+| B methodology drift | ≤ 1 of 4 scenarios | 0 | ✓ |
+| B failure rate | < 10% | **67%** | ❌ FAIL hard |
+| A mean rubric score | ≥ 4.0 | pending | — |
+| A trust-killer < 2 | none allowed | pending | — |
+| A fabrication rate | ≤ 20% | pending | — |
+
+## The clearest finding
+
+**The deployed chat cannot reliably complete population-level reform questions within its 10-minute timeout.**
+
+- B1 (society-wide PA reform): **3 of 3 runs hit the 600s HTTP timeout.** Failure rate 100%.
+- B2 (NI/IT stacked layers): **2 of 3 runs timed out** at ~600s; the surviving run took 547s.
+- B3 (household calc, no microdata): completed in 213-365s. Only B scenario that succeeded on all runs.
+- B4 (MTR schedule, no microdata): completed in 76-240s. Tool counts 8-12.
+
+**All four B scenarios that *don't* need microdata succeed (B3, B4). All B scenarios that *do* need microdata fail (B1, B2).** Pattern: when the chat has to load EFRS and run a microsim through `policyengine_uk` by writing arbitrary Python, it runs out of time long before producing an answer.
+
+This isn't methodology drift, hallucination, or wrong numbers — it's **chat literally couldn't finish**. 600s requests carrying 12-34 tool calls each, then the Modal worker hangs up.
+
+## Other findings worth surfacing
+
+**A4 out-of-scope is the clean win.** All 3 runs refused cleanly in 7-9 seconds with 0-1 tool calls. Of all 9 scenarios, only A4 looked unambiguously correct on every run.
+
+**Variable-name drift confirmed.** B3 had three runs:
+- Run 1: 4/4 fields correct
+- Run 2: 2/4 correct (national_insurance came back as 2.00 — likely the chat reporting 2% NI rate where we expected the £-amount; extractor edge case OR chat output drift)
+- Run 3: 3/4 correct (`marginal_tax_rate` extracted as `2025.00` — almost certainly the chat reporting "2025-26 fiscal year" being grabbed as the rate)
+
+B3 self-consistency SD: 166% (driven by these extraction artefacts on top of any real chat variance).
+
+This mirrors the **B4 smoke-test finding** where one of three runs used `gov_tax` and `ni_class_1_employee_primary` instead of `income_tax` and `national_insurance`, silently producing wrong numbers without errors. Same class of failure: Claude picks marginally different code each run, sometimes wrong, no errors thrown.
+
+**A5 (factual lookup) timed out once.** Should be a parameter-tree lookup — instead 23 tool calls, 600s, dead. The chat tried to *simulate* what should have been a one-line `get_parameter` call. Confirms the anchor's warning: "could waste tool calls trying to 'calculate' what's just a parameter lookup."
+
+**A2 (subset slice) and A3 (multi-param what-if) thrash heavily.** 17-33 tool calls per run, 96-449s for A2, 321-610s for A3. One A3 run timed out. These are the population-level scenarios with subset filtering and reform composition — the exact shapes prior memories flagged as Claude's weakest area.
+
+## Per-scenario summary
+
+| Scenario | Runs | Errors | Median time | Median tool calls | Notes |
+|---|---|---|---|---|---|
+| a1_mechanism | 3 | 0 | 30s | 2 | Substantive prose, varies on whether the £100k PA-taper mechanism is identified |
+| a2_subset_slice | 3 | 0 | 428s | 25 | Heavy thrashing; chat searched for filtering API |
+| a3_multiparam_what_if | 3 | 1 | 333s | 26 | One timeout; the surviving runs are long |
+| a4_out_of_scope | 3 | 0 | 8s | 0 | Clean refusals every time |
+| a5_factual_lookup | 3 | 1 | 56s | 8 | Surprising timeout; chat tried to simulate vs look up |
+| b1_society_wide_pa | 3 | 3 | 600s | 28 | All 3 timed out |
+| b2_ni_it_stacked | 3 | 2 | 601s | 25 | 2 of 3 timed out |
+| b3_household_calc | 3 | 0 | 266s | 26 | Completes but mixed extractor + content correctness |
+| b4_mtr_schedule | 3 | 0 | 215s | 12 | 2/3 runs correct, 1 used wrong variable names |
+
+## Cost
+
+Combined over both runs (27 conversations):
+
+| Metric | Value |
+|---|---|
+| Input tokens | 447,265 |
+| Output tokens | 171,620 |
+| Cache reads | 8,288,775 (≈free) |
+| Cache creates | 179,344 |
+| Wall time (full 21-run) | 33 minutes at concurrency=4 |
+| Wall time (smoke 6-run) | 10 minutes sequential |
+| User-balance billed | £0 (runner doesn't send `user_id`) |
+| Org Anthropic-account billed | ≈$2-3 estimated |
+
+## Interpretation against pre-committed positioning
+
+**Per `SPEC.md`:**
+
+> "If thresholds met: alternative positioning is viable. If not: supplement-only is the right framing."
+
+Test B failed on three of four pre-committed thresholds, including failing-by-67% on the simplest one (failure rate). **The alternative positioning is not viable** with the current chat. The chat cannot reliably produce the numbers PolicyEngine reports already produce, primarily because microdata-bound runs don't fit within the deployed worker's time budget.
+
+Test A is not yet graded — we have the 15 raw responses but need to fill in the rubric sheet. Smoke observation: A4 looks clean, A1 has variable quality, A2/A3 thrash but did complete, A5 was unexpectedly slow. Final verdict on supplement positioning pending the grading pass.
+
+## What this implies — and what the data points at as a fix
+
+The 27-run dataset turns the prior "structured tools would help" hypothesis (memory: `project_structured_tools_evidence.md`) into a quantified claim:
+
+- Every B scenario that ran free-form Python through `policyengine_uk` either timed out or produced wrong numbers in at least one of three runs.
+- B3 and B4 (small situations) completed but still showed 1-in-3-ish content drift from variable-name selection.
+- A4 (refusal) had zero variance — no tool use to drift on.
+
+**Three concrete next steps:**
+
+1. **Add structured tools** for the common shapes — `calculate_household`, `run_economy_comparison`, `get_baseline_parameters` — so the LLM fills in JSON forms instead of writing variable-picking Python. The dormant tools in `agent_tools.py` cover most of this.
+2. **If structured tools are deferred**, position chat as **supplement only**. The "ask a follow-up" drawer integration in `policyengine-app-v2#1036` is already aligned with this framing.
+3. **Re-run the eval after structured-tool changes** to measure the lift. The harness and fixtures are reusable.
+
+## Caveats and limits of this run
+
+- **n=3 per scenario is statistically thin.** Findings are directionally strong (especially the timeout pattern) but individual numbers shouldn't be over-interpreted.
+- **B3 extractor showed false negatives** on prose-embedded numbers. Some "failures" may be extractor weakness rather than chat error — worth manually reviewing before publishing the per-field accuracy numbers.
+- **Test A grading is human-only and not yet done.** This writeup is half the picture.
+- **Vahid-published figures couldn't be the fixture** for B2's full set because `policyengine_uk 2.88.20` baseline has incorporated the Autumn Budget 2025 changes; B5 was dropped entirely for the same reason. See `drift_report.md` alongside the fixtures.
+
+## Artifacts
+
+Everything reproducible from this repo:
+
+- Scenarios: `evals/scenarios/*.yaml`
+- Fixtures + drift report: `evals/fixtures/`
+- Run logs (SSE + extracted text + meta JSON):
+  - `evals/runs/2026-05-27_060403/` (smoke: A1 + B4)
+  - `evals/runs/2026-05-27_075425/` (full: 7 remaining)
+- Grader output: `evals/runs/<dir>/B_results.md`, `A_grading.md`
+- This writeup: `evals/runs/2026-05-27_results.md`

From 3c11d7b2214c9ad4d9b9f835310d5ebc107d66cf Mon Sep 17 00:00:00 2001
From: SakshiKekre <sakshi.kekre@gmail.com>
Date: Wed, 27 May 2026 08:52:50 -0700
Subject: [PATCH 11/11] Add Test A grading aggregates to writeup

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 evals/RESULTS-2026-05-27.md | 28 +++++++++++++++++++++-------
 1 file changed, 21 insertions(+), 7 deletions(-)

diff --git a/evals/RESULTS-2026-05-27.md b/evals/RESULTS-2026-05-27.md
index 51851ea..aecc27c 100644
--- a/evals/RESULTS-2026-05-27.md
+++ b/evals/RESULTS-2026-05-27.md
@@ -9,7 +9,7 @@
 
 ## Headline
 
-**Test B fails decisively. Test A pending human grading but smoke output is qualitatively mixed.**
+**Both tests fail. Test B fails on timeouts and accuracy; Test A fails on rubric quality, trust-killers, and fabrication rate.**
 
 | Test | Threshold | Result | Verdict |
 |---|---|---|---|
@@ -17,9 +17,9 @@
 | B self-consistency SD | < 0.5% of mean | 2 violations | ❌ FAIL |
 | B methodology drift | ≤ 1 of 4 scenarios | 0 | ✓ |
 | B failure rate | < 10% | **67%** | ❌ FAIL hard |
-| A mean rubric score | ≥ 4.0 | pending | — |
-| A trust-killer < 2 | none allowed | pending | — |
-| A fabrication rate | ≤ 20% | pending | — |
+| A mean rubric score | ≥ 4.0 | **3.09** | ❌ FAIL |
+| A trust-killer < 2 | none allowed | **10 dim-scores < 2** across 4 responses | ❌ FAIL |
+| A fabrication rate | ≤ 20% | **27%** (4 of 15) | ❌ FAIL |
 
 ## The clearest finding
 
@@ -86,9 +86,23 @@ Combined over both runs (27 conversations):
 
 > "If thresholds met: alternative positioning is viable. If not: supplement-only is the right framing."
 
-Test B failed on three of four pre-committed thresholds, including failing-by-67% on the simplest one (failure rate). **The alternative positioning is not viable** with the current chat. The chat cannot reliably produce the numbers PolicyEngine reports already produce, primarily because microdata-bound runs don't fit within the deployed worker's time budget.
+Test B failed on three of four pre-committed thresholds, including failing-by-67% on the simplest one (failure rate). Test A failed on all three thresholds (mean 3.09 vs 4.0; 4 of 15 responses fabricated; 10 dimension-scores below the trust-killer line concentrated in A3 and A5 run 2).
 
-Test A is not yet graded — we have the 15 raw responses but need to fill in the rubric sheet. Smoke observation: A4 looks clean, A1 has variable quality, A2/A3 thrash but did complete, A5 was unexpectedly slow. Final verdict on supplement positioning pending the grading pass.
+**Both alternative and supplement positioning are weaker than the pre-committed thresholds.** Alternative positioning is clearly off the table. Supplement positioning is still defensible only because A4 (out-of-scope refusal) and A1 (mechanism explanation) score acceptably — those are the question shapes the chat *should* own in v1.
+
+## Test A by the numbers
+
+| Scenario | Mean (across 5 dims × 3 runs) | Notable |
+|---|---|---|
+| **a4_out_of_scope** | **5.00** | All 3 runs clean refusals; only scenario with zero trust-killers and zero fabrication |
+| a1_mechanism | 2.93 | Run 1 misses PA-taper entirely; run 2 places taper at £125,140 not £100k; run 3 names taper correctly but contradicts itself on D10 income range |
+| a2_subset_slice | 2.93 | Run 1 produces a table (D1 gain of £159 implausibly high); run 2 flat distribution that's qualitatively wrong; run 3 gives up after 17 tool calls |
+| a5_factual_lookup | 2.67 | Run 1 supplied values "from official records" (training-data fallback after the chat admitted it couldn't query the parameter tree); run 2 timed out at 600s with no answer; run 3 returned a clean table, also without showing the parameter path |
+| **a3_multiparam_what_if** | **1.93** | Run 1 truncated mid-tool-use, no answer. Run 2 used synthetic households and quoted £0.66m / £1.08m budgetary impacts — off by ~4 orders of magnitude. Run 3 got the directional conclusion right but quoted £6-7bn vs reality ~£20bn for PA-only |
+
+**Where the fabrication concentrates:** A3 runs 2 and 3 (made-up budgetary magnitudes), A5 run 1 (training-data values dressed up as model output), and A1 run 1 (invented £972/£1,094 saving figures that don't reflect the actual PA-raise arithmetic).
+
+**Pattern.** The chat is reliable for two shapes only: clean refusals (A4) and short mechanism prose where it doesn't have to run anything (A1, qualitatively, though even there it gets details wrong). Anything requiring it to produce a number from the engine either thrashes, times out, or fabricates.
 
 ## What this implies — and what the data points at as a fix
 
@@ -108,7 +122,7 @@ The 27-run dataset turns the prior "structured tools would help" hypothesis (mem
 
 - **n=3 per scenario is statistically thin.** Findings are directionally strong (especially the timeout pattern) but individual numbers shouldn't be over-interpreted.
 - **B3 extractor showed false negatives** on prose-embedded numbers. Some "failures" may be extractor weakness rather than chat error — worth manually reviewing before publishing the per-field accuracy numbers.
-- **Test A grading is human-only and not yet done.** This writeup is half the picture.
+- **Test A grading is one grader's judgement.** A second grader on the same 15 responses would shift individual scores; the aggregate verdict (fail on all three thresholds) is robust to ±1 per dimension but the per-scenario means are not.
 - **Vahid-published figures couldn't be the fixture** for B2's full set because `policyengine_uk 2.88.20` baseline has incorporated the Autumn Budget 2025 changes; B5 was dropped entirely for the same reason. See `drift_report.md` alongside the fixtures.
 
 ## Artifacts