PolicyEngine · SakshiKekre · May 15, 2026 · May 19, 2026 · May 20, 2026 · May 20, 2026
diff --git a/.gitignore b/.gitignore
@@ -4,10 +4,14 @@ __pycache__/
 .next/
 node_modules/
 .venv/
+.venv-*/
 *.egg-info/
 dist/
 
 # Generated at image build time by backend/scripts/build_reference.py.
 # Both backend/Dockerfile and modal_app.py regenerate this against the
 # installed policyengine-uk-compiled version, so it never lives in git.
 backend/reference.md
+
+# Eval run outputs — generated per run, not source
+evals/runs/
diff --git a/evals/README.md b/evals/README.md
@@ -0,0 +1,88 @@
+# uk-chat evals
+
+Evaluation harness for the chat. Tests two positionings — **supplement** to app-v2 reports and **alternative** to them — against pre-committed thresholds.
+
+The full design (rationale, thresholds, scenario descriptions) lives in [SPEC.md](./SPEC.md). This README covers structure and how to extend.
+
+## What's here today
+
+```
+evals/
+  SPEC.md                ← the design doc — read this first
+  README.md              ← this file
+  scenarios/
+    a1_*.yaml            ← Test A scenarios (supplement)
+    b1_*.yaml            ← Test B scenarios (alternative)
+  fixtures/
+    pe_api/              ← reference PE-API responses for Test B
+                          ← populated by a follow-up PR
+  runs/                  ← raw eval-run output (gitignored)
+                          ← populated by the runner in a follow-up PR
+```
+
+## Scenario file shape
+
+Every scenario is a YAML file in `scenarios/`. The fields are the same across A and B; some are populated only on one side.
+
+```yaml
+id: a1_mechanism            # short stable identifier — also the filename prefix
+test: A                     # A (supplement) or B (alternative)
+title: "Mechanism explanation"
+what_it_tests: "One sentence on why this scenario exists."
+
+# Sent to the chat as ?scenario_context= (Test A only). Mirrors what the
+# app-v2 drawer would prepend when opening from a report.
+scenario_context: |
+  ...
+
+# The user's actual message.
+prompt: |
+  ...
+
+chat_settings:
+  model_backend: uk_python  # uk_compiled | uk_python
+  num_runs: 3               # how many fresh sessions to run per scenario
+
+# Test A — qualitative grading rubric.
+rubric:
+  relevance: "1-5 — chat answers the actual follow-up"
+  methodology: "1-5 — chat states dataset/year/assumptions"
+  reasonableness: "1-5 — numbers in plausible range, internally consistent"
+  consistency_with_report: "1-5 — agrees with report on shared facts"
+  honesty: "1-5 — refuses cleanly vs fabricates"
+
+# Test B — numeric comparison against a fixture.
+reference:
+  fixture: pe_api/b1_society_wide_pa.json
+  fields_to_compare:
+    - path: budget.budgetary_impact
+      tolerance_pct: 1.0
+    - path: decile.relative
+      tolerance_pct: 1.0
+
+# Both A and B — golden-path guidance the grader uses (and an LLM-judge will use later).
+anchor:
+  must_mention: ["personal allowance taper", "60%", "£100,000"]
+  must_not_say: ["top decile pays more tax than the 8th"]
+  ideal_explanation: |
+    Multi-line prose describing what a Vahid-quality answer would look like.
+    Used by the human grader as a reference; will become input to an
+    automated LLM-judge in v2.
+```
+
+Where anchors come from: PolicyEngine's published research blog where one matches the scenario (Vahid's UK posts under `app/src/data/posts/articles/` in the app-v2 repo), otherwise UK tax-rule knowledge. See `SPEC.md` § "Anchors" for the per-scenario sourcing.
+
+The runner (added in a follow-up PR) reads every YAML in `scenarios/`, POSTs to a configured chat backend, and writes outputs under `runs/<timestamp>/<scenario-id>/`. See SPEC.md "Roadmap" for what's coming next.
+
+## Decision thresholds
+
+Pre-committed in SPEC.md so we don't rationalize ambiguous results later:
+
+- **Test A** passes if mean rubric score ≥ 4.0 across all responses, no individual score < 2 on Reasonableness or Honesty, and ≤ 1-in-5 fabrication rate.
+- **Test B** passes if field-level accuracy ≥ 95%, self-consistency SD < 0.5% of mean, methodology drift in ≤ 1 of 5 scenarios, failure rate < 10%.
+
+## Adding scenarios
+
+Add a YAML file. That's it — the runner picks it up automatically. Keep the `id` short and stable (it appears in filenames and reports).
+
+When adding a Test B scenario, also add the reference fixture under `fixtures/pe_api/` and reference it via the relative path in the scenario YAML.
diff --git a/evals/RESULTS-2026-05-27.md b/evals/RESULTS-2026-05-27.md
@@ -0,0 +1,138 @@
+# uk-chat eval — first run results
+
+**Run date:** 2026-05-27
+**Scenarios:** 9 (5 Test A, 4 Test B after dropping B5 for baseline drift)
+**Runs per scenario:** 3 fresh sessions
+**Total conversations:** 27
+**Backend under test:** `policyengine--peukchat-feat-model-backend-selector-web.modal.run` (PR 51 preview)
+**`model_backend`:** `uk_python` (`policyengine_uk 2.88.20`)
+
+## Headline
+
+**Both tests fail. Test B fails on timeouts and accuracy; Test A fails on rubric quality, trust-killers, and fabrication rate.**
+
+| Test | Threshold | Result | Verdict |
+|---|---|---|---|
+| B field accuracy | ≥ 95% within 1-15% tol | **75%** | ❌ FAIL |
+| B self-consistency SD | < 0.5% of mean | 2 violations | ❌ FAIL |
+| B methodology drift | ≤ 1 of 4 scenarios | 0 | ✓ |
+| B failure rate | < 10% | **67%** | ❌ FAIL hard |
+| A mean rubric score | ≥ 4.0 | **3.09** | ❌ FAIL |
+| A trust-killer < 2 | none allowed | **10 dim-scores < 2** across 4 responses | ❌ FAIL |
+| A fabrication rate | ≤ 20% | **27%** (4 of 15) | ❌ FAIL |
+
+## The clearest finding
+
+**The deployed chat cannot reliably complete population-level reform questions within its 10-minute timeout.**
+
+- B1 (society-wide PA reform): **3 of 3 runs hit the 600s HTTP timeout.** Failure rate 100%.
+- B2 (NI/IT stacked layers): **2 of 3 runs timed out** at ~600s; the surviving run took 547s.
+- B3 (household calc, no microdata): completed in 213-365s. Only B scenario that succeeded on all runs.
+- B4 (MTR schedule, no microdata): completed in 76-240s. Tool counts 8-12.
+
+**All four B scenarios that *don't* need microdata succeed (B3, B4). All B scenarios that *do* need microdata fail (B1, B2).** Pattern: when the chat has to load EFRS and run a microsim through `policyengine_uk` by writing arbitrary Python, it runs out of time long before producing an answer.
+
+This isn't methodology drift, hallucination, or wrong numbers — it's **chat literally couldn't finish**. 600s requests carrying 12-34 tool calls each, then the Modal worker hangs up.
+
+## Other findings worth surfacing
+
+**A4 out-of-scope is the clean win.** All 3 runs refused cleanly in 7-9 seconds with 0-1 tool calls. Of all 9 scenarios, only A4 looked unambiguously correct on every run.
+
+**Variable-name drift confirmed.** B3 had three runs:
+- Run 1: 4/4 fields correct
+- Run 2: 2/4 correct (national_insurance came back as 2.00 — likely the chat reporting 2% NI rate where we expected the £-amount; extractor edge case OR chat output drift)
+- Run 3: 3/4 correct (`marginal_tax_rate` extracted as `2025.00` — almost certainly the chat reporting "2025-26 fiscal year" being grabbed as the rate)
+
+B3 self-consistency SD: 166% (driven by these extraction artefacts on top of any real chat variance).
+
+This mirrors the **B4 smoke-test finding** where one of three runs used `gov_tax` and `ni_class_1_employee_primary` instead of `income_tax` and `national_insurance`, silently producing wrong numbers without errors. Same class of failure: Claude picks marginally different code each run, sometimes wrong, no errors thrown.
+
+**A5 (factual lookup) timed out once.** Should be a parameter-tree lookup — instead 23 tool calls, 600s, dead. The chat tried to *simulate* what should have been a one-line `get_parameter` call. Confirms the anchor's warning: "could waste tool calls trying to 'calculate' what's just a parameter lookup."
+
+**A2 (subset slice) and A3 (multi-param what-if) thrash heavily.** 17-33 tool calls per run, 96-449s for A2, 321-610s for A3. One A3 run timed out. These are the population-level scenarios with subset filtering and reform composition — the exact shapes prior memories flagged as Claude's weakest area.
+
+## Per-scenario summary
+
+| Scenario | Runs | Errors | Median time | Median tool calls | Notes |
+|---|---|---|---|---|---|
+| a1_mechanism | 3 | 0 | 30s | 2 | Substantive prose, varies on whether the £100k PA-taper mechanism is identified |
+| a2_subset_slice | 3 | 0 | 428s | 25 | Heavy thrashing; chat searched for filtering API |
+| a3_multiparam_what_if | 3 | 1 | 333s | 26 | One timeout; the surviving runs are long |
+| a4_out_of_scope | 3 | 0 | 8s | 0 | Clean refusals every time |
+| a5_factual_lookup | 3 | 1 | 56s | 8 | Surprising timeout; chat tried to simulate vs look up |
+| b1_society_wide_pa | 3 | 3 | 600s | 28 | All 3 timed out |
+| b2_ni_it_stacked | 3 | 2 | 601s | 25 | 2 of 3 timed out |
+| b3_household_calc | 3 | 0 | 266s | 26 | Completes but mixed extractor + content correctness |
+| b4_mtr_schedule | 3 | 0 | 215s | 12 | 2/3 runs correct, 1 used wrong variable names |
+
+## Cost
+
+Combined over both runs (27 conversations):
+
+| Metric | Value |
+|---|---|
+| Input tokens | 447,265 |
+| Output tokens | 171,620 |
+| Cache reads | 8,288,775 (≈free) |
+| Cache creates | 179,344 |
+| Wall time (full 21-run) | 33 minutes at concurrency=4 |
+| Wall time (smoke 6-run) | 10 minutes sequential |
+| User-balance billed | £0 (runner doesn't send `user_id`) |
+| Org Anthropic-account billed | ≈$2-3 estimated |
+
+## Interpretation against pre-committed positioning
+
+**Per `SPEC.md`:**
+
+> "If thresholds met: alternative positioning is viable. If not: supplement-only is the right framing."
+
+Test B failed on three of four pre-committed thresholds, including failing-by-67% on the simplest one (failure rate). Test A failed on all three thresholds (mean 3.09 vs 4.0; 4 of 15 responses fabricated; 10 dimension-scores below the trust-killer line concentrated in A3 and A5 run 2).
+
+**Both alternative and supplement positioning are weaker than the pre-committed thresholds.** Alternative positioning is clearly off the table. Supplement positioning is still defensible only because A4 (out-of-scope refusal) and A1 (mechanism explanation) score acceptably — those are the question shapes the chat *should* own in v1.
+
+## Test A by the numbers
+
+| Scenario | Mean (across 5 dims × 3 runs) | Notable |
+|---|---|---|
+| **a4_out_of_scope** | **5.00** | All 3 runs clean refusals; only scenario with zero trust-killers and zero fabrication |
+| a1_mechanism | 2.93 | Run 1 misses PA-taper entirely; run 2 places taper at £125,140 not £100k; run 3 names taper correctly but contradicts itself on D10 income range |
+| a2_subset_slice | 2.93 | Run 1 produces a table (D1 gain of £159 implausibly high); run 2 flat distribution that's qualitatively wrong; run 3 gives up after 17 tool calls |
+| a5_factual_lookup | 2.67 | Run 1 supplied values "from official records" (training-data fallback after the chat admitted it couldn't query the parameter tree); run 2 timed out at 600s with no answer; run 3 returned a clean table, also without showing the parameter path |
+| **a3_multiparam_what_if** | **1.93** | Run 1 truncated mid-tool-use, no answer. Run 2 used synthetic households and quoted £0.66m / £1.08m budgetary impacts — off by ~4 orders of magnitude. Run 3 got the directional conclusion right but quoted £6-7bn vs reality ~£20bn for PA-only |
+
+**Where the fabrication concentrates:** A3 runs 2 and 3 (made-up budgetary magnitudes), A5 run 1 (training-data values dressed up as model output), and A1 run 1 (invented £972/£1,094 saving figures that don't reflect the actual PA-raise arithmetic).
+
+**Pattern.** The chat is reliable for two shapes only: clean refusals (A4) and short mechanism prose where it doesn't have to run anything (A1, qualitatively, though even there it gets details wrong). Anything requiring it to produce a number from the engine either thrashes, times out, or fabricates.
+
+## What this implies — and what the data points at as a fix
+
+The 27-run dataset turns the prior "structured tools would help" hypothesis (memory: `project_structured_tools_evidence.md`) into a quantified claim:
+
+- Every B scenario that ran free-form Python through `policyengine_uk` either timed out or produced wrong numbers in at least one of three runs.
+- B3 and B4 (small situations) completed but still showed 1-in-3-ish content drift from variable-name selection.
+- A4 (refusal) had zero variance — no tool use to drift on.
+
+**Three concrete next steps:**
+
+1. **Add structured tools** for the common shapes — `calculate_household`, `run_economy_comparison`, `get_baseline_parameters` — so the LLM fills in JSON forms instead of writing variable-picking Python. The dormant tools in `agent_tools.py` cover most of this.
+2. **If structured tools are deferred**, position chat as **supplement only**. The "ask a follow-up" drawer integration in `policyengine-app-v2#1036` is already aligned with this framing.
+3. **Re-run the eval after structured-tool changes** to measure the lift. The harness and fixtures are reusable.
+
+## Caveats and limits of this run
+
+- **n=3 per scenario is statistically thin.** Findings are directionally strong (especially the timeout pattern) but individual numbers shouldn't be over-interpreted.
+- **B3 extractor showed false negatives** on prose-embedded numbers. Some "failures" may be extractor weakness rather than chat error — worth manually reviewing before publishing the per-field accuracy numbers.
+- **Test A grading is one grader's judgement.** A second grader on the same 15 responses would shift individual scores; the aggregate verdict (fail on all three thresholds) is robust to ±1 per dimension but the per-scenario means are not.
+- **Vahid-published figures couldn't be the fixture** for B2's full set because `policyengine_uk 2.88.20` baseline has incorporated the Autumn Budget 2025 changes; B5 was dropped entirely for the same reason. See `drift_report.md` alongside the fixtures.
+
+## Artifacts
+
+Everything reproducible from this repo:
+
+- Scenarios: `evals/scenarios/*.yaml`
+- Fixtures + drift report: `evals/fixtures/`
+- Run logs (SSE + extracted text + meta JSON):
+  - `evals/runs/2026-05-27_060403/` (smoke: A1 + B4)
+  - `evals/runs/2026-05-27_075425/` (full: 7 remaining)
+- Grader output: `evals/runs/<dir>/B_results.md`, `A_grading.md`
+- This writeup: `evals/runs/2026-05-27_results.md`