Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,14 @@ __pycache__/
.next/
node_modules/
.venv/
.venv-*/
*.egg-info/
dist/

# Generated at image build time by backend/scripts/build_reference.py.
# Both backend/Dockerfile and modal_app.py regenerate this against the
# installed policyengine-uk-compiled version, so it never lives in git.
backend/reference.md

# Eval run outputs — generated per run, not source
evals/runs/
88 changes: 88 additions & 0 deletions evals/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# uk-chat evals

Evaluation harness for the chat. Tests two positionings — **supplement** to app-v2 reports and **alternative** to them — against pre-committed thresholds.

The full design (rationale, thresholds, scenario descriptions) lives in [SPEC.md](./SPEC.md). This README covers structure and how to extend.

## What's here today

```
evals/
SPEC.md ← the design doc — read this first
README.md ← this file
scenarios/
a1_*.yaml ← Test A scenarios (supplement)
b1_*.yaml ← Test B scenarios (alternative)
fixtures/
pe_api/ ← reference PE-API responses for Test B
← populated by a follow-up PR
runs/ ← raw eval-run output (gitignored)
← populated by the runner in a follow-up PR
```

## Scenario file shape

Every scenario is a YAML file in `scenarios/`. The fields are the same across A and B; some are populated only on one side.

```yaml
id: a1_mechanism # short stable identifier — also the filename prefix
test: A # A (supplement) or B (alternative)
title: "Mechanism explanation"
what_it_tests: "One sentence on why this scenario exists."

# Sent to the chat as ?scenario_context= (Test A only). Mirrors what the
# app-v2 drawer would prepend when opening from a report.
scenario_context: |
...

# The user's actual message.
prompt: |
...

chat_settings:
model_backend: uk_python # uk_compiled | uk_python
num_runs: 3 # how many fresh sessions to run per scenario

# Test A — qualitative grading rubric.
rubric:
relevance: "1-5 — chat answers the actual follow-up"
methodology: "1-5 — chat states dataset/year/assumptions"
reasonableness: "1-5 — numbers in plausible range, internally consistent"
consistency_with_report: "1-5 — agrees with report on shared facts"
honesty: "1-5 — refuses cleanly vs fabricates"

# Test B — numeric comparison against a fixture.
reference:
fixture: pe_api/b1_society_wide_pa.json
fields_to_compare:
- path: budget.budgetary_impact
tolerance_pct: 1.0
- path: decile.relative
tolerance_pct: 1.0

# Both A and B — golden-path guidance the grader uses (and an LLM-judge will use later).
anchor:
must_mention: ["personal allowance taper", "60%", "£100,000"]
must_not_say: ["top decile pays more tax than the 8th"]
ideal_explanation: |
Multi-line prose describing what a Vahid-quality answer would look like.
Used by the human grader as a reference; will become input to an
automated LLM-judge in v2.
```

Where anchors come from: PolicyEngine's published research blog where one matches the scenario (Vahid's UK posts under `app/src/data/posts/articles/` in the app-v2 repo), otherwise UK tax-rule knowledge. See `SPEC.md` § "Anchors" for the per-scenario sourcing.

The runner (added in a follow-up PR) reads every YAML in `scenarios/`, POSTs to a configured chat backend, and writes outputs under `runs/<timestamp>/<scenario-id>/`. See SPEC.md "Roadmap" for what's coming next.

## Decision thresholds

Pre-committed in SPEC.md so we don't rationalize ambiguous results later:

- **Test A** passes if mean rubric score ≥ 4.0 across all responses, no individual score < 2 on Reasonableness or Honesty, and ≤ 1-in-5 fabrication rate.
- **Test B** passes if field-level accuracy ≥ 95%, self-consistency SD < 0.5% of mean, methodology drift in ≤ 1 of 5 scenarios, failure rate < 10%.

## Adding scenarios

Add a YAML file. That's it — the runner picks it up automatically. Keep the `id` short and stable (it appears in filenames and reports).

When adding a Test B scenario, also add the reference fixture under `fixtures/pe_api/` and reference it via the relative path in the scenario YAML.
138 changes: 138 additions & 0 deletions evals/RESULTS-2026-05-27.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
# uk-chat eval — first run results

**Run date:** 2026-05-27
**Scenarios:** 9 (5 Test A, 4 Test B after dropping B5 for baseline drift)
**Runs per scenario:** 3 fresh sessions
**Total conversations:** 27
**Backend under test:** `policyengine--peukchat-feat-model-backend-selector-web.modal.run` (PR 51 preview)
**`model_backend`:** `uk_python` (`policyengine_uk 2.88.20`)

## Headline

**Both tests fail. Test B fails on timeouts and accuracy; Test A fails on rubric quality, trust-killers, and fabrication rate.**

| Test | Threshold | Result | Verdict |
|---|---|---|---|
| B field accuracy | ≥ 95% within 1-15% tol | **75%** | ❌ FAIL |
| B self-consistency SD | < 0.5% of mean | 2 violations | ❌ FAIL |
| B methodology drift | ≤ 1 of 4 scenarios | 0 | ✓ |
| B failure rate | < 10% | **67%** | ❌ FAIL hard |
| A mean rubric score | ≥ 4.0 | **3.09** | ❌ FAIL |
| A trust-killer < 2 | none allowed | **10 dim-scores < 2** across 4 responses | ❌ FAIL |
| A fabrication rate | ≤ 20% | **27%** (4 of 15) | ❌ FAIL |

## The clearest finding

**The deployed chat cannot reliably complete population-level reform questions within its 10-minute timeout.**

- B1 (society-wide PA reform): **3 of 3 runs hit the 600s HTTP timeout.** Failure rate 100%.
- B2 (NI/IT stacked layers): **2 of 3 runs timed out** at ~600s; the surviving run took 547s.
- B3 (household calc, no microdata): completed in 213-365s. Only B scenario that succeeded on all runs.
- B4 (MTR schedule, no microdata): completed in 76-240s. Tool counts 8-12.

**All four B scenarios that *don't* need microdata succeed (B3, B4). All B scenarios that *do* need microdata fail (B1, B2).** Pattern: when the chat has to load EFRS and run a microsim through `policyengine_uk` by writing arbitrary Python, it runs out of time long before producing an answer.

This isn't methodology drift, hallucination, or wrong numbers — it's **chat literally couldn't finish**. 600s requests carrying 12-34 tool calls each, then the Modal worker hangs up.

## Other findings worth surfacing

**A4 out-of-scope is the clean win.** All 3 runs refused cleanly in 7-9 seconds with 0-1 tool calls. Of all 9 scenarios, only A4 looked unambiguously correct on every run.

**Variable-name drift confirmed.** B3 had three runs:
- Run 1: 4/4 fields correct
- Run 2: 2/4 correct (national_insurance came back as 2.00 — likely the chat reporting 2% NI rate where we expected the £-amount; extractor edge case OR chat output drift)
- Run 3: 3/4 correct (`marginal_tax_rate` extracted as `2025.00` — almost certainly the chat reporting "2025-26 fiscal year" being grabbed as the rate)

B3 self-consistency SD: 166% (driven by these extraction artefacts on top of any real chat variance).

This mirrors the **B4 smoke-test finding** where one of three runs used `gov_tax` and `ni_class_1_employee_primary` instead of `income_tax` and `national_insurance`, silently producing wrong numbers without errors. Same class of failure: Claude picks marginally different code each run, sometimes wrong, no errors thrown.

**A5 (factual lookup) timed out once.** Should be a parameter-tree lookup — instead 23 tool calls, 600s, dead. The chat tried to *simulate* what should have been a one-line `get_parameter` call. Confirms the anchor's warning: "could waste tool calls trying to 'calculate' what's just a parameter lookup."

**A2 (subset slice) and A3 (multi-param what-if) thrash heavily.** 17-33 tool calls per run, 96-449s for A2, 321-610s for A3. One A3 run timed out. These are the population-level scenarios with subset filtering and reform composition — the exact shapes prior memories flagged as Claude's weakest area.

## Per-scenario summary

| Scenario | Runs | Errors | Median time | Median tool calls | Notes |
|---|---|---|---|---|---|
| a1_mechanism | 3 | 0 | 30s | 2 | Substantive prose, varies on whether the £100k PA-taper mechanism is identified |
| a2_subset_slice | 3 | 0 | 428s | 25 | Heavy thrashing; chat searched for filtering API |
| a3_multiparam_what_if | 3 | 1 | 333s | 26 | One timeout; the surviving runs are long |
| a4_out_of_scope | 3 | 0 | 8s | 0 | Clean refusals every time |
| a5_factual_lookup | 3 | 1 | 56s | 8 | Surprising timeout; chat tried to simulate vs look up |
| b1_society_wide_pa | 3 | 3 | 600s | 28 | All 3 timed out |
| b2_ni_it_stacked | 3 | 2 | 601s | 25 | 2 of 3 timed out |
| b3_household_calc | 3 | 0 | 266s | 26 | Completes but mixed extractor + content correctness |
| b4_mtr_schedule | 3 | 0 | 215s | 12 | 2/3 runs correct, 1 used wrong variable names |

## Cost

Combined over both runs (27 conversations):

| Metric | Value |
|---|---|
| Input tokens | 447,265 |
| Output tokens | 171,620 |
| Cache reads | 8,288,775 (≈free) |
| Cache creates | 179,344 |
| Wall time (full 21-run) | 33 minutes at concurrency=4 |
| Wall time (smoke 6-run) | 10 minutes sequential |
| User-balance billed | £0 (runner doesn't send `user_id`) |
| Org Anthropic-account billed | ≈$2-3 estimated |

## Interpretation against pre-committed positioning

**Per `SPEC.md`:**

> "If thresholds met: alternative positioning is viable. If not: supplement-only is the right framing."

Test B failed on three of four pre-committed thresholds, including failing-by-67% on the simplest one (failure rate). Test A failed on all three thresholds (mean 3.09 vs 4.0; 4 of 15 responses fabricated; 10 dimension-scores below the trust-killer line concentrated in A3 and A5 run 2).

**Both alternative and supplement positioning are weaker than the pre-committed thresholds.** Alternative positioning is clearly off the table. Supplement positioning is still defensible only because A4 (out-of-scope refusal) and A1 (mechanism explanation) score acceptably — those are the question shapes the chat *should* own in v1.

## Test A by the numbers

| Scenario | Mean (across 5 dims × 3 runs) | Notable |
|---|---|---|
| **a4_out_of_scope** | **5.00** | All 3 runs clean refusals; only scenario with zero trust-killers and zero fabrication |
| a1_mechanism | 2.93 | Run 1 misses PA-taper entirely; run 2 places taper at £125,140 not £100k; run 3 names taper correctly but contradicts itself on D10 income range |
| a2_subset_slice | 2.93 | Run 1 produces a table (D1 gain of £159 implausibly high); run 2 flat distribution that's qualitatively wrong; run 3 gives up after 17 tool calls |
| a5_factual_lookup | 2.67 | Run 1 supplied values "from official records" (training-data fallback after the chat admitted it couldn't query the parameter tree); run 2 timed out at 600s with no answer; run 3 returned a clean table, also without showing the parameter path |
| **a3_multiparam_what_if** | **1.93** | Run 1 truncated mid-tool-use, no answer. Run 2 used synthetic households and quoted £0.66m / £1.08m budgetary impacts — off by ~4 orders of magnitude. Run 3 got the directional conclusion right but quoted £6-7bn vs reality ~£20bn for PA-only |

**Where the fabrication concentrates:** A3 runs 2 and 3 (made-up budgetary magnitudes), A5 run 1 (training-data values dressed up as model output), and A1 run 1 (invented £972/£1,094 saving figures that don't reflect the actual PA-raise arithmetic).

**Pattern.** The chat is reliable for two shapes only: clean refusals (A4) and short mechanism prose where it doesn't have to run anything (A1, qualitatively, though even there it gets details wrong). Anything requiring it to produce a number from the engine either thrashes, times out, or fabricates.

## What this implies — and what the data points at as a fix

The 27-run dataset turns the prior "structured tools would help" hypothesis (memory: `project_structured_tools_evidence.md`) into a quantified claim:

- Every B scenario that ran free-form Python through `policyengine_uk` either timed out or produced wrong numbers in at least one of three runs.
- B3 and B4 (small situations) completed but still showed 1-in-3-ish content drift from variable-name selection.
- A4 (refusal) had zero variance — no tool use to drift on.

**Three concrete next steps:**

1. **Add structured tools** for the common shapes — `calculate_household`, `run_economy_comparison`, `get_baseline_parameters` — so the LLM fills in JSON forms instead of writing variable-picking Python. The dormant tools in `agent_tools.py` cover most of this.
2. **If structured tools are deferred**, position chat as **supplement only**. The "ask a follow-up" drawer integration in `policyengine-app-v2#1036` is already aligned with this framing.
3. **Re-run the eval after structured-tool changes** to measure the lift. The harness and fixtures are reusable.

## Caveats and limits of this run

- **n=3 per scenario is statistically thin.** Findings are directionally strong (especially the timeout pattern) but individual numbers shouldn't be over-interpreted.
- **B3 extractor showed false negatives** on prose-embedded numbers. Some "failures" may be extractor weakness rather than chat error — worth manually reviewing before publishing the per-field accuracy numbers.
- **Test A grading is one grader's judgement.** A second grader on the same 15 responses would shift individual scores; the aggregate verdict (fail on all three thresholds) is robust to ±1 per dimension but the per-scenario means are not.
- **Vahid-published figures couldn't be the fixture** for B2's full set because `policyengine_uk 2.88.20` baseline has incorporated the Autumn Budget 2025 changes; B5 was dropped entirely for the same reason. See `drift_report.md` alongside the fixtures.

## Artifacts

Everything reproducible from this repo:

- Scenarios: `evals/scenarios/*.yaml`
- Fixtures + drift report: `evals/fixtures/`
- Run logs (SSE + extracted text + meta JSON):
- `evals/runs/2026-05-27_060403/` (smoke: A1 + B4)
- `evals/runs/2026-05-27_075425/` (full: 7 remaining)
- Grader output: `evals/runs/<dir>/B_results.md`, `A_grading.md`
- This writeup: `evals/runs/2026-05-27_results.md`
Loading