Add uk_python typed tools (run_economy_simulation) to fix B1 reform failures by SakshiKekre · Pull Request #97 · PolicyEngine/policyengine-uk-chat

SakshiKekre · 2026-06-02T18:05:28Z

Summary

Adds a run_economy_simulation typed tool that calls policyengine_uk.Microsimulation(reform=...) directly, replacing the LLM's trial-and-error run_python thrashing that currently fails B1 in 3 of 4 ways.

Stacked on #51 (model-backend-selector). Imports typed-tool infrastructure from @anth-volk's #94 (tooling/ package, tool_definitions.py) and wires a new uk_python engine adapter on top — leaves Anthony's compiled-engine path untouched.

Why this PR exists

This morning's B1 eval (feat/eval-harness, model_backend=uk_python) found four distinct failure modes of run_python against the B1 prompt (PA £12,570 → £15,000, economy-wide):

Run	Outcome	Budget
1	Got close after 39 tool calls, wrong poverty measure	-£21.7bn (PE-API: -£20.45bn, 6.3% off)
2	Self-aware bail after 5 failed attempts at reform application	No number
3	Clean refusal: "uk_python doesn't support reforms"	No number
Manual	Silently fabricated by hand-arithmetic	-£5.45bn (4× too small)

All four fail because Claude is re-discovering how to apply a parameter reform via the Python engine every run, and the API surface is non-obvious. Anthony's #94 introduces typed tools to remove this surface for the compiled (Rust) engine; this PR extends them to the Python engine.

End-to-end result (local in-process)

Adapter against B1, single call, 12 seconds wall time:

Field	PE-API v1	uk_python adapter	Δ
Budget impact	-£20,452,307,209	-£20,452,307,295	£86 (0.0000004%)
Tax revenue change	-£21,084,668,700	-£21,084,668,787	£87
Benefit spending	-£632,361,491	-£632,361,491	exact
D1-D10 avg & relative	(all 20 values)	(all 20 values)	exact to nearest £/0.001%
Child / adult / senior / all poverty BHC	14.06 / 8.76 / 12.16 / 10.50	14.06 / 8.76 / 12.16 / 10.50	exact

Same engine + same dataset + same MicroSeries formulas → numbers match PE-API by construction.

Architecture

backend/
  agent_tools.py                 ← wires dispatch on backend_id
  model_backends.py              ← uk_python prompt now mentions run_economy_simulation
  tool_definitions.py            ← typed tool JSON-Schemas (from #94)
  tooling/                       ← engine-agnostic helpers (from #94)
    {households,microdata,reforms,sandbox,serialization,simulations}.py
    backends/                    ← NEW: per-engine implementations
      __init__.py                ← EngineBackend protocol + registry
      uk_python.py               ← policyengine_uk implementation + reform translator

Commit walk

1a4b347 Import [codex] Expose standalone typed tools #94's tooling/ + tool_definitions.py verbatim (no wiring)
5ef5ded Add tooling/backends/ subpackage with uk_python adapter + reform translator
8fdfc1e Dispatch run_economy_simulation in agent_tools.execute_tool for uk_python only
2adc5c1 [TEMP] Mirror PE-API's calculate_economy_comparison methodology line-by-line (see commit message for why we can't import the orchestrator today)
ea11775 Update uk_python system prompt to prefer the typed tool

Known tech debt — see TODO in `tooling/backends/uk_python.py`

run_economy_simulation inlines ~50 lines of glue that already exist as calculate_economy_comparison in the policyengine orchestrator package. We can't import it today because pip install policyengine==0.13.0 transitively requires policyengine_us which pins policyengine_core>=3.26.0 conflicting with PE-API's pinned policyengine_core==3.25.3. Drop the inlined glue when the orchestrator imports cleanly.

Test plan

Local in-process: agent_tools.execute_tool("run_economy_simulation", ..., backend_id="uk_python") returns PE-API-matching B1 numbers in 12s
get_tool_definitions("uk_python") includes new tool; ("uk_compiled") does not
Bad reform field returns clean error listing valid fields, doesn't crash engine
Deploy preview, run evals/runner/run.py b1_society_wide_pa --backend-url <preview> 3× and confirm chat picks run_economy_simulation over run_python
B2, b6-b10 reform scenarios pass against the same preview

Brings backend/tooling/ + backend/tool_definitions.py from anth-volk's codex/standalone-typed-tools branch as-is. No wiring yet — this just lands the package so a follow-up commit can add backend_id dispatch and a uk_python engine adapter. Files imported verbatim from PR #94 commits b7853bf + 38e4e69: - backend/tooling/__init__.py - backend/tooling/serialization.py - backend/tooling/simulations.py - backend/tooling/reforms.py - backend/tooling/households.py - backend/tooling/microdata.py - backend/tooling/sandbox.py - backend/tool_definitions.py Author: Anthony Volk (anth-volk) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

backend/tooling/backends/{__init__.py,uk_python.py} introduces an EngineBackend protocol and one implementation calling policyengine_uk's Microsimulation(reform=...) directly. Reform translator converts the typed-tool programme/field shape into the dotted-path-with-period dict policyengine_uk accepts — the same shape PE-API stores in /uk/policy/<id>. Mapping covers the parameters touched by the eval B-suite (PA amount, IT band rates, NI main rate + primary threshold, child benefit amounts). Verified against CountryTaxBenefitSystem.parameters.get_child(...) at authoring time. Tools aren't wired into agent_tools.execute_tool yet — that comes in the next commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

execute_tool now dispatches run_economy_simulation to the registered backend (currently uk_python). get_tool_definitions exposes the tool to the LLM only when the chat session selected uk_python — keeps uk_compiled paths untouched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Inlines ~50 lines of glue that already exist as calculate_economy_comparison() in the policyengine orchestrator package. We can't import that package today because a pip dependency conflict (policyengine_us pulled in transitively wants policyengine_core>=3.26, the orchestrator at 0.13.0 was built against ==3.25.3) makes the deps unresolvable from the chat backend's lockfile. Choice rationale: rather than spend hours pinning four packages across deploy environments and validating Modal still builds, we replicate the 50 lines so we can keep iterating on the typed-tool eval today. The underlying engine calls are identical, so numbers match PE-API by construction. End-to-end against B1 (PA £12,570 -> £15,000) reproduces PE-API to the second decimal across budget, all 10 deciles, and all 4 poverty groups. £86 budget gap on £20bn is float-order-of-operations noise. TODO marked at the top of run_economy_simulation: drop this block when the orchestrator imports cleanly (PolicyEngine release with coherent deps, or [uk]-extras for country backends). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Top of UKPolicyEnginePythonBackend.prompt_context() now opens with a PREFERRED TOOL block instructing the LLM to call run_economy_simulation for any society-wide reform question instead of writing Python through run_python. Includes the programme/field reform shape with worked examples for the four B-suite scenarios covered today (PA, basic rate, NI main rate, stacked). Without this prompt change the LLM defaults to run_python on uk_python sessions and the typed-tool gain disappears (it still sees run_economy_simulation in the tool list, just doesn't pick it). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vercel · 2026-06-02T18:05:35Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
policyengine-uk-chat	Ready	Preview, Comment	Jun 2, 2026 8:30pm

github-actions · 2026-06-02T18:06:33Z

Beta preview is ready.

Frontend: open preview
Backend: open backend

…referred Previous wording ('PREFERRED TOOL') let the LLM read it as a suggestion and call run_python first to 'explore the API' (4-18 wasted calls per run on the B1 preview test before eventually reaching for the typed tool). Restructured as a TOOL ROUTING decision tree at the very top of the context: - 'Society-wide reform question? CALL run_economy_simulation ON YOUR FIRST TURN. Do not call run_python first. Do not call capabilities().' - 'Reform expressible in the mapped fields? run_economy_simulation is REQUIRED. Falling back to run_python is a bug.' - 'Otherwise, use run_python.' Mirrors Anthony's structure in PR 94's prompts.py but with a stricter injunction tuned to what reliably routes Claude on first turn: explicit 'Do not call X' and 'REQUIRED' instead of 'prefer'. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

SakshiKekre and others added 5 commits June 2, 2026 09:44

vercel Bot deployed to Preview June 2, 2026 18:05 View deployment

vercel Bot deployed to Preview June 2, 2026 20:30 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add uk_python typed tools (run_economy_simulation) to fix B1 reform failures#97

Add uk_python typed tools (run_economy_simulation) to fix B1 reform failures#97
SakshiKekre wants to merge 6 commits into
feat/model-backend-selectorfrom
feat/uk-python-typed-tools

SakshiKekre commented Jun 2, 2026

Uh oh!

vercel Bot commented Jun 2, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SakshiKekre commented Jun 2, 2026

Summary

Why this PR exists

End-to-end result (local in-process)

Architecture

Commit walk

Known tech debt — see TODO in tooling/backends/uk_python.py

Test plan

Uh oh!

vercel Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Known tech debt — see TODO in `tooling/backends/uk_python.py`

vercel Bot commented Jun 2, 2026 •

edited

Loading