Skip to content

Add uk_python typed tools (run_economy_simulation) to fix B1 reform failures#97

Draft
SakshiKekre wants to merge 6 commits into
feat/model-backend-selectorfrom
feat/uk-python-typed-tools
Draft

Add uk_python typed tools (run_economy_simulation) to fix B1 reform failures#97
SakshiKekre wants to merge 6 commits into
feat/model-backend-selectorfrom
feat/uk-python-typed-tools

Conversation

@SakshiKekre
Copy link
Copy Markdown
Collaborator

Summary

Adds a run_economy_simulation typed tool that calls policyengine_uk.Microsimulation(reform=...) directly, replacing the LLM's trial-and-error run_python thrashing that currently fails B1 in 3 of 4 ways.

Stacked on #51 (model-backend-selector). Imports typed-tool infrastructure from @anth-volk's #94 (tooling/ package, tool_definitions.py) and wires a new uk_python engine adapter on top — leaves Anthony's compiled-engine path untouched.

Why this PR exists

This morning's B1 eval (feat/eval-harness, model_backend=uk_python) found four distinct failure modes of run_python against the B1 prompt (PA £12,570 → £15,000, economy-wide):

Run Outcome Budget
1 Got close after 39 tool calls, wrong poverty measure -£21.7bn (PE-API: -£20.45bn, 6.3% off)
2 Self-aware bail after 5 failed attempts at reform application No number
3 Clean refusal: "uk_python doesn't support reforms" No number
Manual Silently fabricated by hand-arithmetic -£5.45bn (4× too small)

All four fail because Claude is re-discovering how to apply a parameter reform via the Python engine every run, and the API surface is non-obvious. Anthony's #94 introduces typed tools to remove this surface for the compiled (Rust) engine; this PR extends them to the Python engine.

End-to-end result (local in-process)

Adapter against B1, single call, 12 seconds wall time:

Field PE-API v1 uk_python adapter Δ
Budget impact -£20,452,307,209 -£20,452,307,295 £86 (0.0000004%)
Tax revenue change -£21,084,668,700 -£21,084,668,787 £87
Benefit spending -£632,361,491 -£632,361,491 exact
D1-D10 avg & relative (all 20 values) (all 20 values) exact to nearest £/0.001%
Child / adult / senior / all poverty BHC 14.06 / 8.76 / 12.16 / 10.50 14.06 / 8.76 / 12.16 / 10.50 exact

Same engine + same dataset + same MicroSeries formulas → numbers match PE-API by construction.

Architecture

backend/
  agent_tools.py                 ← wires dispatch on backend_id
  model_backends.py              ← uk_python prompt now mentions run_economy_simulation
  tool_definitions.py            ← typed tool JSON-Schemas (from #94)
  tooling/                       ← engine-agnostic helpers (from #94)
    {households,microdata,reforms,sandbox,serialization,simulations}.py
    backends/                    ← NEW: per-engine implementations
      __init__.py                ← EngineBackend protocol + registry
      uk_python.py               ← policyengine_uk implementation + reform translator

Commit walk

  1. 1a4b347 Import [codex] Expose standalone typed tools #94's tooling/ + tool_definitions.py verbatim (no wiring)
  2. 5ef5ded Add tooling/backends/ subpackage with uk_python adapter + reform translator
  3. 8fdfc1e Dispatch run_economy_simulation in agent_tools.execute_tool for uk_python only
  4. 2adc5c1 [TEMP] Mirror PE-API's calculate_economy_comparison methodology line-by-line (see commit message for why we can't import the orchestrator today)
  5. ea11775 Update uk_python system prompt to prefer the typed tool

Known tech debt — see TODO in tooling/backends/uk_python.py

run_economy_simulation inlines ~50 lines of glue that already exist as calculate_economy_comparison in the policyengine orchestrator package. We can't import it today because pip install policyengine==0.13.0 transitively requires policyengine_us which pins policyengine_core>=3.26.0 conflicting with PE-API's pinned policyengine_core==3.25.3. Drop the inlined glue when the orchestrator imports cleanly.

Test plan

  • Local in-process: agent_tools.execute_tool("run_economy_simulation", ..., backend_id="uk_python") returns PE-API-matching B1 numbers in 12s
  • get_tool_definitions("uk_python") includes new tool; ("uk_compiled") does not
  • Bad reform field returns clean error listing valid fields, doesn't crash engine
  • Deploy preview, run evals/runner/run.py b1_society_wide_pa --backend-url <preview> 3× and confirm chat picks run_economy_simulation over run_python
  • B2, b6-b10 reform scenarios pass against the same preview

SakshiKekre and others added 5 commits June 2, 2026 09:44
Brings backend/tooling/ + backend/tool_definitions.py from
anth-volk's codex/standalone-typed-tools branch as-is. No
wiring yet — this just lands the package so a follow-up commit
can add backend_id dispatch and a uk_python engine adapter.

Files imported verbatim from PR #94 commits b7853bf + 38e4e69:
- backend/tooling/__init__.py
- backend/tooling/serialization.py
- backend/tooling/simulations.py
- backend/tooling/reforms.py
- backend/tooling/households.py
- backend/tooling/microdata.py
- backend/tooling/sandbox.py
- backend/tool_definitions.py

Author: Anthony Volk (anth-volk)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
backend/tooling/backends/{__init__.py,uk_python.py} introduces an
EngineBackend protocol and one implementation calling policyengine_uk's
Microsimulation(reform=...) directly.

Reform translator converts the typed-tool programme/field shape into the
dotted-path-with-period dict policyengine_uk accepts — the same shape
PE-API stores in /uk/policy/<id>. Mapping covers the parameters touched
by the eval B-suite (PA amount, IT band rates, NI main rate + primary
threshold, child benefit amounts). Verified against
CountryTaxBenefitSystem.parameters.get_child(...) at authoring time.

Tools aren't wired into agent_tools.execute_tool yet — that comes in
the next commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
execute_tool now dispatches run_economy_simulation to the registered
backend (currently uk_python). get_tool_definitions exposes the tool
to the LLM only when the chat session selected uk_python — keeps
uk_compiled paths untouched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Inlines ~50 lines of glue that already exist as
calculate_economy_comparison() in the policyengine orchestrator package.
We can't import that package today because a pip dependency conflict
(policyengine_us pulled in transitively wants policyengine_core>=3.26,
the orchestrator at 0.13.0 was built against ==3.25.3) makes the deps
unresolvable from the chat backend's lockfile.

Choice rationale: rather than spend hours pinning four packages across
deploy environments and validating Modal still builds, we replicate the
50 lines so we can keep iterating on the typed-tool eval today. The
underlying engine calls are identical, so numbers match PE-API by
construction.

End-to-end against B1 (PA £12,570 -> £15,000) reproduces PE-API to
the second decimal across budget, all 10 deciles, and all 4 poverty
groups. £86 budget gap on £20bn is float-order-of-operations noise.

TODO marked at the top of run_economy_simulation: drop this block when
the orchestrator imports cleanly (PolicyEngine release with coherent
deps, or [uk]-extras for country backends).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Top of UKPolicyEnginePythonBackend.prompt_context() now opens with a
PREFERRED TOOL block instructing the LLM to call
run_economy_simulation for any society-wide reform question instead
of writing Python through run_python. Includes the programme/field
reform shape with worked examples for the four B-suite scenarios
covered today (PA, basic rate, NI main rate, stacked).

Without this prompt change the LLM defaults to run_python on
uk_python sessions and the typed-tool gain disappears (it still
sees run_economy_simulation in the tool list, just doesn't pick it).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented Jun 2, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
policyengine-uk-chat Ready Ready Preview, Comment Jun 2, 2026 8:30pm

Request Review

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 2, 2026

Beta preview is ready.

…referred

Previous wording ('PREFERRED TOOL') let the LLM read it as a suggestion
and call run_python first to 'explore the API' (4-18 wasted calls per
run on the B1 preview test before eventually reaching for the typed
tool).

Restructured as a TOOL ROUTING decision tree at the very top of the
context:

- 'Society-wide reform question? CALL run_economy_simulation ON YOUR
   FIRST TURN. Do not call run_python first. Do not call capabilities().'
- 'Reform expressible in the mapped fields? run_economy_simulation is
   REQUIRED. Falling back to run_python is a bug.'
- 'Otherwise, use run_python.'

Mirrors Anthony's structure in PR 94's prompts.py but with a stricter
injunction tuned to what reliably routes Claude on first turn:
explicit 'Do not call X' and 'REQUIRED' instead of 'prefer'.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant