Add uk_python typed tools (run_economy_simulation) to fix B1 reform failures#97
Draft
SakshiKekre wants to merge 6 commits into
Draft
Add uk_python typed tools (run_economy_simulation) to fix B1 reform failures#97SakshiKekre wants to merge 6 commits into
SakshiKekre wants to merge 6 commits into
Conversation
Brings backend/tooling/ + backend/tool_definitions.py from anth-volk's codex/standalone-typed-tools branch as-is. No wiring yet — this just lands the package so a follow-up commit can add backend_id dispatch and a uk_python engine adapter. Files imported verbatim from PR #94 commits b7853bf + 38e4e69: - backend/tooling/__init__.py - backend/tooling/serialization.py - backend/tooling/simulations.py - backend/tooling/reforms.py - backend/tooling/households.py - backend/tooling/microdata.py - backend/tooling/sandbox.py - backend/tool_definitions.py Author: Anthony Volk (anth-volk) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
backend/tooling/backends/{__init__.py,uk_python.py} introduces an
EngineBackend protocol and one implementation calling policyengine_uk's
Microsimulation(reform=...) directly.
Reform translator converts the typed-tool programme/field shape into the
dotted-path-with-period dict policyengine_uk accepts — the same shape
PE-API stores in /uk/policy/<id>. Mapping covers the parameters touched
by the eval B-suite (PA amount, IT band rates, NI main rate + primary
threshold, child benefit amounts). Verified against
CountryTaxBenefitSystem.parameters.get_child(...) at authoring time.
Tools aren't wired into agent_tools.execute_tool yet — that comes in
the next commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
execute_tool now dispatches run_economy_simulation to the registered backend (currently uk_python). get_tool_definitions exposes the tool to the LLM only when the chat session selected uk_python — keeps uk_compiled paths untouched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Inlines ~50 lines of glue that already exist as calculate_economy_comparison() in the policyengine orchestrator package. We can't import that package today because a pip dependency conflict (policyengine_us pulled in transitively wants policyengine_core>=3.26, the orchestrator at 0.13.0 was built against ==3.25.3) makes the deps unresolvable from the chat backend's lockfile. Choice rationale: rather than spend hours pinning four packages across deploy environments and validating Modal still builds, we replicate the 50 lines so we can keep iterating on the typed-tool eval today. The underlying engine calls are identical, so numbers match PE-API by construction. End-to-end against B1 (PA £12,570 -> £15,000) reproduces PE-API to the second decimal across budget, all 10 deciles, and all 4 poverty groups. £86 budget gap on £20bn is float-order-of-operations noise. TODO marked at the top of run_economy_simulation: drop this block when the orchestrator imports cleanly (PolicyEngine release with coherent deps, or [uk]-extras for country backends). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Top of UKPolicyEnginePythonBackend.prompt_context() now opens with a PREFERRED TOOL block instructing the LLM to call run_economy_simulation for any society-wide reform question instead of writing Python through run_python. Includes the programme/field reform shape with worked examples for the four B-suite scenarios covered today (PA, basic rate, NI main rate, stacked). Without this prompt change the LLM defaults to run_python on uk_python sessions and the typed-tool gain disappears (it still sees run_economy_simulation in the tool list, just doesn't pick it). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Beta preview is ready.
|
…referred
Previous wording ('PREFERRED TOOL') let the LLM read it as a suggestion
and call run_python first to 'explore the API' (4-18 wasted calls per
run on the B1 preview test before eventually reaching for the typed
tool).
Restructured as a TOOL ROUTING decision tree at the very top of the
context:
- 'Society-wide reform question? CALL run_economy_simulation ON YOUR
FIRST TURN. Do not call run_python first. Do not call capabilities().'
- 'Reform expressible in the mapped fields? run_economy_simulation is
REQUIRED. Falling back to run_python is a bug.'
- 'Otherwise, use run_python.'
Mirrors Anthony's structure in PR 94's prompts.py but with a stricter
injunction tuned to what reliably routes Claude on first turn:
explicit 'Do not call X' and 'REQUIRED' instead of 'prefer'.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a
run_economy_simulationtyped tool that callspolicyengine_uk.Microsimulation(reform=...)directly, replacing the LLM's trial-and-errorrun_pythonthrashing that currently fails B1 in 3 of 4 ways.Stacked on #51 (model-backend-selector). Imports typed-tool infrastructure from @anth-volk's #94 (
tooling/package,tool_definitions.py) and wires a newuk_pythonengine adapter on top — leaves Anthony's compiled-engine path untouched.Why this PR exists
This morning's B1 eval (
feat/eval-harness,model_backend=uk_python) found four distinct failure modes ofrun_pythonagainst the B1 prompt (PA £12,570 → £15,000, economy-wide):All four fail because Claude is re-discovering how to apply a parameter reform via the Python engine every run, and the API surface is non-obvious. Anthony's #94 introduces typed tools to remove this surface for the compiled (Rust) engine; this PR extends them to the Python engine.
End-to-end result (local in-process)
Adapter against B1, single call, 12 seconds wall time:
Same engine + same dataset + same MicroSeries formulas → numbers match PE-API by construction.
Architecture
Commit walk
1a4b347Import [codex] Expose standalone typed tools #94'stooling/+tool_definitions.pyverbatim (no wiring)5ef5dedAddtooling/backends/subpackage withuk_pythonadapter + reform translator8fdfc1eDispatchrun_economy_simulationinagent_tools.execute_toolforuk_pythononly2adc5c1[TEMP] Mirror PE-API'scalculate_economy_comparisonmethodology line-by-line (see commit message for why we can'timportthe orchestrator today)ea11775Updateuk_pythonsystem prompt to prefer the typed toolKnown tech debt — see TODO in
tooling/backends/uk_python.pyrun_economy_simulationinlines ~50 lines of glue that already exist ascalculate_economy_comparisonin thepolicyengineorchestrator package. We can't import it today becausepip install policyengine==0.13.0transitively requirespolicyengine_uswhich pinspolicyengine_core>=3.26.0conflicting with PE-API's pinnedpolicyengine_core==3.25.3. Drop the inlined glue when the orchestrator imports cleanly.Test plan
agent_tools.execute_tool("run_economy_simulation", ..., backend_id="uk_python")returns PE-API-matching B1 numbers in 12sget_tool_definitions("uk_python")includes new tool;("uk_compiled")does notevals/runner/run.py b1_society_wide_pa --backend-url <preview>3× and confirm chat picksrun_economy_simulationoverrun_python