Register the three dormant typed tools (UK) by SakshiKekre · Pull Request #55 · PolicyEngine/policyengine-uk-chat

SakshiKekre · 2026-05-27T15:45:47Z

What

Re-exposes `calculate_household`, `run_economy_simulation`, and `analyse_microdata` to the LLM. These have existed in `backend/agent_tools.py` since the early days but were quietly removed from `get_tool_definitions()` in PR #11's `compute → run_python` rewrite. No documented rationale for the removal — tests still cover them, the dispatch dict still handles them; only the LLM-facing registration was missing.

Why now

The eval harness in PR #52 makes this empirically testable. We can A/B:

Control: main with `run_python` only
Treatment: this branch with all four tools registered
Run PR Add eval harness scaffold: spec, scenarios, fixtures dir #52's 10 scenarios × 3 = 60 conversations against each preview
Use the new `tool_usage.py` aggregator (also in PR Add eval harness scaffold: spec, scenarios, fixtures dir #52) to see whether Claude actually picks the typed tools

The hypothesis: methodology drift on society-wide questions (the ~6-10% divergence seen in the May 12 benchmark) comes from Claude picking BHC vs AHC, dataset year, decile definition inside the Python it writes. Typed tools pin those choices in the engine objects, so the drift should shrink — at least on questions that route to typed tools.

Stacked on PR #51

Base = `feat/model-backend-selector`. The backend-selector + scenario_context plumbing is needed in the preview for the eval harness to target this build.

What changed

`agent_tools.py`: 3 entries added to `execute_tool()` dispatch + 4 entries in `get_tool_definitions()` (3 typed + updated `run_python` framed as fallback). Shared `REFORM_SCHEMA`.
`chatbot.py`: system prompt rewritten from "one execution tool" to a four-tool description with "prefer typed tools first".
`ChatPage.tsx`: label map so the SSE tool indicator shows "household sim" / "economy sim" / "microdata analysis" instead of raw snake_case.

Deliberately NOT in scope

`structural_reform` arg on `run_economy_simulation` stays unexposed (Python-as-string inside JSON is awkward; structural reforms route to `run_python`)
No feature flag — eval IS the gate. If A/B shows regression, this PR doesn't merge.

Test plan

Wait for Modal preview deploy to land
Run PR Add eval harness scaffold: spec, scenarios, fixtures dir #52 eval against PR Add model backend selector and Claude Agent SDK runner #51's preview (control)
Run PR Add eval harness scaffold: spec, scenarios, fixtures dir #52 eval against this PR's preview (treatment)
Compare manifests with `tool_usage.py`; compare grader output for field accuracy + methodology drift
Write findings as a PR comment

Draft until eval evidence in.

Re-exposes calculate_household, run_economy_simulation, and analyse_microdata to the LLM via get_tool_definitions(). These functions have existed since the early agent_tools.py but were quietly removed from the registry in PR #11 (compute → run_python rewrite) — no documented rationale. Tests still cover them, the dispatch dict still handles them; only the LLM-facing registration was missing. Why register them now: methodology drift on society-wide reform questions comes largely from Claude picking BHC vs AHC, dataset year, decile definition, etc. inside the Python code it writes. The typed tools pin those choices inside the engine objects. The dormant tools already work and the eval harness in PR #52 can A/B them — gate the change on the eval, not opinion. What changed: - agent_tools.py: add three entries to execute_tool() dispatch + four entries to get_tool_definitions() (the three typed tools, plus an updated run_python description that frames it as the fallback). Shared REFORM_SCHEMA describes the reform input shape. - chatbot.py: system prompt rewritten from "one execution tool" to a four-tool description with "prefer typed tools first" guidance. - ChatPage.tsx: label map so the SSE tool indicator reads "household sim" / "economy sim" / "microdata analysis" instead of raw snake_case (better for screenshots; better signal about what Claude is actually doing). Deliberately not in scope: - structural_reform argument on run_economy_simulation stays unexposed (Python-as-string inside JSON tool input is awkward; defer structural reforms to run_python where Python is natural) - no feature flag — the eval harness IS the gate

vercel · 2026-05-27T15:45:53Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
policyengine-uk-chat	Ready	Preview, Comment	May 27, 2026 3:46pm

github-actions · 2026-05-27T15:46:34Z

Beta preview is ready.

Frontend: open preview
Backend: open backend

vahid-ahmadi

Review: functional spec check + verdict

Walked through this against main to verify the spec.

Functional check — passes.

All three target functions exist on main (calculate_household L218, run_economy_simulation L374, analyse_microdata L430) and are kwargs-callable, so the tools[tool_name](**tool_input) dispatch path works as-is — no wrapper needed.
The three new input_schema blocks line up with the function signatures: required matches the positional args (person/benunit/household for the household sim; entity/operation for microdata), defaults match (year=2025, dataset='frs', n=5), and the enums for entity/operation/dataset match the in-function lookup tables / dataset_labels.
Dropping structural_reform from the LLM-facing schemas (for both run_economy_simulation and analyse_microdata) is the right call — those reforms are Python objects, not JSON, so routing them through run_python keeps the schema honest. The system prompt already says so.
REFORM_SCHEMA deferring validation to _build_compiled_policy via additionalProperties: True is the pragmatic choice — programme names change, and the engine already raises clean errors.
System-prompt rewrite is well-ordered (most-specific → fallback) and the run_python description carries a matching FALLBACK note, so the agent gets the same signal from both directions.
Frontend label map is minimal and isolated.

Re: today's 262s timeout on "+1pp basic rate distributional". This PR is the fix. That question routes straight to run_economy_simulation(reform={\"income_tax\": {\"basic_rate\": 0.21}}) — one tool call returning decile_impacts, winners_losers, budgetary_impact in a single round trip, vs. the current path of free-forming the API in run_python and looping until it parses.

Verdict: ready to merge once eval evidence lands. Draft status is gated only on PR #52's A/B numbers per the PR body — there's no code blocker. CI green, MERGEABLE/CLEAN, base is PR #51 (stacked, also open). Suggest leaving in draft until eval data is posted, then flipping to ready. Closes #81 once merged.

vercel Bot deployed to Preview May 27, 2026 15:46 View deployment

SakshiKekre mentioned this pull request May 27, 2026

Add eval harness scaffold: spec, scenarios, fixtures dir #52

Open

This was referenced May 28, 2026

Land typed reform tools to short-circuit reform-API guessing (track PR #55) #81

Open

Conversation memory layer: persist households/scenarios across turns #86

Open

vahid-ahmadi reviewed May 29, 2026

View reviewed changes

SakshiKekre mentioned this pull request Jun 2, 2026

Enable multizone embedding via prefixed asset URLs #96

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Register the three dormant typed tools (UK)#55

Register the three dormant typed tools (UK)#55
SakshiKekre wants to merge 1 commit into
feat/model-backend-selectorfrom
feat/register-typed-tools

SakshiKekre commented May 27, 2026

Uh oh!

vercel Bot commented May 27, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

vahid-ahmadi left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SakshiKekre commented May 27, 2026

What

Why now

Stacked on PR #51

What changed

Deliberately NOT in scope

Test plan

Uh oh!

vercel Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

vahid-ahmadi left a comment

Choose a reason for hiding this comment

Review: functional spec check + verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vercel Bot commented May 27, 2026 •

edited

Loading