Skip to content

Register the three dormant typed tools (UK)#55

Draft
SakshiKekre wants to merge 1 commit into
feat/model-backend-selectorfrom
feat/register-typed-tools
Draft

Register the three dormant typed tools (UK)#55
SakshiKekre wants to merge 1 commit into
feat/model-backend-selectorfrom
feat/register-typed-tools

Conversation

@SakshiKekre
Copy link
Copy Markdown
Collaborator

What

Re-exposes `calculate_household`, `run_economy_simulation`, and `analyse_microdata` to the LLM. These have existed in `backend/agent_tools.py` since the early days but were quietly removed from `get_tool_definitions()` in PR #11's `compute → run_python` rewrite. No documented rationale for the removal — tests still cover them, the dispatch dict still handles them; only the LLM-facing registration was missing.

Why now

The eval harness in PR #52 makes this empirically testable. We can A/B:

The hypothesis: methodology drift on society-wide questions (the ~6-10% divergence seen in the May 12 benchmark) comes from Claude picking BHC vs AHC, dataset year, decile definition inside the Python it writes. Typed tools pin those choices in the engine objects, so the drift should shrink — at least on questions that route to typed tools.

Stacked on PR #51

Base = `feat/model-backend-selector`. The backend-selector + scenario_context plumbing is needed in the preview for the eval harness to target this build.

What changed

  • `agent_tools.py`: 3 entries added to `execute_tool()` dispatch + 4 entries in `get_tool_definitions()` (3 typed + updated `run_python` framed as fallback). Shared `REFORM_SCHEMA`.
  • `chatbot.py`: system prompt rewritten from "one execution tool" to a four-tool description with "prefer typed tools first".
  • `ChatPage.tsx`: label map so the SSE tool indicator shows "household sim" / "economy sim" / "microdata analysis" instead of raw snake_case.

Deliberately NOT in scope

  • `structural_reform` arg on `run_economy_simulation` stays unexposed (Python-as-string inside JSON is awkward; structural reforms route to `run_python`)
  • No feature flag — eval IS the gate. If A/B shows regression, this PR doesn't merge.

Test plan

Draft until eval evidence in.

Re-exposes calculate_household, run_economy_simulation, and analyse_microdata
to the LLM via get_tool_definitions(). These functions have existed since
the early agent_tools.py but were quietly removed from the registry in PR
#11 (compute → run_python rewrite) — no documented rationale. Tests still
cover them, the dispatch dict still handles them; only the LLM-facing
registration was missing.

Why register them now: methodology drift on society-wide reform questions
comes largely from Claude picking BHC vs AHC, dataset year, decile
definition, etc. inside the Python code it writes. The typed tools pin
those choices inside the engine objects. The dormant tools already work
and the eval harness in PR #52 can A/B them — gate the change on the
eval, not opinion.

What changed:
- agent_tools.py: add three entries to execute_tool() dispatch + four
  entries to get_tool_definitions() (the three typed tools, plus an
  updated run_python description that frames it as the fallback).
  Shared REFORM_SCHEMA describes the reform input shape.
- chatbot.py: system prompt rewritten from "one execution tool" to a
  four-tool description with "prefer typed tools first" guidance.
- ChatPage.tsx: label map so the SSE tool indicator reads
  "household sim" / "economy sim" / "microdata analysis" instead of
  raw snake_case (better for screenshots; better signal about what
  Claude is actually doing).

Deliberately not in scope:
- structural_reform argument on run_economy_simulation stays unexposed
  (Python-as-string inside JSON tool input is awkward; defer structural
  reforms to run_python where Python is natural)
- no feature flag — the eval harness IS the gate
@vercel
Copy link
Copy Markdown

vercel Bot commented May 27, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
policyengine-uk-chat Ready Ready Preview, Comment May 27, 2026 3:46pm

Request Review

@github-actions
Copy link
Copy Markdown

Beta preview is ready.

Copy link
Copy Markdown
Collaborator

@vahid-ahmadi vahid-ahmadi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: functional spec check + verdict

Walked through this against main to verify the spec.

Functional check — passes.

  • All three target functions exist on main (calculate_household L218, run_economy_simulation L374, analyse_microdata L430) and are kwargs-callable, so the tools[tool_name](**tool_input) dispatch path works as-is — no wrapper needed.
  • The three new input_schema blocks line up with the function signatures: required matches the positional args (person/benunit/household for the household sim; entity/operation for microdata), defaults match (year=2025, dataset='frs', n=5), and the enums for entity/operation/dataset match the in-function lookup tables / dataset_labels.
  • Dropping structural_reform from the LLM-facing schemas (for both run_economy_simulation and analyse_microdata) is the right call — those reforms are Python objects, not JSON, so routing them through run_python keeps the schema honest. The system prompt already says so.
  • REFORM_SCHEMA deferring validation to _build_compiled_policy via additionalProperties: True is the pragmatic choice — programme names change, and the engine already raises clean errors.
  • System-prompt rewrite is well-ordered (most-specific → fallback) and the run_python description carries a matching FALLBACK note, so the agent gets the same signal from both directions.
  • Frontend label map is minimal and isolated.

Re: today's 262s timeout on "+1pp basic rate distributional". This PR is the fix. That question routes straight to run_economy_simulation(reform={\"income_tax\": {\"basic_rate\": 0.21}}) — one tool call returning decile_impacts, winners_losers, budgetary_impact in a single round trip, vs. the current path of free-forming the API in run_python and looping until it parses.

Verdict: ready to merge once eval evidence lands. Draft status is gated only on PR #52's A/B numbers per the PR body — there's no code blocker. CI green, MERGEABLE/CLEAN, base is PR #51 (stacked, also open). Suggest leaving in draft until eval data is posted, then flipping to ready. Closes #81 once merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants