Register the three dormant typed tools (UK)#55
Conversation
Re-exposes calculate_household, run_economy_simulation, and analyse_microdata to the LLM via get_tool_definitions(). These functions have existed since the early agent_tools.py but were quietly removed from the registry in PR #11 (compute → run_python rewrite) — no documented rationale. Tests still cover them, the dispatch dict still handles them; only the LLM-facing registration was missing. Why register them now: methodology drift on society-wide reform questions comes largely from Claude picking BHC vs AHC, dataset year, decile definition, etc. inside the Python code it writes. The typed tools pin those choices inside the engine objects. The dormant tools already work and the eval harness in PR #52 can A/B them — gate the change on the eval, not opinion. What changed: - agent_tools.py: add three entries to execute_tool() dispatch + four entries to get_tool_definitions() (the three typed tools, plus an updated run_python description that frames it as the fallback). Shared REFORM_SCHEMA describes the reform input shape. - chatbot.py: system prompt rewritten from "one execution tool" to a four-tool description with "prefer typed tools first" guidance. - ChatPage.tsx: label map so the SSE tool indicator reads "household sim" / "economy sim" / "microdata analysis" instead of raw snake_case (better for screenshots; better signal about what Claude is actually doing). Deliberately not in scope: - structural_reform argument on run_economy_simulation stays unexposed (Python-as-string inside JSON tool input is awkward; defer structural reforms to run_python where Python is natural) - no feature flag — the eval harness IS the gate
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Beta preview is ready.
|
vahid-ahmadi
left a comment
There was a problem hiding this comment.
Review: functional spec check + verdict
Walked through this against main to verify the spec.
Functional check — passes.
- All three target functions exist on
main(calculate_householdL218,run_economy_simulationL374,analyse_microdataL430) and are kwargs-callable, so thetools[tool_name](**tool_input)dispatch path works as-is — no wrapper needed. - The three new
input_schemablocks line up with the function signatures:requiredmatches the positional args (person/benunit/householdfor the household sim;entity/operationfor microdata), defaults match (year=2025,dataset='frs',n=5), and the enums forentity/operation/datasetmatch the in-function lookup tables /dataset_labels. - Dropping
structural_reformfrom the LLM-facing schemas (for bothrun_economy_simulationandanalyse_microdata) is the right call — those reforms are Python objects, not JSON, so routing them throughrun_pythonkeeps the schema honest. The system prompt already says so. REFORM_SCHEMAdeferring validation to_build_compiled_policyviaadditionalProperties: Trueis the pragmatic choice — programme names change, and the engine already raises clean errors.- System-prompt rewrite is well-ordered (most-specific → fallback) and the
run_pythondescription carries a matching FALLBACK note, so the agent gets the same signal from both directions. - Frontend label map is minimal and isolated.
Re: today's 262s timeout on "+1pp basic rate distributional". This PR is the fix. That question routes straight to run_economy_simulation(reform={\"income_tax\": {\"basic_rate\": 0.21}}) — one tool call returning decile_impacts, winners_losers, budgetary_impact in a single round trip, vs. the current path of free-forming the API in run_python and looping until it parses.
Verdict: ready to merge once eval evidence lands. Draft status is gated only on PR #52's A/B numbers per the PR body — there's no code blocker. CI green, MERGEABLE/CLEAN, base is PR #51 (stacked, also open). Suggest leaving in draft until eval data is posted, then flipping to ready. Closes #81 once merged.
What
Re-exposes `calculate_household`, `run_economy_simulation`, and `analyse_microdata` to the LLM. These have existed in `backend/agent_tools.py` since the early days but were quietly removed from `get_tool_definitions()` in PR #11's `compute → run_python` rewrite. No documented rationale for the removal — tests still cover them, the dispatch dict still handles them; only the LLM-facing registration was missing.
Why now
The eval harness in PR #52 makes this empirically testable. We can A/B:
The hypothesis: methodology drift on society-wide questions (the ~6-10% divergence seen in the May 12 benchmark) comes from Claude picking BHC vs AHC, dataset year, decile definition inside the Python it writes. Typed tools pin those choices in the engine objects, so the drift should shrink — at least on questions that route to typed tools.
Stacked on PR #51
Base = `feat/model-backend-selector`. The backend-selector + scenario_context plumbing is needed in the preview for the eval harness to target this build.
What changed
Deliberately NOT in scope
Test plan
Draft until eval evidence in.