Symptom (live test 2026-05-28)
Asked the agent to compute the decile impact of "increasing the income tax basic rate by 1 percentage point". The agent ran run_python 15+ times trying to construct the reform, each attempt producing nonsense results (revenue dropping when the rate goes up), never converged, never produced a chart. Several minutes of compute wasted on API guessing.
The agent's own commentary during the run:
"There's clearly a bug in how I'm specifying the tax brackets - they're causing revenue to DROP, not rise."
There's no bug. The agent guessed the wrong shape for Parameters(income_tax={...}) — likely bracket override that wipes the rest of the schedule instead of an additive change.
Diagnosis
The cached library reference passed via _build_system_blocks (backend/routes/chatbot.py:170) apparently doesn't include concrete reform examples. Without them, the model has to re-derive the API on every cold session and often fails.
Fix
Add 3–5 canonical worked examples to the library reference (whatever string is being cached) — exact Python the model can copy:
- Basic rate +1pp (the failure case from today)
- Personal allowance increase (e.g. £12,570 → £15,000)
- NI primary threshold change
- Child benefit uprating
- Marriage allowance toggle
Each recipe should show the precise Parameters(income_tax=...) / Parameters(national_insurance=...) etc. shape that produces a partial reform (not a full schedule replacement).
~50 lines of system-prompt content. One-time cost. Cached.
Constraints
- Don't add new tools.
- Don't change
_build_system_blocks or run_python sandbox.
- Examples should be copy-pasteable Python.
Symptom (live test 2026-05-28)
Asked the agent to compute the decile impact of "increasing the income tax basic rate by 1 percentage point". The agent ran
run_python15+ times trying to construct the reform, each attempt producing nonsense results (revenue dropping when the rate goes up), never converged, never produced a chart. Several minutes of compute wasted on API guessing.The agent's own commentary during the run:
There's no bug. The agent guessed the wrong shape for
Parameters(income_tax={...})— likely bracket override that wipes the rest of the schedule instead of an additive change.Diagnosis
The cached library reference passed via
_build_system_blocks(backend/routes/chatbot.py:170) apparently doesn't include concrete reform examples. Without them, the model has to re-derive the API on every cold session and often fails.Fix
Add 3–5 canonical worked examples to the library reference (whatever string is being cached) — exact Python the model can copy:
Each recipe should show the precise
Parameters(income_tax=...)/Parameters(national_insurance=...)etc. shape that produces a partial reform (not a full schedule replacement).~50 lines of system-prompt content. One-time cost. Cached.
Constraints
_build_system_blocksorrun_pythonsandbox.