WIP: US Python backend (latency spike — do not merge) by SakshiKekre · Pull Request #54 · PolicyEngine/policyengine-uk-chat

SakshiKekre · 2026-05-27T12:38:29Z

Purpose

Spike to measure how long US simulations take in the chat interface. Not for merge — exists to deploy a preview where we can run timed prompts against the US backend.

Branches off PR #51 (feat/model-backend-selector) so it includes the backend-selector + scenario_context plumbing already.

What's in the diff

USPolicyEnginePythonBackend in backend/model_backends.py, mirrors the UK Python backend
policyengine_us added to backend/requirements.txt
Frontend label map: UK (Compiled) / UK (Python) / US (Python)

Known gaps deferred for the latency test

reference.md is UK-compiled-only. Claude sees UK API docs when US backend is selected. Will write some wrong code on the first attempt — that's part of what we want to measure (recovery latency).
System prompt still says "British English" and the title-generation route still calls itself "a UK tax and benefit policy assistant."
Modal region is eu — US response latency will reflect the transatlantic hop, not a US-optimised deploy.
No preloaded US dataset. Microsimulation questions will be very slow on first call.

Latency numbers to capture

Cold start (first message after Modal app spin-up)
Warm response on a simple US household question (e.g. CA single earner, federal EITC)
Warm response on a multi-state comparison
Warm response on a microsimulation question (expected slow — no preloaded dataset)
UK baseline for comparison: same warm household question on uk_python

What this PR does NOT do

Open the chat product up to US users
Make the US backend usable as a real product (gaps above need fixing first)
Add US eval scenarios to PR Add eval harness scaffold: spec, scenarios, fixtures dir #52's harness

- USPolicyEnginePythonBackend in model_backends.py: mirrors the UK Python backend, swaps to policyengine_us, capabilities() lists US variables and parameter roots, prompt notes the state_code requirement. - Add policyengine_us to backend/requirements.txt. - Frontend label map: UK (Compiled) / UK (Python) / US (Python). Known gaps deferred: - reference.md is still UK-compiled-only — Claude sees UK API docs on US backend. Acceptable for the initial latency smoke test. - System prompt still says "British English" and the title-generation route still calls itself "a UK tax and benefit policy assistant". - Modal region is "eu" — US response latency will reflect transatlantic hop, not a US-optimised deploy.

vercel · 2026-05-27T12:38:37Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
policyengine-uk-chat	Ready	Preview, Comment	May 27, 2026 12:57pm

github-actions · 2026-05-27T12:59:53Z

Beta preview is ready.

Frontend: open preview
Backend: open backend

SakshiKekre · 2026-05-27T13:38:50Z

Smoke test findings

Tested on preview: https://policyengine-uk-chat-git-feat-us-backend-policy-engine.vercel.app
Backend: https://policyengine--peukchat-feat-us-backend-web.modal.run

What works ✅

policyengine_us installs and imports on the Modal image (build job: 2m58s)
Backend selector routes correctly to the US adapter
State code (CA) plumbs through Simulation.situation correctly
Federal tax calc returns the right number

Verification — tight prompt:

"Single adult, age 30, $30,000 employment income in 2024, California, no dependents. What's the federal income tax? Just the number."

Returned $1,616.00, which matches hand calc exactly (standard deduction $14,600 → taxable $15,400 → 10% × $11,600 + 12% × $3,800).

What doesn't work yet ⚠️

Open-ended prompts cause ~30+ tool calls of dir(pe) / guess-variable-name / retry behaviour before Claude converges. Eventually does converge on a sensible answer (verified $50k CA single-no-kids returning $4,016 federal IT and $0 EITC, both correct) — but the UX is poor.
Root cause: reference.md is generated only from policyengine_uk_compiled. Claude has no API cheat sheet for policyengine_us, so it explores via tool calls instead. The UK backends short-circuit this because their reference doc lists every variable, parameter class, and method up front.

Other known gaps (deferred for the latency test, still present)

System prompt still says "British English" — confirmed in responses
Title-generation route still calls itself "a UK tax and benefit policy assistant"
Modal region is eu — transatlantic latency on every call
No preloaded US dataset → microsimulation questions will be very slow first-call

Suggested next steps (priority order, not yet started)

Generalise build_reference.py to take a package name and emit reference_uk.md / reference_us.md. Biggest single unlock — should eliminate the tool-call wandering.
Conditionally load the matching reference doc based on model_backend in routes/chatbot.py.
Fix UK-hardcoded prompt language (system prompt + title generator) so US responses don't read as British.
Defer everything else (Modal region, US engine preload, US eval scenarios) until we decide whether this is a real product direction.

PR remains draft / not for merge.

SakshiKekre · 2026-06-02T21:45:36Z

Closing — this was a latency-test spike to measure how long US sims take in the chat interface. Confirmed end-to-end working ($1,616 result on tight prompt matched hand calc), but the reference.md is UK-only so open-ended US prompts cause tool-call wandering. Findings are preserved in the PR comments. Branch feat/us-backend is kept for reference if US becomes a real product direction; reviving it would mean: generalising scripts/build_reference.py for policyengine_us, fixing UK-flavoured system prompt + title route, extending modal_app.py pre-warm for US, and adding US eval scenarios.

SakshiKekre changed the base branch from main to feat/model-backend-selector May 27, 2026 12:40

SakshiKekre changed the base branch from feat/model-backend-selector to main May 27, 2026 12:53

SakshiKekre marked this pull request as ready for review May 27, 2026 12:54

SakshiKekre changed the base branch from main to feat/model-backend-selector May 27, 2026 12:55

Trigger CI

ad0fb28

vercel Bot deployed to Preview May 27, 2026 12:57 View deployment

This was referenced May 27, 2026

Add UK chat integration (drawer on reports + standalone page) PolicyEngine/policyengine-app-v2#1036

Open

Cut wasted tokens (off-topic gate) and cold-start latency (/chat/backends warmup) #95

Draft

Enable multizone embedding via prefixed asset URLs #96

Open

SakshiKekre closed this Jun 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: US Python backend (latency spike — do not merge)#54

WIP: US Python backend (latency spike — do not merge)#54
SakshiKekre wants to merge 2 commits into
feat/model-backend-selectorfrom
feat/us-backend

SakshiKekre commented May 27, 2026

Uh oh!

vercel Bot commented May 27, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

SakshiKekre commented May 27, 2026

Uh oh!

SakshiKekre commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SakshiKekre commented May 27, 2026

Purpose

What's in the diff

Known gaps deferred for the latency test

Latency numbers to capture

What this PR does NOT do

Uh oh!

vercel Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

SakshiKekre commented May 27, 2026

Smoke test findings

What works ✅

What doesn't work yet ⚠️

Other known gaps (deferred for the latency test, still present)

Suggested next steps (priority order, not yet started)

Uh oh!

SakshiKekre commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented May 27, 2026 •

edited

Loading