You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
bounded_global_variable_weights in policybench/analysis.py:187-224 averages each output's |reference| / max(|net_income|, Σ|references|) share over just the 100 sampled scenarios in the benchmark, then renormalizes so weights sum to 1. If a program's reference happens to be £0 in all 100 sampled households — even if it's policy-important at population — its weight is exactly 0 and it can't contribute to any model's score under the household-impact view.
Concrete evidence
UK Personal Independence Payment (pip):
DWP-published claimant total: 3.8M recipients (England + Wales, January 2026)
~15.2% of UK adults age 16+ have nonzero PIP at population
In the current benchmark sample: 0 of 100 UK scenarios have PIP > £0
Resulting household-impact weight: 0.0000
Resulting effect on the leaderboard: PIP is one of the 7 UK benchmark outputs, but no model can earn or lose any household-impact score on it. It's a free pass — a model that always says "£0 PIP" gets exactly the same household-impact score as one that always gets PIP right (or wildly wrong).
This propagates into the equal-output-group sensitivity view too (every output gets 1/N weight only if it's in the sample's effective output set), and into the program-filter weight rescaling.
There's a related but separate policyengine-uk-data data-pipeline bug (pip_dl_category is NONE for every person in the transfer dataset, so PE-UK computes £0 even for households that should have PIP — to be tracked in a policyengine/policyengine-uk-data issue). Even after that bug is fixed, the structural problem here remains: any benchmark output whose population prevalence is below ~1/100 will frequently get 0 weight just from sampling variance.
Proposed fix
Compute bounded_global_variable_weights from a population-representative source rather than the 100-scenario benchmark draw. Options, increasing in effort:
Larger sample for weighting only. Draw, say, 10,000 households from the same eligibility filter, compute references for all benchmark outputs, derive weights once, then commit the resulting per-output weights as a static lookup. The 100-scenario benchmark continues to be the evaluation set; weighting is decoupled from the sample.
Published aggregates as targets. For each output, use the matching public source's caseload × award-level totals (DWP/HMRC/OBR for UK; SSA/IRS/USDA for US) to derive a population-level expected-amount share. Same downstream effect as option 1, but anchored to authoritative aggregates that the upstream data pipelines are already calibrating against.
Both, side-by-side. Use option 1 for the score and option 2 as a sensitivity check; flag when they diverge.
Option 1 is the smallest defensible change. Option 2 is more rigorous but requires per-program source-target plumbing.
Renormalization implication
If a benchmark output gets non-zero population weight but is absent from the active program filter (e.g., user filters down to "Federal income tax only"), the existing weight-rescale-over-active-set behavior should continue to apply — weightedProgramScore already renormalizes over the variables actually scored. No change needed there.
Out of scope
The PIP-categories-are-NONE bug in policyengine-uk-data's transfer pipeline. That's an upstream issue; this PR/issue is about the weight formula on the policybench side.
The child-1-is-oldest convention in scenarios.py — separate concern, separate fix.
policybench/analysis.py:187 — bounded_global_variable_weights signature changes to take a wider ground-truth frame or accept a precomputed weights mapping
policybench/analysis.py:1076 — caller in analyze_no_tools
policybench/analysis.py:1956 — caller in build_dashboard_payload
New policybench/weight_targets.py (or similar) — owns the population-derived weights computation, with sources documented
Tests asserting weights match a snapshot for the current benchmark sample and a no-zero invariant for benchmark-resident outputs
Problem
bounded_global_variable_weightsinpolicybench/analysis.py:187-224averages each output's|reference| / max(|net_income|, Σ|references|)share over just the 100 sampled scenarios in the benchmark, then renormalizes so weights sum to 1. If a program's reference happens to be £0 in all 100 sampled households — even if it's policy-important at population — its weight is exactly 0 and it can't contribute to any model's score under the household-impact view.Concrete evidence
UK Personal Independence Payment (
pip):This propagates into the equal-output-group sensitivity view too (every output gets 1/N weight only if it's in the sample's effective output set), and into the program-filter weight rescaling.
There's a related but separate
policyengine-uk-datadata-pipeline bug (pip_dl_categoryisNONEfor every person in the transfer dataset, so PE-UK computes £0 even for households that should have PIP — to be tracked in apolicyengine/policyengine-uk-dataissue). Even after that bug is fixed, the structural problem here remains: any benchmark output whose population prevalence is below ~1/100 will frequently get 0 weight just from sampling variance.Proposed fix
Compute
bounded_global_variable_weightsfrom a population-representative source rather than the 100-scenario benchmark draw. Options, increasing in effort:Option 1 is the smallest defensible change. Option 2 is more rigorous but requires per-program source-target plumbing.
Renormalization implication
If a benchmark output gets non-zero population weight but is absent from the active program filter (e.g., user filters down to "Federal income tax only"), the existing weight-rescale-over-active-set behavior should continue to apply —
weightedProgramScorealready renormalizes over the variables actually scored. No change needed there.Out of scope
policyengine-uk-data's transfer pipeline. That's an upstream issue; this PR/issue is about the weight formula on the policybench side.scenarios.py— separate concern, separate fix.Affected files (preview)
policybench/analysis.py:187—bounded_global_variable_weightssignature changes to take a wider ground-truth frame or accept a precomputed weights mappingpolicybench/analysis.py:1076— caller inanalyze_no_toolspolicybench/analysis.py:1956— caller inbuild_dashboard_payloadpolicybench/weight_targets.py(or similar) — owns the population-derived weights computation, with sources documented