Skip to content

Compute global variable weights from population, not the 100-scenario sample #49

@MaxGhenis

Description

@MaxGhenis

Problem

bounded_global_variable_weights in policybench/analysis.py:187-224 averages each output's |reference| / max(|net_income|, Σ|references|) share over just the 100 sampled scenarios in the benchmark, then renormalizes so weights sum to 1. If a program's reference happens to be £0 in all 100 sampled households — even if it's policy-important at population — its weight is exactly 0 and it can't contribute to any model's score under the household-impact view.

Concrete evidence

UK Personal Independence Payment (pip):

  • DWP-published claimant total: 3.8M recipients (England + Wales, January 2026)
  • ~15.2% of UK adults age 16+ have nonzero PIP at population
  • In the current benchmark sample: 0 of 100 UK scenarios have PIP > £0
  • Resulting household-impact weight: 0.0000
  • Resulting effect on the leaderboard: PIP is one of the 7 UK benchmark outputs, but no model can earn or lose any household-impact score on it. It's a free pass — a model that always says "£0 PIP" gets exactly the same household-impact score as one that always gets PIP right (or wildly wrong).

This propagates into the equal-output-group sensitivity view too (every output gets 1/N weight only if it's in the sample's effective output set), and into the program-filter weight rescaling.

There's a related but separate policyengine-uk-data data-pipeline bug (pip_dl_category is NONE for every person in the transfer dataset, so PE-UK computes £0 even for households that should have PIP — to be tracked in a policyengine/policyengine-uk-data issue). Even after that bug is fixed, the structural problem here remains: any benchmark output whose population prevalence is below ~1/100 will frequently get 0 weight just from sampling variance.

Proposed fix

Compute bounded_global_variable_weights from a population-representative source rather than the 100-scenario benchmark draw. Options, increasing in effort:

  1. Larger sample for weighting only. Draw, say, 10,000 households from the same eligibility filter, compute references for all benchmark outputs, derive weights once, then commit the resulting per-output weights as a static lookup. The 100-scenario benchmark continues to be the evaluation set; weighting is decoupled from the sample.
  2. Published aggregates as targets. For each output, use the matching public source's caseload × award-level totals (DWP/HMRC/OBR for UK; SSA/IRS/USDA for US) to derive a population-level expected-amount share. Same downstream effect as option 1, but anchored to authoritative aggregates that the upstream data pipelines are already calibrating against.
  3. Both, side-by-side. Use option 1 for the score and option 2 as a sensitivity check; flag when they diverge.

Option 1 is the smallest defensible change. Option 2 is more rigorous but requires per-program source-target plumbing.

Renormalization implication

If a benchmark output gets non-zero population weight but is absent from the active program filter (e.g., user filters down to "Federal income tax only"), the existing weight-rescale-over-active-set behavior should continue to apply — weightedProgramScore already renormalizes over the variables actually scored. No change needed there.

Out of scope

  • The PIP-categories-are-NONE bug in policyengine-uk-data's transfer pipeline. That's an upstream issue; this PR/issue is about the weight formula on the policybench side.
  • The child-1-is-oldest convention in scenarios.py — separate concern, separate fix.
  • Switching the leaderboard headline metric away from Within 1% — already done in Within 1% as headline metric, household-facts modal, add DeepSeek V4 #47.

Affected files (preview)

  • policybench/analysis.py:187bounded_global_variable_weights signature changes to take a wider ground-truth frame or accept a precomputed weights mapping
  • policybench/analysis.py:1076 — caller in analyze_no_tools
  • policybench/analysis.py:1956 — caller in build_dashboard_payload
  • New policybench/weight_targets.py (or similar) — owns the population-derived weights computation, with sources documented
  • Tests asserting weights match a snapshot for the current benchmark sample and a no-zero invariant for benchmark-resident outputs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions