Compute global variable weights from population, not the 100-scenario sample

## Problem

`bounded_global_variable_weights` in `policybench/analysis.py:187-224` averages each output's `|reference| / max(|net_income|, Σ|references|)` share over **just the 100 sampled scenarios** in the benchmark, then renormalizes so weights sum to 1. If a program's reference happens to be £0 in all 100 sampled households — even if it's policy-important at population — its weight is exactly 0 and it can't contribute to any model's score under the household-impact view.

## Concrete evidence

UK Personal Independence Payment (`pip`):

- DWP-published claimant total: 3.8M recipients (England + Wales, January 2026)
- ~15.2% of UK adults age 16+ have nonzero PIP at population
- In the current benchmark sample: **0 of 100** UK scenarios have PIP > £0
- Resulting household-impact weight: **0.0000**
- Resulting effect on the leaderboard: PIP is one of the 7 UK benchmark outputs, but no model can earn or lose any household-impact score on it. It's a free pass — a model that always says "£0 PIP" gets exactly the same household-impact score as one that always gets PIP right (or wildly wrong).

This propagates into the equal-output-group sensitivity view too (every output gets 1/N weight only if it's *in* the sample's effective output set), and into the program-filter weight rescaling.

There's a related but separate `policyengine-uk-data` data-pipeline bug (`pip_dl_category` is `NONE` for every person in the transfer dataset, so PE-UK computes £0 even for households that should have PIP — to be tracked in a `policyengine/policyengine-uk-data` issue). Even after that bug is fixed, the structural problem here remains: any benchmark output whose population prevalence is below ~1/100 will frequently get 0 weight just from sampling variance.

## Proposed fix

Compute `bounded_global_variable_weights` from a **population-representative source** rather than the 100-scenario benchmark draw. Options, increasing in effort:

1. **Larger sample for weighting only.** Draw, say, 10,000 households from the same eligibility filter, compute references for all benchmark outputs, derive weights once, then commit the resulting per-output weights as a static lookup. The 100-scenario benchmark continues to be the evaluation set; weighting is decoupled from the sample.
2. **Published aggregates as targets.** For each output, use the matching public source's caseload × award-level totals (DWP/HMRC/OBR for UK; SSA/IRS/USDA for US) to derive a population-level expected-amount share. Same downstream effect as option 1, but anchored to authoritative aggregates that the upstream data pipelines are already calibrating against.
3. **Both, side-by-side.** Use option 1 for the score and option 2 as a sensitivity check; flag when they diverge.

Option 1 is the smallest defensible change. Option 2 is more rigorous but requires per-program source-target plumbing.

## Renormalization implication

If a benchmark output gets non-zero population weight but is absent from the active program filter (e.g., user filters down to "Federal income tax only"), the existing weight-rescale-over-active-set behavior should continue to apply — `weightedProgramScore` already renormalizes over the variables actually scored. No change needed there.

## Out of scope

- The PIP-categories-are-NONE bug in `policyengine-uk-data`'s transfer pipeline. That's an upstream issue; this PR/issue is about the weight formula on the policybench side.
- The child-1-is-oldest convention in `scenarios.py` — separate concern, separate fix.
- Switching the leaderboard headline metric away from Within 1% — already done in #47.

## Affected files (preview)

- `policybench/analysis.py:187` — `bounded_global_variable_weights` signature changes to take a wider ground-truth frame or accept a precomputed weights mapping
- `policybench/analysis.py:1076` — caller in `analyze_no_tools`
- `policybench/analysis.py:1956` — caller in `build_dashboard_payload`
- New `policybench/weight_targets.py` (or similar) — owns the population-derived weights computation, with sources documented
- Tests asserting weights match a snapshot for the current benchmark sample and a no-zero invariant for benchmark-resident outputs


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compute global variable weights from population, not the 100-scenario sample #49

Problem

Concrete evidence

Proposed fix

Renormalization implication

Out of scope

Affected files (preview)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Compute global variable weights from population, not the 100-scenario sample #49

Description

Problem

Concrete evidence

Proposed fix

Renormalization implication

Out of scope

Affected files (preview)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions