Fix NaN in genotype LD stats with missing data in small populations by andrewkern · Pull Request #124 · kr-colab/pg_gpu

andrewkern · 2026-06-13T00:23:40Z

Summary

Fixes the NaN reported in #123, where pg_gpu emitted a single non-finite value
(Dz_0_0_1, one r-bin) on a region that moments handled fine. The root cause is
a class of unguarded denominators in the genotype LD path that go to zero when
missing data shrinks a population's per-pair (or per-site) valid sample count.

Reported by @mufernando with a clean, self-contained MWE (region BO_0093, a
68-sample VCF with a 14-sample focal population).

Root cause

The genotype LD estimators normalize by falling factorials of the per-pair valid
sample count n (individuals genotyped at both loci of a pair):

between-pop Dz: n*(n-1)*(n-2) (and n of the partner pop)
normalized pi2 frequencies: 1/n
heterozygosity: n*(n-1) (per site, n = non-missing individuals)

Missing data can drive n below the order of its factorial -- e.g. on the MWE,
the focal population has only 2 individuals jointly typed at the variant pair
19,049,574 x 19,137,724, so n*(n-1)*(n-2) = 0. The estimator is then 0/0,
which is undefined in moments too (its per-pair formula raises / yields nan;
moments and pg_gpu count per-pair samples identically). The between-pop Dz
kernels, the pi2 normalized-frequency kernels (alldiff/iikl/ijkk/shared), and
the heterozygosity sum divided without a guard, so one degenerate pair/site
produced a NaN that poisoned the whole r-bin sum.

The union-level biallelic filter only requires the union of populations to have

= 2 valid samples, so it never protects an individual small population.

Fix

Guard each denominator so a degenerate pair/site contributes 0 -- excluding
the undefined term -- using the same cp.maximum(denom, 1) / (d>0)?...:0
idiom the single-pop Dz (ns<4 -> 0) and every pi2 kernel already use. On
well-sampled data the guards are exact no-ops.

genotype_kernels.py: the four between-pop Dz kernels + the pi2 inv_n
moments_ld.py: _compute_heterozygosity

Correctness

pg_gpu's genotype formulas match moments' Python reference
(stats_from_genotype_counts, which carries the correct -20 pi2(i,i;i,i)
coefficient from MomentsLD/moments#258), applying moments' own index
symmetrization, to machine precision (~3e-17) across all 45 DD/Dz/pi2 stats.
Note: the comparison must use moments' Python path -- the Cython path in moments
<= 1.5.0 still has the #258 +20 sign bug, so pg_gpu is the correct one there.

Tests

New tests/test_ld_missing_data_degenerate.py:

test_genotype_stats_match_moments_python (moments env) -- machine-precision
parity with moments' Python reference for every DD/Dz/pi2 case.
three robustness regressions (between-pop Dz degenerate pair, pi2
zero-sample population, heterozygosity zero-sample site) -- each is NaN before
this change and finite after; GPU-only, no moments needed.

All 54 moments-env LD parity/missing-data tests pass (no regression); lint clean.

Closes #123

…e counts Missing data can drop a population's per-pair valid sample count below the order of a statistic's denominator (n*(n-1)*(n-2) for between-pop Dz, 1/n for the normalized pi2 frequencies, n*(n-1) for heterozygosity). The denominator is then zero and the unbiased estimator is 0/0, which is undefined in moments too. The between-pop Dz kernels, the pi2 normalized-frequency path, and the heterozygosity sum divided without guarding, so one degenerate pair or site emitted a NaN that poisoned the whole r-bin sum. Guard each denominator so a degenerate pair/site contributes 0 -- excluding the undefined term -- matching the single-pop Dz and pi2 kernels that already guard the same way. On well-sampled data the guards are exact no-ops; pg_gpu still matches moments' Python reference (the PR #258 -20 pi2 coefficient) to machine precision across every DD/Dz/pi2 case.

andrewkern merged commit 145660f into main Jun 13, 2026
1 check passed

andrewkern deleted the fix/dz-genotype-missing-data-nan branch June 13, 2026 00:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix NaN in genotype LD stats with missing data in small populations#124

Fix NaN in genotype LD stats with missing data in small populations#124
andrewkern merged 1 commit into
mainfrom
fix/dz-genotype-missing-data-nan

andrewkern commented Jun 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andrewkern commented Jun 13, 2026

Summary

Root cause

Fix

Correctness

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant