Skip to content

Fix NaN in genotype LD stats with missing data in small populations#124

Merged
andrewkern merged 1 commit into
mainfrom
fix/dz-genotype-missing-data-nan
Jun 13, 2026
Merged

Fix NaN in genotype LD stats with missing data in small populations#124
andrewkern merged 1 commit into
mainfrom
fix/dz-genotype-missing-data-nan

Conversation

@andrewkern

Copy link
Copy Markdown
Member

Summary

Fixes the NaN reported in #123, where pg_gpu emitted a single non-finite value
(Dz_0_0_1, one r-bin) on a region that moments handled fine. The root cause is
a class of unguarded denominators in the genotype LD path that go to zero when
missing data shrinks a population's per-pair (or per-site) valid sample count.

Reported by @mufernando with a clean, self-contained MWE (region BO_0093, a
68-sample VCF with a 14-sample focal population).

Root cause

The genotype LD estimators normalize by falling factorials of the per-pair valid
sample count n (individuals genotyped at both loci of a pair):

  • between-pop Dz: n*(n-1)*(n-2) (and n of the partner pop)
  • normalized pi2 frequencies: 1/n
  • heterozygosity: n*(n-1) (per site, n = non-missing individuals)

Missing data can drive n below the order of its factorial -- e.g. on the MWE,
the focal population has only 2 individuals jointly typed at the variant pair
19,049,574 x 19,137,724, so n*(n-1)*(n-2) = 0. The estimator is then 0/0,
which is undefined in moments too (its per-pair formula raises / yields nan;
moments and pg_gpu count per-pair samples identically). The between-pop Dz
kernels, the pi2 normalized-frequency kernels (alldiff/iikl/ijkk/shared), and
the heterozygosity sum divided without a guard, so one degenerate pair/site
produced a NaN that poisoned the whole r-bin sum.

The union-level biallelic filter only requires the union of populations to have

= 2 valid samples, so it never protects an individual small population.

Fix

Guard each denominator so a degenerate pair/site contributes 0 -- excluding
the undefined term -- using the same cp.maximum(denom, 1) / (d>0)?...:0
idiom the single-pop Dz (ns<4 -> 0) and every pi2 kernel already use. On
well-sampled data the guards are exact no-ops.

  • genotype_kernels.py: the four between-pop Dz kernels + the pi2 inv_n
  • moments_ld.py: _compute_heterozygosity

Correctness

pg_gpu's genotype formulas match moments' Python reference
(stats_from_genotype_counts, which carries the correct -20 pi2(i,i;i,i)
coefficient from MomentsLD/moments#258), applying moments' own index
symmetrization, to machine precision (~3e-17) across all 45 DD/Dz/pi2 stats.
Note: the comparison must use moments' Python path -- the Cython path in moments
<= 1.5.0 still has the #258 +20 sign bug, so pg_gpu is the correct one there.

Tests

New tests/test_ld_missing_data_degenerate.py:

  • test_genotype_stats_match_moments_python (moments env) -- machine-precision
    parity with moments' Python reference for every DD/Dz/pi2 case.
  • three robustness regressions (between-pop Dz degenerate pair, pi2
    zero-sample population, heterozygosity zero-sample site) -- each is NaN before
    this change and finite after; GPU-only, no moments needed.

All 54 moments-env LD parity/missing-data tests pass (no regression); lint clean.

Closes #123

…e counts

Missing data can drop a population's per-pair valid sample count below the
order of a statistic's denominator (n*(n-1)*(n-2) for between-pop Dz, 1/n for
the normalized pi2 frequencies, n*(n-1) for heterozygosity). The denominator
is then zero and the unbiased estimator is 0/0, which is undefined in moments
too. The between-pop Dz kernels, the pi2 normalized-frequency path, and the
heterozygosity sum divided without guarding, so one degenerate pair or site
emitted a NaN that poisoned the whole r-bin sum.

Guard each denominator so a degenerate pair/site contributes 0 -- excluding the
undefined term -- matching the single-pop Dz and pi2 kernels that already guard
the same way. On well-sampled data the guards are exact no-ops; pg_gpu still
matches moments' Python reference (the PR #258 -20 pi2 coefficient) to machine
precision across every DD/Dz/pi2 case.
@andrewkern andrewkern merged commit 145660f into main Jun 13, 2026
1 check passed
@andrewkern andrewkern deleted the fix/dz-genotype-missing-data-nan branch June 13, 2026 00:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: pg-gpu returning nan where momentsLD does not

1 participant