Fix NaN in genotype LD stats with missing data in small populations#124
Merged
Conversation
…e counts Missing data can drop a population's per-pair valid sample count below the order of a statistic's denominator (n*(n-1)*(n-2) for between-pop Dz, 1/n for the normalized pi2 frequencies, n*(n-1) for heterozygosity). The denominator is then zero and the unbiased estimator is 0/0, which is undefined in moments too. The between-pop Dz kernels, the pi2 normalized-frequency path, and the heterozygosity sum divided without guarding, so one degenerate pair or site emitted a NaN that poisoned the whole r-bin sum. Guard each denominator so a degenerate pair/site contributes 0 -- excluding the undefined term -- matching the single-pop Dz and pi2 kernels that already guard the same way. On well-sampled data the guards are exact no-ops; pg_gpu still matches moments' Python reference (the PR #258 -20 pi2 coefficient) to machine precision across every DD/Dz/pi2 case.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the NaN reported in #123, where pg_gpu emitted a single non-finite value
(
Dz_0_0_1, one r-bin) on a region that moments handled fine. The root cause isa class of unguarded denominators in the genotype LD path that go to zero when
missing data shrinks a population's per-pair (or per-site) valid sample count.
Reported by @mufernando with a clean, self-contained MWE (region BO_0093, a
68-sample VCF with a 14-sample focal population).
Root cause
The genotype LD estimators normalize by falling factorials of the per-pair valid
sample count
n(individuals genotyped at both loci of a pair):Dz:n*(n-1)*(n-2)(andnof the partner pop)1/nn*(n-1)(per site,n= non-missing individuals)Missing data can drive
nbelow the order of its factorial -- e.g. on the MWE,the focal population has only 2 individuals jointly typed at the variant pair
19,049,574 x 19,137,724, so
n*(n-1)*(n-2) = 0. The estimator is then0/0,which is undefined in moments too (its per-pair formula raises / yields nan;
moments and pg_gpu count per-pair samples identically). The between-pop
Dzkernels, the pi2 normalized-frequency kernels (alldiff/iikl/ijkk/shared), and
the heterozygosity sum divided without a guard, so one degenerate pair/site
produced a NaN that poisoned the whole r-bin sum.
The union-level biallelic filter only requires the union of populations to have
Fix
Guard each denominator so a degenerate pair/site contributes 0 -- excluding
the undefined term -- using the same
cp.maximum(denom, 1)/(d>0)?...:0idiom the single-pop
Dz(ns<4 -> 0) and every pi2 kernel already use. Onwell-sampled data the guards are exact no-ops.
genotype_kernels.py: the four between-popDzkernels + the pi2inv_nmoments_ld.py:_compute_heterozygosityCorrectness
pg_gpu's genotype formulas match moments' Python reference
(
stats_from_genotype_counts, which carries the correct-20pi2(i,i;i,i)coefficient from MomentsLD/moments#258), applying moments' own index
symmetrization, to machine precision (~3e-17) across all 45 DD/Dz/pi2 stats.
Note: the comparison must use moments' Python path -- the Cython path in moments
<= 1.5.0 still has the #258
+20sign bug, so pg_gpu is the correct one there.Tests
New
tests/test_ld_missing_data_degenerate.py:test_genotype_stats_match_moments_python(moments env) -- machine-precisionparity with moments' Python reference for every DD/Dz/pi2 case.
Dzdegenerate pair, pi2zero-sample population, heterozygosity zero-sample site) -- each is NaN before
this change and finite after; GPU-only, no moments needed.
All 54 moments-env LD parity/missing-data tests pass (no regression); lint clean.
Closes #123