-
Notifications
You must be signed in to change notification settings - Fork 24
Description
Hi modkit authors,
I’m running haplotype-aware DMR analysis and I’m seeing a reproducible crash in modkit dmr pair when the BedMethyl inputs were produced by haplotype-aware pileup (--partition-tag in v0.4.4 or --phased in v0.6.1). I’d like to ask whether this is expected / a known limitation, and what the recommended workflow is.
My workflow:
Basecalling model: hac@v5.0.0 with modified base model 5mCG_5hmCG
Reference: GRCh38/hg38 (e.g. GCA_000001405.15_GRCh38_no_alt_analysis_set.fa)
BAM is HP-tagged (e.g. from WhatsHap phasing)
Option A (modkit v0.4.4)
I run pileup with --partition-tag HP to produce haplotype-partitioned BedMethyl:
modkit pileup phased.bam out.bed.gz
--ref /path/to/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa
--base C --threads 16
--partition-tag HP
Option B (modkit v0.6.1)
I run pileup with --phased:
modkit pileup phased.bam out.bed.gz
--ref /path/to/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa
--base C --threads 16
--phased
Then I run dmr pair on the haplotype BedMethyl files:
modkit dmr pair
-a hp1.bed.gz
-b hp2.bed.gz
-o test.dmr
--segment test.segment
--header
--ref /path/to/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa
--base C
--threads 16
Observed error
dmr pair fails with:
"batch failed: invalid data, valid coverage (62) is not equal to the sum of canonical and modified counts (61),
[BedMethylLine { chrom: "chr1", interval: Interval { start: 94017, stop: 94018, ... },
mod_code: Code('m'), strand: Both,
count_methylated: 48, valid_coverage: 62,
count_canonical: 13, count_other: 1, count_fail: 2, count_nocall: 2 }]
Error! invalid-bedmethyl-data"
From the failing line, it looks like:
valid_coverage = 62
count_canonical + count_methylated = 61
but count_other = 1 (so 13 + 48 + 1 = 62)
So it seems dmr pair is validating coverage using only canonical + methylated and ignoring count_other.
However, if I do not use --partition-tag/--phased in pileup, but instead split the BAM by HP tag first, then run pileup separately on each haplotype BAM, dmr pair completes and does not crash (it just reports regions with fail counts).
Example:
split BAM by HP tag
samtools view -h phased.bam | awk '/^@/ || $0 ~ /HP:i:1/' | samtools view -b -o hap1.bam -
samtools view -h phased.bam | awk '/^@/ || $0 ~ /HP:i:2/' | samtools view -b -o hap2.bam -
samtools index hap1.bam
samtools index hap2.bam
pileup per haplotype (no --partition-tag/--phased)
modkit pileup hap1.bam hap1.bed.gz --ref /path/to/ref.fa --base C --threads 16
modkit pileup hap2.bam hap2.bed.gz --ref /path/to/ref.fa --base C --threads 16
dmr pair works
modkit dmr pair -a hap1.bed.gz -b hap2.bed.gz -o test.dmr --segment test.segment --header --ref /path/to/ref.fa --base C --threads 16
So the Questions below for the authors:
1:Is dmr pair supposed to accept BedMethyl generated by pileup --partition-tag / pileup --phased? If not, is the recommended approach to always split BAMs by HP first?
2:Is count_other expected when using the 5mCG_5hmCG model (e.g. 5hmC contributing to count_other)? If yes, should dmr pair include count_other in the “modified counts” sum for coverage validation, or should we filter BedMethyl to 5mC-only before running dmr pair?
3:If the correct workflow is to run DMR only for one mod type (5mC), what is the recommended modkit command/flag to ensure pileup outputs only that type (and avoids non-zero count_other)?
Looking forward to your feedback,Thanks