Skip to content

Question: modkit dmr pair throws invalid-bedmethyl-data when BedMethyl comes from pileup --partition-tag/--phased (HP-tagged BAM); splitting BAM by HP avoids the crash #579

@ChenyuLi0627

Description

@ChenyuLi0627

Hi modkit authors,

I’m running haplotype-aware DMR analysis and I’m seeing a reproducible crash in modkit dmr pair when the BedMethyl inputs were produced by haplotype-aware pileup (--partition-tag in v0.4.4 or --phased in v0.6.1). I’d like to ask whether this is expected / a known limitation, and what the recommended workflow is.

My workflow:
Basecalling model: hac@v5.0.0 with modified base model 5mCG_5hmCG
Reference: GRCh38/hg38 (e.g. GCA_000001405.15_GRCh38_no_alt_analysis_set.fa)
BAM is HP-tagged (e.g. from WhatsHap phasing)
Option A (modkit v0.4.4)
I run pileup with --partition-tag HP to produce haplotype-partitioned BedMethyl:
modkit pileup phased.bam out.bed.gz
--ref /path/to/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa
--base C --threads 16
--partition-tag HP

Option B (modkit v0.6.1)
I run pileup with --phased:
modkit pileup phased.bam out.bed.gz
--ref /path/to/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa
--base C --threads 16
--phased

Then I run dmr pair on the haplotype BedMethyl files:

modkit dmr pair
-a hp1.bed.gz
-b hp2.bed.gz
-o test.dmr
--segment test.segment
--header
--ref /path/to/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa
--base C
--threads 16

Observed error
dmr pair fails with:
"batch failed: invalid data, valid coverage (62) is not equal to the sum of canonical and modified counts (61),
[BedMethylLine { chrom: "chr1", interval: Interval { start: 94017, stop: 94018, ... },
mod_code: Code('m'), strand: Both,
count_methylated: 48, valid_coverage: 62,
count_canonical: 13, count_other: 1, count_fail: 2, count_nocall: 2 }]
Error! invalid-bedmethyl-data"

From the failing line, it looks like:
valid_coverage = 62
count_canonical + count_methylated = 61
but count_other = 1 (so 13 + 48 + 1 = 62)
So it seems dmr pair is validating coverage using only canonical + methylated and ignoring count_other.

However, if I do not use --partition-tag/--phased in pileup, but instead split the BAM by HP tag first, then run pileup separately on each haplotype BAM, dmr pair completes and does not crash (it just reports regions with fail counts).
Example:
split BAM by HP tag
samtools view -h phased.bam | awk '/^@/ || $0 ~ /HP:i:1/' | samtools view -b -o hap1.bam -
samtools view -h phased.bam | awk '/^@/ || $0 ~ /HP:i:2/' | samtools view -b -o hap2.bam -
samtools index hap1.bam
samtools index hap2.bam

pileup per haplotype (no --partition-tag/--phased)
modkit pileup hap1.bam hap1.bed.gz --ref /path/to/ref.fa --base C --threads 16
modkit pileup hap2.bam hap2.bed.gz --ref /path/to/ref.fa --base C --threads 16

dmr pair works
modkit dmr pair -a hap1.bed.gz -b hap2.bed.gz -o test.dmr --segment test.segment --header --ref /path/to/ref.fa --base C --threads 16

So the Questions below for the authors:
1:Is dmr pair supposed to accept BedMethyl generated by pileup --partition-tag / pileup --phased? If not, is the recommended approach to always split BAMs by HP first?
2:Is count_other expected when using the 5mCG_5hmCG model (e.g. 5hmC contributing to count_other)? If yes, should dmr pair include count_other in the “modified counts” sum for coverage validation, or should we filter BedMethyl to 5mC-only before running dmr pair?
3:If the correct workflow is to run DMR only for one mod type (5mC), what is the recommended modkit command/flag to ensure pileup outputs only that type (and avoids non-zero count_other)?

Looking forward to your feedback,Thanks

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionLooking for clarification on inputs and/or outputstroubleshootingworkflow and data preparation questions

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions