I've recently been working with GATK's Mutect2 --mitochondria-mode for calling variants on chrM. I am testing the WDL Best Practices Pipeline on terra which employs a double alignment strat, along with a reverse engineered command line version for a single alignment strat.
Recently tested on a WES exome cram, as well as gatk's provided sample WGS bam, both aligned on broad's hg38 referenced genome. Both samples' seq probes are designed to capture mito content effectively.
Coverage per base data was collected using GATK's CollectHsMetrics tool, which is separate from Mutect2. I do not believe CollectHsMetrics can account for NuMTs like Mutect2 mitomode does, so no filters have really been applied to these datasets. This is simply bam/cram file and bait/target interval input with coverage per base output.
I noticed some odd quirks in the coverage tsvs and was hoping some of you could provide insight.
The gatk sample data, run on terra, shows a sharp dip in coverage at position 3106:
The terra exome data and the gatk command line exome data both show the same dip at the same position, along with an unusual spike immediately proceeding it.
A colleague has suggested that the spike in our exome data can probably be attributed to the 16s rRNA gene, which has copies in the nuclear genome and which our seq kit has separate probes for, which inflates coverage.
But I'm still wondering what is causing this odd dip at 3106. If it was an error with the double alignment process, why is it showing up in the comline version?
Something to do with an inaccuracy in the hg38 reference genome maybe? Or perhaps a highly conserved deletion? Maybe a CollectHsMetrics tool error?