Very high coverage Nanopore alignment Hg38
1
0
Entering edit mode
19 days ago
pablo ▴ 230

Hi,

I aligned my first Nanopore reads on the hg38 reference. Then, I used bedtools genomecov to get an idea about the mean coverage over the genome.

I noticed some bases have a very high coverage >20,000X , whereas the mean coverage is ~10X . I extracted the corresponding alignments (20,000X) from the bam file, and used IGV to have a look.

As you can see, the very-high covered region has many insertions (>100bp). We are not in a telomeric region. Why all the reads has this profile, and why so many reads mapped here ? I also checked the flags of the alignments here :

  number  flag
2532   2064
2488   2048
7367   272
7479   256
936   16
1023   0


There are many not primary alignments / secondary / supplementary alignments. But also, 1023 + 936 primary alignments, which gives a very high coverage anyway.

We have the same profile here, but in a telomeric region, with many repetitive motifs.

IGV nanopore alignment • 323 views
0
Entering edit mode

It seems that this is telomeric region of the chromosome and all the reads of clipped reads (I think its hard-clipped). I have also observed the same in my nanopore data. These are mapping artifacts in repeatative regions of the genome. You either need to identify these kind of regions in the BAM file and remove them prior to variant calling. Or, you can easily identify these regions from the resulting VCF file as well. In a VCF file, you need to look at the site mean depth (either using vcftools or bcftools) and then filter those sites having unusual depth of coverage.

Check out the way to do it from here: https://speciationgenomics.github.io/filtering_vcfs/ (check out these two sections: Calculate mean depth per site section and Variant mean depth)

Additionally, most variant callers only consider primary alignments, so you are also safe by not removing them as most of the reads are secondary. But there are other regions in the genome that may such artifacts, so its better to filter them off.

0
Entering edit mode
10 days ago

This will likely be an artifact, as you know. Check the Mapping quality MQ of these reads and use a MQ filter of 30 for example to avoid repeat regions.

This is also why people use MQ and maxDepth in their SNP calling. That said, I've never seen a region as bad as that.