Reads with highest MAPQ values from SAM files are showing mismatches to reference sequence and IGV classified them as supplementary reads
Entering edit mode
3 months ago
Mohd ▴ 40

Hi all,

I am expressing a GFP synonymous variant library in human cells and sequencing its RNA on the nanopore and I am having some trouble analysing the data. Initially, I basecalled all the fast5 files using the super accuracy model in the guppy basecaller, then I discarded the reads having phred quality of less than 10 using nanofilt (filtered reads on the basis of read quality). Then I mapped all the reads to the reference database using the minimap.

After this, I retained all the primary alignments from the SAM file using the flags 0 & 16 (filtered reads on the basis of alignment type). Then I extracted reads from the SAM file on the basis of mapping quality using the MAPQ scores. Minimap SAM files have MAPQ values from 0 to 60 in the fifth column. While 0 indicates the reads that can possibly map to many locations and have the lowest quality, 60 indicates the highest quality and unique reads. I was losing a lot of reads on MAPQ 60, so thought to check a range of MAPQ values: 10, 30, 31 and 60. Then I visualised all these BAM files in IGV. I was thinking that I will get rid of the supplementary, secondary alignments & mismatches as I am doing all these filtering.

Unexpectedly, in the reads with MAPQ >31, I am having some alignments with mismatches:

enter image description here

As you see in the image, upon clicking the reads, IGV says these reads are having supplementary alignments to the other GFP variants. I am so confused why this is happing as I have retained the reads only with the primary alignments using 0 & 16 flags and selected reads with MAPQ>31. I even tried the read with highest value of MAPQ (60), and I am having the same problem there as well.

Can someone please explain what could be wrong or if there is something else that I could do to get rid of these mismatched alignments? In my experiments I am quantifying the RNA expression levels of the variants and such mismatched alignments will introduce bias to my data, that is why I am trying to get the unique reads (one read matching to only one variant).

Thanks, Mo

minimap Nanopore IGV • 700 views
Entering edit mode

With long nanopore reads you are going to get alignments that will have "errors" since that is partly due to characteristic of how nanopore sequencing works. There are large parts of the read in the example that you show that are soft-clipped and thus not visible (you need to turn a preference setting on in IGV to show these bases).

If you need perfect matches then you will need to filter the alignments further or change initial minimap2 alignment parameters so the alignments become strict. You may end up with few/no reads if you insist on having "perfect" matches depending on what your data looks like.


Login before adding your answer.

Traffic: 1747 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6