I am expressing a GFP synonymous variant library in human cells and sequencing its RNA on the nanopore and I am having some trouble analysing the data. Initially, I basecalled all the fast5 files using the super accuracy model in the guppy basecaller, then I discarded the reads having phred quality of less than 10 using nanofilt (filtered reads on the basis of read quality). Then I mapped all the reads to the reference database using the minimap.
After this, I retained all the primary alignments from the SAM file using the flags 0 & 16 (filtered reads on the basis of alignment type). Then I extracted reads from the SAM file on the basis of mapping quality using the MAPQ scores. Minimap SAM files have MAPQ values from 0 to 60 in the fifth column. While 0 indicates the reads that can possibly map to many locations and have the lowest quality, 60 indicates the highest quality and unique reads. I was losing a lot of reads on MAPQ 60, so thought to check a range of MAPQ values: 10, 30, 31 and 60. Then I visualised all these BAM files in IGV. I was thinking that I will get rid of the supplementary, secondary alignments & mismatches as I am doing all these filtering.
Unexpectedly, in the reads with MAPQ >31, I am having some alignments with mismatches:
As you see in the image, upon clicking the reads, IGV says these reads are having supplementary alignments to the other GFP variants. I am so confused why this is happing as I have retained the reads only with the primary alignments using 0 & 16 flags and selected reads with MAPQ>31. I even tried the read with highest value of MAPQ (60), and I am having the same problem there as well.
Can someone please explain what could be wrong or if there is something else that I could do to get rid of these mismatched alignments? In my experiments I am quantifying the RNA expression levels of the variants and such mismatched alignments will introduce bias to my data, that is why I am trying to get the unique reads (one read matching to only one variant).