I am interested in extracting uniquely mapped reads from a bam file by filtering based on the MAPQ field.
I ran some tests to figure out what MAPQ threshold should be used but am having trouble understanding the results.
I ran bowtie2 on paired-end data and saw that when the reads are unique and map perfectly to the genome (no mismatches), the pair gets a MAPQ of 42.
When one mismatch is introduced in each read of the pair, the MAPQ drops to 24.
When both reads of the pair align to a repetitive region of the genome (no mismatches), the MAPQ is 32.
Why would the MAPQ of a pair that aligns to many regions in the genome (which I would like to filter out because it is not uniquely mapped) be higher than a pair that contains two mismatches (one in each read) but maps uniquely?
How should a threshold for uniquely mapped reads be determined?
Any help would be greatly appreciated.