I have a few 3' RNA sequencing data sets that have mixed human and mouse samples. As I am trying to split these reads apart, during the alignment (transcript region only) step, I noticed that there are about 3-5% of reads that can be mapped to both references - with VERY similar map-able base pair counts.
One example shows a 66bp (in blat) exact match for both human (CIGAR string: 14S66M) and mouse (CIGAR string: 66M14S) :
(read A in human）
QB5019:183:GJGBGX7:1:111:246:99 16 17 1344539 60 14S66M * 0 0 TTTTTTTTTTTTTTTTTGGCATTTTTAATTTAGGTTTGTTTTATTTAAGTTTAATGTTAATTCCATGCTGTGTTTCAGTA EE/E6E/EAEA/EAEAE//EEEEEEAEEEEEE<EEEEEAE/EEEEEEEEEEEEEEEEEEE6EEEEEEEEEEEEEEAAAAA AS:i:-14 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:66 YT:Z:UU NH:i:1
(read A in mouse)
QB5019:183:GJGBGX7:1:111:246:99 0 11 75765513 60 66M14S * 0 0 TACTGAAACACAGCATGGAATTAACATTAAACTTAAATAAAACAAACCTAAATTAAAAATGCCAAAAAAAAAAAAAAAAA AAAAAEEEEEEEEEEEEEE6EEEEEEEEEEEEEEEEEEE/EAEEEEE<EEEEEEAEEEEEE//EAEAE/AEAE/E6E/EE AS:i:-14 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:66 YT:Z:UU NH:i:1
Another example shows an 80bp (in blat) exact match for human (CIGAR string: 80M), 76bp(in blat) for mouse (CIGAR string still showed 80M using default setting of HISAT2).
(read B in human)
QB501519:1283:GJGBGX7:1:111:2486:1800 0 10 11333759 60 80M * 0 0 GTTTTATTGATGCATTTGGATTTTGTTGTTTGATGGAATTTGAGCCAAAAAAAAAATACGCAGGCTTTCCTATTTCTACA AAAAA6EEEAEEEEEEEEEEEEEEEEEEEEEEEEEEA/EEEEEE//EEEEEEEEEA<EEE/EAE<AEEEEE<E<EAEAE/ AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:80 YT:Z:UU NH:i:1
(read B in mouse)
QB501519:1283:GJGBGX7:1:111:2486:1800 16 2 6542373 60 80M * 0 0 TGTAGAAATAGGAAAGCCTGCGTATTTTTTTTTTGGCTCAAATTCCATCAAACAACAAAATCCAAATGCATCAATAAAAC /EAEAE<E<EEEEEA<EAE/EEE<AEEEEEEEEE//EEEEEE/AEEEEEEEEEEEEEEEEEEEEEEEEEEAEEE6AAAAA AS:i:-14 XN:i:0 XM:i:3 XO:i:0 XG:i:0 NM:i:3 MD:Z:21A1T3G52 YT:Z:UU NH:i:1
For cases like read A, I cannot think of an easy way to distinguish them. As for cases like read B, I guess I could try a different aligner and/or change threshold and so the 76bps matches to give a lower MAPQ and/or different CIGAR string... but since it is 80 vs 76, I don't know how sensitive the software can be.
The reason I worry about it is because 1. it counts for 3-5% of the total reads, and 2. Blat showed that they fall into highly conserved gene regions. Ignoring them may as well mean ignoring all conservative genes. Therefore, I would also like to learn if I simply delete these reads from my downstream differential expression and pathway analysis, would it influence the final outcome badly?
Please let me know what you think. Any advice is appreciate it!