Question: How to handle RNASeq reads that can map to both human and mouse ref (e.g. conserved gene regions)?
0
gravatar for DVA
11 weeks ago by
DVA490
United States
DVA490 wrote:

Hello,

I have a few 3' RNA sequencing data sets that have mixed human and mouse samples. As I am trying to split these reads apart, during the alignment (transcript region only) step, I noticed that there are about 3-5% of reads that can be mapped to both references - with VERY similar map-able base pair counts.

One example shows a 66bp (in blat) exact match for both human (CIGAR string: 14S66M) and mouse (CIGAR string: 66M14S) :

(read A in human)

QB5019:183:GJGBGX7:1:111:246:99       16      17      1344539 60     14S66M   *       0       0       TTTTTTTTTTTTTTTTTGGCATTTTTAATTTAGGTTTGTTTTATTTAAGTTTAATGTTAATTCCATGCTGTGTTTCAGTA        EE/E6E/EAEA/EAEAE//EEEEEEAEEEEEE<EEEEEAE/EEEEEEEEEEEEEEEEEEE6EEEEEEEEEEEEEEAAAAA        AS:i:-14        XN:i:0  XM:i:0 XO:i:0   XG:i:0  NM:i:0  MD:Z:66 YT:Z:UU NH:i:1

(read A in mouse)

QB5019:183:GJGBGX7:1:111:246:99       0       11      75765513       60       66M14S  *       0       0       TACTGAAACACAGCATGGAATTAACATTAAACTTAAATAAAACAAACCTAAATTAAAAATGCCAAAAAAAAAAAAAAAAA        AAAAAEEEEEEEEEEEEEE6EEEEEEEEEEEEEEEEEEE/EAEEEEE<EEEEEEAEEEEEE//EAEAE/AEAE/E6E/EE        AS:i:-14        XN:i:0 XM:i:0   XO:i:0  XG:i:0  NM:i:0  MD:Z:66 YT:Z:UU NH:i:1

Another example shows an 80bp (in blat) exact match for human (CIGAR string: 80M), 76bp(in blat) for mouse (CIGAR string still showed 80M using default setting of HISAT2).

(read B in human)

QB501519:1283:GJGBGX7:1:111:2486:1800       0       10      11333759       60       80M     *       0       0       GTTTTATTGATGCATTTGGATTTTGTTGTTTGATGGAATTTGAGCCAAAAAAAAAATACGCAGGCTTTCCTATTTCTACA        AAAAA6EEEAEEEEEEEEEEEEEEEEEEEEEEEEEEA/EEEEEE//EEEEEEEEEA<EEE/EAE<AEEEEE<E<EAEAE/        AS:i:0  XN:i:0  XM:i:0 XO:i:0   XG:i:0  NM:i:0  MD:Z:80 YT:Z:UU NH:i:1

(read B in mouse)

QB501519:1283:GJGBGX7:1:111:2486:1800       16      2       6542373 60     80M      *       0       0       TGTAGAAATAGGAAAGCCTGCGTATTTTTTTTTTGGCTCAAATTCCATCAAACAACAAAATCCAAATGCATCAATAAAAC        /EAEAE<E<EEEEEA<EAE/EEE<AEEEEEEEEE//EEEEEE/AEEEEEEEEEEEEEEEEEEEEEEEEEEAEEE6AAAAA        AS:i:-14        XN:i:0  XM:i:3 XO:i:0   XG:i:0  NM:i:3  MD:Z:21A1T3G52  YT:Z:UU NH:i:1

For cases like read A, I cannot think of an easy way to distinguish them. As for cases like read B, I guess I could try a different aligner and/or change threshold and so the 76bps matches to give a lower MAPQ and/or different CIGAR string... but since it is 80 vs 76, I don't know how sensitive the software can be.

The reason I worry about it is because 1. it counts for 3-5% of the total reads, and 2. Blat showed that they fall into highly conserved gene regions. Ignoring them may as well mean ignoring all conservative genes. Therefore, I would also like to learn if I simply delete these reads from my downstream differential expression and pathway analysis, would it influence the final outcome badly?

Please let me know what you think. Any advice is appreciate it!

hisat2 alignment • 121 views
ADD COMMENTlink modified 11 weeks ago by JC7.4k • written 11 weeks ago by DVA490
1
gravatar for JC
11 weeks ago by
JC7.4k
Mexico
JC7.4k wrote:

Checking the coordinates for your first pair seems like the human hit is more reliable because it is at the end of a gene, meanwhile the mouse coordinates are in an intergenic region. So, you can filter if the hit clearly comes from a possible expressed region.

ADD COMMENTlink written 11 weeks ago by JC7.4k

Hi @JC, thanks a lot for the reply! This is an interesting point - for the human case, it falls at the end with only one of the isoforms it seems like. I guess to use this design, that would require a very thorough annotation.

ADD REPLYlink written 11 weeks ago by DVA490
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2150 users visited in the last hour