I am confused about my RNA-seq mapping results from tophat. It's pair-end sequencing data, 101 bp length from Hiseq 2000.
Here is the command I used:
tophat -p 8 -g 1 -G $geneInfo -o $examp_thout $bowtie2in $fastqP1 $fastqP2
"-g 1" specified that only uniquely mapped reads are kept; "$geneInfo" is the gtf files I downloaded from tophat website; "$bowtie2in" is the index files from tophat website.
tophat used bowtie 18.104.22.168 and samtools 0.1.18.0.
As an example, I have 95604894 reads in total.
According to "the accepted_hits.bam", there are 88040486 uniquely reads mapped (all of them have tag : "NH:i:1"), and there are 7564408 reads on "unmapped.bam".
So, 92.1% of reads are uniquely mapped on the genome! I have similar results for other samples.
Is it too good to be true? How about your data?
And there are some log files under "log" folder, one of them is "bowtie.leftkeptreads.log"
47730953 reads; of these: 47730953 (100.00%) were unpaired; of these: 10052350 (21.06%) aligned 0 times 20708981 (43.39%) aligned exactly 1 time 16969622 (35.55%) aligned >1 times 78.94% overall alignment rate
It seems that lots of reads are mapped in multiple location, how could I get high uniquely mapped reads percentage?