Question: Tophat Mapping Rate With Rna-Seq.
gravatar for C Shao
6.8 years ago by
C Shao130
C Shao130 wrote:

I am confused about my RNA-seq mapping results from tophat. It's pair-end sequencing data, 101 bp length from Hiseq 2000.

Here is the command I used:

tophat -p 8 -g 1 -G $geneInfo -o $examp_thout $bowtie2in $fastqP1 $fastqP2

"-g 1" specified that only uniquely mapped reads are kept; "$geneInfo" is the gtf files I downloaded from tophat website; "$bowtie2in" is the index files from tophat website.

tophat used bowtie and samtools

As an example, I have 95604894 reads in total.

According to "the accepted_hits.bam", there are 88040486 uniquely reads mapped (all of them have tag : "NH:i:1"), and there are 7564408 reads on "unmapped.bam".

So, 92.1% of reads are uniquely mapped on the genome! I have similar results for other samples.

Is it too good to be true? How about your data?

And there are some log files under "log" folder, one of them is "bowtie.leftkeptreads.log"

47730953 reads; of these:
    47730953 (100.00%) were unpaired; of these:
    10052350 (21.06%) aligned 0 times
    20708981 (43.39%) aligned exactly 1 time
    16969622 (35.55%) aligned >1 times
78.94% overall alignment rate

It seems that lots of reads are mapped in multiple location, how could I get high uniquely mapped reads percentage?

Many thanks!

tophat rna-seq • 9.0k views
ADD COMMENTlink modified 6.8 years ago by Istvan Albert ♦♦ 81k • written 6.8 years ago by C Shao130
gravatar for Istvan Albert
6.8 years ago by
Istvan Albert ♦♦ 81k
University Park, USA
Istvan Albert ♦♦ 81k wrote:

The manual is a little ambiguous on the matter but the way I read it the -g 1 is a reporting option not a filtering option.

In your case this means that the software will report only one alignment for reads that map to multiple locations and not that it removes reads that map to more than one location.

So the accepted_hits.bam is not a file of uniquely aligned reads, it just contains each read only once. It is still impressive that you are able to align 92% of data but does not contradict the second observation.

ADD COMMENTlink written 6.8 years ago by Istvan Albert ♦♦ 81k

I want to get the read counts for each genes, then analyze them with linear model under poisson distribution. That is why I want the uniquely mapped reads. How do you work with tophat? Do you know any way to get the uniquely mapped reads?

ADD REPLYlink written 6.8 years ago by C Shao130

There could be multiple way to do it. For example there is an optional field in the SAM file called:

NH i Number of reported alignments that contains the query in the current record

You will need to allow reporting of more than one read and then you will want to select lines that contains NH:i:1

Also look here Counting repeat and unique reads of tophat output

ADD REPLYlink written 6.8 years ago by Istvan Albert ♦♦ 81k

According to your explanation, I set "-g 1" to let tophat report only one mapped reads even if reads are mapped in multiple position. Then, for example, I set "-g 2" , and selecte reas with "NH i 1", which will give the uniquely mapped reads. Is it a paradox?

I also see many people in seqanswers said that "-g 1" will give the uniquly mapped reads. It is so confusing.

ADD REPLYlink written 6.8 years ago by C Shao130

I would recommend to try to avoid setting too many limits during alignment phase, instead filter the resulting SAM file.

For example note what the manual says about -g:

If there are more alignments with the same score than this number, TopHat will randomly report only this many alignments

ADD REPLYlink modified 6.8 years ago • written 6.8 years ago by Istvan Albert ♦♦ 81k

Thanks, I will try with default value, then see how it happens.

ADD REPLYlink written 6.8 years ago by C Shao130
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1273 users visited in the last hour