Question

Tophat Mapping Rate With Rna-Seq.

2

Entering edit mode

11.3 years ago

C Shao ▴ 140

I am confused about my RNA-seq mapping results from tophat. It's pair-end sequencing data, 101 bp length from Hiseq 2000.

Here is the command I used:

tophat -p 8 -g 1 -G $geneInfo -o $examp_thout $bowtie2in $fastqP1 $fastqP2

"-g 1" specified that only uniquely mapped reads are kept; "$geneInfo" is the gtf files I downloaded from tophat website; "$bowtie2in" is the index files from tophat website.

tophat used bowtie 2.0.0.7 and samtools 0.1.18.0.

As an example, I have 95604894 reads in total.

According to "the accepted_hits.bam", there are 88040486 uniquely reads mapped (all of them have tag : "NH:i:1"), and there are 7564408 reads on "unmapped.bam".

So, 92.1% of reads are uniquely mapped on the genome! I have similar results for other samples.

Is it too good to be true? How about your data?

And there are some log files under "log" folder, one of them is "bowtie.leftkeptreads.log"

47730953 reads; of these:
    47730953 (100.00%) were unpaired; of these:
    10052350 (21.06%) aligned 0 times
    20708981 (43.39%) aligned exactly 1 time
    16969622 (35.55%) aligned >1 times
78.94% overall alignment rate

It seems that lots of reads are mapped in multiple location, how could I get high uniquely mapped reads percentage?

Many thanks!

tophat rna-seq • 10k views

ADD COMMENT • link updated 11.3 years ago by Istvan Albert 100k • written 11.3 years ago by C Shao ▴ 140

score 1 · Answer 1 · 2013-01-24

1

Entering edit mode

11.3 years ago

Istvan Albert 100k

The manual is a little ambiguous on the matter but the way I read it the -g 1 is a reporting option not a filtering option.

In your case this means that the software will report only one alignment for reads that map to multiple locations and not that it removes reads that map to more than one location.

So the accepted_hits.bam is not a file of uniquely aligned reads, it just contains each read only once. It is still impressive that you are able to align 92% of data but does not contradict the second observation.

ADD COMMENT • link 11.3 years ago by Istvan Albert 100k

0

Entering edit mode

I want to get the read counts for each genes, then analyze them with linear model under poisson distribution. That is why I want the uniquely mapped reads. How do you work with tophat? Do you know any way to get the uniquely mapped reads?

ADD REPLY • link 11.3 years ago by C Shao ▴ 140

0

Entering edit mode

There could be multiple way to do it. For example there is an optional field in the SAM file called:

NH i Number of reported alignments that contains the query in the current record

You will need to allow reporting of more than one read and then you will want to select lines that contains NH:i:1

Also look here Counting repeat and unique reads of tophat output

ADD REPLY • link 11.3 years ago by Istvan Albert 100k

0

Entering edit mode

According to your explanation, I set "-g 1" to let tophat report only one mapped reads even if reads are mapped in multiple position. Then, for example, I set "-g 2" , and selecte reas with "NH i 1", which will give the uniquely mapped reads. Is it a paradox?

I also see many people in seqanswers said that "-g 1" will give the uniquly mapped reads. It is so confusing.

ADD REPLY • link 11.3 years ago by C Shao ▴ 140

0

Entering edit mode

I would recommend to try to avoid setting too many limits during alignment phase, instead filter the resulting SAM file.

For example note what the manual says about -g:

If there are more alignments with the same score than this number, TopHat will randomly report only this many alignments