Discrepency between number of reads tophat
0
1
Entering edit mode
7.3 years ago
mbio.kyle ▴ 380

Hello,

I am using tophat to align 100bp single end RNAseq reads to the human transcriptome (using hg19). I have noticed a large difference between the number of reads reported in the prep_reads step and the align_summary step.

As an example here it the prep_reads.info file from one of my samples:

min_read_len=101
max_read_len=101
reads_in =22536887
reads_out=22535224

And here is the align summary:

Reads:
          Input     :   2599620
           Mapped   :   1557662 (59.9% of input)
            of these:    125871 ( 8.1%) have multiple alignments (80 have >20)
59.9% overall read mapping rate.

Why is the number of reads set in much higher than the number of reads listed as input when calculating the alignment rate. My understanding is that the prep reads step is the one which filters out reads.

Thanks,
Kyle

RNA-Seq tophat software-error • 2.3k views
ADD COMMENT
2
Entering edit mode

This is weird. reads_in should be same as Input. Somewhere on this forum I read that multi-threading may cause this problem but if you don't use -p parameter then it should resolve the problem. But they couldn't figure how why it is happening.

ADD REPLY
1
Entering edit mode

I did some more investigating into this. All the samples which ran in my pipeline python script (multi threaded) showed this discrepancy. One sample failed for other reasons and I had to re run it manually, and reads_in matched input. So this must be the issue.

Thanks!

ADD REPLY
0
Entering edit mode

Could you link me to the original thread by chance? I am quite interested in this now.

ADD REPLY
2
Entering edit mode

Found it but not sure how much it will help Tophat - Understated Number Of Reads In The "Align_Summary.Txt" File

ADD REPLY
0
Entering edit mode

Excellent, thank you very much. I have re-ran my alignments without multi-threading and the results are quite shocking.

This is without the -p flag

Reads:
          Input     :  18821606
           Mapped   :  17816811 (94.7% of input)
            of these:   1851740 (10.4%) have multiple alignments (2006 have >20)
94.7% overall read mapping rate.

And this is with it

Reads:
          Input     :    881549
           Mapped   :    832270 (94.4% of input)
            of these:     87588 (10.5%) have multiple alignments (85 have >20)
94.4% overall read mapping rate.

I double checked to see if it was just a reporting issue but it is not, the single threaded bam file is almost a GB in size, while the threaded one is 53M.

Thank you so much for clearing this up for me. I hope this gets fixed soon.

ADD REPLY
1
Entering edit mode

Here is a github issue which was opened a few days ago: https://github.com/infphilo/tophat/issues/18

The suggestion is that the issue should be fixed in the new tophat version (2.1.0). I am rerunning with the updated version.

ADD REPLY
0
Entering edit mode

Thanks for the follow up. This is a pretty common issue with most of the bioinformatics tools. You have errors coming and going. Normally most of the problems can be resolved through using the most latest version or going one version back if the error is in the most latest version.

ADD REPLY
0
Entering edit mode

I tried searching for the post but couldn't find it. I don't think that the post explained reason behind the discrepancy. I will search again and post the link if I am successful.

ADD REPLY

Login before adding your answer.

Traffic: 1799 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6