Discrepency between number of reads tophat
0
1
Entering edit mode
7.3 years ago
mbio.kyle ▴ 380

Hello,

I am using tophat to align 100bp single end RNAseq reads to the human transcriptome (using hg19). I have noticed a large difference between the number of reads reported in the prep_reads step and the align_summary step.

As an example here it the prep_reads.info file from one of my samples:

min_read_len=101


And here is the align summary:

Reads:
Input     :   2599620
Mapped   :   1557662 (59.9% of input)
of these:    125871 ( 8.1%) have multiple alignments (80 have >20)


Why is the number of reads set in much higher than the number of reads listed as input when calculating the alignment rate. My understanding is that the prep reads step is the one which filters out reads.

Thanks,
Kyle

RNA-Seq tophat software-error • 2.3k views
2
Entering edit mode

This is weird. reads_in should be same as Input. Somewhere on this forum I read that multi-threading may cause this problem but if you don't use -p parameter then it should resolve the problem. But they couldn't figure how why it is happening.

1
Entering edit mode

I did some more investigating into this. All the samples which ran in my pipeline python script (multi threaded) showed this discrepancy. One sample failed for other reasons and I had to re run it manually, and reads_in matched input. So this must be the issue.

Thanks!

0
Entering edit mode

Could you link me to the original thread by chance? I am quite interested in this now.

2
Entering edit mode

Found it but not sure how much it will help Tophat - Understated Number Of Reads In The "Align_Summary.Txt" File

0
Entering edit mode

Excellent, thank you very much. I have re-ran my alignments without multi-threading and the results are quite shocking.

This is without the -p flag

Reads:
Input     :  18821606
Mapped   :  17816811 (94.7% of input)
of these:   1851740 (10.4%) have multiple alignments (2006 have >20)


And this is with it

Reads:
Input     :    881549
Mapped   :    832270 (94.4% of input)
of these:     87588 (10.5%) have multiple alignments (85 have >20)


I double checked to see if it was just a reporting issue but it is not, the single threaded bam file is almost a GB in size, while the threaded one is 53M.

Thank you so much for clearing this up for me. I hope this gets fixed soon.

1
Entering edit mode

Here is a github issue which was opened a few days ago: https://github.com/infphilo/tophat/issues/18

The suggestion is that the issue should be fixed in the new tophat version (2.1.0). I am rerunning with the updated version.

0
Entering edit mode

Thanks for the follow up. This is a pretty common issue with most of the bioinformatics tools. You have errors coming and going. Normally most of the problems can be resolved through using the most latest version or going one version back if the error is in the most latest version.

0
Entering edit mode

I tried searching for the post but couldn't find it. I don't think that the post explained reason behind the discrepancy. I will search again and post the link if I am successful.