This really bothers me.....
I used tophats and htseq-count to get count information for a RNAseq projects (50 bp reads)
I then used OSA from Omicsoft to do the same thing and get gene count numbers
Well, I was ready to accept some difference, actually, the total mapped reads are very much similar from the two pipelines (tophat2 qc report)
my htseq-count was used as:
htseq-count -i gene_name bamfile annotation > outfile
However, at the very end, I noticed for multiple genes, the count data between arraystudio and tophat2 are dramatically different, for example, for many genes tophat2 outputed 1 or 0 for particular gene, arraystudio aligner give counts ranges from several to hundreds of reads...
For all that I checked, for some highly expressed gene, tophat2 pipeline reported 100 times more reads than arraystudio, seriously I actually believe arraystudio reads are more accurate, since there are certain genes we know should show decent expression are essentially reported as not expressed by tophat2....
Well, I have two questions:
1. Anyone every seen this dramatic different different aligne/quantifications?
2. As tophat2 is widely used, does this mean I must have done something with the pipeline? -- given the high similar reported overall aligned pairs, I would say the two aligners should have done similarly
3. Then can htseq-count be the cause of such difference? I used gene_name, not gene_ID for quantification of counts
4. How do I qc this? can I extract a particular gene from my tophat2 output .bam files and look at them?
I understand this is complicated but really appreciate your time and any input is welcome, this is really shocking for the difference I observed and I am almost sure something must have gone wrong after the tophat2...