Question: Very significant difference in gene counts by different aligner
gravatar for colonppg
4.7 years ago by
United States
colonppg110 wrote:

This really bothers me.....

I used tophats and htseq-count to get count information for a RNAseq projects (50 bp reads)

I then used OSA from Omicsoft to do the same thing and get gene count numbers

Well, I was ready to accept some difference, actually, the total mapped reads are very much similar from the two pipelines (tophat2 qc report)

my htseq-count was used as:

htseq-count -i gene_name bamfile annotation > outfile

However, at the very end, I noticed for multiple genes, the count data between arraystudio and tophat2 are dramatically different, for example, for many genes tophat2 outputed 1 or 0 for particular gene, arraystudio aligner give counts ranges from several to hundreds of reads...

For all that I checked, for some highly expressed gene, tophat2 pipeline reported 100 times more reads than arraystudio, seriously I actually believe arraystudio reads are more accurate, since there are certain genes we know should show decent expression are essentially reported as not expressed by tophat2....

Well, I have two questions:

1. Anyone every seen this dramatic different different aligne/quantifications?

2. As tophat2 is widely used, does this mean I must have done something with the pipeline? -- given the high similar reported overall aligned pairs, I would say the two aligners should have done similarly

3. Then can htseq-count be the cause of such difference? I used gene_name, not gene_ID for quantification of counts

4. How do I qc this? can I extract a particular gene from my tophat2 output .bam files and look at them?

I understand this is complicated but really appreciate your time and any input is welcome, this is really shocking for the difference I observed and I am almost sure something must have gone wrong after the tophat2...



rna-seq aligner • 1.5k views
ADD COMMENTlink modified 4.7 years ago • written 4.7 years ago by colonppg110

Can you get a BAM or SAM file from OSA? If so, run that through htseq-count. If the number are similar to what you get with tophat2 then you know OSA is screwing up the counting. If they're not, then you know that the alignment is being done poorly by tophat2. I should note that htseq-count is working correctly, if something disagrees with it substantially then that thing is either wrong or working under completely different assumptions (e.g., using an EM method).

ADD REPLYlink written 4.7 years ago by Devon Ryan96k

that's a very good suggestion... just did that... however I think the bam file from arraystudio has a different format... and htseq-count complained:

Error occured when reading first line of sam file.
Error: ('SAM line does not contain at least 11 tab-delimited fields.')

well, it fixed by first samtools index thebamfile

I actually extracted the region of interest from bam files from two different aligners, I do not see a huge difference...

the problem may lay after bam generation...

ADD REPLYlink modified 8 months ago by RamRS30k • written 4.7 years ago by colonppg110

looks like htseq-count may have issue:

convert bam file to raw read count file for DESeq or EdgeR

someone may have seen similar issues

ADD REPLYlink written 4.7 years ago by colonppg110

That's a two year old thread and pretty much any bug reported then will have gotten fixed by now.

ADD REPLYlink written 4.7 years ago by Devon Ryan96k
gravatar for colonppg
4.7 years ago by
United States
colonppg110 wrote:

I figured out the issue, the way I ran htseq-count was wrong:

-s no needs to be specified otherwise it will be taken as stranded read... I rerun one and everything looks fine now....


ADD COMMENTlink modified 8 months ago by RamRS30k • written 4.7 years ago by colonppg110
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1959 users visited in the last hour