Question

Very significant difference in gene counts by different aligner

0

Entering edit mode

8.2 years ago

colonppg ▴ 120

This really bothers me.

I used tophats and htseq-count to get count information for a RNAseq projects (50 bp reads). I then used OSA from Omicsoft to do the same thing and get gene count numbers.

Well, I was ready to accept some difference, actually, the total mapped reads are very much similar from the two pipelines (tophat2 qc report)

My htseq-count was used as:

htseq-count -i gene_name bamfile annotation > outfile

However, at the very end, I noticed for multiple genes, the count data between arraystudio and tophat2 are dramatically different, for example, for many genes tophat2 outputed 1 or 0 for particular gene, arraystudio aligner give counts ranges from several to hundreds of reads.

For all that I checked, for some highly expressed gene, tophat2 pipeline reported 100 times more reads than arraystudio, seriously I actually believe arraystudio reads are more accurate, since there are certain genes we know should show decent expression are essentially reported as not expressed by tophat2.

Well, I have two questions:

Anyone every seen this dramatic different different aligne/quantifications?
As tophat2 is widely used, does this mean I must have done something with the pipeline? -- given the high similar reported overall aligned pairs, I would say the two aligners should have done similarly
Then can htseq-count be the cause of such difference? I used gene_name, not gene_ID for quantification of counts
How do I qc this? can I extract a particular gene from my tophat2 output .bam files and look at them?

I understand this is complicated but really appreciate your time and any input is welcome, this is really shocking for the difference I observed and I am almost sure something must have gone wrong after the tophat2.

Thanks

RNA-Seq aligner • 2.2k views

ADD COMMENT • link updated 21 months ago by Ram 43k • written 8.2 years ago by colonppg ▴ 120

0

Entering edit mode

Can you get a BAM or SAM file from OSA? If so, run that through htseq-count. If the number are similar to what you get with tophat2 then you know OSA is screwing up the counting. If they're not, then you know that the alignment is being done poorly by tophat2. I should note that htseq-count is working correctly, if something disagrees with it substantially then that thing is either wrong or working under completely different assumptions (e.g., using an EM method).

ADD REPLY • link 8.2 years ago by Devon Ryan 104k

0

Entering edit mode

that's a very good suggestion... just did that... however I think the bam file from arraystudio has a different format... and htseq-count complained:

Error occured when reading first line of sam file.
Error: ('SAM line does not contain at least 11 tab-delimited fields.')

well, it fixed by first samtools index thebamfile

I actually extracted the region of interest from bam files from two different aligners, I do not see a huge difference...

the problem may lay after bam generation...

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 8.2 years ago by colonppg ▴ 120

0

Entering edit mode

looks like htseq-count may have issue:

convert bam file to raw read count file for DESeq or EdgeR

someone may have seen similar issues

ADD REPLY • link 8.2 years ago by colonppg ▴ 120

0

Entering edit mode

That's a two year old thread and pretty much any bug reported then will have gotten fixed by now.

ADD REPLY • link 8.2 years ago by Devon Ryan 104k

Ram · Answer 1 · 2016-02-03

1

Entering edit mode

8.2 years ago

colonppg ▴ 120

I figured out the issue, the way I ran htseq-count was wrong:

-s no needs to be specified otherwise it will be taken as stranded read... I rerun one and everything looks fine now....

Thanks

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 8.2 years ago by colonppg ▴ 120