Question: HTSeq-count wrong reads count
gravatar for ariesliu324
5.0 years ago by
United Kingdom
ariesliu32420 wrote:

Hello all,

This has puzzled me for days and I couldnt find any explanation on the internet.
I noticed HTSeq-count gives the reads counts of a gene as 0. But when viewed in IGV with the accepted_hits.bam from tophat2 alignment, there are hundreds to thousands reads aligned in that gene region (from different samples). 

This is how I used HTSeq-count from HTSeq/0.6.1p1 and samtools/1.1
samtools sort -T tmp -o accepted_hits_nsort.bam -n accepted_hits.bam 
htseq-count --format=bam --strand=no --order=name -a 5 --mode=union accepted_hits_nsort.bam Homo_sapiens.GRCh37.72.gtf > htseq_count_0.6.txt

However, when I break down the accepted_hits.bam and extract the chromosome where the gene is, HTSeq-count gives the right reads counts for this BAM file.

My accepted_hits.bam contains 60,845,216 alginemnt using samtools view -c accepted_hits.bam. I dont think for PE RNA-seq, file at size of 4.5G would be a problem for samtools or HTSeq-count to deal with?

My suspicions are:
1. samtools name sort problem
2. htseq-count bug
3. out of memory

But from the try-error I have done so far, I cannot really figure out what has gone wrong here. 

Could any one give any hint here? Any suggestion would be appreciated. Thanks.

rna-seq • 3.8k views
ADD COMMENTlink modified 4.9 years ago by Biostar ♦♦ 20 • written 5.0 years ago by ariesliu32420

Isn't it --stranded instead of --strand? Default value for --stranded is yes. If so, then probably it is assuming your reads are stranded. Although, not sure if that could be the issue.

ADD REPLYlink written 5.0 years ago by komal.rathi3.4k

Hello ariesliu324!

It appears that your post has been cross-posted to another site: SEQanswers.

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLYlink written 5.0 years ago by Devon Ryan92k

If it is a memory issue, I guess it should fail for all the genes.

ADD REPLYlink modified 5.0 years ago • written 5.0 years ago by geek_y9.9k


I also had an issue, I was using different versions of reference fasta and GTF files.

make sure that reference fasta for mapping and GTF/GFF for feature count should be from same source and similar version.

as mentioned by komal.rathi , its --stranded or -s, but not --strand


ADD REPLYlink written 5.0 years ago by Manvendra Singh2.1k

I am having the same problem, @ariesliu324, did you manage to resolve this issue?

ADD REPLYlink written 3.8 years ago by bojingjia10

I have seen this before with polycistronic genes (more than one protein from one transcript). htseq-count does not assign a read when it overlaps two "genes". To htseq-count gene is simply the id attribute field (gene_id by default). Since polycistronic genes are generally given a different gene ids for each transcript/protein. Thus htseq-count does not know where to assign the read. To check for this you can do a bedtools intersect on you gtf/gff with the region of interest. If there is more than one gene_id in that region this is the likely culprit. Not sure if this is your problem, but it has bitten me a couple times. Hope that helps!

Aside: There is not really a great solution to assigning these reads. For example if the reads were assigned to each "gene" from a polycistronic locus and then you did a DE analysis and identified one of the polycistronic genes you would get all of them. If you then did a GO enrichment type analysis you would certainly enrich for any specific terms assigned to this gene. By not counting these reads (the default here) at least you don't get false positives enrichment, but you could miss important biology by not identifying a DE gene. It's really an annotation issue. When analyzing RNA data how should one handle annotation on a protein level?

ADD REPLYlink written 3.7 years ago by mstoiber0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 706 users visited in the last hour