This has puzzled me for days and I couldnt find any explanation on the internet.
I noticed HTSeq-count gives the reads counts of a gene as 0. But when viewed in IGV with the accepted_hits.bam from tophat2 alignment, there are hundreds to thousands reads aligned in that gene region (from different samples).
This is how I used HTSeq-count from HTSeq/0.6.1p1 and samtools/1.1
samtools sort -T tmp -o accepted_hits_nsort.bam -n accepted_hits.bam
htseq-count --format=bam --strand=no --order=name -a 5 --mode=union accepted_hits_nsort.bam Homo_sapiens.GRCh37.72.gtf > htseq_count_0.6.txt
However, when I break down the accepted_hits.bam and extract the chromosome where the gene is, HTSeq-count gives the right reads counts for this BAM file.
My accepted_hits.bam contains 60,845,216 alginemnt using samtools view -c accepted_hits.bam. I dont think for PE RNA-seq, file at size of 4.5G would be a problem for samtools or HTSeq-count to deal with?
My suspicions are:
1. samtools name sort problem
2. htseq-count bug
3. out of memory
But from the try-error I have done so far, I cannot really figure out what has gone wrong here.
Could any one give any hint here? Any suggestion would be appreciated. Thanks.