I'm working with single-end Illumina RNA-seq data. After producing BAM files using tophat against hg19, I'm running cufflinks (against hg19 knownGene) and subsequently htseq-count (in order to generate count data for use in DESeq2).
The BAM files are aligning between 35-40M reads per sample (>90% of total reads in each case), and the BAMs look good in terms of alignment to the reference.
However, I am seeing htseq counts in the region of 300,000 to 600,000 reads (representing ~7000 transcripts), far below the total number of reads, and certainly what appear to be visually acceptable when viewing the BAM against the hg19 reference with gene annotations.
Cufflinks is producing FPKM values for ~21,000 transcripts which, in contrast with the htseq-count output, makes me think that htseq-count is missing something, or I am missing something and I have not configured it correctly.
Why are my htseq counts so low?
Any help appreciated.