I have some RNAseq data from rat neurons which was processed using STAR and HTseq. I noticed earlier today that for all samples, there were fewer than 20 copies of GAPDH, which should be highly expressed, in the HTSeq count files. There may be other high count genes missing, but I haven't been able to find any other anomalies so far and there are genes such as Map2, Calm1/2 and Nrgn with counts in the tens of thousands range.
I've been trying to trace back the source of this error, starting with one of the samples as a test case, and found in the STAR BAM file (converted from SAM) that there were ~15000 alignments to the GAPDH co-ordinates, trying two slightly different sets of co-ordinates for GAPDH in the rat genome from Ensembl and NCBI.
samtools view s10.bam 4:157676336-157680322 | wc -l
Yet for the same sample, the count of uniquely aligned reads in HTSeq is just 9.
python -m HTSeq.scripts.count -s no -a 10 ~/Star_cleanreads/S10.sam Rattus_norvegicus.Rnor_6.0.89.gtf > s10.count grep ENSRNOG00000018630 s10.count
I'm assuming that the issue is the GAPDH transcripts are for some reason aligning to multiple locations and have tried to confirm this by running HTSeq with the "--nonunique all" option, but get the error that this option doesn't exist. I can't find the version number for HTSeq but assume I must be running an old version. I've also had a look at GAPDH in the gtf file and can't see an obvious problem there.
Has anyone else run into this issue and/or can advise how to proceed?
Many thanks