I'm trying to figure out if my dataset was collected from mRNA, or tRNA. My pipeline is TopHat2 for alignment and htseq-count for read counting/generation of DESeq2 input. Previously, I had aligned and counted the data using the genes.gtf file from the GRCh37 iGenomes build: http://support.illumina.com/sequencing/sequencing_software/igenome.html . As I understood it, this should represent all mRNA-encoding regions of the genome.
However, I'm seeing something strange - although alignment/TopHat2 output is fine (90%+ alignment), 95% of those aligned reads are listed as 'no feature' by HTSeq, and not included in the final read tallies. As a result, the total number of reads in HTSeq's output for a sample is close to 1 million, not 30 million as I would expect.
Right now, I'm trying to figure out if the company that did the sequencing accidentally sequenced total RNA, rather than mRNA - so of course most of the reads would have 'no feature' when I count reads using the genes.gtf file, because maybe most of those RNAs are ribosomal RNA or something.
So, I'd like to re-try counting the aligned reads using a .gtf file that represents all RNA-encoding regions of the genome, not just the mRNA-encoding regions. Is this the WholeGenomeFasta.gff that comes with the GRCh37 iGenome package? Or is there another place to get such a file?
Thank you for your help!