Question

Which .gtf file should I use when aligning/counting total RNA?

0

Entering edit mode

8.6 years ago

Kristin Muench ▴ 620

Hello,

I'm trying to figure out if my dataset was collected from mRNA, or tRNA. My pipeline is TopHat2 for alignment and htseq-count for read counting/generation of DESeq2 input. Previously, I had aligned and counted the data using the genes.gtf file from the GRCh37 iGenomes build: http://support.illumina.com/sequencing/sequencing_software/igenome.html. As I understood it, this should represent all mRNA-encoding regions of the genome.

However, I'm seeing something strange - although alignment/TopHat2 output is fine (90%+ alignment), 95% of those aligned reads are listed as 'no feature' by HTSeq, and not included in the final read tallies. As a result, the total number of reads in HTSeq's output for a sample is close to 1 million, not 30 million as I would expect.

Right now, I'm trying to figure out if the company that did the sequencing accidentally sequenced total RNA, rather than mRNA - so of course most of the reads would have 'no feature' when I count reads using the genes.gtf file, because maybe most of those RNAs are ribosomal RNA or something.

So, I'd like to re-try counting the aligned reads using a .gtf file that represents all RNA-encoding regions of the genome, not just the mRNA-encoding regions. Is this the WholeGenomeFasta.gff that comes with the GRCh37 iGenome package? Or is there another place to get such a file?

Thank you for your help!

RNA-Seq • 2.9k views

ADD COMMENT • link updated 19 months ago by Ram 43k • written 8.6 years ago by Kristin Muench ▴ 620

0

Entering edit mode

Make sure you are using the same version of genome and annotations. One more thing you could try is to just align the data to transcriptome and see the mapping statistics.

ADD REPLY • link 8.6 years ago by GouthamAtla 12k

0

Entering edit mode

Thank you! All the files I'm using (genome and annotations) are from the same iGenome GRCh37 files, so I think they should be the same. Could you explain what you mean by aligning the data to the transcriptome -i.e., how is that different from what TopHat is doing? Do you mean making the data into a BigWig file and looking at that in a genome browser?

ADD REPLY • link 8.6 years ago by Kristin Muench ▴ 620

0

Entering edit mode

The annotation and fasta reference should not only be from the same genome version, but also from the same place - you should not use UCSC reference with an Ensembl annotation, even if they are from the same genome version.

ADD REPLY • link 8.6 years ago by h.mon 35k

0

Entering edit mode

I mean to say align the reads to transcriptome using bowtie2 or bwa, instead of genome. This helps in finding if the reads arises from mRNA. You could get the transcriptome from utilities like gffread. Tophat also builds a transcriptome.fa while aligning the reads, which can be saved.

ADD REPLY • link 8.6 years ago by GouthamAtla 12k