Question: Which .gtf file should I use when aligning/counting total RNA?
0
gravatar for Kristin Muench
4.2 years ago by
United States
Kristin Muench470 wrote:

Hello,

I'm trying to figure out if my dataset was collected from mRNA, or tRNA. My pipeline is TopHat2 for alignment and htseq-count for read counting/generation of DESeq2 input. Previously, I had aligned and counted the data using the genes.gtf file from the GRCh37 iGenomes build: http://support.illumina.com/sequencing/sequencing_software/igenome.html . As I understood it, this should represent all mRNA-encoding regions of the genome.

However, I'm seeing something strange - although alignment/TopHat2 output is fine  (90%+ alignment), 95% of those aligned reads are listed as 'no feature' by HTSeq, and not included in the final read tallies. As a result, the total number of reads in HTSeq's output for a sample is close to 1 million, not 30 million as I would expect.

Right now, I'm trying to figure out if the company that did the sequencing accidentally sequenced total RNA, rather than mRNA - so of course most of the reads would have 'no feature' when I count reads using the genes.gtf file, because maybe most of those RNAs are ribosomal RNA or something.

So, I'd like to re-try counting the aligned reads using a .gtf file that represents all RNA-encoding regions of the genome, not just the mRNA-encoding regions. Is this the WholeGenomeFasta.gff that comes with the GRCh37 iGenome package? Or is there another place to get such a file?

Thank you for your help!

 

rna-seq • 1.4k views
ADD COMMENTlink written 4.2 years ago by Kristin Muench470

Make sure you are using the same version of genome and annotations. One more thing you could try is to just align the data to transcriptome and see the mapping statistics.

ADD REPLYlink written 4.2 years ago by geek_y10k

Thank you! All the files I'm using (genome and annotations) are from the same iGenome GRCh37 files, so I think they should be the same. Could you explain what you mean by aligning the data to the transcriptome -i.e., how is that different from what TopHat is doing? Do you mean making the data into a BigWig file and looking at that in a genome browser?

ADD REPLYlink written 4.2 years ago by Kristin Muench470

The annotation and fasta reference should not only be from the same genome version, but also from the same place - you should not use UCSC reference with an Ensembl annotation, even if they are from the same genome version.

ADD REPLYlink written 4.2 years ago by h.mon28k

I mean to say align the reads to transcriptome using bowtie2 or bwa, instead of genome. This helps in finding if the reads arises from mRNA. You could get the transcriptome from utilities like gffread. Tophat also builds a transcriptome.fa while aligning the reads, which can be saved.

ADD REPLYlink written 4.2 years ago by geek_y10k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 770 users visited in the last hour