Question

Crossing Wires with GTF files

0

Entering edit mode

22 months ago

joseph.landry ▴ 50

Hi All,

I have a question regarding the use of GTF files and RNA-Seq analysis. I have built a index for STAR using the UCSC Genome Browser provided mm39 files. The GTF from that bundle is the following ...mm39_RefGene.gtf

I use that index to align my PE reads and get sorted bam files.

When I do featureCounts do I need to use the same GTF file as I used to make the index file for mapping. Sometimes I would like to get counts to the RefGene data set so I use the following GTF file from NCBI.....mm39.ncbiRefSeq.gtf

Am I going to get incorrect counting if I did that. If I want counts to mm39.ncbiRefSeq.gtf do I need to map using an STAR index created with mm39.ncbiRefSeq.gtf? Of can I use my already existing STAR genome index that was made with mm39_RefGene.gtf?

Best,

Joe

featureCounts STAR GTF • 839 views

ADD COMMENT • link updated 22 months ago by i.sudbery 19k • written 22 months ago by joseph.landry ▴ 50

0

Entering edit mode

You need to use the same reference for all downstream analysis since there will be important differences such as gene models and syntax (like chromosome names) that make them incompatible.

As a side note, if all you care about is gene level quantification Salmon since it provides more accurate abundance estimates.

ADD REPLY • link 22 months ago by rpolicastro 13k

score 1 · Answer 1 · 2022-06-08

Using one GTF to build the STAR index and a different one to do the quantitation will run fine and without errors. It will mostly be accurate, but you will bias away from assigning reads to genes in the GTF using for featureCounts that contain exon/intron junctions not found in the GTF used for the construction of the STAR index. This is because STAR requires less evidence to map a read spliced across two exons if that junction is present in the annotation given to it at index construction. Reads that map across junctions not in the annotation can be mapped, but they require more evidence. Thus, you are more likely to map reads to transcripts that are in the STAR annotation, which will lead to higher read counts for those genes when you run featureCounts, where as transcripst that are in your featureCounts set, but not your STAR set will get slightly fewer reads assigned to them.

This may not matter if your intension is to carry about differential gene expression analysis, as long as you do the same for test and treatment samples, the bias should be the same in both conditions and therefore cancel out in the comparison. Although you will have less power to detect differences in genes where the annotations differ. However, I wouldn't recommend it if you are interesting in comparing the expression of two genes within samples, or splicing analysis or any analysis that relies either on distinguishing splice isoforms or on knowing absolute expression levels.

Just as a side note, you might like to compare mm39_RefGene.gtf to mm39.ncbiRefSeq.gtf. I'm not entirely convinced they are not the same thing.