Recently I've been using Cufflinks to analyze some mouse RNA-seq data. However, when trying to decide which GTF file to use, I find an important conundrum: all files differ among themselves! For example, the latest version of RefGene UCSC gives a line number of 267796, while NCBI gives 1259220. 1 million more entries I think are very likely to affect Cufflinks results.
So my question is: what do these different files include? Where is an explanation to each of them? I've never been able to discover a site that explains what was included in each annotation (i.e. protein-coding genes, non-coding genes, pseudogenes, etc), or where can I find a gtf file which reports only manually curated genes (vega?)?
Also, in your experience, what would be the best gtf file to use?
Thanks! Any comment will be highly appreciated
Updated link for tophat iGenomes, now: