22 months ago
nlehmann ▴ 130

I could not find any information on 1) how are built the annotation files that one can find in UCSC database and 2) what's the difference between the different annotation files they offer.

Do someone know if they have their own pipeline to build an annotation ? Can we find instructions of how they're built ?

For example, when trying to download chicken annotation data, I go to https://hgdownload.soe.ucsc.edu/goldenPath/galGal6/bigZips/ and then click "genes/" folder. Here I have the choice between 3 annotations:

  • galGal6.ensGene.gtf.gz
  • galGal6.ncbiRefSeq.gtf.gz
  • galGal6.refGene.gtf.gz

I am not sure which one I should use. And they are very different, even just the features numbers (number of lines in each file) is highly variable:

  • 833601 galGal6.ensGene.gtf
  • 1768359 galGal6.ncbiRefSeq.gtf
  • 163989 galGal6.refGene.gtf

Any advice on that would be welcome.

22 months ago

Hi NathL,

Those files represent three annotation tables that make-up similarly named UCSC Genome Browser tracks, converted directly into GTF with the genePredToGtf command. These files were created since there is an error in Table Browser where the geneId and transcriptId are identical, producing improper GTF files.

Ensembl and NCBI both release independent gene annotations with different quality criteria. The annotations for the refGene (UCSC RefSeq) track were generated from UCSC's realignment of RefSeq RNAs with NM and NR accessions, ignoring the XM and XR unvalidated gene prediction category of annotations. You can look at these specific datasets on the Genome Browser visualization. The choice of gene set depends on your application and tolerance of false positives versus potentially missing annotations. You can read more about their differences in the FAQ pages or the track description pages:

Hope that helps!

Daniel Schmelter

UCSC Genome Browser Support Team

Thanks a lot, very helpful !


