I am trying to do a comparison of gene annotations for the latest release of the human genome with annotations from the previous release. I have rarely worked with human data before, so I wasn't sure where to start. I found this thread which provides a link for downloading some custom-generated GFF3 for the hg19 release (this seems to be the latest "official" release).

Getting data for the hg18 release hasn't been so easy. I checked out UCSC's download site, but found it very difficult to navigate. So then I tried Ensembl's FTP site and found the data ordered by date and organism (not labels like "hg18" or "hg19"). UCSC's site lists dates of the human genome releases, so I guess I could just download the annotations for the closest following Ensembl release...but then again, the dates on UCSC's site aren't exact and I'm not sure how quickly these data are integrated into the Ensembl data bank.

Does anyone have any tips for obtaining gene annotations for different releases of the human genome? Is there some simple documentation I'm missing, or is everything really as complicated as it seems?

A few pointers:

  • hg18 = NCBI36, hg19 = GRCh37
  • Which release of Ensembl is based on which genome assembly you can find when you click on the 'View in Archive site' link at the bottom of this page
  • Note that regularly a new Ensembl genebuild is done for human (so, not only when there is a new assembly!) and that even in between genebuilds the gene set is updated / patched. Therefore, almost every release has a different gene set.
  • Note also that the way Ensembl annotates genes is different from UCSC and that the Ensembl automatic annotation is merged with manual annotation from the Havana group at the Sanger Institute. A basic outline of the basic annotation process you can find here. There are separate annotation strategies for immunoglobulin and T-cell receptor genes and non-coding RNA genes.

So, I am afraid that things are probably more complicated than you had hoped for ....

Hope this helps.

At UCSC table browser you can download Bed format of the various human releases that include exons in the extended bed columns or each as a seperate row. I think this would be the easiest plce to start. GFF makes things more complicated.

