Question: Human Genome Annotations
gravatar for Daniel Standage
8.2 years ago by
Daniel Standage3.9k
Davis, California, USA
Daniel Standage3.9k wrote:

I am trying to do a comparison of gene annotations for the latest release of the human genome with annotations from the previous release. I have rarely worked with human data before, so I wasn't sure where to start. I found this thread which provides a link for downloading some custom-generated GFF3 for the hg19 release (this seems to be the latest "official" release).

Getting data for the hg18 release hasn't been so easy. I checked out UCSC's download site, but found it very difficult to navigate. So then I tried Ensembl's FTP site and found the data ordered by date and organism (not labels like "hg18" or "hg19"). UCSC's site lists dates of the human genome releases, so I guess I could just download the annotations for the closest following Ensembl release...but then again, the dates on UCSC's site aren't exact and I'm not sure how quickly these data are integrated into the Ensembl data bank.

Does anyone have any tips for obtaining gene annotations for different releases of the human genome? Is there some simple documentation I'm missing, or is everything really as complicated as it seems?

gene annotation gff human genome • 6.1k views
ADD COMMENTlink written 8.2 years ago by Daniel Standage3.9k

What source/format are your current annotations in for comparison?

ADD REPLYlink written 8.2 years ago by Pi510

@pi All I currently have is the GFF3 file of the hg19 release.

ADD REPLYlink written 8.2 years ago by Daniel Standage3.9k
gravatar for Bert Overduin
8.2 years ago by
Bert Overduin3.6k
Edinburgh Genomics, The University of Edinburgh
Bert Overduin3.6k wrote:

A few pointers:

  • hg18 = NCBI36, hg19 = GRCh37
  • Which release of Ensembl is based on which genome assembly you can find when you click on the 'View in Archive site' link at the bottom of this page
  • Note that regularly a new Ensembl genebuild is done for human (so, not only when there is a new assembly!) and that even in between genebuilds the gene set is updated / patched. Therefore, almost every release has a different gene set.
  • Note also that the way Ensembl annotates genes is different from UCSC and that the Ensembl automatic annotation is merged with manual annotation from the Havana group at the Sanger Institute. A basic outline of the basic annotation process you can find here. There are separate annotation strategies for immunoglobulin and T-cell receptor genes and non-coding RNA genes.

So, I am afraid that things are probably more complicated than you had hoped for ....

Hope this helps.

ADD COMMENTlink written 8.2 years ago by Bert Overduin3.6k
gravatar for brentp
8.2 years ago by
Salt Lake City, UT
brentp23k wrote:

At UCSC table browser you can download Bed format of the various human releases that include exons in the extended bed columns or each as a seperate row. I think this would be the easiest plce to start. GFF makes things more complicated.

ADD COMMENTlink written 8.2 years ago by brentp23k

@brentp GFF makes it more in it's a more complicated format (than BED) or it's more complicated to obtain GFF3 for human (than BED)?

ADD REPLYlink written 8.2 years ago by Daniel Standage3.9k

@Daniel, both. Though you can get a nice GTF file from ensembl for GRCh37. If you download the whole gene BED from UCSC, it will likely have everything you need.

ADD REPLYlink written 8.2 years ago by brentp23k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1537 users visited in the last hour