Question: Gencode Annotations For 44 Regions Of The Human Genome
1
gravatar for Daniel Standage
8.6 years ago by
Daniel Standage3.9k
Davis, California, USA
Daniel Standage3.9k wrote:

As part of the ENCODE pilot project, 44 regions representing about 1% of the human genome were selected for a community annotation effort. This paper describes how the community was involved in a sort of annotation contest and how the submitted annotations were compare against a high-quality reference--annotations generated by GENCODE with extensive manual curation and experimental validation.

It has been less than 5 years since this was published, and yet I am having serious issues tracking down the data associated with this pilot project. The paper refers the reader to this website, but I cannot find the data there. The page says it is no longer maintained and that the project has moved the Sanger. So...I poked around on Sanger's website for a while and finally found a link to this website, which seems to be the current home for GENCODE. Unfortunately, all the links I've followed (the Sanger FTP site and the UCSC ENCODE page) have taken me to data for the current (production) phase, not the pilot phase, which is what I'm looking for. I got my hopes up momentarily when I saw a link on the UCSC ENCODE page for "Pilot Project." The page described exactly what I was looking for, but I cannot seem to find anywhere that will allow me to download the data.

Can anyone point me to where I can download these data? In particular, I am looking for:

  • Nucleotide sequences for the 44 genomic regions released during the ENCODE pilot project and as part of the EGASP project/competition
  • The GENCODE annotations for these 44 regions that were used as a standard reference in the EGASP project/competition (hopefully in some common text tab-delimited format...or a format with at least some documentation)

Thanks!

Edit: After some suggestions and poking around, I was able to find the data I was looking for. The nucleotide sequences are at ftp://genome.imim.es/pub/projects/gencode/data/seqs/44_ENCODE_regions_NCBI35.fa and annotations for protein-codding genes can be found at ftp://genome.imim.es/pub/projects/gencode/data/havana-encode/version00.2_29apr05/ENCODE_coord/genes_with_cds/44regions_coding.gff.gz.

encode human ucsc • 2.4k views
ADD COMMENTlink modified 8.6 years ago by Gjain5.4k • written 8.6 years ago by Daniel Standage3.9k
2
gravatar for Bert Overduin
8.6 years ago by
Bert Overduin3.6k
Edinburgh Genomics, The University of Edinburgh
Bert Overduin3.6k wrote:

Isn't this what you're after?

ftp://genome.imim.es/pub/projects/gencode/data/havana-encode/version03.1_mar07/

Cheers, Bert

ADD COMMENTlink modified 5 months ago by RamRS25k • written 8.6 years ago by Bert Overduin3.6k

@Bert This was indeed very close. I poked around in these directories and was able to find what I'm looking for. Thanks!

ADD REPLYlink written 8.6 years ago by Daniel Standage3.9k
1
gravatar for Gjain
8.6 years ago by
Gjain5.4k
Munich, Germany
Gjain5.4k wrote:

hi , you can go to the ucsc test browser (http://genome-test.cse.ucsc.edu/) and then go to table browser. Under the group "genes and gene prediction tracks" you can choose "Gencode V3, V4 and V7 genes annotation". the assembly should be HG19 . For the 44 Encode regions you need click on the "defign regions" button and then paste the coordinates of all the 44 regions in the format "chrXYZ:start-end".

there are many more tracks which are not public yet. But the Gencode annotations are avilable in this format.

hope this helps.

ADD COMMENTlink modified 5 months ago by RamRS25k • written 8.6 years ago by Gjain5.4k

Here are the coordinates for the 44 regions http://genome.ucsc.edu/ENCODE/encode.hg18.html which are in HG18. You can use the liftover or http://hgdownload.cse.ucsc.edu/admin/exe/) utility to get the coordinates in HG19.

ADD REPLYlink modified 4 months ago by RamRS25k • written 8.6 years ago by Gjain5.4k

@Gjain I actually do want to keep the hg18 coordinates since that is the assembly that was used during the EGASP project. I went to the table browser, selected the hg18 assembly, and this is looking promising. Under "region" there is an option for the ENCODE pilot regions, and under "track" there is an option for GENCODE genes. This may be what I'm looking for!

ADD REPLYlink written 8.6 years ago by Daniel Standage3.9k

Uh oh, it looks like those same options are available in the live UCSC genome browser. Oops...

ADD REPLYlink written 8.6 years ago by Daniel Standage3.9k

oh thats nice... yes for HG18 all the options are available.

ADD REPLYlink written 8.6 years ago by Gjain5.4k

Please note these version 3,4,7 are NOT the pilot phase data sets. The pilot phase website is here: http://www.sanger.ac.uk/resources/databases/encode/pilot.html and the data can be visualized when you do a track search for GENCODE or EGASP on the UCSC hg18 assembly. The website for the main phase of GENCODE is http://www.gencodegenes.org like it was mentioned.

ADD REPLYlink written 8.6 years ago by Felix50

And I also discovered that hg17 was the assembly used for EGASP, not hg18.

ADD REPLYlink written 8.6 years ago by Daniel Standage3.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1710 users visited in the last hour