Gencode Annotations For 44 Regions Of The Human Genome
2
1
Entering edit mode
12.7 years ago

As part of the ENCODE pilot project, 44 regions representing about 1% of the human genome were selected for a community annotation effort. This paper describes how the community was involved in a sort of annotation contest and how the submitted annotations were compare against a high-quality reference--annotations generated by GENCODE with extensive manual curation and experimental validation.

It has been less than 5 years since this was published, and yet I am having serious issues tracking down the data associated with this pilot project. The paper refers the reader to this website, but I cannot find the data there. The page says it is no longer maintained and that the project has moved the Sanger. So...I poked around on Sanger's website for a while and finally found a link to this website, which seems to be the current home for GENCODE. Unfortunately, all the links I've followed (the Sanger FTP site and the UCSC ENCODE page) have taken me to data for the current (production) phase, not the pilot phase, which is what I'm looking for. I got my hopes up momentarily when I saw a link on the UCSC ENCODE page for "Pilot Project." The page described exactly what I was looking for, but I cannot seem to find anywhere that will allow me to download the data.

Can anyone point me to where I can download these data? In particular, I am looking for:

  • Nucleotide sequences for the 44 genomic regions released during the ENCODE pilot project and as part of the EGASP project/competition
  • The GENCODE annotations for these 44 regions that were used as a standard reference in the EGASP project/competition (hopefully in some common text tab-delimited format...or a format with at least some documentation)

Thanks!

Edit: After some suggestions and poking around, I was able to find the data I was looking for. The nucleotide sequences are at ftp://genome.imim.es/pub/projects/gencode/data/seqs/44_ENCODE_regions_NCBI35.fa and annotations for protein-codding genes can be found at ftp://genome.imim.es/pub/projects/gencode/data/havana-encode/version00.2_29apr05/ENCODE_coord/genes_with_cds/44regions_coding.gff.gz.

human encode ucsc • 3.7k views
ADD COMMENT
2
Entering edit mode
12.7 years ago
Bert Overduin ★ 3.7k

Isn't this what you're after?

ftp://genome.imim.es/pub/projects/gencode/data/havana-encode/version03.1_mar07/

Cheers, Bert

ADD COMMENT
0
Entering edit mode

@Bert This was indeed very close. I poked around in these directories and was able to find what I'm looking for. Thanks!

ADD REPLY
1
Entering edit mode
12.7 years ago
Gjain 5.8k

hi , you can go to the ucsc test browser (http://genome-test.cse.ucsc.edu/) and then go to table browser. Under the group "genes and gene prediction tracks" you can choose "Gencode V3, V4 and V7 genes annotation". the assembly should be HG19 . For the 44 Encode regions you need click on the "defign regions" button and then paste the coordinates of all the 44 regions in the format "chrXYZ:start-end".

there are many more tracks which are not public yet. But the Gencode annotations are avilable in this format.

hope this helps.

ADD COMMENT
0
Entering edit mode

Here are the coordinates for the 44 regions http://genome.ucsc.edu/ENCODE/encode.hg18.html which are in HG18. You can use the liftover or http://hgdownload.cse.ucsc.edu/admin/exe/) utility to get the coordinates in HG19.

ADD REPLY
0
Entering edit mode

@Gjain I actually do want to keep the hg18 coordinates since that is the assembly that was used during the EGASP project. I went to the table browser, selected the hg18 assembly, and this is looking promising. Under "region" there is an option for the ENCODE pilot regions, and under "track" there is an option for GENCODE genes. This may be what I'm looking for!

ADD REPLY
0
Entering edit mode

Uh oh, it looks like those same options are available in the live UCSC genome browser. Oops...

ADD REPLY
0
Entering edit mode

oh thats nice... yes for HG18 all the options are available.

ADD REPLY
0
Entering edit mode

Please note these version 3,4,7 are NOT the pilot phase data sets. The pilot phase website is here: http://www.sanger.ac.uk/resources/databases/encode/pilot.html and the data can be visualized when you do a track search for GENCODE or EGASP on the UCSC hg18 assembly. The website for the main phase of GENCODE is http://www.gencodegenes.org like it was mentioned.

ADD REPLY
0
Entering edit mode

And I also discovered that hg17 was the assembly used for EGASP, not hg18.

ADD REPLY

Login before adding your answer.

Traffic: 2509 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6