Where to find gene annotation file for 1000 genomes phase 3 data / hs37d5/ b37 assembly .
0
0
Entering edit mode
6.8 years ago
William ★ 5.1k

I would like to filter variants / genotypes form the 1000 genomes phase 3 data using a gene annotation model that is suitable for the used reference.

The 1000 genomes data I am working with is from: s3://1000genomes/release/20130502/ALL.chr*.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

In the header of the 1000 genomes vcf file the assembly is named hs37d5.fa.gz

The first contig has this information

##contig=<ID=1,assembly=b37,length=249250621>

which lists b37 as the assembly.

Some searching online led me to

http://www.gencodegenes.org/releases/19.html

This seems to be a matching gene annotation except that is has "chr" in front of every chromosome number. Which is luckily fixable with sed.

Another thing is that this file (gencode.v19.annotation.gtf) has 2.619.449 annotations/lines, for the ca 20.000 human genes, so this contains records not just for genes but also on a more detailed level, exon intron.

Are there other publicly available gene annotations that that can be used with the 1000 genomes data / assembly (hs37d5 / b37) ? ( preferably without using sed to fix the extra leading chr in front of chromosome names).

And are there gene annotations that are not that detailed, ie just one record per gene? Or a way to convert to this.

gene annotation • 3.0k views
ADD COMMENT

Login before adding your answer.

Traffic: 1664 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6