I would like to filter variants / genotypes form the 1000 genomes phase 3 data using a gene annotation model that is suitable for the used reference.
The 1000 genomes data I am working with is from: s3://1000genomes/release/20130502/ALL.chr*.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
In the header of the 1000 genomes vcf file the assembly is named hs37d5.fa.gz
The first contig has this information
##contig=<ID=1,assembly=b37,length=249250621>
which lists b37 as the assembly.
Some searching online led me to
http://www.gencodegenes.org/releases/19.html
This seems to be a matching gene annotation except that is has "chr" in front of every chromosome number. Which is luckily fixable with sed.
Another thing is that this file (gencode.v19.annotation.gtf) has 2.619.449 annotations/lines, for the ca 20.000 human genes, so this contains records not just for genes but also on a more detailed level, exon intron.
Are there other publicly available gene annotations that that can be used with the 1000 genomes data / assembly (hs37d5 / b37) ? ( preferably without using sed to fix the extra leading chr in front of chromosome names).
And are there gene annotations that are not that detailed, ie just one record per gene? Or a way to convert to this.