I need to extract polymorphism data from 1000 genomes data for about 85 coding genes.
What I need in particular for each gene is (1) silent polymorphisms, (2) amino acid changing polymorphisms and (3) stop-inducing polymorphisms and (4) global allele frequencies (I guess, for n = 1029).
I know I can get this information through the official 1000 genomes browser or Ensemble web site. But is there anyway that I can automate this process and do it in a go?
I thought of the following strategy, but perhaps you might suggest a more clean and faster way.
- Get chromosomal position data for each gene (i.e., exon start / end, +/- strand) [from UCSC perhaps?!]
- Download genotype files (vcf) for each chromosome for 1000 genomes [phase 1, release v3, March 2011 calls or should I just stick to high coverage data?]
- Pick vcf for Chr 1; check whether any SNP falls inbetween some exons, if it does, note it down.
- Among the noted SNPs, parse allele variant and allele frequency (AF)
- Determine the amino acid position of the corresponding SNP
- Check the resulting amino acid state when the variant is introduced (be careful about +/- strand)
- Classify the polymorphisms and report the corresponding AF