Estimating site information from 1000 genomes vcf data
Entering edit mode
3.7 years ago
spiral01 ▴ 100

Hi. I am working with the phase 3 1000 genomes vcf data (available here: and need to estimate the number of synonymous and non-synonymous sites

For my analysis, if I had details of what type of site is occurring - e.g. non-synonymous change at a 0-fold site, synonymous change at 2-fold site, then I could restrict my analysis to 0-fold and 4-fold sites and just count those sites and the numbers of polymorphisms at them.

However, I do not have complete codon information. The VCF files provide the reference allele and the alternative allele but not the codon within which the allele is located (which I would need to calculate whether a site is 0-fold etc). Is there any way of obtaining this information? I know UCSC has this data, but their set of alleles seem to be incomplete when compared to the data taken directly fromm 1000genomes.

If this is not possible, I would be grateful for any other suggested methods that might work.

SNP • 859 views
Entering edit mode
3.7 years ago

Can't you just annotate the vcf files using VEP or snpeff? That will give you the aminoacid substitution and mutation impact.

Entering edit mode

Hi, I've just seen that nestled in the annotation is the codon information as you suggested. Many thanks, and apologies for the unnecessary question!


Login before adding your answer.

Traffic: 1755 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6