I would like to get the alternate allele counts (AC) and the total allele counts (AN) for any variant in each of the five 1000 Genomes super-populations (AFR, AMR, EAS, EUR, SAS) as well as the global population (ALL).
1000 Genomes offers its Allele Frequency Calculator which gives an output for the global population (ALL) and each sub-population (ACB, ACW, BEB, etc.) like the following:
CHR POS ID REF ALT ALL_POP_TOTAL_CNT ALL_POP_ALT_CNT ALL_POP_FRQ ... 1 10177 . A AC 5008 2130 0.43 ...
This gives me exactly what I need, but ideally I would like to have a solution that I can implement in a pipeline (aka independent of the online interface), perhaps using vcftools or bcftools. I know I can sum the values for the sub-populations to get the values for each respective super-population, but I also wonder if there is a simpler/faster way that I'm missing.
What I've tried already:
- I can easily get AF for the global and super-populations using ANNOVAR, but I still need AC and AN.
- I can get AC and AF from dbNSFP 2, but this limits the variants to non-synonymous SNPs only. Technically, I could calculate AN by dividing AC by AF, but this introduces rounding errors because AF has been truncated. Additionally, if AC and AF are zero, then I won't be able to calculate AN at all.
- I've dabbled in the idea of adding up the genotypes (e.g. 0|0, 0|1, 1|1, etc.) in the VCF/BCF files, but I was hoping to avoid this if possible.
How can I get the AC, AN, and AF of any variant for each of the five 1000 Genomes super-populations as well as the global population? Can I do this without first calculating the sub-populations?
NOTE: I know AF is included in the 1000 Genomes VCF/BCF files, but if someone knows how to get AC, AN, and AF in one fell swoop (similar to Allele Frequency Calculator) then it would be greatly appreciated.