I have a VCF file with 200 samples (mitochondrial genome of Plasmodium falciparum). Here is a pic to take a look at:
And a few relevant lines from the actual file:
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed"> ##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT FP0008-C FP0009-C Pf_M76611 19 . A T 33196.8 MissingVQSLOD;Mitochondrion AC=9;AF=0.0003019;AN=29816;ANN=T|intragenic_variant|MODIFIER|PF3D7_MIT00100|PF3D7_MIT00100|gene_variant|PF3D7_MIT00100|||n.19A>T||||||;AS_InbreedingCoeff=0.0437;AS_QD=20.96;BaseQRankSum=0.141;C DS;DP=559138;ExcessHet=0.0026;FS=2.892;InbreedingCoeff=0.52;MLEAC=8;MLEAF=0.0002683;MQ=60;MQR ankSum=0;QD=20.96;ReadPosRankSum=0.856;RegionType=Mitochondrion;SOR=0.789;set=genotypegvcfs GT:AD:DP:GQ:PL 0/0:52,0:52:99:0,120,1800 0/0:54,0:54:99:0,120,1800
A small subset of samples I am working on can be found here: https://www.mediafire.com/file/6dkct6guk5zx83m/sample.vcf/file
I am trying to calculate allele frequency (AF) for each variant in the dataset. The population is assumed to be all the samples in the file.
In other words, even though the VCF file contains AF statistics from population data, I would like to know how to recalculate AF only from the VCF file. I believe I need to know how AC and AN are calculated first.
AF is calculated this way: allele count (AC) / allele number (AN)
I tried to calculate them from the information from GT (counting 0/0, 1/1 and 0/1 in all the 200 samples), but then I learned that this calculation is not suitable for mithocondrial data.
All lines in the VCF file contain an AF statistic:
$ grep -v 'AF=' sample.vcf | grep -v '^#' <No output>
The information is provided under INFO column, but how were they calculated? If I do not have AC and AN to begin with, what should I do? Is there a way to calculate AC and AN from the other information provided in the dataset? such as the data mentioned in the FORMAT column? GT: genotype AD: unfiltered allele depth DP : read depth at this position for this sample (Integer) GQ : conditional genotype quality, encoded as a phred quality PL : the phred-scaled genotype likelihoods rounded to the closest integer