How to Calulate Allele Frequency from a VCF File?
0
0
Entering edit mode
7 weeks ago
Emoji • 0

I have a VCF file with 200 samples (mitochondrial genome of Plasmodium falciparum). Here is a pic to take a look at:

And a few relevant lines from the actual file:

##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the
same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order
as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  FP0008-C   FP0009-C
Pf_M76611       19      .       A       T       33196.8 MissingVQSLOD;Mitochondrion
AC=9;AF=0.0003019;AN=29816;ANN=T|intragenic_variant|MODIFIER|PF3D7_MIT00100|PF3D7_MIT00100|gene_variant|PF3D7_MIT00100|||n.19A>T||||||;AS_InbreedingCoeff=0.0437;AS_QD=20.96;BaseQRankSum=0.141;C
DS;DP=559138;ExcessHet=0.0026;FS=2.892;InbreedingCoeff=0.52;MLEAC=8;MLEAF=0.0002683;MQ=60;MQR


A small subset of samples I am working on can be found here: https://www.mediafire.com/file/6dkct6guk5zx83m/sample.vcf/file

I am trying to calculate allele frequency (AF) for each variant in the dataset. The population is assumed to be all the samples in the file.

In other words, even though the VCF file contains AF statistics from population data, I would like to know how to recalculate AF only from the VCF file. I believe I need to know how AC and AN are calculated first.

AF is calculated this way: allele count (AC) / allele number (AN)

I tried to calculate them from the information from GT (counting 0/0, 1/1 and 0/1 in all the 200 samples), but then I learned that this calculation is not suitable for mithocondrial data.

All lines in the VCF file contain an AF statistic:

\$ grep -v 'AF=' sample.vcf | grep -v '^#'
<No output>


The information is provided under INFO column, but how were they calculated? If I do not have AC and AN to begin with, what should I do? Is there a way to calculate AC and AN from the other information provided in the dataset? such as the data mentioned in the FORMAT column? GT: genotype AD: unfiltered allele depth DP : read depth at this position for this sample (Integer) GQ : conditional genotype quality, encoded as a phred quality PL : the phred-scaled genotype likelihoods rounded to the closest integer

Thanks

frequency allele VCF • 1.1k views
0
Entering edit mode

pretty much all vcftools will enable this for you; vcftools, plink2 etc

0
Entering edit mode

for mithocondrial data ?

0
Entering edit mode

Thanks for the reply. I am not just looking for a tool to quickly calculate something. I checked the documentation for vfctools, but it just described how to use it to get the result. Plus, i am not sure if it covers mithocondrial genome. I would like to know about the calculation. How can i do this manually?

1
Entering edit mode
bcftools +fill-tags in.vcf -- -t AN,AC,AF


AC= COUNT(ALT-ALLELE)

AN= COUNT(ALT-ALLELE)+COUNT(REF-ALLELE)

AF=AC/AN

0
Entering edit mode

Thanks, but it looks like when i tried to calculate them from GT (counting 0/0, 1/1 and 0/1 in all the 200 samples). I do not think it works that way, at least not for the mithocondrial genome.

1
Entering edit mode

That's correct. You don't have biallelic alleles in mitochondria, so you can use either the Allele depth (AD) or depth of coverage (DP) for each individual. DP should be a bit more raw than AD.

0
Entering edit mode

Thanks, but how do i do that?

1
Entering edit mode

for AD it's that second element 52,0 for FP0008-C then 54,0 for FP0009-C. When you have more subjects it should be interesting. You can add those between individuals and divide each allele by the total.

0
Entering edit mode

Thanks, aplogy for follow up questions, as i am new to this field. I am just trying to clarify your explanation. What you are saying is to add numbers, for these two samples, 52 and 54 which would be the total and then divide each allele count (counts for 0/0, 0/1 and 1/1) by that total?

1
Entering edit mode

Imagine it was more heteroplasmy in these two, so like 42,10 and 50,4 then the frequencies would be 92/106=.868 and 14/106=.132

0
Entering edit mode

Awesome! I get it now. However, there is still a small problem. All the ADs are like this: some number (between 1 and 60):0. So, the number on the right side of any AD read is always 0. That would mean that my fractions are alway 0/total and total/total whih would result in 0 and 1. How is this possible?

1
Entering edit mode

ok, what do the DP's look like? ever more than one number?

0
Entering edit mode

For each sample, there is only one number.

0
Entering edit mode

Hmm i would expect some heteroplasmy to show up in AD or DP at some positions