I have a vcf file with about 800 individuals (diploids) and millions of SNPs. The individuals can be divided in 15 to 25 populations. I would like to calculate the allele frequencies for each SNP on each population. Has someone got a R script doing this? Thank you
I found BGT is a very convenient tool for slicing and querying genotypes from large VCF files. With the sliced genotypes (either by regions or by samples, such as by individuals in different populations), it should be straightforward to calculate allele frequencies for any variants in each population.
Or you could directly tap into the VCF file using pyvcf and fetch sample and genotype information for your allele frequency calculations.
plink2 --vcf [VCF path] --freq --pheno [population-file path] --loop-cats [name of population col.]
(https://www.cog-genomics.org/plink/2.0/ ) should work, if your population file has sample IDs in the first column and population labels in the second.
If the population file has no header line, use "PHENO1" as the --loop-cats argument. If there is a header line, it may be necessary to change the header for the sample ID column to "#IID".