I have a vcf file with about 800 individuals (diploids) and millions of SNPs. The individuals can be divided in 15 to 25 populations. I would like to calculate the allele frequencies for each SNP on each population. Has someone got a R script doing this? Thank you
If your population file has IDs in the first column and population labels in the second, and you edit/add an "#IID population" header line to it,
plink2 --vcf <VCF path> --freq --pheno <population-file path> --loop-cats population
(https://www.cog-genomics.org/plink/2.0/ ) should work.
I found BGT is a very convenient tool for slicing and querying genotypes from large VCF files. With the sliced genotypes (either by regions or by samples, such as by individuals in different populations), it should be straightforward to calculate allele frequencies for any variants in each population.
Or you could directly tap into the VCF file using pyvcf and fetch sample and genotype information for your allele frequency calculations.