Calculate population allele frequencies from a vcf file including multiple populations
2
1
Entering edit mode
2.7 years ago

I have a vcf file with about 800 individuals (diploids) and millions of SNPs. The individuals can be divided in 15 to 25 populations. I would like to calculate the allele frequencies for each SNP on each population. Has someone got a R script doing this? Thank you

R SNP • 4.5k views
0
Entering edit mode

With millions of SNPs, it is better to use bcftools.

3
Entering edit mode
2.7 years ago

If your population file has IDs in the first column and population labels in the second, and you edit/add an "#IID population" header line to it,

plink2 --vcf <VCF path> --freq --pheno <population-file path> --loop-cats population


0
Entering edit mode

Hi, I am trying to use exactly this suggestion, but I'm getting the following Error: Line 1 of poplist has fewer tokens than expected.

poplist is a file where every row looks for ex. like this: AltaiNea Neanderthal

Do you have any idea on how to fix it?

Thank you

1
Entering edit mode

Hmm, it's necessary to add an "#IID pop" header line in this case, since otherwise plink2 assumes two-part IDs (for backward compatibility with plink 1.x). I'll edit my original answer accordingly.

0
Entering edit mode

Since this situation isn't that rare, I added an "iid-only" modifier to --pheno today; this removes the need to add a header line (--pheno iid-only <population-file path>).

0
Entering edit mode

Hi, chrchang523. Thanks for this post. I am trying to run this and I get the following error: Error: --loop-cats phenotype 'population' not loaded. Any Idea why?

Thank you very much!

0
Entering edit mode

What is the top line of your --pheno file?

0
Entering edit mode

Mi first individual and population id. Shall I write population?

0
Entering edit mode

1
Entering edit mode
2.7 years ago
Vitis ★ 2.5k

I found BGT is a very convenient tool for slicing and querying genotypes from large VCF files. With the sliced genotypes (either by regions or by samples, such as by individuals in different populations), it should be straightforward to calculate allele frequencies for any variants in each population.

https://github.com/lh3/bgt

Or you could directly tap into the VCF file using pyvcf and fetch sample and genotype information for your allele frequency calculations.