Calculate population allele frequencies from a vcf file including multiple populations
2
1
Entering edit mode
5.2 years ago

I have a vcf file with about 800 individuals (diploids) and millions of SNPs. The individuals can be divided in 15 to 25 populations. I would like to calculate the allele frequencies for each SNP on each population. Has someone got a R script doing this? Thank you

R SNP • 8.7k views
ADD COMMENT
0
Entering edit mode

With millions of SNPs, it is better to use bcftools.

ADD REPLY
3
Entering edit mode
5.2 years ago

If your population file has IDs in the first column and population labels in the second, and you edit/add an "#IID population" header line to it,

plink2 --vcf <VCF path> --freq --pheno <population-file path> --loop-cats population

(https://www.cog-genomics.org/plink/2.0/ ) should work.

ADD COMMENT
0
Entering edit mode

Hi, I am trying to use exactly this suggestion, but I'm getting the following Error: Line 1 of poplist has fewer tokens than expected.

poplist is a file where every row looks for ex. like this: AltaiNea Neanderthal

Do you have any idea on how to fix it?

Thank you

ADD REPLY
1
Entering edit mode

Hmm, it's necessary to add an "#IID pop" header line in this case, since otherwise plink2 assumes two-part IDs (for backward compatibility with plink 1.x). I'll edit my original answer accordingly.

ADD REPLY
0
Entering edit mode

Since this situation isn't that rare, I added an "iid-only" modifier to --pheno today; this removes the need to add a header line (--pheno iid-only <population-file path>).

ADD REPLY
0
Entering edit mode

Hi, chrchang523. Thanks for this post. I am trying to run this and I get the following error: Error: --loop-cats phenotype 'population' not loaded. Any Idea why?

Thank you very much!

ADD REPLY
0
Entering edit mode

What is the top line of your --pheno file?

ADD REPLY
0
Entering edit mode

Mi first individual and population id. Shall I write population?

ADD REPLY
0
Entering edit mode

Please read the --pheno documentation.

ADD REPLY
1
Entering edit mode
5.2 years ago
Vitis ★ 2.5k

I found BGT is a very convenient tool for slicing and querying genotypes from large VCF files. With the sliced genotypes (either by regions or by samples, such as by individuals in different populations), it should be straightforward to calculate allele frequencies for any variants in each population.

https://github.com/lh3/bgt

Or you could directly tap into the VCF file using pyvcf and fetch sample and genotype information for your allele frequency calculations.

https://pyvcf.readthedocs.io/en/latest/

ADD COMMENT

Login before adding your answer.

Traffic: 2966 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6