Question: Calculate population allele frequencies from a vcf file including multiple populations
1
gravatar for gwenola.tosser
7 months ago by
gwenola.tosser10 wrote:

I have a vcf file with about 800 individuals (diploids) and millions of SNPs. The individuals can be divided in 15 to 25 populations. I would like to calculate the allele frequencies for each SNP on each population. Has someone got a R script doing this? Thank you

snp R • 937 views
ADD COMMENTlink modified 7 months ago by chrchang5235.6k • written 7 months ago by gwenola.tosser10

With millions of SNPs, it is better to use bcftools.

ADD REPLYlink written 7 months ago by zx87548.2k
2
gravatar for chrchang523
7 months ago by
chrchang5235.6k
United States
chrchang5235.6k wrote:

If your population file has IDs in the first column and population labels in the second, and you edit/add an "#IID population" header line to it,

plink2 --vcf <VCF path> --freq --pheno <population-file path> --loop-cats population

(https://www.cog-genomics.org/plink/2.0/ ) should work.

ADD COMMENTlink modified 9 days ago • written 7 months ago by chrchang5235.6k

Hi, I am trying to use exactly this suggestion, but I'm getting the following Error: Line 1 of poplist has fewer tokens than expected.

poplist is a file where every row looks for ex. like this: AltaiNea Neanderthal

Do you have any idea on how to fix it?

Thank you

ADD REPLYlink written 9 days ago by Earendil20
1

Hmm, it's necessary to add an "#IID pop" header line in this case, since otherwise plink2 assumes two-part IDs (for backward compatibility with plink 1.x). I'll edit my original answer accordingly.

ADD REPLYlink written 9 days ago by chrchang5235.6k

Since this situation isn't that rare, I added an "iid-only" modifier to --pheno today; this removes the need to add a header line (--pheno iid-only <population-file path>).

ADD REPLYlink written 7 days ago by chrchang5235.6k
1
gravatar for Vitis
7 months ago by
Vitis2.2k
New York
Vitis2.2k wrote:

I found BGT is a very convenient tool for slicing and querying genotypes from large VCF files. With the sliced genotypes (either by regions or by samples, such as by individuals in different populations), it should be straightforward to calculate allele frequencies for any variants in each population.

https://github.com/lh3/bgt

Or you could directly tap into the VCF file using pyvcf and fetch sample and genotype information for your allele frequency calculations.

https://pyvcf.readthedocs.io/en/latest/

ADD COMMENTlink modified 7 months ago • written 7 months ago by Vitis2.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1026 users visited in the last hour