Question: Calculate population allele frequencies from a vcf file including multiple populations
1
gravatar for gwenola.tosser
17 months ago by
gwenola.tosser10 wrote:

I have a vcf file with about 800 individuals (diploids) and millions of SNPs. The individuals can be divided in 15 to 25 populations. I would like to calculate the allele frequencies for each SNP on each population. Has someone got a R script doing this? Thank you

snp R • 2.1k views
ADD COMMENTlink modified 17 months ago by chrchang5237.1k • written 17 months ago by gwenola.tosser10

With millions of SNPs, it is better to use bcftools.

ADD REPLYlink written 17 months ago by zx87549.4k
2
gravatar for chrchang523
17 months ago by
chrchang5237.1k
United States
chrchang5237.1k wrote:

If your population file has IDs in the first column and population labels in the second, and you edit/add an "#IID population" header line to it,

plink2 --vcf <VCF path> --freq --pheno <population-file path> --loop-cats population

(https://www.cog-genomics.org/plink/2.0/ ) should work.

ADD COMMENTlink modified 10 months ago • written 17 months ago by chrchang5237.1k

Hi, I am trying to use exactly this suggestion, but I'm getting the following Error: Line 1 of poplist has fewer tokens than expected.

poplist is a file where every row looks for ex. like this: AltaiNea Neanderthal

Do you have any idea on how to fix it?

Thank you

ADD REPLYlink written 10 months ago by Earendil20
1

Hmm, it's necessary to add an "#IID pop" header line in this case, since otherwise plink2 assumes two-part IDs (for backward compatibility with plink 1.x). I'll edit my original answer accordingly.

ADD REPLYlink written 10 months ago by chrchang5237.1k

Since this situation isn't that rare, I added an "iid-only" modifier to --pheno today; this removes the need to add a header line (--pheno iid-only <population-file path>).

ADD REPLYlink written 10 months ago by chrchang5237.1k
1
gravatar for Vitis
17 months ago by
Vitis2.4k
New York
Vitis2.4k wrote:

I found BGT is a very convenient tool for slicing and querying genotypes from large VCF files. With the sliced genotypes (either by regions or by samples, such as by individuals in different populations), it should be straightforward to calculate allele frequencies for any variants in each population.

https://github.com/lh3/bgt

Or you could directly tap into the VCF file using pyvcf and fetch sample and genotype information for your allele frequency calculations.

https://pyvcf.readthedocs.io/en/latest/

ADD COMMENTlink modified 17 months ago • written 17 months ago by Vitis2.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 769 users visited in the last hour