Question

More rare variants than common variants

1

Entering edit mode

5 months ago

RT ▴ 20

Hi there!

I have a joint-called VCF with10000+ samples (all were 30x cov WGS with multiple ancestry), and intend to carry out GWAS analysis. I used Plink to convert VCF to BED, and applied the initial QC (--mind and --geno) , after which I had a good genotyping rate of ~95% with 180M variants, but when I tried to apply the minor allele freq --maf 0.01 , more than 95% variants were removed with only 900K variants retained. My question is,

Is it normal to have more rare variants (MAF<0.01) than common variants in a large dataset like this?
As we don't know the self-reported ancestry for all, I am going to use somalier to do ancestry prediction, Should I use this to separately apply the maf fitler for each ancestry? I appreciate your inputs and suggestions.

Thank you!

PLINK GWAS • 646 views

ADD COMMENT • link updated 5 months ago by LChart 5.1k • written 5 months ago by RT ▴ 20

score 1 · Answer 1 · 2025-05-15

Is it normal to have more rare variants (MAF<0.01) than common variants in a large dataset like this?

Yes, this is the site frequency spectrum in action.

As we don't know the self-reported ancestry for all, I am going to use somalier to do ancestry prediction, Should I use this to separately apply the maf fitler for each ancestry? I appreciate your inputs and suggestions.

This is a circular question. If you don't have ancestry labels, you can't apply a filter within each ancestry group. The only thing you can do is apply it globally and then run Somalier. Edit: You may mean you have partial ancestry information. In general ancestry inference works fine with a hard threshold common variants, as there are plenty of frequency-divergent SNPs.