Question

Why removing SNPs with MAF<5% for Fst calculation?

0

Entering edit mode

6.5 years ago

Mr Locuace ▴ 160

I have a very ignorant question. Let's say the SNP X has an allele A with a frequency of 0.52 and 0.002 in populations 1 and 2, respectively. In some papers I have read that people remove SNPs with MAF<5% in either of the populations when calculating Fst. These values suggest that A is very differentiated between pop1 and pop2. Indeed, I calculated Fst for SNP X and it has a value of ~0.9. But if I use the MAF>5% criterion, I would remove this strong signal of population differentiation. This does not make much sense for me. I would very much appreciate to have some feedback. Thanks !

snp maf Fst • 4.1k views

ADD COMMENT • link updated 6.5 years ago by Kevin Blighe 87k • written 6.5 years ago by Mr Locuace ▴ 160

score 2 · Accepted Answer · 2017-10-19

2

Entering edit mode

6.5 years ago

Kevin Blighe 87k

These guys, published in Genome Research, have addressed just this issue of allele frequency when calculating Fst: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3759727/

Their results show just as you have implied, i.e., that the Fst is dependent on the allele frequency, but in addition they imply that the sample size is important. On that note, rare variants, being rare, will naturally be encountered less in populations and it is possible only now (recent years) that we have accumulated sequencing data on 1000s of individuals such that we can actually begin to analyse rare variants in various metrics, including Fst.

ADD COMMENT • link 6.5 years ago by Kevin Blighe 87k

1

Entering edit mode

Thanks very much Kevin

ADD REPLY • link 6.5 years ago by Mr Locuace ▴ 160

0

Entering edit mode

¡De nada amigo!

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k

0

Entering edit mode

Hi Kevin,

Given your response, the rare variants cannot be considered for the population differentiation as they are created in recent years, yes? however, the variants with the allele frequency < 5% are not rare, they are not just common. With removing variants with AF < 5%, we just assay the population differentiation in terms of common variants, while these variants cannot have the significant role in regards to the trait of interest and the various populations may differentiate at the low-frequency variants, not common variants. Could you please kindly correct me whenever I'm wrong and explain me a bit about removing the variants with AF <5% for Fst calculation, which does not still make sense for me?

ADD REPLY • link 5.4 years ago by seta ★ 1.9k

1

Entering edit mode

In my answer, I just state that the authors noted a difference when calculating Fst for 'low frequency' variants (MAF <=0.05) versus 'most common' variants (<0.45 MAF <= 0.5). The title of this question is misleading because it implies that everybody should filter out MAF<=0.05 for calculating Fst.

Common variants can have a big role in disease. It is incorrect to assume that only rare variants contribute to complex disease phenotypes.

ADD REPLY • link 5.4 years ago by Kevin Blighe 87k

0

Entering edit mode

Thanks a lot for your explanation. So, in your opinion, is it better to calculate the Fst for lower frequency and common variants, separately rather than removing some variants?

Agree with you about the common variants and disease, thanks for correcting me.

In this paper, the authors mentioned that Fst analysis is not appropriate for detecting genetic risk differentiation among populations and Genetic Risk Variation (GRV) method developed by them can overcome the Fst problems in this situation and and showed its strength for detecting genetic risk differentiation in type 2 diabetes. However, I couldn’t find any script/too to run the GRV method. Could you please kindly share me your idea about it?

ADD REPLY • link 5.4 years ago by seta ★ 1.9k

0

Entering edit mode

So, in your opinion, is it better to calculate the Fst for lower frequency and common variants, separately rather than removing some variants?

I am not in the best position to advise on that. It would be a question more for a statistician, or at least a bioinformatician who has worked in this area for a number of years. I will say that literature frequently contradicts itself. Also the authors' work (GRV) likely will not work in other situations / diseases. You may find more information looking through CrossValidated / StackExchange

ADD REPLY • link 5.4 years ago by Kevin Blighe 87k