Negative Fst values in Lositan
4
6
Entering edit mode
8.0 years ago

Hi there,

I am using Lositan to detect outlier SNPs from a set of 556 SNPs. When I first upload my dataset, I get an overall Fst value of -0.025. After running the simulation (with default settings) 40% of all SNPs are listed as outliers. When I exclude the candidate outliers I get the same Fst value (-0.026).

Does anyone know why I could be getting negative Fst values? Also, is it normal to have almost half of the SNPs listed as outliers?

Cecilia

SNP Fst Lositan • 7.8k views
3
Entering edit mode
8.0 years ago

This problem of negative Fst scores is not limited to Lositan, and it happens also with BioPerl, vcftools, and others.

In principle Fst scores are not impossible, as they mean that there is more variation within the population than between the two populations compared. In general, I believe it is common practice to change all the negative Fst scores to 0 and basically consider them as loci for which there is no population differentiation.

Regarding the problem of too many outliers, I am not certain of which demographic model is implemented in Lositan, and about which types of simulations are done. I would plot the site frequency spectra of both simulations and real data, and make sure they do not differ significantly (e.g. they have the same shape), specially for the SNPS at low frequency.

EDIT: I just discovered that, when you calculate Fst using vcftools between a population and itself, it returns some negative Fst scores:

\$: vcftools --weir-fst-pop ACB.pop --weir-fst-pop ACB.pop --gzvcf (1000genomes phase3 data)

CHROM   POS     WEIR_AND_COCKERHAM_FST
11      61395   -0.00465518
11      73015   nan
11      73048   nan
11      77250   nan
11      87150   nan
11      87203   nan
11      87209   -0.00512243
11      87268   -0.00574944
11      87293   nan
11      87341   -0.0052356
11      90692   nan
11      90697   nan
11      90964   nan
11      102905  -0.00794515
11      103253  -0.00704929
11      103365  -0.00517962
11      103367  -0.00517962
11      103368  -0.00517962
11      103604  nan


This basically tells that you can't trust negative Fst scores, and that you should consider them as software errors due to rounding or something else.

0
Entering edit mode

This is useful. Have you published on this? I'm looking for citation of an example that has been through peer review.

3
Entering edit mode
8.0 years ago

As Giovanni M Dall'Olio pointed out negative values are possible and common for Weir and Cockerham 1984 (equations A, B and C).

To avoid excessive outliers, try removing very rare variants and sites where there are many missing genotypes.

If you are looking for alternative tool, I've written a suite for association testing that has Fst.

https://github.com/jewmanchue/vcflib/wiki/Association-testing-with-GPAT

and Smoothing

https://github.com/jewmanchue/vcflib/wiki/Smoothing-with-GPAT

0
Entering edit mode

A tool to calculate Fst taking into account the genotype likelihood directly from the VCF file. That's wonderful! :-)

2
Entering edit mode
8.0 years ago
confusedious ▴ 450

Folks on Biostars in the past helped me with some similar questions.

Wright'S Fst And Weir & Cockerham'S Fst Estimator - Simple Explanation Of The Difference

0
Entering edit mode
21 months ago
andemexoax • 0

I ran into this using SNPrelate package in R to calculate FST values. The references for the SNPrelate function that computes FST

1. Weir, BS. & Hill, WG. Estimating F-statistics. Annual review of genetics 36, 721-50 (2002).
2. Population-specific FST values for forensic STR markers: A worldwide survey. Buckleton J, Curran J, Goudet J, Taylor D, Thiery A, Weir BS.Forensic Sci Int Genet. 2016 Jul;23:91-100. doi: 10.1016/j.fsigen.2016.03.004.

A discussion on FST methods

Willing EM, Dreyer C, van Oosterhout C. Estimates of genetic differentiation measured by F(ST) do not necessarily require large sample sizes when using many SNP markers. PLoS One. 2012;7(8):e42649. doi:10.1371/journal.pone.0042649

The 2012 Willing article talks about the comparison of 3 different FST models. They say the original method by Wright, "assumed infinite sample sizes in his definition, but population size is finite in real datasets. The absence of negative FST values in Wright’s (1951) definition can lead to an overestimation of FST, particularly when the populations are only weakly or not differentiated. Cockerham and Weir (1984) proposed an unbiased estimator that can also have negative FST estimates and that has been widely used [9]."

SNPrelate uses a more recent method derived from Weir. "The estimates can also have negative values which do not have a biological meaning [19], but they can compensate for overestimates especially at low levels of genetic differentiation."