Question

Wright'S Fst And Weir & Cockerham'S Fst Estimator - Simple Explanation Of The Difference

12

Entering edit mode

10.7 years ago

confusedious ▴ 470

Hello everyone,

I have been on the hunt for a simple explanation of how Wright's Fst and Weir & Cockerham's Fst estimator differ. I have gone to the original papers, but with a weak statistical background I have not made much ground.

In short, I understand that Wright's Fst assumed infinite population sizes, and that this could create biased Fst estimates if samples were small or unevenly sized, but would this be an upward or downward bias? I have made the assumption that it would be an upward bias, as Weir & Cockerham's Fst estimator is supposed to correct for Wright's lack of explicit handling of sample size, and is capable of producing negative Fst values, where Wright's Fst produced values of only zero to one. Is this correct?

Also, in simple terms, could someone explain the difference in how these are calculated? I have searched high and low for a straight forward example but found none. I have been using software which calculates Weir & Cockerham's, and would like to be able to explain the difference should the need arise.

Thank you for your time.

NOTE: I am also aware that Jost and others have criticised the use of this indicator for its production of exceedingly low values at highly polymorphic markers. Fortunately, I am using it only with biallelic markers.

fst population-genetics • 33k views

ADD COMMENT • link updated 16 months ago by Ram 43k • written 10.7 years ago by confusedious ▴ 470

1

Entering edit mode

hi, I've edited my answer, after finding a bit more time to dedicate to it.

ADD REPLY • link 10.7 years ago by Giovanni M Dall'Olio 28k

score 12 · Answer 1 · 2013-07-25

EDIT: I have expanded this answer, after having a few more time.

When Wright introduced the Fst index in 1951, there were no techniques to study the "genetic" components of a trait, and the whole field was based on the observation of the phenotypes. All the studies were based on counting characters that could be observed externally, like the height of the plant or the color of a flower. Population genetics was a theoretical field, and none of the theories proposed by Wright could be verified at the genetic level.

This changed dramatically with the advent of protein electrophoresis in the 1970s. Electrophoresis allowed, for the first time, to observe the underlying genetic component of a trait. Elecrophoresis allowed to determine how many protein isoforms produced the same phenotype, and how many isoforms were present in a population. This introduced a drastic change in the field of population genetics, because it was the first time that many principles theorized up to then could be verified - and some of these principles were found to be in contrast with what was observed in population genetics. The most important change is Kimura's neutral theory - Kimura observed that the number of protein polimorphisms observed were too high to be explained by darwinian selection, and proposed that evolution was governed by genetic drift.

So, the main difference between Wright's and C-W's Fst is the advent of protein electrophoresis. C-W's Fst was adapted to determine the Fst using data from protein elecrophoresis. As Tiago Antao said, Wright's Fst is a theoretical index, while C-W's Fst is a estimator of the former.

Multiple alleles

Before the advent of electrophoresis, the common view was that most loci were bi-allelic. For historical reasons, it was believed that each gene could have only two alleles, a-g. A and a, as in Mendel's study. Wright himself developed a framework to study multi-allele loci (the infinite-loci model), but this was not used until the 70s and Kimura.

Thus, W-C introduced a way to estimate Fst on multi-allelic loci. When they published the paper in 1984, the existence of multi-allelic loci was acknowledged, as electrophoresis had demonstrated that a protein could have more than one isoform.

Sample Size

In their 1984 paper, C-W dedicated a section to the problem of sample size. I guess that this is due to the need to calculate the Fst index on real data, from protein electrophoresis. Probably one of the first issue at the time was to compare two or more populations, using different sample size.

Moreover, in the 1984 they were in the middle of the debate about the Neutral/Nearly Neutral Theory evolution. In particular, Tomoko Ohta proposed that the strength of genetic drift depends on the effective sample size of a population - if the effective sample size of a population is large enough, then the effects of genetic drift are lower. Thus, the concept of effective sample size was very important at the moment, and in the middle of the debate - so I imagine that it was important to determine if the number of samples in a study were enough to determine the genetics of the whole population.

Formulas (previous answer)

As written in the same 1984 article, there were multiple definition of the Wright's Fst statistics. This was the most common:

enter image description here

While Cockerham's Fst is:

enter image description here

score 8 · Answer 2 · 2013-07-25

8

Entering edit mode

10.7 years ago

tiagoantao ▴ 690

I suppose you might have come across Genetics in geographically structured populations: defining, estimating and interpreting FST in Nat Rev Gen?

I think the rough intuition that should not be forgotten is that C-W is, in practice (though in theory this can be discussed), an estimator of Wright's Fst.

I do not have here the original Wright formulation, but you will notice that it assumes that you can precisely know allele frequency per population. This is, in practice, impossible (unless you sample all individuals from all populations!), so your allele frequency from real data will be an estimate itself. You will want your estimator of Fst to be able to be, in as much as possible, robust to sample size effects (which the original formulation was never meant to be, being constructed in an idealized theoretical model). Who says sample size effects, can say number of alleles, population size, number of populations sampled (you normally do not normally sample all populations), ...

Wright's Fst works in an idealized situation, then you have a plethora of estimators (of which C-W is by far the most used/famous) that try to estimate Fst in non-ideal conditions (a.k.a the real world). The choice of estimator depends mostly of a match between your dataset conditions and the properties of each estimator (which will be robust to different departures from the ideal model)

Again, I do not have here the original document from Wright, but I think this covers the general intuition...

ADD COMMENT • link 10.7 years ago by tiagoantao ▴ 690

1

Entering edit mode

This was very helpful also, thank you.

So in this case is W&C's estimator not biased by number of alleles? If it isn't I am not sure what all the fuss was with Jost's critique of the technique... Unless he was commenting on Wright's Fst, which no one seems to use in the literature anyhow.

ADD REPLY • link 10.7 years ago by confusedious ▴ 470

3

Entering edit mode

The world of "Fst analogues" can be a confusing place. Jost was really making his own point about what "differentiation" should mean, and developed a statistic that he considers to be a "true" measure of differentiation in allele frequencies. Earlier, Hedrick and Merimans had pointed out that another analogue to Fst, Nei's Gst, is necessarily low when there is lots of diversity. I wrote a little about this problem, and some, solutions here, if that's a help to you.

ADD REPLY • link 10.7 years ago by David W 4.9k

2

Entering edit mode

You probably have seen The Relationship Between FST and the Frequency of the Most Frequent Allele? Using Gst as a proxy to Fst (I did not raise this before precisely because of confusion you refer)

ADD REPLY • link 10.7 years ago by tiagoantao ▴ 690

0

Entering edit mode

I just discovered this paper last night - it has demystified things a good deal.

ADD REPLY • link 10.7 years ago by confusedious ▴ 470

0

Entering edit mode

Sorry for the mis-understanding. I was making a general comment: estimators are developed to correct for some of these areifacts (or to generalize, as in the case of multiple alleles). C&W generalizes for multiple alleles. That does not mean it performs well under all circumstances. Out of lazyness I never read the Joost paper (other than the abstract).

For example (this is from memory) while C&W is concerned with sample size there is a paper that shows that the estimator needs large sample sizes if the real Fst is low. I.e., while the estimator was made to be concerned with sample size, it does not mean that it works well with all sample sizes...

An interesting observation on how some estimators behave (this is with Fst, but also with LD and surely many others) is that that sample sizes for precise and unbiased estimation of the parameter depend on the value of the parameter. E.g. with Fst estimators, you normally need a bigger sample size if the real field value is low. This creates a chicken and egg problem: when you are creating your experimental design, the number of samples needed to estimate a parameter will depend on the parameter value (which you are trying to calculate in the first place)...

ADD REPLY • link 10.7 years ago by tiagoantao ▴ 690