Question: Selection from population data
gravatar for Adrian Pelin
6.1 years ago by
Adrian Pelin2.4k
Adrian Pelin2.4k wrote:


I am interested in traces of positive selection in my population data. I have been able to calculate Watterson's Pi and Theta for synonymous and non-synonymous sites for every gene in my genome.

The problem is that I am a bit lost as to how to look for positive/negative selection. I do not really understand what these values are, Pi and Theta. I have seen literature where Pi(A) is divided by Pi(S), that's sort of like dN/dS, and if the ratio is bigger than 1, then we can infer positive selection?

Thanks for any help,


selection theta pi watterson popgen • 4.8k views
ADD COMMENTlink modified 6.0 years ago by Chrispin Chaguza260 • written 6.1 years ago by Adrian Pelin2.4k
gravatar for Zev.Kronenberg
6.1 years ago by
United States
Zev.Kronenberg11k wrote:

Population genetics can be difficult to break into, but worth it!  I found a recent review that provides a decent overview of the current methods.  It might not directly answer your question, but it is a good place to start.


"Detecting Natural Selection in Genomic Data"

ADD COMMENTlink written 6.1 years ago by Zev.Kronenberg11k

This is very interesting, thanks for pointing me in the right direction.

I understand measuring Fst values is a powerful way of identifying genes under diversifying selection. After computing Fst values, is there any way to determine which genes are significantly impacted? Is it possible using a statistical test to determine which genes are significantly evolving quicker then others? I have 9 population samples, so Fst is computed pairwise between any 2 populations.

ADD REPLYlink modified 6.1 years ago • written 6.1 years ago by Adrian Pelin2.4k

I may be a bit late to be of help here, but the way I have done this in the past is to calculate Fst on a SNP by SNP basis between each possible population pair, and then see which SNPs fall into the tail of the distribution of the Fst scores (say highest 1% - high probability of positive selection here). From there you can figure out which genes these SNPs fall into relatively easily through using a tool in R called NCBI2R. If you want to look for functional trends you can then run the gene list you get from NCBI2R through a GO term overrepresentation test like GOrilla (web-based and free, also uses FDR instead of the overly conservative Bonferroni correction).

ADD REPLYlink written 6.0 years ago by confusedious420
gravatar for David W
6.1 years ago by
David W4.8k
New Zealand
David W4.8k wrote:

The paper Zev links to provides a very good intro to this field.

I thought I'd just that the specific statistics you mention,  Tajima's (pi) and Watterson's estimators of theta, form the basis of Tajima's D.

Briefly. The idea is that if a gene has been subject to directional selection (i.e. positive or negative selection) those variants are present will be at low frequency so nucleotide diversity will be low relative to Watterson's theta (which is based only on the number of segregating sites). A positive value for D would suggest balancing selection (maintaining an excess of medium-frequency alleles). BUT, Tajima's D is also affected by demography, since population expansion also leads to an excess of rare alleles.

As Zev's paper describes, there are a whole suite of measures that are more or less sensitive to different demographic and population genetic processes.

I'm not aware of a test that compares Pi_non-syn with Pi_syn, though some tests like McDonald Krietman include those values along with divergence stats.


ADD COMMENTlink modified 6.1 years ago • written 6.1 years ago by David W4.8k
gravatar for confusedious
6.1 years ago by
confusedious420 wrote:

You could consider using integrated haplotype score if you have adequate data and are interested in relatively recent selection-driven change. This is a reasonably straight forward way of looking for signals of positive selection.

This has been used often in studies of recent human evolution.


If your data includes sub-populations of reasonable sizes you could also consider using Fst to find variants that might be indicative of selection driven differences between sub-popualtions as well.

ADD COMMENTlink modified 6.1 years ago • written 6.1 years ago by confusedious420
gravatar for Chrispin Chaguza
6.0 years ago by
Wellcome Sanger Institute
Chrispin Chaguza260 wrote:

As you'll have already known, interpretation of these values can indeed be very tricky. For example, how do you know or test whether the Tajima's D estimate is significant?. The Wikipedia link shows a table that provides a summary on how to interpret the results ( and it also provides a 'rule of thumb' that suggests that values less than -2 or greater than +2 are generally significant (but do not represent critical values).

There is also a paper that provides a method for constructing critical for Tajima's D (and similar statistics)

ADD COMMENTlink written 6.0 years ago by Chrispin Chaguza260

Just in comment here, if you want a quick and easy way to calculate Tajima's D then download MEGA. It's free and offers calculation of this statistic from an alignment file - very simple.

You're on your own though on figuring out whether the D value is significant.

ADD REPLYlink modified 6.0 years ago • written 6.0 years ago by confusedious420

Problem is I got .vcf/snp data, not alignments.

ADD REPLYlink written 6.0 years ago by Adrian Pelin2.4k

Ah, well if that's the case then methods like Fst outlier analysis or integrated haplotype score would be the way to go.

ADD REPLYlink written 6.0 years ago by confusedious420

A shameless plug, but try out my GPAT suite of tools for selection:

ADD REPLYlink written 6.0 years ago by Zev.Kronenberg11k

Forgot to mention, I am working on spores, and since you can't sequence single spores efficiently, my samples consist of populations of spores, so in a way it is a pooled sample. Very hard to call SNPs and phase data.


ADD REPLYlink written 6.0 years ago by Adrian Pelin2.4k

That does make things a bit harder. One approach you could take, though somewhat speculative, would be to calculate a diversity score of some kind at each locus and then produce a distribution of this score. Scores that are at the low end of the spectrum might be indicative of loci that have been under a selective sweep or purifying selecion, and scores at the high end may be examples of loci under balancing selection. This isn't an iron-clad way of doing things, but you can say something about the data this way as opposed to not much.

I second Zev's recommendation to look at GPAT. I just took at look at the github site and it does look very useful. I wish I had known about it back when I was doing my Master's thesis and calculating pairwise Fst at ~3,000,000 individual SNPs between three populations (I did it in R - it took forever).

ADD REPLYlink written 6.0 years ago by confusedious420
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 825 users visited in the last hour