Question: How To Simulate The Null Distribution For Tajima'S D?
gravatar for David W
7.8 years ago by
David W4.7k
New Zealand
David W4.7k wrote:

I know there are a few populations genetics/molecular evolution types in the Biostar community, so hopefully someone has already run into this problem.

I've calculated Tajima's D statistic for a couple of DNA alignments using the R package PEGAS, which gives a p-value using a Beta distribution. It happens that the Beta distribution doesn't perfectly model the distribution of D under neutral evolution, so many authors simulate data under neutrality to make a null distribution. Unfortunately, all the methods for these papers tend to say about this procedure is "we used ms to simulate the null...".

Does anyone know how to do this properly. My understanding was you simulated data with a matching number of segregating sites, so it you had 100 samples and 20 segregating sites you'd do

ms 100 1000 -s 20

giving 1000 simulations. However when I calculate D from those simulations I get a skewed distribution with a mean near to 1 (I would have though it would b 0) but a p-value pretty close to the one using the Beta distribution. It's entirely possible I messed up somewhere in the calculations, (you can see my function here) but I thought I'd make sure I'm actually simulating the distribution correctly before I pick that apart.

population sequence • 4.3k views
ADD COMMENTlink written 7.8 years ago by David W4.7k

Yeah! Love population genetics questions!

ADD REPLYlink written 7.8 years ago by Jarretinha3.3k
gravatar for Jarretinha
7.8 years ago by
São Paulo, Brazil
Jarretinha3.3k wrote:

Aha! Found the problem! After checking you code carefully for mistakes (found none) I've noted that your interpretation (and expectation) of the test could be you problem. Tajima's D isn't normally distributed and is naturally skewed. By using a fixed number of segregating sites, 4Nu/2Nu will vary between samples to enforce the restriction. So, you can obtain a positively skewed distribution even in a neutral setup. I'm not aware of the overall sensitivity of Tajima's D in this situation. You can try to generate a much larger number of samples and verify if the skeweness still persists. If PEGAS is correct, it should decrease.

ADD COMMENTlink modified 7.8 years ago • written 7.8 years ago by Jarretinha3.3k

Cheers Jarretinha, looks like I might need to do some digging into pop. gen. or just use Arlequin and move on :)

ADD REPLYlink written 7.8 years ago by David W4.7k
gravatar for Casey Bergman
7.8 years ago by
Casey Bergman18k
Athens, GA, USA
Casey Bergman18k wrote:

You are on the right track and Jarretinha provides additional (correct) insight into the problem. If you need to check whether your solution in R is giving good results, you can run a test alignment through your script and also through DNAsp, which provides methods to compute the confidence interval on Tajima's D via standard neutral coalescent simulations.

ADD COMMENTlink written 7.8 years ago by Casey Bergman18k
gravatar for David W
7.8 years ago by
David W4.7k
New Zealand
David W4.7k wrote:

(not sure if this should be an answer, a comment or an edit to the original question - let me know if I'm doing it wrong...)

So, it turns out there was something funny going on in the way I was calculating D_ from my simulated data (I suspect the polymorphism counting step, but having dissected that yet). Thankfully ms comes with another utility sample_stats that calculates some of the standard diviersity indices including _D for each replicate so

ms 20 100 -s 20 | sample_stat > sims.tsv

Creates a text file with D_ in the sixth column. To call this in R and get the distribution of _D I've used

cmd <- paste("ms", n, nreps, "-s", S, "| sample_stats")
sims <- system(cmd, intern=TRUE)
get.D <- function(x) unlist(strsplit(x, "\t"))[6]
Ds <- as.numeric(unlist(lapply(sims, get.D)))

Which gives a result close to that of Arlequin (haven't checked DNAsp).

Thanks Casey and Jarretinha - both helpful answers, Jarrenthinha get's the Big Green Tick since getting to the bottom of what I was actually expecting from the simulations is what helped me get this answer.

ADD COMMENTlink written 7.8 years ago by David W4.7k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1629 users visited in the last hour