Question: How To Simulate The Null Distribution For Tajima'S D?
3
7.8 years ago by
David W4.7k
New Zealand
David W4.7k wrote:

I know there are a few populations genetics/molecular evolution types in the Biostar community, so hopefully someone has already run into this problem.

I've calculated Tajima's D statistic for a couple of DNA alignments using the R package PEGAS, which gives a p-value using a Beta distribution. It happens that the Beta distribution doesn't perfectly model the distribution of D under neutral evolution, so many authors simulate data under neutrality to make a null distribution. Unfortunately, all the methods for these papers tend to say about this procedure is "we used `ms` to simulate the null...".

Does anyone know how to do this properly. My understanding was you simulated data with a matching number of segregating sites, so it you had 100 samples and 20 segregating sites you'd do

``````ms 100 1000 -s 20
``````

giving 1000 simulations. However when I calculate D from those simulations I get a skewed distribution with a mean near to 1 (I would have though it would b 0) but a p-value pretty close to the one using the Beta distribution. It's entirely possible I messed up somewhere in the calculations, (you can see my function here) but I thought I'd make sure I'm actually simulating the distribution correctly before I pick that apart.

population sequence • 4.3k views
written 7.8 years ago by David W4.7k

Yeah! Love population genetics questions!

4
7.8 years ago by
Jarretinha3.3k
São Paulo, Brazil
Jarretinha3.3k wrote:

Aha! Found the problem! After checking you code carefully for mistakes (found none) I've noted that your interpretation (and expectation) of the test could be you problem. Tajima's D isn't normally distributed and is naturally skewed. By using a fixed number of segregating sites, 4Nu/2Nu will vary between samples to enforce the restriction. So, you can obtain a positively skewed distribution even in a neutral setup. I'm not aware of the overall sensitivity of Tajima's D in this situation. You can try to generate a much larger number of samples and verify if the skeweness still persists. If PEGAS is correct, it should decrease.

Cheers Jarretinha, looks like I might need to do some digging into pop. gen. or just use Arlequin and move on :)

3
7.8 years ago by
Casey Bergman18k
Athens, GA, USA
Casey Bergman18k wrote:

You are on the right track and Jarretinha provides additional (correct) insight into the problem. If you need to check whether your solution in R is giving good results, you can run a test alignment through your script and also through DNAsp, which provides methods to compute the confidence interval on Tajima's D via standard neutral coalescent simulations.

3
7.8 years ago by
David W4.7k
New Zealand
David W4.7k wrote:

(not sure if this should be an answer, a comment or an edit to the original question - let me know if I'm doing it wrong...)

So, it turns out there was something funny going on in the way I was calculating D_ from my simulated data (I suspect the polymorphism counting step, but having dissected that yet). Thankfully `ms` comes with another utility `sample_stats` that calculates some of the standard diviersity indices including _D for each replicate so

``````ms 20 100 -s 20 | sample_stat > sims.tsv
``````

Creates a text file with D_ in the sixth column. To call this in R and get the distribution of _D I've used

``````cmd <- paste("ms", n, nreps, "-s", S, "| sample_stats")
sims <- system(cmd, intern=TRUE)
get.D <- function(x) unlist(strsplit(x, "\t"))[6]
Ds <- as.numeric(unlist(lapply(sims, get.D)))
``````

Which gives a result close to that of Arlequin (haven't checked DNAsp).

Thanks Casey and Jarretinha - both helpful answers, Jarrenthinha get's the Big Green Tick since getting to the bottom of what I was actually expecting from the simulations is what helped me get this answer.