Estimating Purifying Selection With Codeml (Paml)
1
4
Entering edit mode
11.2 years ago
mgalactus ▴ 760

I've seen a number of tutorials on how to estimate positive selection with codeml (for instance by using the model M1a Vs. M2a and then using the lnL and the BEB method, as indicated here): What if I would like to highlight genes under purifying selection (dN/dS < 1)?

Should I run the analysis again using different models or should I parse the analysis I already have differently?

codeml evolution paml • 10.0k views
6
Entering edit mode
11.2 years ago
fransua ▴ 390

Using the output of your best model M1a or M2a you can check how many sites belong significantly to the first class of sites (w<1), this may be a solution.

However if you want an answer for the full gene, you may do a branch test with: H0: wanted branched fixed (w=1) H1: wanted branch free

If H1 has a better fit and an estimated omega <1 than dN/dS is significantly lower than 1.

note that you can either use one estimte of omega for all the background, let all background branches independent (perhaps a bad idea if you have big tree)

EDIT: tipically at the end of your main output file you will find a summary of the proportion and estimated values of the class of sites (M1a model has 2 class of sites the first: omega<1, and the second: omega=1):

dN/dS (w) for site classes (K=2)

p:   0.94075  0.05925
w:   0.07025  1.00000


thus here we would have 94% of the sites in our alignment belonging to a class of site with omega=0.07 and 6% of the sites belonging to a second class of sites with omega =1

these values are means for the whole alignment. In order to know the values of significance to belonging to conserved class of site, for each site, you may look at an other output file, the rst.

sample of rst file:

dN/dS (w) for site classes (K=2)

p:   0.94075  0.05925
w:   0.07025  1.00000

Naive Empirical Bayes (NEB) probabilities for 2 classes & postmean_w
(amino acids refer to 1st sequence: Gorilla_gorilla)

1 M   0.97088 0.02912 ( 1)  0.097
2 A   0.95138 0.04862 ( 1)  0.115
3 R   0.97919 0.02081 ( 1)  0.090
4 Y   0.98049 0.01951 ( 1)  0.088
5 R   0.97381 0.02619 ( 1)  0.095
....


here for example first site has a probability of belonging to the first class of site of 97%, and its estimated value of omega is 0.097

this output is for model M1a, where only NEB probabilities are computed, in M2a model BEB probabilities are also computed and these values are though to be more accurate (see PAML doc.)

0
Entering edit mode

When you talk about "sites belong significantly to the first class of sites", what do you mean exactly? Which parameters have to be extrapolated from the output file?

1
Entering edit mode

M1a model assumes 2 classes of sites, the first with omega<1 and the second with omega=1. (I edit to complete the answer)

0
Entering edit mode

I think you pretty much answered my question, i'll try it out and let you now, many thanks!

0
Entering edit mode

I've read the PAML doc for a while and I think that the NEB and BEB are used for the purpose of find positively selected genes: i've started to think that one reasonable way to find genes undergoing purifying selection is to take those genes with the proportion of sites with omega < 1 over a certain threshold (say 95% of the sites). I think i'll use model M2a (or should I look at the lnL to decide which of the two models is better?)

1
Entering edit mode

you should use model paasing LRT, however as there is no BEB under M1 model, and your are not loknig for sites under positive selection, it might not be too bad to use M2a BEB results, for a descriptive analysis of the strngth of selection along the sequence. the methodology you propose may not allow you to say that a gene is under negative selection, but that a gene is quite conserved in 95% of its sites (positive selection may occur in the 5% remaining). it depends on what you need to answer, but to be abble to say that a gene is undergoing purofying selection (with some statistical power) a good thing would be to contrast it with the rest of the phylogeny, and thus, use a branch model. note that if you are reticent about using it because of computation time, you can estimate all background branch together (something like this: (speciesofinterest#2,b#1,(c#1,d|#1)#1);). This may be quite fast in comparison to M2a for example.

0
Entering edit mode

Thanks for this really detailed answer: I ended up using SLR, which has a really nicely parsable output file, and as far as I have seen seems to yield comparable results.