Detect Positively Selected Sites Among A Lineage In Paml
3
3
Entering edit mode
11.2 years ago

Dear All, I have a lineage that is all parasitic plant sequences, I want to examine inside the sequences, if there are sites evolving under positive selection.

Now I have a tree file, which is a parasitic-plant lineage and some other sequences (from several legume species). This is a part of a tree reflecting horizontal gene transfer. We detected that our parasitic plants got this gene from its host Medicago(non parasitic plants). We are very interested in identifying if sequences in the parasites have some sites that evolve some new function. However, because this sequences are very similar, so on the whole sequence level, Ks/Ks < 1, suggesting it evolves some negative selection. So I was wondering if there are particular sites under positive selection.

I use the following major parameters in my .ctl file fix_kappa = 0 * 1: kappa fixed, 0: kappa to be estimated kappa = .3 * initial or fixed kappa fix_omega = 0 * 1: omega or omega_1 fixed, 0: estimate omega = 1.3 * initial or fixed omega, for codons or codon-based AAs ncatG = 10 * # of categories in the dG or AdG models of rates

I am not sure if my settings are correct. If anyone can give me some suggestions on how to do this in paml or give me some suggestions, I would appreciate it a lot. I run this in paml it seems that there is no positively selected sites, yet I am not sure.

Anyone has some experience in identifying sites under positive selection for a particular lineage? Suggestions are welcome!

Many thanks!

Best, Jenny

paml • 11k views
ADD COMMENT
2
Entering edit mode
11.1 years ago
Rahul Sharma ▴ 660

Hi

1) You can calculate the degree of freedom by subtracting the number of parameters (np) of H0 codeml output file (model=0 in this case) from H1 codeml output file (model=1 in this case). Number of parameters is specified by np variable in the output file.

2) In this case I would do the codeml Branch-site model, it will out the branch and sites under selection. And will pick only those genes which are having at least one site with BEB confidence >95%. This test will let you know which species gene is under selection pressure along with the sites under selection. In further analysis you can check if the selected sites are present in the functional domain of the protein, by visualizing its structure using PDB database (If its structure has been reported).

3) In case of Branch-site model one has to make ctr file like this:

For H0: model=2, NSsite=2, fix_omege=1, omega=1.

For H1: model=2, NSsite=2, fix_omege=0, omega=1.

2∆l=lnL1-lnL0

d.f = np1-np0

I would do LRT, BC and FDR at 5% and 1% level of significance, and consider the sites >95% BEB confidence with FDR 1% level of significance.

Hope this will be helpful,

Best wishes, Rahul

ADD COMMENT
1
Entering edit mode
11.2 years ago
Rahul Sharma ▴ 660

Hi

Finding positively selected sites with great accuracy is very tricky. To find positively selected sites within certain lineage, you have to mark that lineage as foreground in the tree by labeling it with "#1" or "$1". Selected genes can be predicted using LRT test and multiple hypothesis testing using BC or FDR @ 5 or 1% level of significance. Further positively selected sites you can pick from the output file, if the BEB confidence is >95% or 99%, will depend on your choice.

Best, Rahul

ADD COMMENT
0
Entering edit mode
11.2 years ago

Thanks a lot, Rahul! If we were to see if there is positive selection among a certain branch, we conduct test for cases where we choose model = 0 (fixed rate of change for all branches) and choose model = 1(allow the branch that we specified to change freely). We can get the likelihood ratio for both tests, and conduct a chi-square test to see if there is significant difference between the w of our branch vs. the background w. I wonder what the degree of freedom is, how we can estimate that.

The test that involves choosing two models (model =0 vs. model =1) makes sense and allows us to determine if there is a significant difference between our foreground and background. Also it gives us w of our branch of interest and background branch. If when model =1 (free model), we found that our branch w has higher value compared to the background w(when model =0, i.e. all branches have the same w), what can we conclude? Can we say that our branch has relaxed evolution of rate compared to the background branch? Or our branch has faster evolutionary rate?

However, when examining if there is sites that evolve under positive selection, the parameters that people set makes me puzzled. They use branch-site model, so they set model =0, NSsites = 0 1 2. The output result shows that all branches have the same omega, why is this? I know that NSsites = 0, 1, 2 are testing cases of neutral evolution, positive and negative selection. Is it because they are allowing sites to be changed, so the whole branch keeps fixed? And that's the model that they detect if there is positive selected sites?

I run this in paml where I want to test if there is some site that undergo positive selection for my branch(which contains about 15 seqs and I labeled this lineage as #1), I tested no sites that are significant under 5% or 1% significance level. And the result gives me the same w for all the branches. At that time, it puzzled me. Now if my understanding is right, that makes sense right? Anyone can confirm if I am right? (My understanding is that when testing sites of positive selection among a lineage, model =0, assuming all branches have the same w, but allowing different sites to evolve differently?)

Then it comes to me another question, will there be a difference if I set the background of my tree differently ? Will it gives me different results? When running paml, what is the criteria of choosing background taxa or seqs if we want to get significant results?

Thank all for your attention! Jenny

ADD COMMENT
0
Entering edit mode

It is a 20 months old thread, but I am adding some answers to the queries posted by yangzhenzhen1988 above, which may benefit others. Please read the query completely in the above post to get a clear idea of what I am answering.

1. I wonder what the degree of freedom is, how we can estimate that.

If we look into the result file, we will see a line similar to as follows:

lnL(ntime: 11  np: 12):  -2788.563776      +0.000000

Here np is the number of parameters

So, our degree of freedom = (np of the alternate model - np of Null model )

2. If when model =1 (free model), we found that our branch w has higher value compared to the background w(when model =0, i.e. all branches have the same w), what can we conclude? Can we say that our branch has relaxed evolution of rate compared to the background branch? Or our branch has faster evolutionary rate?

To my understanding, free model only says that different branches have different rate of evolution. It dose not confirm if any of the branch have positive selection. To test if a branch have positive selection, we have to use model=2 and also would have to use the tree file with specifying the branch (for which we are testing if there is positive selection) with the tag #1.

3. Is it because they are allowing sites to be changed, so the whole branch keeps fixed? And that's the model that they detect if there is positive selected sites? (read the complete question from the above post)

Yes, This is the to find out if there are positively selected sites, and this is called site model (not branch-site model as assumed in the questions).

4. Then it comes to me another question, will there be a difference if I set the background of my tree differently ? Will it gives me different results? When running paml, what is the criteria of choosing background taxa or seqs if we want to get significant results?

I do not understand the question completely, but I would add a note. Running codeml with different starting tree topology gives different results.

ADD REPLY

Login before adding your answer.

Traffic: 2453 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6