Question

Number of species for positive selection analysis

0

Entering edit mode

5.1 years ago

sammy.ich17 ▴ 20

I have read several papers that uses PAML for positive selection analysis among different species. However, number of species they have used in the analysis has great divergence. For instance some papers used single copy ortholog alignment of 3 species while some used >35 species. My question is how much the results (signal of positive selection) can depend upon number of species used in analysis. Further, what could be the appropriate number of species in the analysis?

evolution positive selection PAML alignment genome • 2.6k views

ADD COMMENT • link updated 5.1 years ago by pltbiotech_tkarthi ▴ 180 • written 5.1 years ago by sammy.ich17 ▴ 20

0

Entering edit mode

Hello rprog008, Do you have reference that suggest one should not use less than 10 species for Maximum Likelihood method?

ADD REPLY • link 5.1 years ago by pltbiotech_tkarthi ▴ 180

0

Entering edit mode

As i said in the beginning of my answer, I can't remember where I have read this.. so at present there is no reference :) We strictly follow this rule because most the time reviewer asked analysis of large data set. if you are lucky, reviewer will not comment on smaller dataset but if unlucky, reviewer asked to re perform the analysis. Hence, why to take the chance.

ADD REPLY • link 5.1 years ago by rprog008 ▴ 70

0

Entering edit mode

I know that one should use large data set than smaller ones. My question is not that, how you can claim that one should not use less than 10 species as you have written here? Even you can see above from the statements of sammy.ich17 stated "For instance some papers used single copy ortholog alignment of 3 species while some used >35 species". If you have reference for your answer, it is acceptable, otherwise without reference your answer is not acceptable.

ADD REPLY • link 5.1 years ago by pltbiotech_tkarthi ▴ 180

1

Entering edit mode

Read this discussion on researchgate, where one member have written

4 species is on the smaller scale. If you do not have >10 sequences, you just don't have much power for detecting specific sites under positive selection (Murrell et al, 2010 - considering site classes are considered fixed effects).

https://www.researchgate.net/post/Any_advice_for_using_PAML_to_estimate_dn_ds_for_4_sequences_only https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1002764

For more information, please read in detail. Hope this helps !!!!

By the way, my answer is valid and acceptable since 2012 :p

ADD REPLY • link 5.1 years ago by rprog008 ▴ 70

0

Entering edit mode

The Research Gate conversation was about number of sequences, you can go through that Research Gate conversation again, even someone tends provide the information about the use of even 4 species at smaller scale analysis, however the same person suggested to use more than 10 sequences, which means one can use even less than 10 species or less, but need to add accessions/genotypes of more sequences within each species. Nobody said not to use less than 10 species for the analysis. The PLOS Genetics article provided was about the use of number of samples, not about the number of species. So you can go through Research Gate conversation again, Did anyone argue not to use less than 10 species for the analysis? Hope it may help you to understand the use of number of species for analysis even at smaller scale level.

ADD REPLY • link 5.1 years ago by pltbiotech_tkarthi ▴ 180

1

Entering edit mode

Leave it !!! You are not understanding your own question. You just need to prove that others are wrong then you have to prove it. Sometime we need to use our brain too.

Have you ever thought if we are taking single copy ortholog of each species, what does that means? Here each sequence represent a species. Did u understand this basic.

( For your reference read the question "For instance some papers used single copy ortholog alignment of 3 species while some used >35 species")

Moreover, read again in researchgate, due to insufficient sequences/species, professor have changes the research topic. Here respected professor know the reality and not stuborn like other.

I know you will come up will new idea of blame game. If you want you can do PAML analysis with single sequence too. You must be wonder how. but you are clever enough. one day you will do that too. Thus, I quit because discussing with you is like banging my head against the wall. :)

Research is not always about looking for reference. We have to use our brain and knowledge often to come to a conclusion.

Don't always look for the reference and Don't jump to conclusion too fast. They can be misleading sometime.

It's a research not a marathon. :p Have you ever performed any codeml analysis ( i don't think so). If not, try to do one. you will understand more about it's parameter and model.

This is one good tutorial for your reference :p

https://www.ncbi.nlm.nih.gov/pubmed/25388108

Hope this help Best wishes for your research ..:)

ADD REPLY • link 5.1 years ago by rprog008 ▴ 70

1

Entering edit mode

This is getting out of hand, risks no longer being civil, and doesn’t appear to be getting anyone any closer to an answer.

I’m sure 10 sequences is a rule of thumb not an absolute lower bound, so, as with everything, take it advisedly. In short, if you have the computational resources, do as much as you can.

If there are references for ideal numbers, then please provide them, and the same goes for the other side of the argument. I would reason that another handwavy discussion from researchgate et al. doesn’t really qualify as a reference.

If there are relevant passages in the linked papers, I would ask that the participants organise the thoughts in to an orderly manner and post an answer accordingly, with real consideration as to whether it answers the OP.

ADD REPLY • link 5.1 years ago by Joe 21k

0

Entering edit mode

I do agree with rporg008 but i would like to add some point which have been copied from

https://groups.google.com/forum/#!topic/pamlsoftware/7mylvrvmWVU

I do think the number of species is too small. FAQ page 9 of PAML (Specifically to @pltbiotech_tkarthi)

How many species are needed?

I suppose the absolute minimum is 4 or 5 if the sequence divergence is optimal. 10 would be good, while 20 would be much better. This will depend on how divergent the sequences are.

Thus, here another user also emphasized that when we using 10 sequences/species, we would good get more accurate result

I too would suggest @pltbiotech_tkarthi to read more about PAML and codeml before jumping into conclusion

ADD REPLY • link 5.1 years ago by aloke205 ▴ 40

0

Entering edit mode

Thanks @aloke205 for this reference. I remember now, where i have read this. :) To @pltbiotech_tkarthi, if you read my first answer below careful ( for your kind reference), i have written " With less than 10 sequences, result obtained from PAML are sometime questionable?" now you know why it is questionable sometimes !!! It's a free world, you can do analysis with any n number of sequence you want. But while doing research we have to be very careful :p

Best wishes

ADD REPLY • link 5.1 years ago by rprog008 ▴ 70

0

Entering edit mode

Hi rprog008, Nobody is trying to prove your answer is wrong? but you said "I know you will come up will new idea of blame game". Do I repeat this same question and statements to you. Of course without any man made procedure and human brain input, how the computer will do the analysis itself.

" Even if we use small number of species or single ortholog, if you include different geographical isolates or accessions within each species, you would probably obtain the result of positive selection, if the variations of the sequences are higher even within species of a particular ortholog" this explanation I posted in the beginning of the discussion, Did u understand this basic yourself.

In scientific research, we can't accept anything without proper references, even you should have had an experience of asking references from reviewer side, when you revise your manuscript.
Both experience and empirical ways are playing a major role in research.

Moreover, please properly read the suggestions of @jrj.healey

" I’m sure 10 sequences is a rule of thumb not an absolute lower bound, so, as with everything, take it advisedly. In short, if you have the computational resources, do as much as you can.

If there are references for ideal numbers, then please provide them, and the same goes for the other side of the argument. I would reason that another handwavy discussion from researchgate et al. doesn’t really qualify as a reference"

Thanks @jrj.healey for your suggestions.

It's not scientific references misleading sometime, if we have proper references, they will guide us for specific research aspects. Please don't conclude and convince yourself, if you don't know about others. How do you conclude yourself, whether I run codeml analysis or not? Yes, Research is not a marathon, it is not meaning only we have to run for Marathon, you can run or walk it is up to you, but at least we should walk through the path of research, if we want to attain the goal. Let us complete this discussion and I wish you all the best with your research!

ADD REPLY • link 5.1 years ago by pltbiotech_tkarthi ▴ 180

1

Entering edit mode

read again my first comment... 😑☺ i said less than 10 may be questionable sometime.. i never said always. others too are claiming going for 10 sequence is a rule of thumb. one of the reason already stated by aloke and others. my intention was to state why not to play safe rather than repeating your experiment again and again. yes if you dont have computational resource you can go for fewer sequences too. codeml will definetly give some result. its us who will infer if we are getting correct result or not. i m still on the track by the way 😂..

ADD REPLY • link 5.1 years ago by rprog008 ▴ 70

score 0 · Answer 1 · 2019-03-26

Maximum Likelihood is a best a method for selection and phylogenetic analysis. Increasing number of species obviously would increase the chance of having higher divergence for a particular ortholog, since you would have a higher chance of synonymous to non-synonymous variations (Ka/Ks). If Ka/Ks is >1, which indicates positive selection. So it is not just depends on how many species you select to have positive selection, it is also depends on whether the species you have selected are closely related or distantly related? Even if you select different geographic isolates/accessions of the same species for a particular gene, probably you would have a result of positive section, since the variations and Ka/Ks will be higher even within a species for a particular gene when they are proceeding from different geographical regions. So when you have more number of species, sure you can expect positive selection, so try to increase the number of accessions/genotypes/geographical isolates even within species to see whether you would obtain a positive selection with higher polymorphisms.

score 0 · Answer 2 · 2019-03-27

Kind notice: At present I can't remember where I have read this but would like to share with you the information below

Generally, number of species should not be less than 10. With less than 10 sequences, result obtained from PAML are sometime questionable. So we should always perform PAML with at least 10 species.

If you have more than 100 sequences, it will very computational demanding . Hence, in that case, it will be wise to do random sampling and then perform PAML analysis with those 100 sequences.

I use the same approach for my analysis.

Hope this help.