I have read several papers that uses PAML for positive selection analysis among different species. However, number of species they have used in the analysis has great divergence. For instance some papers used single copy ortholog alignment of 3 species while some used >35 species. My question is how much the results (signal of positive selection) can depend upon number of species used in analysis. Further, what could be the appropriate number of species in the analysis?
Maximum Likelihood is a best a method for selection and phylogenetic analysis. Increasing number of species obviously would increase the chance of having higher divergence for a particular ortholog, since you would have a higher chance of synonymous to non-synonymous variations (Ka/Ks). If Ka/Ks is >1, which indicates positive selection. So it is not just depends on how many species you select to have positive selection, it is also depends on whether the species you have selected are closely related or distantly related? Even if you select different geographic isolates/accessions of the same species for a particular gene, probably you would have a result of positive section, since the variations and Ka/Ks will be higher even within a species for a particular gene when they are proceeding from different geographical regions. So when you have more number of species, sure you can expect positive selection, so try to increase the number of accessions/genotypes/geographical isolates even within species to see whether you would obtain a positive selection with higher polymorphisms.
Kind notice: At present I can't remember where I have read this but would like to share with you the information below
Generally, number of species should not be less than 10. With less than 10 sequences, result obtained from PAML are sometime questionable. So we should always perform PAML with at least 10 species.
If you have more than 100 sequences, it will very computational demanding . Hence, in that case, it will be wise to do random sampling and then perform PAML analysis with those 100 sequences.
I use the same approach for my analysis.
Hope this help.