8.0 years ago by
There really is no better advice than as posted above: use PAML!
To expand a bit: PAML is the accepted standard for codon-level selection analyses and is used in a sizable proportion of publications that mention either the words 'evolution' or 'selection'. The
codeml sub-program is what you'd want to use: it supports nearly everything you might want to test with respect to estimating dN/dS or detecting / quantifying purifying or positive selection in genes using codon alignments.
Common tasks performed with
- Estimating a single dN/dS ratio (roughly equivalent to Ka/Ks) for a single alignment of coding sequences. This gives a ballpark figure for the amount of protein-level selective constraint experienced by a gene. But be warned, most genes do not evolve homogenously along their length -- different domains (extracellular vs. intracellular, for example) may experience drastically different levels of constraint.
- Identifying different dN/dS ratios in different branches of a phylogenetic tree
- Identifying genes with individual sites or regions under positive selection, sometimes in specific branches of a tree.
PAML and codeml are complicated (and sometimes finicky) programs, with dozens of analyses available depending on how you configure them. In order to use it well, you'll really want to either:
- Read the documentation
- Read one or two of Ziheng Yang's many publications about the models PAML implements and the applications it enables
- Look at the examples in the PAML download package.
Regarding the 'true relation between conservation and selection': the codon models implemented by PAML are more sensible than amino acid-based measures for measuring selection pressures acting on proteins because they offer built-in 'correction' for synonymous mutation rates (as long as synonymous sites are evolving neutrally, which seems to be largely the case with smallish population sizes).
Amino-acid based measures of evolutionary conservation are not so well normalized, because the rate of amino acid change is the result of a combination of mutation rate plus selection for conservation of protein structure and function. Codon models, on the other hand, use DNA alignments of coding regions and an explicit model incorporating the genetic code to estimate the amount of natural selection for or against protein-level changes. To paraphrase Ziheng Yang's Computational Molecular Evolution: dN/dS represents the ratio between the rate at which a protein is evolving to the rate at which it would be evolving were it a non-structured, non-functional string of DNA characters.