dN/dS (also known as Ka/Ks) analysis does indeed provide a way to infer conservation of protein sequences. The value of dN/dS actually varies form 0 to infinity, but it is a ratio whose expected null is by default centered around 1 (neutral evolution). If you are only interested in the conservation signal, you can always focus on the dN/dS values that range between 0 and 1.
Conclusions one may draw from dN/dS ratios:
Neutral Protein Evolution: dN/dS ratio of 1 implies there has been equal numbers of synonymous (dna substitutions that do not affect the protein sequence) and non-synonymous changes (dna substitutions that do affect the protein sequence) during the time between ancestral to the modern versions of the protein.
Positive Evolution (adaptive evolution): dN/dS ratio > 1 implies there has been more non-synonymous changes than synonymous changes. There has been evolutionary pressure to escape from the ancestral state - i.e. positive selection pressure. This can occur for example in paralogues that are required to serve a novel function, or in proteins of parasites that need to escape host immune recognition (e.g. changes to avoid MHC-1 binding to evade T-cell attack).
Negative Evolution (conservation): dN/dS ratio < 1 implies there has been more synonymous changes than non-synonymous changes. There has been evolutionary pressure to conserve the ancestral state - i.e. negative selection pressure. This can occur for example in orthologues that are required to maintain (conserve) some function encoded in the protein sequence, since changes from this state would lead to disruption of function.
- Algorithms can either run on multiple sequences, or just a pair of sequences. In either case the input sequences used to derive a dN/dS ratio must share ancestry - too divergent and there is a problem with multiple substitutions, too recent and you will not have sufficient enough observed changes to draw conclusions from.
- dN/dS can be used to compare whole proteins or regions within proteins (a sliding dN/dS value across the protein)
- A dN/dS ratio calculated for a whole protein is often an underestimate (lower than it should be) due to the variety of domains that constitute each protein, for instance a alpha-helix structure may always be required in a set of proteins that perform a variety of different functions.
- The only sequence changes considered are substitutions (not duplications, or inversions etc.)
- Significance of a given dN/dS ratio can be assessed using Fishers exact test: read this
Here are my recommendations for software ordered by how flexible they are:
- MATLAB's Bioinformatics Toolbox: Here you have the greatest variety of alternative algorithms, operating system compatibility, sliding vs. whole protein analysis, API to Genbank, etc (Here's a great tutorial for using their dN/dS tool). Just remember MATLAB is not free.
- KaKs Calculator: If you only care about whole protein dN/dS, many options are available with the Ka/Ks calculator - they also compute statistical significance using Fisher's exact test. I can also provide an R script that generates error bars from the output, just ask.
- PAML: If you have >2 sequences per protein that you wish to get a dN/dS value from, then many options are available with PAML. This is often used in published papers, but it's not recommended if you only have a pair of sequences per protein.
modified 5.7 years ago
5.7 years ago by
a1ultima • 720