Question: Best Practices/Softwares To Calculate Ka/Ks Ratio
gravatar for Jarretinha
6.5 years ago by
São Paulo, Brazil
Jarretinha3.2k wrote:

Hi all again,

I've noticed that most programs used to predict the effect of non-synonymous amino acids substitutions simply don't analyze the action of selection of any kind. Given that conservation in alignment does not mean necessarily purifying selection, I'm wondering about the true relation between conservation and selection. In order to test this idea, I really need some guidelines to calculate Ka/Ks in close and distantly related species. Any software able to deal with such situation is much welcome too!

-- Edit --

I know a lot of programs able to calculate Ka/Ks. What I really want is an expert opinion of what to do and which one to use.

software selection • 29k views
ADD COMMENTlink modified 21 months ago by Philipp Bayer4.6k • written 6.5 years ago by Jarretinha3.2k
gravatar for a1ultima
3.5 years ago by
a1ultima610 wrote:

Conclusions one may draw from dN/dS ratios (aka. ~Ka/Ks):

Neutral Evolution (drift): dN/dS ratio = 1 implies there has been equal numbers of synonymous (dna substitutions that do not affect the protein sequence) and non-synonymous changes (dna substitutions that do affect the protein sequence) during the time between ancestral to the modern versions of the protein.

Positive Selection (adaptive evolution): dN/dS ratio > 1 implies there has been more non-synonymous changes than synonymous changes. There has been evolutionary pressure to escape from the ancestral state - i.e. positive selection pressure. This can occur for example in paralogues that are required to serve a novel function, or in proteins of parasites that need to escape host immune recognition (e.g. changes to avoid MHC-1 binding to evade T-cell attack).

Negative Selection (conservation): dN/dS ratio < 1 implies there has been more synonymous changes than non-synonymous changes. There has been evolutionary pressure to conserve the ancestral state - i.e. negative selection pressure. This can occur for example in orthologues that are required to maintain (conserve) some function encoded in the protein sequence, since changes from this state would lead to disruption of function.

Useful Tips:

  • Algorithms can either run on multiple sequences, or just a pair of sequences. In either case the input sequences used to derive a dN/dS ratio must share ancestry - too divergent and there is a problem with multiple substitutions, too recent and you will not have sufficient enough observed changes to draw conclusions from.
  • dN/dS can be used to compare whole proteins or regions within proteins (a sliding dN/dS value across the protein.)
  • A dN/dS ratio calculated for a whole protein is often an underestimate (lower than it should be) due to the variety of domains that constitute each protein, for instance a alpha-helix structure may always be required in a set of proteins that perform a variety of different functions.
  • The only sequence changes considered are substitutions (not duplications, or inversions etc.)
  • Significance of a given dN/dS ratio can be assessed using Fishers exact test: read this.


Here are my recommendations for software ordered by how flexible they are:

  • MATLAB's Bioinformatics Toolbox: Here you have the greatest variety of alternative algorithms, operating system compatibility, sliding vs. whole protein analysis, API to Genbank, etc (Here's a great tutorial for using their dN/dS tool). Just remember MATLAB is not free.
  • KaKs Calculator: If you only care about whole protein dN/dS, many options are available with the Ka/Ks calculator - they also compute statistical significance using Fisher's exact test. Here's an R script I made to help you out, it plots a bar chart (with error bars), from the KaKs Calculator output files.
  • PAML: If you have >2 sequences per protein that you wish to get a dN/dS value for, then many options are available with PAML. This is often used in published papers, but it's not recommended if you only have a pair of sequences per protein.
  • VectorBase dN/dS app.: If you you just want to quickly visualize a sliding window dN/dS curve, without installing anything, me and a friend have finished a website, for you to do just that. It allows for simple pairwise, sliding window dN/dS analysis using the Nei-Gojobori method. Accuracy is comparable to MATLAB's Bioinformatics Toolbox, and performance is optimized for phone. So far the GUI can only accept VectorBase gene ids, but the script itself can take any input pair of proteins. Performance is heavily optimized, e.g. a pair of 3000bp long proteins in <1s. When the site is finally hosted I'll post a link, code:

enter image description here

ADD COMMENTlink modified 5 months ago • written 3.5 years ago by a1ultima610

Also.. If you wish to run dN/dS on 100-1000s of proteins me and a friend have created our own Python script that implements the the Nei-Gojobori algorithm (but uses approximate multiple-substitutions correction rather than exact) on large-scale protein sequence pair data (i.e. one pair of sequences per protein). it implements both whole-protein and sliding-window modes. Given two million base-pairs of sequence data it will finish the job in <15 seconds with ~1% error vs. MATLAB. If you ask, I can properly document it and publish on GitHub.

Specifically you need:

For more details, either read this, or see the comment below.

ADD REPLYlink modified 6 months ago • written 3.5 years ago by a1ultima610

Hi, can you send me that script too? Or publish on Github

ADD REPLYlink written 2.1 years ago by latticetower10

Hi! Have you published the python script on GitHub? Or could you please send me the python? Thank you so much!

ADD REPLYlink modified 21 months ago • written 21 months ago by urgaa.mgl10

Thank you for the detailed answer. It helped me a lot. Could you send me the code, too, please? My email is I will appreciate it.

ADD REPLYlink modified 16 months ago • written 16 months ago by Karyo10

Hi @a1ultima, I know I am little late to join this ticket, but did you ever make a github page for your program. If not, could you please send it to me through email, Much appreciated.

ADD REPLYlink written 16 months ago by salamzader10

Hi a1ultima, I know im very late to ask you the python script. Could you send me the link of GitHub or via email to I will appreciate it. Thank you.

ADD REPLYlink written 16 months ago by vharshavardhanan30

For all who have asked for the dN/dS Python file, it is finally on GitHub, I really do apologize for how long I have taken to get back to some of you. This Python file I am sharing is part of a bigger pipeline, but I hope I have documented it well enough now so that you can just take this part and play with it yourself:

Specifically you need:


  • We've made a GUI (in the form of a webpage: vg-genes.html), that works with a local Python server (, so that means you don't have to mess around with the script itself. I will post a URL to the GUI as soon as it is hosted, if you cannot wait, then read the to set up the local python server yourself and also how to run the GUI. Though, it currently has restrictions: can only accept VectorBase gene ids as input (it needs to do a REST call to the to retrieve sequence data). But later we'll try and make it more flexible. If you want to analyze pairs of protein sequences outside of VectorBase, then focus only on (above).

  • For very fast computation needs to work with, in that it imports in the same directory. But you don't need to really worry about it, since we decided the pre-pickled computations data was worth uploading to GitHub anyway, i.e. you will only ever have to read the code if you decide to delete, the pre-pickled computations.

  • If your pair of input sequences that are not already aligned, you will be faced with a performance bottleneck in the alignment process, a step that is necessary to calculate dN/dS via By default, will try to align the inputs by importing, but I have tried to write informative comments in to help you turn this step off. The catch is that you would then need to have the inputs codon aligned prior to running Apart from boosted performance, another issue with keeping the step running is that it needs to import an external Python module called Biopython (import Bio). If you do not want to install Biopython, then make sure you align the inputs yourself, turn off the alignment part in, and run

I am not sure this message will be seen by any of those who asked for it, will this pop up in their notifications? I will for now assume it will.

ADD REPLYlink modified 6 months ago • written 6 months ago by a1ultima610

Hi  a1ultima,

    Could you please send me the R script for error rate calculation for KaKs Calculator? My email id is

ADD REPLYlink written 3.1 years ago by rpraveenkumarbio10

Hello, thanks for your tutorial. Could you kindly send me the R code as well, and current link for KaKs calculator. Email: Thanks, Evans

ADD REPLYlink written 12 months ago by ronoevan10

I am really sorry about the embarassingly long wait before giving the R script for calculating error-bars for your dN/dS analyses.

For anybody who still needs it, I've put the script on GitHub with instructions on what you need to do to get it to work in the

ADD REPLYlink written 8 months ago by a1ultima610

Assalm o alikum everyone,

I'm using ka ks calculator for ka ks ratio. But i'm too confused either it is working for multiple sequence or only for a pair of sequences ????

ADD REPLYlink written 10 days ago by adeenahassan7720
gravatar for Gjuggler
6.5 years ago by
Cambridge, UK
Gjuggler150 wrote:

There really is no better advice than as posted above: use PAML!

To expand a bit: PAML is the accepted standard for codon-level selection analyses and is used in a sizable proportion of publications that mention either the words 'evolution' or 'selection'. The codeml sub-program is what you'd want to use: it supports nearly everything you might want to test with respect to estimating dN/dS or detecting / quantifying purifying or positive selection in genes using codon alignments.

Common tasks performed with codeml involve:

  • Estimating a single dN/dS ratio (roughly equivalent to Ka/Ks) for a single alignment of coding sequences. This gives a ballpark figure for the amount of protein-level selective constraint experienced by a gene. But be warned, most genes do not evolve homogenously along their length -- different domains (extracellular vs. intracellular, for example) may experience drastically different levels of constraint.
  • Identifying different dN/dS ratios in different branches of a phylogenetic tree
  • Identifying genes with individual sites or regions under positive selection, sometimes in specific branches of a tree.

PAML and codeml are complicated (and sometimes finicky) programs, with dozens of analyses available depending on how you configure them. In order to use it well, you'll really want to either: - Read the documentation - Read one or two of Ziheng Yang's many publications about the models PAML implements and the applications it enables - Look at the examples in the PAML download package.

Regarding the 'true relation between conservation and selection': the codon models implemented by PAML are more sensible than amino acid-based measures for measuring selection pressures acting on proteins because they offer built-in 'correction' for synonymous mutation rates (as long as synonymous sites are evolving neutrally, which seems to be largely the case with smallish population sizes).

Amino-acid based measures of evolutionary conservation are not so well normalized, because the rate of amino acid change is the result of a combination of mutation rate plus selection for conservation of protein structure and function. Codon models, on the other hand, use DNA alignments of coding regions and an explicit model incorporating the genetic code to estimate the amount of natural selection for or against protein-level changes. To paraphrase Ziheng Yang's Computational Molecular Evolution: dN/dS represents the ratio between the rate at which a protein is evolving to the rate at which it would be evolving were it a non-structured, non-functional string of DNA characters.

ADD COMMENTlink written 6.5 years ago by Gjuggler150

I do use PAML already. And HyPhy too. I have a problem with distance. PAML is effective for distantly related species. HyPhy can be used to same species ration calculation when mutation rates are large. As I want to calculate it for certain human proto-oncogenes, I'm not certain on how to proceed.

ADD REPLYlink written 6.5 years ago by Jarretinha3.2k

The entire justification behind dN/dS calculations basically requires fixed differences between species resulting from long periods of natural selection. I'm not sure that in the case of observed mutations within proto-oncogenes or intra-species variability you will ever get sensible results from this type of a measure. I enjoyed this analysis of the issue:

If you're interested in quantifying the amount of 'functional change' resulting from certain mutations, perhaps Grantham scores would be a viable alternative...

ADD REPLYlink written 6.5 years ago by Gjuggler150

In fact, this ratio is only applicable when a significant amount of fixations by selection occur (which depends on a lot of effects). I'm aware of Plotkin's paper. Distance is a relative measure. Most RNA viral populations show intraspecific distances much higher than any large group of vertebrates. So, it's feasible to use it in their case. But, the question how much distance you need to call lineages distant? I wasn't aware of Grantham matrices. Anyway, the are quite similar to SDM/Andante approach (which I already use).

ADD REPLYlink written 6.5 years ago by Jarretinha3.2k
gravatar for Botond Sipos
6.5 years ago by
Botond Sipos1.5k
United Kingdom
Botond Sipos1.5k wrote:

Check out the PAML package.

ADD COMMENTlink written 6.5 years ago by Botond Sipos1.5k
gravatar for Philipp Bayer
21 months ago by
Philipp Bayer4.6k
Philipp Bayer4.6k wrote:

A few weeks ago POTION was published:

>We present POTION, an open source, modular and end-to-end software for genome-scale detection of positive Darwinian selection in groups of homologous coding sequences. Our software represents a key step towards genome-scale, automated detection of positive selection, from predicted coding sequences and their homology relationships to high-quality groups of positively selected genes. POTION reduces false positives through several sophisticated sequence and group filters based on numeric, phylogenetic, quality and conservation criteria to remove spurious data and through multiple hypothesis corrections, and considerably reduces computation time thanks to a parallelized design.


Internally it runs PAML's codeml together with a few other tools. I've tested it with one test data-set and it's much nicer to use compared to running codeml directly.

ADD COMMENTlink written 21 months ago by Philipp Bayer4.6k

Quite interesting. Any idea if a dN/dS values can be recovered from the output files? I could see such output file in the README.txt

ADD REPLYlink written 19 months ago by dago2.3k
gravatar for Chrispin Chaguza
2.9 years ago by
University of Liverpool, UK
Chrispin Chaguza220 wrote:

There is also an online tool can could be used for that analysis in addition to what others have recommended (

ADD COMMENTlink written 2.9 years ago by Chrispin Chaguza220
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 677 users visited in the last hour