Question

CAI in python: SeqUtils.CodonUsage vs CAI package

1

Entering edit mode

6.3 years ago

grassostefan ▴ 10

Hi I am trying to calculate the CAI (codon adaptation index) for a list of genes of interest. I decided to use the RSCU values from the Codon Bias database (CBDB), for the relative organism (B. subtilis 168): http://homepages.luc.edu/~cputonti/cbdb/genera/bacillus.html#s30 .

I implemented a small python script to do so. While doing so, I couldn't decide between two methods tested them both: one is from Biopython SeqUtils (module CodonUsage, method cai_for_gene) and the other from the 'CAI' package (method CAI) (https://pypi.python.org/pypi/CAI). Since I was using the same reference RSCU values, and both claim to implement the method from "Sharp & Li, 1987, NAR (15)", I was expecting the same value for each gene from both packages. But this did not happen and it is clear at first sight as the results from the SeqUtils package are not included between 0 and 1 (in some cases are >1), which is in contrast with the definition of CAI, while for the 'CAI' package results were within the range.

I further investigated trying to regenerate the RSCU values and looking at both source codes. The first thing I noticed is that from the same fasta file (i.e. same collection of genes, still downloaded from CBDB) I was getting two different RSCUs values on the two packages, with CAI-generated one matching the online posted values, while the values generated by SeqUtils were different. I assumed that the dictionary used by SeqUtils was not RSCUs values, thus I used the SeqUtils-generated RSCUs value to calculate the CAI for my genes, but still values were not allowed. Then looking at the source code I noticed that SeqUtils implementations differs a lot from the paper of Sharp and Li, unless it is a mathematical equivalence I do not know/understand. On the other hand the approach implemented in the CAI package is identical to the paper.

This given, I assume the CAI package is returning the correct values.

My question is thus: what is then SeqUtils calculating? I could not retrieve a lot of info about this package. Could there be an error? Why the two algorithms differ? And why the one from SeqUtils differs also from the paper? Anyone able to explain this?

P.S. I am of course assuming I used correctly both methods. Since I followed the relative guides, I feel reasonably safe in assuming so.

CAI codon biopython sequtils • 4.3k views

ADD COMMENT • link updated 14 months ago by schlogl ▴ 160 • written 6.3 years ago by grassostefan ▴ 10

0

Entering edit mode

Benjamin,

Could you let us know what was wrong with the CAI calculation in SeqUtils?

Thanks!

ADD REPLY • link 5.5 years ago by jblumens • 0

0

Entering edit mode

Hi, I am trying to estimate CAI for some genes of interest using CAI package but getting an error. Could you please provide some insights on how were you able to calculate CAI?

Thanks a lot

ADD REPLY • link 4.6 years ago by rthapa ▴ 90

0

Entering edit mode

4.0 years ago

abdulwrs7 ▴ 10

Codon Adaptation Index (CAI) was first introduced by Sharp and Li to measure synonymous codon usage bias for a DNA or RNA sequence.It also measures the resemblance between the synonymous codon usage of a gene and the synonymous codon frequencies of a reference set.CAI was originally proposed to provide an estimate that can be used across genes and species, ranging from 0 to 1.

If a gene always uses the most frequently used synonymous codon in the reference set,then CAI=1. If a gene always uses the least frequently used synonymous codon in the reference set,then CAI=0.

Click on this link to explore more:Implementation Of Codon Adaptation Index (CAI) Using Biopython

ADD COMMENT • link 4.0 years ago by abdulwrs7 ▴ 10

score 4 · Accepted Answer · 2018-06-22

4

Entering edit mode

5.8 years ago

benjamindlee ▴ 50

SeqUtils is actually not a correct implementation of the CAI metric (as of 6/22/18). I wrote the CAI package specifically out of frustration with SeqUtils.CodonUsage. In addition to being correct, the CAI package is faster and supports multiple genetic codes.

For more information, take a look at the preprint of the CAI software package paper.

Hope this helps and be sure to contact me if you run into any problems!

Benjamin Lee

ADD COMMENT • link 5.6 years ago by benjamindlee ▴ 50

0

Entering edit mode

Thanks a lot Benjamin!

This explain a lot! I am only wondering why a "broken" package is in the official set of biopython packages, it took a while to me to realize that, as I was assuming the package was right and I was doing something wrong. I had to go through the code and compare it to "Sharp & Li, 1987, NAR (15)" to be doubtful. Anyway, it should be fixed, or better replaced with yours.

You package just works great and additionally it is very easy to use. I'll cite your preprint (or article most likely soon) when using it for my papers.

People like you really improve the bioinformatics community!

ADD REPLY • link 5.7 years ago by grassostefan ▴ 10

0

Entering edit mode

@benjamindlee I saw that CAI has a option -g. Do you have a reference to check for the different genetic codes to this flag? Thank you.

ADD REPLY • link 14 months ago by schlogl ▴ 160