Question: CAI in python: SeqUtils.CodonUsage vs CAI package
0
gravatar for grassostefan
14 months ago by
grassostefan0 wrote:

Hi I am trying to calculate the CAI (codon adaptation index) for a list of genes of interest. I decided to use the RSCU values from the Codon Bias database (CBDB), for the relative organism (B. subtilis 168): http://homepages.luc.edu/~cputonti/cbdb/genera/bacillus.html#s30 .

I implemented a small python script to do so. While doing so, I couldn't decide between two methods tested them both: one is from Biopython SeqUtils (module CodonUsage, method cai_for_gene) and the other from the 'CAI' package (method CAI) (https://pypi.python.org/pypi/CAI). Since I was using the same reference RSCU values, and both claim to implement the method from "Sharp & Li, 1987, NAR (15)", I was expecting the same value for each gene from both packages. But this did not happen and it is clear at first sight as the results from the SeqUtils package are not included between 0 and 1 (in some cases are >1), which is in contrast with the definition of CAI, while for the 'CAI' package results were within the range.

I further investigated trying to regenerate the RSCU values and looking at both source codes. The first thing I noticed is that from the same fasta file (i.e. same collection of genes, still downloaded from CBDB) I was getting two different RSCUs values on the two packages, with CAI-generated one matching the online posted values, while the values generated by SeqUtils were different. I assumed that the dictionary used by SeqUtils was not RSCUs values, thus I used the SeqUtils-generated RSCUs value to calculate the CAI for my genes, but still values were not allowed. Then looking at the source code I noticed that SeqUtils implementations differs a lot from the paper of Sharp and Li, unless it is a mathematical equivalence I do not know/understand. On the other hand the approach implemented in the CAI package is identical to the paper.

This given, I assume the CAI package is returning the correct values.

My question is thus: what is then SeqUtils calculating? I could not retrieve a lot of info about this package. Could there be an error? Why the two algorithms differ? And why the one from SeqUtils differs also from the paper? Anyone able to explain this?

P.S. I am of course assuming I used correctly both methods. Since I followed the relative guides, I feel reasonably safe in assuming so.

cai biopython codon sequtils • 584 views
ADD COMMENTlink modified 4 months ago by jblumens0 • written 14 months ago by grassostefan0

Benjamin,

Could you let us know what was wrong with the CAI calculation in SeqUtils?

Thanks!

ADD REPLYlink written 4 months ago by jblumens0
2
gravatar for benjamindlee
7 months ago by
benjamindlee30
Harvard University
benjamindlee30 wrote:

SeqUtils is actually not a correct implementation of the CAI metric (as of 6/22/18). I wrote the CAI package specifically out of frustration with SeqUtils.CodonUsage. In addition to being correct, the CAI package is faster and supports multiple genetic codes.

For more information, take a look at the preprint of the CAI software package paper.

Hope this helps and be sure to contact me if you run into any problems!

Benjamin Lee

ADD COMMENTlink modified 5 months ago • written 7 months ago by benjamindlee30

Thanks a lot Benjamin!

This explain a lot! I am only wondering why a "broken" package is in the official set of biopython packages, it took a while to me to realize that, as I was assuming the package was right and I was doing something wrong. I had to go through the code and compare it to "Sharp & Li, 1987, NAR (15)" to be doubtful. Anyway, it should be fixed, or better replaced with yours.

You package just works great and additionally it is very easy to use. I'll cite your preprint (or article most likely soon) when using it for my papers.

People like you really improve the bioinformatics community!

ADD REPLYlink written 6 months ago by grassostefan0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 971 users visited in the last hour