We need to optimize a gene for heterologous expression in a GC-rich organism which as of yet doesn't make any (detectable) protein. This is a sum-up of my workflow and some remarks on my way to get there. I hope it may be useful for others, and I'd be grateful for any hints on the remaining issues.
I understand from literature (i.e. allert 2010 ) that it's rather hinderance of initiation of translation by mRNA secondary structure between ribosome binding site and around +60, and that regarding codon usage bias, i take it, that it only matters that no rare codons are used, i.e. the CAI not to fall below a certain threshhold of, say 0.4.
Sequence optimization: - GeneGA packge for R in combination with rstudio (very useful as environment). GeneGA allows to use an objective functions which optimizes either one or both, RNA-Structure and CAI optimization. - indicate the 5'UTR using the frontseq parameter, since that's where RBS is and it folds together with the ORF. - ramp option creates an error message for me. But i'm not sure if the "ramp" (tuller 2010, science) should be considered for heterologous expression, since somehow I doubt that for my gene products I could ever get so much transcript and translation that possible ribosome traffic jams actually reduces expression rate and general fitness? What do you think?
Manual rechecking of GeneGA results: a) that no rare codons left (see below) b) structure: GeneGA seems to only consider "minimal energy" of vienna RNA package. Besides that I cant reproduce the "free energy" value, the "minimal energy" structure is unlikely to be relevant. It seems important to me to rather consider both positional entropy and basepair probability of the "centroid" indicated when entering here: http://rna.tbi.univie.ac.at/cgi-bin/RNAfold.cgi The two options for displaying the sequences "color by base-pairing probability | color by positional entropy" seem to be useful to confirm that in the relevant region -30 -> +60 there is not much stable structure. I find the help file quite limited, besides basic description. Any hints regarding interpretation of all the different rnafold outputs? c) to reduce dependency on one single RNA-prediction method, i use contextFold. AveRNA would have been perfect to compare several RNA prediction methods, but unfortunately final predictions seem to create "unbalanced brackets" and also I cannot find the publication or mentioned "manuscript" that helps understand how to use it. However, when running RNAeval, it is clear to see how heterogenous the result of the compared algorithms are! AveRNA is also useful, since it brings precompiled binaries of various tools, which often failed to compile with ubuntu 12.04LTS/64bit. Often those tools return dot bracket notation, which can be interpreted with RNAeval. http://mobyle.pasteur.fr/cgi-bin/portal.py?#forms::rnaeval is helpful here, since the commandline tool is strange to use (example command would be welcome!). However, rnaeval only calculates " calculate energy of RNA sequences". Hmm, well! Could somebody hint me a way to calculate basepair probability and positional entropies with non-ViennaRNA tools?!
Codons usage: - I tried to get "local tAI" as in Tuller 2010, instead of CAI. However tAI requires to know the gene number of tRNAs, but I doubt that reliable information is available for my organism (http://gtrnadb.ucsc.edu gets me 53 tRNA genes, but it's based on a very old algorithm; http://trna.nagahama-i-bio.ac.jp gives me 13 tRNA genes, and >100 candidate tRNA genes) -> how do you handle this uncertainty in tRNA gene prediction?? - falling back to calculating the CAI on basis of highly expressed genes using jcat tool, and manually confirming that no rare codons as those indicated in kazusa db and http://gtrnadb.ucsc.edu are present in the optimized ORF. - don't forget to doublecheck for restriction sites.