Question

minimal free energy and RNA folding

0

Entering edit mode

4.1 years ago

Bogdan ★ 1.4k

Dear all,

'd appreciate your advise on RNA secondary structures :

considering some software that predict RNA secondary structure (eg http://rna.tbi.univie.ac.at/), what are the optimal numerical values of MFE (minimum free energy) in order to consider a RNA molecule well structured ? thanks a lot !

-- bogdan

MFE RNA • 3.0k views

ADD COMMENT • link 4.1 years ago by Bogdan ★ 1.4k

3

Entering edit mode

4.1 years ago

andrew.uzilov ▴ 60

To add some detail:

The di-nucleotide shuffling algorithm is referred to as the "Altschul-Erikson Shuffle". It is important because the MFE model that RNAz and many other tools use is based on dinucleotide stacking of RNA bases. So you can't just pick some MFE cutoff. You have to control for the dinucleotide distribution of the specific RNA sequence that you are assessing.

So in short, you would be calculating a p-value or a Z score or some such with respect to each sequence's control distribution. I don't know how many sequences you're trying to assess, but you could have a huge false positive problem on your hands, so you want to be careful with that.

Back in the day when I did this (and it's been a while, around the time that RNAz came out, see Uzilov 2005), I used Peter Clote's implementation of the AE Shuffle. Remarkably, the webserver is still up! And the Python scrips that I used are still available for download from that page.

You can see my paper, and the papers I cite, for the state of the field of this problem back in 2005, for both single-sequence and multi-sequence approaches. Unfortunately I've fallen out of the field, but last I checked, there are a lot of gotchas to finding whether something is "well-structured". Using multi-sequence data and looking for evidence of evolutionary conservation of structure, in my experience, is a much more robust way than doing single-sequence analysis. I don't know the state of the art in this field, but I would advise you to follow the work of Jacob Pedersen (of EvoFold), David Mathews (of Dynalign and other software), Sean Eddy, and other people they cite -- the list is too large. I also wouldn't get attached to just MFE-based methods.

ADD COMMENT • link 4.1 years ago by andrew.uzilov ▴ 60

0

Entering edit mode

Dear Andrew, many thanks for your comments and suggestions.

if I may add a question (it is coming from a novice in RNA secondary structure prediction and interpretation) :

what is the typical MFE value range for a set of well-structured or less-structured RNAs ?

for a set of RNA of the same length, shall we make a histogram of MFE that has a peak at MFE = -250, are the RNAs very well structured ? they look well structured based on the harpin/stem loops we do get with RNAfold.

ADD REPLY • link 4.1 years ago by Bogdan ★ 1.4k

2

Entering edit mode

Quite frankly, ANY sufficiently long RNA sequence, even a randomly generated one, will be foldable into something that has some secondary structure and contains hairpin loops. Give it a try -- randomly generate some sequences and put then into any structure prediction algorithm -- you will get a structure no matter what. It is actually hard to find sequences that LACK structure, unless they are rich in low complexity/homopolymer sequence.

Regarding an MFE range, there are two things that are important -- the length of the sequence and the dinucleotide frequency distribution (i.e. how many AA, AC, AG, AU, CA, ... -- all 16 possibilities). Longer sequences will have longer MFE since the deltaG free energy terms in the dinucleotide stacking energy model are additive. Now, in your study, at least all the sequences are the same length, so that variable is controlled. But the dinucleotide frequences are not controlled. So it's hard to give an MFE range -- because that range would be conditioned on the dinuc distribution of some particular sequence. That is why I am advising that, for each sequence, you generate a set of shuffled sequences using the Altschul-Erikson shuffle, score their MFEs, and get a MFE range that would occur by chance conditioned on those dinuc frequencies.

ADD REPLY • link 4.1 years ago by andrew.uzilov ▴ 60

0

Entering edit mode

Thanks a lot, Andrew, for comments, and suggestions, I am very grateful. It has been very helpful

ADD REPLY • link 4.1 years ago by Bogdan ★ 1.4k

0

Entering edit mode

4.1 years ago

Bogdan ★ 1.4k

Dear all, many thanks for your time, replies, comments, and suggestions ! Stay healthy, and be safe ;) !

If I may add a question though : some of my colleagues are asking about the measures of the complexity of RNA secondary structures (that may take into consideration the number of hairpins, stem loops, possibly MFE ?).

I would appreciate having your insights ! Many thanks !

ADD COMMENT • link 4.1 years ago by Bogdan ★ 1.4k

score 5 · Accepted Answer · 2020-03-19

5

Entering edit mode

4.1 years ago

Asaf 10k

You can try and shuffle the sequence keeping di-nucleotide distribution constant and compare the original delta G to the distribution of shuffled sequences. I have some old code that does the shuffling, I can dig it up.

ADD COMMENT • link 4.1 years ago by Asaf 10k

1

Entering edit mode

This is a well-known jack-knife procedure to assess the relevance of calculated MFE. If Asaf can't find his code, I am pretty sure that ViennaRNA package has programs/scripts that will do it. I seem to remember that Sean Eddy's easel library had a shuffling program as well.

ADD REPLY • link 4.1 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Thank you gentlemen.

As I am just learning about RNA secondary structure predictions, thought that i shall ask you please : shall I use MEA (maximum expected accuracy) to assess the predicted structures instead of MFE, and to divide the RNA structures into 2 categories : a) more structured, and b) less structured ?

many thanks !

ADD REPLY • link 4.1 years ago by Bogdan ★ 1.4k

1

Entering edit mode

Sorry, I didn't realize in my previous post that you gave some info on your study:

we have a set of 1000 RNA sequences that we compare and all of them have the same length (500 nt)

Are these human sequences? Do they have homologs if you BLAT or BLAST them against nearest related species? Do you know them to be transcribed? Do they belong to known genes, or are they intergenic transcripts of unknown function, or somewhere in between?

If you really can't incorporate related homologs, and you are forced to do single-sequence analysis, I guess you could just compute a Z score for each sequence by shuffling it using the Altschul-Erikson shuffle, and rank them by that... but depending on your study, there is a lot more that would have to be done to assign significance to any findings. Without knowing more about the study design, I'm not sure how to advise.

I'm also curious how you wound up with exactly 1000 sequences of all exactly the same length, unless what you're giving is an approximation?

To give you an idea of the difficulty of the problem, the FDR for screens for well-structured elements is 10% even in recent work by people who have been in this field for a LONG time, though what's published there is still an improvement in the FDR of one of my old studies. And those studies were using structure conservation evolutionary modeling, not single-sequence analysis. So this isn't a clean problem.

ADD REPLY • link 4.1 years ago by andrew.uzilov ▴ 60

score 4 · Accepted Answer · 2020-03-19

4

Entering edit mode

4.1 years ago

Mensur Dlakic ★ 27k

Maybe I misunderstand your question, but to my knowledge there is no such thing as optimal MFE. These energies are like raw BLAST scores, which tend to be larger for alignments of longer proteins without actually implying relationships. Similarly, larger RNA molecules are more likely to have lower energy than smaller molecules, but that doesn't mean they are folded better. A covariation pattern in multiple sequence alignments, along with low energy, is usually what is needed in a well-structured RNA.

If you look at Figure 2 of this paper, you'll see that they use both quantities I mentioned above to classify whether an RNA is likely to be structured or not. Z-score serves as a proxy for MFE, while structural conservation index (SCI) is dependent on sequence covariation.

ADD COMMENT • link 4.1 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Dear Mensur, thank you very much for your comments and suggestions.

If I may add please : we have a set of 1000 RNA sequences that we compare and all of them have the same length (500 nt). Within this dataset, could I use MFE in order to gain an insight into more stable or less stable RNA secondary structures ?

About RNAz : hmmm, we may not be able to use it, for 2 reasons :

-- the RNA sequences that we look at are not too evolutionary conserved. My understanding is that RNAz works the best on evolutionary conserved sequences, correct ?

-- RNAz code seems to be a bit outdated, and the RNAz server does not work (http://rna.tbi.univie.ac.at/cgi-bin/RNAz/RNAz.cgi). There are RNAz tracks that are released for UCSC genome browser, however, those do not overlap the set of RNA that we are interested in ...

Any comments/suggestions would be very helpful. Thanks !

ADD REPLY • link 4.1 years ago by Bogdan ★ 1.4k

1

Entering edit mode

Generally speaking, lower energies mean more stable structures. However, it does matter how MFE is calculated, and what the differences are between molecules. Consider this: if the difference is 0.1, is that due to calculation error or because of true difference in stability?

You are correct if you think that that RNAz needs multiple alignments, but it is actually better if sequences are not very similar in terms of simple identity. This is where covariation kicks in: one can get the same structure and similar energies from very different sequences, as long as the covariation pattern is preserved. So when you say that your sequences are not evolutionarily conserved, I assume you mean in terms of sequence identities. That would be fine. If you are talking about the lack of structural conservation, comparing energies of sequences of the same length that fold differently may not be appropriate.

Finally, I want to point out something even though it will likely complicate your life. There is a matter of RNA sequence recognition in biology that comes on top of RNA folding. If there is a protein recognizing some combination of RNA sequence and its folded structure, you may have a higher-energy RNA be more biologically relevant than a molecule with lower energy but with biologically unimportant structure.

ADD REPLY • link 4.1 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Dear Mensur, thank you for your comments and insights.

Talking about the common/similar RNA folds in a set of RNA sequences (of the same length): beside RNAz (that does not compile on Ubuntu 18, in our hands, and the webserver does not work), what other algorithms would you recommend to accomplish the same task ?

RNAshapes ? RNAstructure ? any other algorithm/webserver ? thank you ..!

ADD REPLY • link 4.1 years ago by Bogdan ★ 1.4k

1

Entering edit mode

I like ViennaRNA package, so my preference for single sequences would be RNAfold. You can downloads source or binaries from here. They also have a web server.

ADD REPLY • link 4.1 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Thank you Mensur ; yes, RNAfold had worked well in our hands.

If I may add a question though (from someone that just starts reading the literature on RNA folding) : how could I use the MEA (maximum expected accuracy) to assess the predicted structures ? thanks !

ADD REPLY • link 4.1 years ago by Bogdan ★ 1.4k