Entering edit mode
6.0 years ago
Manu Madhavan
▴
20
Hi All, For coding-non coding RNA identification (using Machine learning classifier), I would like to add features extracted from RNA secondary structure. I used RNAfold to get the secondary structure from primary sequence (as dot-bracket representation). Now I want to identify loops, stems, bulges, etc.., from the structure and represent as a feature vector (with some numerical values).
- Is there any tool for this purpose?
- how can I identify the structural elements from dot-bracket notation?
- Is there any better numerical/vector representation for RNA secondary structure for machine learning applications?
For your first point, I have code for identifying secondary structure elements from base-pair lists (https://github.com/cschu/biolib, mdg_dt.py, don't judge me on the code :P) and I am quite certain I also have code lying around somewhere to convert bracket notation into such a list (just need to find it)... Alternatively, if you can program, you could write it yourself:
One option would be to iterate through the string, pushing the position of an opening bracket to a stack. When a closing bracket is encountered, you pop the top-element from the stack and store it with the position of the closing bracket as a pair
(position_open, position_close)
in a list. The last step is to sort the list byposition_open
and process it with my code.For your last point: have a look at this: https://www.ncbi.nlm.nih.gov/pubmed/19339518 (they used graph properties derived from the secondary structure as input for support vector machines).
ADDENDUM (edit): The processing above assumes a pure secondary structure without crossing or touching edges (RNAfold produces such a structure, I just wanted to mention it.)
Thank you for your reply and sharing your insights. When I run your code "mdg_dt.py", got the following output. Output
Could you please help me to interpret this output?
Yea, of course, could you post your secondary structure, please? This code is quite old so I need to get back into it...
Ok, please pull the code from github again. I have modified that you can run it on a bracket string.
Thank you for your effort. Let me ask some of my basic doubts (I am sorry if it is too childish)
Can you suggest some materials to understand both the biological and computational perspectives of RNA secondary structure?
You're welcome.
((((...))((...))((...))))
.Right now, I don't have a source at hand. You might find some insight in the introduction parts of my master's/bachelor's theses here: http://bioinf.darkjade.net/thesis/ (or in the cited references). If more comes to mind, I'll get back to you.
Thank you for the detailed reply.
I started reading your Master thesis, and it help me to get more clear understanding of the secondary structure. I am interested to know more details of definition/identification secondary structure motifs, mentioned in second chapter. If I can represent the secondary structure as a fragment of structural motifs, this can be a feature vector for classification. ( my study is identifying coding/non coding RNA by machine learning technique)
Hi,
Thanks a lot!! Your support and suggestions are very useful and motivational.
I mean, what is the algorithm for identifying structural fragments ( as in Fig 2.7 in your master thesis)? What I understand is a fragment may contain more than single motif ( stem, hairpin, multiloop,...). Am I right?
n-multiloops is also seems interesting, but I couldn't find any reference to understand this. Could you please give any suggestions?
Hi,
I don't have the algorithm written out formally, but you can find the idea of it in the assemble() and find_all_motifs() functions in mdg_dt.py. While the output of the latter is different from what you see in my master's thesis, the general concept still applies. The reason for the difference in output is that mdg_dt.py was developed during my PhD, where I focused on loop structures.
My definitions from the master's work are a bit weak.
In general, the idea is that each RNA structure can, on the secondary structure level, be broken into paired (stems) and unpaired regions (loops). The traditional RNA secondary structure motifs* are comprised of stems and loops. These are hairpin: 1 stem terminated by one loop, internal loop: 2 stems separated by 1 or 2 unpaired regions (the former is the bulge, which is a special case of an internal loop), n-branched- (or n-)multiloop: (n>2) + 1 stems connected by [0,n+1] unpaired regions.
*) And here it gets a bit complicated, as - depending on the source you use - a stem is counted as a motif. In that case, a fragment could contain more than one motif. However, if you view a stem as something more basic, then a structural fragment will only contain one secondary structure motif.
Only if you move on to tertiary structure, then you will deal with so-called composite motifs, which are two (or more) secondary structure motifs connected by a set of tertiary interactions (base pairs or base - backbone interactions).
Hope this makes sense.
Thank you for your clarification. This discussions help me to get more insights on my problem. My study is whether there exist any distinctive structural motif pattern in protein coding RNAs and non-coding RNAs, so that we can use this feature for their identification using machine learning. Based on our discussion, I think the the properties of loops/stems (length, nucleotide composition,..) can be used as such feature. Thank you.
You're very welcome. It is known that there is a difference in "structuredness" in coding and non-coding RNAs, but any approach to quantify it more thoroughly would surely be of interest. You also may be interested in the information that UTRs and introns may contain complex RNA structure motifs such as riboswitches.