Question: Folding RNA on python
3
gravatar for gprezza
17 months ago by
gprezza30
gprezza30 wrote:

Hi all,

I'd like to write a script in python that, given a set of RNA sequences, finds the ones with the strongest predicted folding. A quick and dirty way like folding them and picking the ones with a - delta G below a set threshold would suffice in my case.

I was looking at two "classical" folding packages (mfold and RNAfold from the Vienna package), that I can run on command line from python. However, neither of them has an options in which just the delta G is output. They produce a series or files for each RNA, and it would be complicated to dive through each one just to extract the delta G value. RNA fold has an output that would fit better my needs, but is still not the ideal.

Does anyone have any suggestion on how to proceed?

EDIT: I'll add some examples, as some people suggested. I'm a bit of a python novice, so my apologies if I am missing any obvious ways to solve my issue.

the best I managed to get is through RNAfold. The output looks like this, repeated for each sequence:

>seq1
AUGGCUGUUCGCCAUUUAAAG
(((((.....)))))...... ( -5.20)

It's not too bad, and I could extract the delta G value (-5.20 here) by telling python to extract the value between the brackets. However, I'm not sure that's the most elegant way, and I was wondering if anyone knew of packages that output a tab delimited file, or of ways to have RNAfold (or mfold) do that.

Another "problem" is that both mfold and RNAfold also produce one file with an image of the folding for each sequence (which I don't need), so then I would have to delete them via python, and again I'm not sure this is ideal.

folding rna python • 1.4k views
ADD COMMENTlink modified 8 months ago by jtimmons30 • written 17 months ago by gprezza30
1

Are they just text files? Have you tried parsing the text output? Python is usually very good at processing text.

ADD REPLYlink written 17 months ago by curious430

Thanks for the comment, I updated my question.

ADD REPLYlink written 17 months ago by gprezza30
1

Have you looked at RNAsketch? This is a Python wrapper for the Vienna Package, which might give you just what you want.

ADD REPLYlink written 17 months ago by cschu1812.5k

Without examples of your output it’s impossible for us to say how to proceed

ADD REPLYlink written 17 months ago by Joe17k

Thanks for the comment, I updated my question.

ADD REPLYlink written 17 months ago by gprezza30

Your approach you described is a perfectly valid approach. Pull out a number between brackets via a regular expression (it wouldn’t be enough to look for brackets alone, since the dot bracket notation would mess it up.

See my answer below.

ADD REPLYlink modified 17 months ago • written 17 months ago by Joe17k

awk '{print $1"\t"$n}' fill in <n> as whatever the column the value is in, assuming it is a tab deliminated file. $1 is the sample name if not... substitute that for wherever the sample name is on that file.

like Brett said... can't really help you if we don't know what file type you are looking at.

ADD REPLYlink written 17 months ago by ggman80
3
gravatar for jtimmons
8 months ago by
jtimmons30
Boston
jtimmons30 wrote:

I created a Python library that does what you want. It uses the same DP algorithm as mfold. There are small differences in free energy estimates between seqfold (this tool) and mfold/Unafold because of things like multi-loop energy calculation, but you can look at the project's examples directory (below) to see compared dg estimates for a list of DNA and RNA sequences.

Documentation

https://github.com/Lattice-Automation/seqfold

Examples

https://github.com/Lattice-Automation/seqfold/tree/master/examples

Python

from seqfold import dg, fold, Struct

# just returns minimum free energy
dg("GGGAGGTCGTTACATCTGGGTAACACCGGTACTGATCCGGTGACCTCCC", temp = 37.0)  # -12.94

# returns a list of `seqfold.Struct` from the minimum free energy structure
structs: List[Struct] = fold("GGGAGGTCGTTACATCTGGGTAACACCGGTACTGATCCGGTGACCTCCC")
print(sum(s.e for s in structs))  # -12.94; same as dg()
for struct in structs:
    print(struct) # prints the i, j, dg, and description of each structure

CLI

$ seqfold GGGAGGTCGTTACATCTGGGTAACACCGGTACTGATCCGGTGACCTCCC -t 32
-17.1
ADD COMMENTlink modified 8 months ago • written 8 months ago by jtimmons30

Since you are providing an answer this is great.

ADD REPLYlink written 8 months ago by genomax89k
0
gravatar for Joe
17 months ago by
Joe17k
United Kingdom
Joe17k wrote:

Here’s one approach:

import sys
import re

regex = re.compile(r'\( -?\d\.\d*\)|\( -?0\)')

with open(sys.argv[1], 'r') as fh:
    num = regex.findall("".join(fh.readlines()))[0].lstrip('( ').rstrip(')')
    print(num)

This is quite a cautious approach, as I pick up the ( ) as well as the leading space in your provided example (assuming thats not a typo), so consequently they need stripping off after to get the number.

I’ve also assumed that the minus symbol may be optional, and that 0 is a valid value (which may or may not have a minus sign).

This will break if, for some reason, your sequence name had a similarly formed decimal in it.

ADD COMMENTlink modified 17 months ago • written 17 months ago by Joe17k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 819 users visited in the last hour