Question: Using Biopython to present ORF lengths by identifier
0
gravatar for hamaisa
4.5 years ago by
hamaisa10
Australia
hamaisa10 wrote:

Hi,

I am new to computational biology and Python and lately, I've been fiddling with Python, specifically Biopython, to parse FASTA files and present summary information of the sequences in the file in an elegant manner.

I understand that ORFs can be identified and translated via Biopython using this tutorial: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc51

In my case though, I don't wish to translate my ORFs. I instead wish to compute the lengths of these ORFs and present them in a table next to information about the sequence they were found in (i.e. identifier).

I've tried the following messy format with very little success. I've highlighted where I've tried to input the ORF finder:

DNA = open("file.fasta").read()
import re
from Bio import SeqIO
for index, record in enumerate(SeqIO.parse(open("file.fasta"), "fasta")):
    for strand, nuc in [(+1, record.seq), (-1, record.seq.reverse_complement())]:
        for frame in range(3):
            length = 3 * ((len(record)-frame) // 3) #Multiple of three
        for orf in DNA:
            orf = re.search(r"ATG([ATGC]{:}TAA|TAG|TGA)", DNA).group()

            print "ID = %s, length %i, frame %i, strand %i" \
                  % record.id, len(orf), frame, strand)

The reason I've opened my FASTA file twice is because I can't enter 'record.seq' into the 're.search' expression without an error. I'd also like to insert a bit of code to find the start position of my ORFs but I'm having trouble as it is inserting the ORF finder.

Any advice on how to improve my code above is much appreciated!

 

programming biopython python • 4.5k views
ADD COMMENTlink modified 4.5 years ago by skbrimer620 • written 4.5 years ago by hamaisa10
1

Instead of opening your file twice, you should call the str() function on your record.seq object, and pass that to re.search() - it shouldn't throw an error because you are passing it a string argument (which is what re.search() expects)

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by James Ashmore2.8k

Thank you for the advice!

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by hamaisa10

Please don't delete posts after someone has helped you - imagine what the site would be like if everyone deleted their posts after someone took their time to assist.

ADD REPLYlink written 4.5 years ago by Istvan Albert ♦♦ 83k
0
gravatar for skbrimer
4.5 years ago by
skbrimer620
United States
skbrimer620 wrote:

I'm not sure if this will help you or not and I'm not a strong coder which is why I have been working for some practice but maybe this will help. the following code does NOT work.

from Bio import SeqIO
from Bio import Seq
import argparse
import re

parser = argparse.ArgumentParser()
parser.add_argument("-n","--name", type=str, help="name of fasta file")
args = parser.parse_args()

##read in fasta file
seq = SeqIO.readargs.name,"fasta")

##look for start:end ORF per strand (+1/-1)
def find_orfs(seq):
    answer = []
    for strand, nuc in [(+1, seq), (-1, seq.reverse_complement())]:
        for frame in range(3):
            seq_str = str(seq)
            pattern = re.compile(r'(?=(ATG(?:...)*?)(?=TAG|TGA|TAA))', re.I)
            match = pattern.findall(seq_str)
            start = match.start()
            end = match.end()
            answer.append(start,end,strand,frame)
    answer.sort()
    return answer

##sepperated output of fasta id, orf length, orf start:stop

orf_list = find_orfs(seq.seq)

for start, end, frame, strand in orf_list:
    print("%s, has ORF on frame %s strand %s, %s:%s" \
            %seq.id, frame, strand, start, end))

I'm having issues with getting the regex to read the containers that biopython is making. I feel like I'm on the right track but it is not correct

You could also just you the code in the tutorial that does translate everything looking for the frames. It could be less efficent like they say but I bet it still works fast enough and you could just omit the sequence print out to collect the information you are after. That way you will have a solution until you can figure out the more effecent script.

ADD COMMENTlink modified 4 months ago by RamRS26k • written 4.5 years ago by skbrimer620
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1448 users visited in the last hour