Question

Using Biopython to present ORF lengths by identifier

0

Entering edit mode

8.5 years ago

hamaisa ▴ 10

Hi,

I am new to computational biology and Python and lately, I've been fiddling with Python, specifically Biopython, to parse FASTA files and present summary information of the sequences in the file in an elegant manner.

I understand that ORFs can be identified and translated via Biopython using this tutorial: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc51

In my case though, I don't wish to translate my ORFs. I instead wish to compute the lengths of these ORFs and present them in a table next to information about the sequence they were found in (i.e. identifier).

I've tried the following messy format with very little success. I've highlighted where I've tried to input the ORF finder:

DNA = open("file.fasta").read()
import re
from Bio import SeqIO
for index, record in enumerate(SeqIO.parse(open("file.fasta"), "fasta")):
    for strand, nuc in [(+1, record.seq), (-1, record.seq.reverse_complement())]:
        for frame in range(3):
            length = 3 * ((len(record)-frame) // 3) #Multiple of three
        **for orf in DNA:
            orf = re.search(r"ATG([ATGC]{:}TAA|TAG|TGA)", DNA).group()**
            print "ID = %s, length %i, frame %i, strand %i" \
                  % (record.id, len(orf), frame, strand)

The reason I've opened my FASTA file twice is because I can't enter record.seq into the re.search expression without an error. I'd also like to insert a bit of code to find the start position of my ORFs but I'm having trouble as it is inserting the ORF finder.

Any advice on how to improve my code above is much appreciated!

python biopython • 8.4k views

ADD COMMENT • link updated 19 months ago by Ram 43k • written 8.5 years ago by hamaisa ▴ 10

1

Entering edit mode

Instead of opening your file twice, you should call the str() function on your record.seq object, and pass that to re.search() - it shouldn't throw an error because you are passing it a string argument (which is what re.search() expects)

ADD REPLY • link 8.5 years ago by James Ashmore ★ 3.4k

0

Entering edit mode

Thank you for the advice!

ADD REPLY • link 8.5 years ago by hamaisa ▴ 10

0

Entering edit mode

Please don't delete posts after someone has helped you - imagine what the site would be like if everyone deleted their posts after someone took their time to assist.

ADD REPLY • link 8.5 years ago by Istvan Albert 100k

Ram · Answer 1 · 2015-10-19

I'm not sure if this will help you or not and I'm not a strong coder which is why I have been working for some practice but maybe this will help. the following code does NOT work.

from Bio import SeqIO
from Bio import Seq
import argparse
import re

parser = argparse.ArgumentParser()
parser.add_argument("-n","--name", type=str, help="name of fasta file")
args = parser.parse_args()

##read in fasta file
seq = SeqIO.readargs.name,"fasta")

##look for start:end ORF per strand (+1/-1)
def find_orfs(seq):
    answer = []
    for strand, nuc in [(+1, seq), (-1, seq.reverse_complement())]:
        for frame in range(3):
            seq_str = str(seq)
            pattern = re.compile(r'(?=(ATG(?:...)*?)(?=TAG|TGA|TAA))', re.I)
            match = pattern.findall(seq_str)
            start = match.start()
            end = match.end()
            answer.append(start,end,strand,frame)
    answer.sort()
    return answer

##sepperated output of fasta id, orf length, orf start:stop

orf_list = find_orfs(seq.seq)

for start, end, frame, strand in orf_list:
    print("%s, has ORF on frame %s strand %s, %s:%s" \
            %seq.id, frame, strand, start, end))

I'm having issues with getting the regex to read the containers that biopython is making. I feel like I'm on the right track but it is not correct

You could also just you the code in the tutorial that does translate everything looking for the frames. It could be less efficent like they say but I bet it still works fast enough and you could just omit the sequence print out to collect the information you are after. That way you will have a solution until you can figure out the more effecent script.