Question: Using Biopython to present ORF lengths by identifier
gravatar for hamaisa
5.3 years ago by
hamaisa10 wrote:


I am new to computational biology and Python and lately, I've been fiddling with Python, specifically Biopython, to parse FASTA files and present summary information of the sequences in the file in an elegant manner.

I understand that ORFs can be identified and translated via Biopython using this tutorial:

In my case though, I don't wish to translate my ORFs. I instead wish to compute the lengths of these ORFs and present them in a table next to information about the sequence they were found in (i.e. identifier).

I've tried the following messy format with very little success. I've highlighted where I've tried to input the ORF finder:

DNA = open("file.fasta").read()
import re
from Bio import SeqIO
for index, record in enumerate(SeqIO.parse(open("file.fasta"), "fasta")):
    for strand, nuc in [(+1, record.seq), (-1, record.seq.reverse_complement())]:
        for frame in range(3):
            length = 3 * ((len(record)-frame) // 3) #Multiple of three
        for orf in DNA:
            orf ="ATG([ATGC]{:}TAA|TAG|TGA)", DNA).group()

            print "ID = %s, length %i, frame %i, strand %i" \
                  %, len(orf), frame, strand)

The reason I've opened my FASTA file twice is because I can't enter 'record.seq' into the '' expression without an error. I'd also like to insert a bit of code to find the start position of my ORFs but I'm having trouble as it is inserting the ORF finder.

Any advice on how to improve my code above is much appreciated!


programming biopython python • 5.5k views
ADD COMMENTlink modified 5.3 years ago by skbrimer650 • written 5.3 years ago by hamaisa10

Instead of opening your file twice, you should call the str() function on your record.seq object, and pass that to - it shouldn't throw an error because you are passing it a string argument (which is what expects)

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by James Ashmore3.0k

Thank you for the advice!

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by hamaisa10

Please don't delete posts after someone has helped you - imagine what the site would be like if everyone deleted their posts after someone took their time to assist.

ADD REPLYlink written 5.3 years ago by Istvan Albert ♦♦ 86k
gravatar for skbrimer
5.3 years ago by
United States
skbrimer650 wrote:

I'm not sure if this will help you or not and I'm not a strong coder which is why I have been working for some practice but maybe this will help. the following code does NOT work.

from Bio import SeqIO
from Bio import Seq
import argparse
import re

parser = argparse.ArgumentParser()
parser.add_argument("-n","--name", type=str, help="name of fasta file")
args = parser.parse_args()

##read in fasta file
seq =,"fasta")

##look for start:end ORF per strand (+1/-1)
def find_orfs(seq):
    answer = []
    for strand, nuc in [(+1, seq), (-1, seq.reverse_complement())]:
        for frame in range(3):
            seq_str = str(seq)
            pattern = re.compile(r'(?=(ATG(?:...)*?)(?=TAG|TGA|TAA))', re.I)
            match = pattern.findall(seq_str)
            start = match.start()
            end = match.end()
    return answer

##sepperated output of fasta id, orf length, orf start:stop

orf_list = find_orfs(seq.seq)

for start, end, frame, strand in orf_list:
    print("%s, has ORF on frame %s strand %s, %s:%s" \
  , frame, strand, start, end))

I'm having issues with getting the regex to read the containers that biopython is making. I feel like I'm on the right track but it is not correct

You could also just you the code in the tutorial that does translate everything looking for the frames. It could be less efficent like they say but I bet it still works fast enough and you could just omit the sequence print out to collect the information you are after. That way you will have a solution until you can figure out the more effecent script.

ADD COMMENTlink modified 14 months ago by _r_am32k • written 5.3 years ago by skbrimer650
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1532 users visited in the last hour