Python script to find ORFs not returning expected results for final frame
1
3
Entering edit mode
7.8 years ago
njmcdonald ▴ 40

I am using this as a learning exercise, I realize there are existing tools to determine the end result. The goal is to gain experience with Python and nuances of syntax.

Given that I have a fasta file I am parsing, and want to find non-overlapping ORFs in all possible frames (0,1,2). My code I have written thus far is below:

def getORF(seq, treshold, start, stop):
start = ["ATG"]
stop = ["TAA","TAG","TGA"]
start_codon_index = 0
end_codon_index = 0
start_codon_found = False
for frame in range(3):
    q=frame
    for i in range(q, len(seq), 3):
        current_codon = seq[i:i+3]                            
        if current_codon in start and not start_codon_found:
            start_codon_found = True
            start_codon_index = i
            e=start_codon_index+1                
        if current_codon in stop and start_codon_found:
            end_codon_index = i
            d=end_codon_index+3
            length = end_codon_index - start_codon_index + 3
            if length >= treshold * 3:        #treshold is protein length but length is nuc
                b=length//3 + 3
                start_codon_found = False
                print("aa %i, nuc %i, frame %i, coord %i:%i\n" % (b, length, q+1, e, d))

However, when I run the following (myfile contains a single sequence record, but goal is to use multiple fasta record) it only returns results for the first 2 frames.

for record in SeqIO.parse(myfile,'fasta',IUPAC.unambiguous_dna):
seq=record.seq
name=record.id
print("%s\n" % name)
getORF(seq,2,start,stop)

I know there are ORFs in frame 3, and I can find start and stop codons using the following scripts:

def test(seq, start, stop):
    start = ["ATG"]
    stop = ["TAA","TAG","TGA"]
    start_codon_index = 0
    end_codon_index = 0
    q=2
    for i in range(q, len(seq), 3):
        current_codon = seq[i:i+3]                            
        if current_codon in start:
            start_codon_index = i
            print("Start at %i " % start_codon_index)

def test2(seq, start, stop):
    start = ["ATG"]
    stop = ["TAA","TAG","TGA"]
    start_codon_index = 0
    end_codon_index = 0
    q=2
    for i in range(q, len(seq), 3):
        current_codon = seq[i:i+3]                            
        if current_codon in stop:
            stop_codon_index = i
            print("Stop at %i " % stop_codon_index)

for record in SeqIO.parse(myfile,'fasta',IUPAC.unambiguous_dna):
    seq=record.seq
    name=record.id
    print("%s\n" % name)
    test(seq,start,stop)

for record in SeqIO.parse(myfile,'fasta',IUPAC.unambiguous_dna):
    seq=record.seq
    name=record.id
    print("%s\n" % name)
    test2(seq,start,stop)

I can't figure out why I am getting no results for frame 2. I have even tried throwing in 4 as my range for frame, and instead I get a result for a non existent frame 4 and no results for frame 3.... Clearly I am missing something obvious! Thanks in advance for any help!

sequence Python ORF • 2.6k views
ADD COMMENT
4
Entering edit mode
7.8 years ago
Markus ▴ 320

You have to reset your start_codon_found variable to False in each frame loop. Otherwise it can be that you find a start codon in frame 2 but no more stop codon. Then, in the next frame loop, your start_codon_found varible is still true. Try this:

...
for frame in range(3):
    start_codon_index = 0
    end_codon_index = 0
    start_codon_found = False
    for i in range(frame, len(seq), 3):
    ....

This makes it working for me.

ADD COMMENT
1
Entering edit mode

Many thanks! I knew I had to be overlooking something simple!

ADD REPLY

Login before adding your answer.

Traffic: 1831 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6