Question

Python script to find ORFs not returning expected results for final frame

3

Entering edit mode

7.8 years ago

njmcdonald ▴ 40

I am using this as a learning exercise, I realize there are existing tools to determine the end result. The goal is to gain experience with Python and nuances of syntax.

Given that I have a fasta file I am parsing, and want to find non-overlapping ORFs in all possible frames (0,1,2). My code I have written thus far is below:

def getORF(seq, treshold, start, stop):
start = ["ATG"]
stop = ["TAA","TAG","TGA"]
start_codon_index = 0
end_codon_index = 0
start_codon_found = False
for frame in range(3):
    q=frame
    for i in range(q, len(seq), 3):
        current_codon = seq[i:i+3]                            
        if current_codon in start and not start_codon_found:
            start_codon_found = True
            start_codon_index = i
            e=start_codon_index+1                
        if current_codon in stop and start_codon_found:
            end_codon_index = i
            d=end_codon_index+3
            length = end_codon_index - start_codon_index + 3
            if length >= treshold * 3:        #treshold is protein length but length is nuc
                b=length//3 + 3
                start_codon_found = False
                print("aa %i, nuc %i, frame %i, coord %i:%i\n" % (b, length, q+1, e, d))

However, when I run the following (myfile contains a single sequence record, but goal is to use multiple fasta record) it only returns results for the first 2 frames.

for record in SeqIO.parse(myfile,'fasta',IUPAC.unambiguous_dna):
seq=record.seq
name=record.id
print("%s\n" % name)
getORF(seq,2,start,stop)

I know there are ORFs in frame 3, and I can find start and stop codons using the following scripts:

def test(seq, start, stop):
    start = ["ATG"]
    stop = ["TAA","TAG","TGA"]
    start_codon_index = 0
    end_codon_index = 0
    q=2
    for i in range(q, len(seq), 3):
        current_codon = seq[i:i+3]                            
        if current_codon in start:
            start_codon_index = i
            print("Start at %i " % start_codon_index)

def test2(seq, start, stop):
    start = ["ATG"]
    stop = ["TAA","TAG","TGA"]
    start_codon_index = 0
    end_codon_index = 0
    q=2
    for i in range(q, len(seq), 3):
        current_codon = seq[i:i+3]                            
        if current_codon in stop:
            stop_codon_index = i
            print("Stop at %i " % stop_codon_index)

for record in SeqIO.parse(myfile,'fasta',IUPAC.unambiguous_dna):
    seq=record.seq
    name=record.id
    print("%s\n" % name)
    test(seq,start,stop)

for record in SeqIO.parse(myfile,'fasta',IUPAC.unambiguous_dna):
    seq=record.seq
    name=record.id
    print("%s\n" % name)
    test2(seq,start,stop)

I can't figure out why I am getting no results for frame 2. I have even tried throwing in 4 as my range for frame, and instead I get a result for a non existent frame 4 and no results for frame 3.... Clearly I am missing something obvious! Thanks in advance for any help!

sequence Python ORF • 2.6k views

ADD COMMENT • link updated 4.1 years ago by Biostar 20 • written 7.8 years ago by njmcdonald ▴ 40

score 4 · Accepted Answer · 2016-06-26

4

Entering edit mode

7.8 years ago

Markus ▴ 320

You have to reset your start_codon_found variable to False in each frame loop. Otherwise it can be that you find a start codon in frame 2 but no more stop codon. Then, in the next frame loop, your start_codon_found varible is still true. Try this:

...
for frame in range(3):
    start_codon_index = 0
    end_codon_index = 0
    start_codon_found = False
    for i in range(frame, len(seq), 3):
    ....

This makes it working for me.

ADD COMMENT • link 7.8 years ago by Markus ▴ 320

1

Entering edit mode

Many thanks! I knew I had to be overlooking something simple!

ADD REPLY • link 7.8 years ago by njmcdonald ▴ 40