Question: ORF reading in Python
0
gravatar for a.lathifbt
5 months ago by
a.lathifbt0
Chennai, India
a.lathifbt0 wrote:

I am trying to read ORF from a list of sequences and identify the one with maximum length and correspond it to the sequence and the reading frame it belongs to as the output. I wrote the code and it's not quite working properly.

def startstop_codon(dna,frame):
    """ TO CHECK THE PRESENCE OF START CODON AND ITS POSITION"""
    start_codons=['ATG','atg'] ; stop_codons=['TAA','TAG','TGA','taa','tag','tga'] #referece list of start and stop codons
    for i in range(frame,len(dna),3):
        codon1=dna[i:i+3]
        if codon1 in start_codons:
            position1=dna.index(codon1)#getting the index of the start codon
            ORF_start=position1+1
            for j in range(position1,len(dna),3):
                codon2=dna[j:j+3]
                if codon2 in stop_codons:
                    position2=dna.index(codon2)#getting the index of the stop codon
                    ORF_stop=position2+1
                    break #terminating the loop when a stop codon is found
             break
    try:
        return len(dna[position1:position2])+2
    except UnboundLocalError:
        None 

def finding_ORF(frame):
    """TO IDENTIFY THE ORF IN THE READING FRAMES AND 
    IDENTIFY THE LONGEST ORF IN EACH SEQUENCE AND THE POSITION OF THE START CODON"""
    sequences=[reads.seq for reads in records] #creating alist of all the sequence reads as elements of the list
    ORF_lengths={}
    positions={}
    max_lenth_ORFs=[]
    for read in range(len(records)): #loop for accessing each sequence read from the list
        seq=str(sequences[read])
        a=startstop_codon(seq,frame) #calling the codon reading function
        if a== None: #assigning value for reads with no codons
            a = 0
        ORF_lengths[records[read].id]=a #creating a dictionary of ORF lengths(value) and sequence IDs(key)
    max_len=max(ORF_lengths,key=ORF_lengths.get) #getting the ID of the Sequence read containing the maximum length ORF
    #printing the result
    print("Longest ORF in Frame %d is of length %d and its ID is %s." % (frame+1,ORF_lengths[max_len],max_len))

"I have only given the functions that do this job from the whole script"

A sample output I am getting is:

Longest ORF in Frame 1 is of length 1310 and its ID is gi|142022655|gb|EQ086233.1|229.
Longest ORF in Frame 2 is of length 1310 and its ID is gi|142022655|gb|EQ086233.1|229.
Longest ORF in Frame 3 is of length 1310 and its ID is gi|142022655|gb|EQ086233.1|229.

The lengths are not correct and for somereason, it gets only stuck in a single read as in read 229 here. I spent hours looking for the mistake, but could'nt. can someone please look it up and tell me how I can refine it. THANKS IN ADVANCE!

sequence gene • 545 views
ADD COMMENTlink modified 4 months ago by a.zielezinski8.9k • written 5 months ago by a.lathifbt0
1

The problem is with position1 = dna.index(codon1) and position2 = dna.index(codon2). If the same codon is present more than once in dna, index() method always returns its smallest/first position. That is why, you get the same ORF length all the time. You can fix this by assigning position1 = i and position2 = j.

ADD REPLYlink written 4 months ago by a.zielezinski8.9k

Thank you. It is working now. But the issue I am facing now is it's reading only one orf per sequence and if it finds the first one, it stops checking for more. presence of more than ORF per sequence and the second on being longer than the first is complicating the problem. any idea how to overcome this?

ADD REPLYlink written 4 months ago by a.lathifbt0

Please see my answer.

ADD REPLYlink written 4 months ago by a.zielezinski8.9k
  1. Please format your code properly using the code formatting options
  2. Make your life easier using the getorf program, then filter the output and only retain the longest ORF per sequence
ADD REPLYlink written 5 months ago by Michael Dondrup46k

At a quick glance you may need to show us more of the script, where these functions are actually called. Based on your output I’d guess that your loop over your file is incorrect, rather than your ORF finding.

ADD REPLYlink written 4 months ago by Joe14k
3
gravatar for a.zielezinski
4 months ago by
a.zielezinski8.9k
a.zielezinski8.9k wrote:

I edited your startstop_codon() function to handle multiple ORFs in a single DNA sequence.

def startstop_codon(dna, frame):
    dna = dna.upper()
    for i in range(frame, len(dna), 3):
        codon1 = dna[i:i+3]
        if codon1 == 'ATG':
            position1 = i
            for j in range(position1, len(dna), 3):
                codon2 = dna[j:j+3]
                if codon2 in ['TAA', 'TAG', 'TGA']:
                    position2 = j
                    yield (position2-position1+3, dna[position1:position2+3])
                    break

For example, running the function the following way:

for orflen, orf in startstop_codon('ATGabcabcTGAabcATGabcTAG', 0):
    print(orflen, orf)

will give you:

(12, 'ATGABCABCTGA')
(9, 'ATGABCTAG')

or you can directly translate the above output to a list:

print(list(startstop_codon('ATGabcabcTGAabcATGabcTAG', 0)))

which will give you:

[(12, 'ATGABCABCTGA'), (9, 'ATGABCTAG')]
ADD COMMENTlink modified 4 months ago • written 4 months ago by a.zielezinski8.9k

Thank you so much. I wasn't familiar with yield function. It's helpful :)

ADD REPLYlink written 4 months ago by a.lathifbt0

Glad it helped. Does this answer your question?

ADD REPLYlink written 4 months ago by a.zielezinski8.9k

Yes, it's helpful and it answer the question.

ADD REPLYlink written 4 months ago by a.lathifbt0

Don't forget to accept and upvote the answers which were helpful to you :)

ADD REPLYlink written 4 months ago by a.zielezinski8.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1855 users visited in the last hour