1
1
Entering edit mode
2.5 years ago
a.lathifbt ▴ 30

I am trying to read ORF from a list of sequences and identify the one with maximum length and correspond it to the sequence and the reading frame it belongs to as the output. I wrote the code and it's not quite working properly.

def startstop_codon(dna,frame):
""" TO CHECK THE PRESENCE OF START CODON AND ITS POSITION"""
start_codons=['ATG','atg'] ; stop_codons=['TAA','TAG','TGA','taa','tag','tga'] #referece list of start and stop codons
for i in range(frame,len(dna),3):
codon1=dna[i:i+3]
if codon1 in start_codons:
position1=dna.index(codon1)#getting the index of the start codon
ORF_start=position1+1
for j in range(position1,len(dna),3):
codon2=dna[j:j+3]
if codon2 in stop_codons:
position2=dna.index(codon2)#getting the index of the stop codon
ORF_stop=position2+1
break #terminating the loop when a stop codon is found
break
try:
return len(dna[position1:position2])+2
except UnboundLocalError:
None

def finding_ORF(frame):
"""TO IDENTIFY THE ORF IN THE READING FRAMES AND
IDENTIFY THE LONGEST ORF IN EACH SEQUENCE AND THE POSITION OF THE START CODON"""
sequences=[reads.seq for reads in records] #creating alist of all the sequence reads as elements of the list
ORF_lengths={}
positions={}
max_lenth_ORFs=[]
for read in range(len(records)): #loop for accessing each sequence read from the list
a=startstop_codon(seq,frame) #calling the codon reading function
if a== None: #assigning value for reads with no codons
a = 0
ORF_lengths[records[read].id]=a #creating a dictionary of ORF lengths(value) and sequence IDs(key)
max_len=max(ORF_lengths,key=ORF_lengths.get) #getting the ID of the Sequence read containing the maximum length ORF
#printing the result
print("Longest ORF in Frame %d is of length %d and its ID is %s." % (frame+1,ORF_lengths[max_len],max_len))


"I have only given the functions that do this job from the whole script"

A sample output I am getting is:

Longest ORF in Frame 1 is of length 1310 and its ID is gi|142022655|gb|EQ086233.1|229.
Longest ORF in Frame 2 is of length 1310 and its ID is gi|142022655|gb|EQ086233.1|229.
Longest ORF in Frame 3 is of length 1310 and its ID is gi|142022655|gb|EQ086233.1|229.


The lengths are not correct and for somereason, it gets only stuck in a single read as in read 229 here. I spent hours looking for the mistake, but could'nt. can someone please look it up and tell me how I can refine it. THANKS IN ADVANCE!

gene sequence • 5.6k views
1
Entering edit mode

The problem is with position1 = dna.index(codon1) and position2 = dna.index(codon2). If the same codon is present more than once in dna, index() method always returns its smallest/first position. That is why, you get the same ORF length all the time. You can fix this by assigning position1 = i and position2 = j.

0
Entering edit mode

Thank you. It is working now. But the issue I am facing now is it's reading only one orf per sequence and if it finds the first one, it stops checking for more. presence of more than ORF per sequence and the second on being longer than the first is complicating the problem. any idea how to overcome this?

0
Entering edit mode

0
Entering edit mode
2. Make your life easier using the getorf program, then filter the output and only retain the longest ORF per sequence
0
Entering edit mode

At a quick glance you may need to show us more of the script, where these functions are actually called. Based on your output I’d guess that your loop over your file is incorrect, rather than your ORF finding.

0
Entering edit mode

Since I can run a "Fasta" file, I'm entering it:

for orflen, orf in startstop_codon ('dna.example.fasta', 0):           print (orflen, orf)

And the command is not reading the ORFs

1
Entering edit mode

This function reads just the sequence and not fasta files directly. You need to parse the fasta file and extract only the sequence before you can run the function. Hope it helps.

0
Entering edit mode

7
Entering edit mode
2.5 years ago

I edited your startstop_codon() function to handle multiple ORFs in a single DNA sequence.

def startstop_codon(dna, frame):
dna = dna.upper()
for i in range(frame, len(dna), 3):
codon1 = dna[i:i+3]
if codon1 == 'ATG':
position1 = i
for j in range(position1, len(dna), 3):
codon2 = dna[j:j+3]
if codon2 in ['TAA', 'TAG', 'TGA']:
position2 = j
yield (position2-position1+3, dna[position1:position2+3])
break


For example, running the function the following way:

for orflen, orf in startstop_codon('ATGabcabcTGAabcATGabcTAG', 0):
print(orflen, orf)


will give you:

(12, 'ATGABCABCTGA')
(9, 'ATGABCTAG')


or you can directly translate the above output to a list:

print(list(startstop_codon('ATGabcabcTGAabcATGabcTAG', 0)))


which will give you:

[(12, 'ATGABCABCTGA'), (9, 'ATGABCTAG')]

0
Entering edit mode

Thank you so much. I wasn't familiar with yield function. It's helpful :)

0
Entering edit mode

0
Entering edit mode

0
Entering edit mode

Don't forget to accept and upvote the answers which were helpful to you :)