Hello all, I have been working on a practice problem from Rosalind, and I think I am 99% done but there is one issue - my code seems to output an extra reading frame that Rosalind does not register as correct. I seriously have no clue as to why it does that, but I assume it has to do with my use of regex. I compared my code and output with someone else's code I found on GitHub, and considering they used a different method to find the orf, it still proved an issue in determining what is at the crux of the issue with my code (bare in mind I am very new to programming). I will include my code and its output and the other person's code and output.
Dataset:
>Rosalind_4682
CACTAGTGTGCCCAAAGGTTGCCAGACGGTATCCTACGGTGTCTTGGGTTAGGCCTGAAA
ACTAAAGCGGCTCCGAGATAACCGTCGCGCGTTCAGGCCCGATATTAAATCCAAGGATCG
CGTCCACGGCTACAGCCATAACTGAGGGTCTGGCCGGGCTTTTTTATTTCCTTCGTCCGA
TAGTAACCCCTTTCGGAGCGAAATGGCCAAATTGATTATTCGGGCTCACTTGAAATCAGT
GTACTACTGCTGAGGTTACGGGACTGGCGTTAATCGGCGGGGCCGGCTAGCAGGTTCTTG
AAAGAAGCTCCTGCGTACTATATTTGGTATGTATCATGACATAGGGATGATACCGTCCCC
GGTCAACATAGGCGACATCGTAGTTTATTTACGGTCGGCAGGTCGGCTTGGATGCCCCAA
TGACCATCAATTGAAGGTCCGATCGTATATAGCTATATACGATCGGACCTTCAATTGATG
GTCATCTGTTTTATAACCTACGGCGGTACCCCATTCGATCTGCCAAACCGTAGCCATGTC
CGTATGCTGAAACATGGTATACAGGATTTGGGACAAGCCCCCCAGCCCATCCGTGGTGCT
CCGGGTGCGAACCTCAGGATGCGATTTGACAATACCGGCACTAAAGCCACGGATAGCCGT
CTCCGACCTCGAGGCTTGGGCACCAAGCTGAGCCCATGCTTTCTATGTCGTTTCATTCGA
ACCGAGCAGAGGGTTCAGTGGCTGTAGAACGCAACGGTTGAGTAAAGGTCTTGACTTAGA
ATCGTATCTCCAACTAATGATGTTATCCCGAAGGCCCCCACCGCAGCGCTTATCAACGGA
TCAACCCGACATCACAGAAGATAACGCGTGTAGGCCAGCCAGAACTACGATTCTGACTTT
CCGT
My code:
# problem with code: it outputs extra sequences - not sure why yet
from gettext import find
import re
from Bio import SeqIO
record = SeqIO.read("/Users/danielpintard/downloads/rosalind_orf (1).txt", 'fasta')
string = record.seq
rna_seq = str(string.replace('T', 'U'))
reverse_comp = (rna_seq[::-1].replace("A", "u").replace("U", "a").replace("G", "c").replace('C', 'g')).upper()
amino_acid_codon_dict = {'UUU':'F', 'UUC':'F', 'UUU':'F', 'UUC':'F', 'UUA':'L', 'UUG':'L', 'UCU':'S', 'UCC':'S', 'UCA':'S', 'UCG':'S', 'UAU':'Y', 'UAC':'Y', 'UAA':'Stop', 'UAG':'Stop', 'UGU':'C', 'UGC':'C', 'UGA':'Stop', 'UGG':'W',
'CUU':'L', 'CUC':'L', 'CUA':'L', 'CUG':'L', 'CCU':'P', 'CCC':'P', 'CCA':'P', 'CCG':'P', 'CAU':'H','CAC':'H', 'CAA':'Q','CAG':'Q','CGU':'R','CGC':'R','CGA':'R','CGG':'R',
'AUU':'I', 'AUC':'I','AUA':'I','AUG':'M','ACU':'T','ACC':'T','ACA':'T','ACG':'T','AAU':'N','AAC':'N','AAA':'K','AAG':'K','AGU':'S','AGC':'S','AGA':'R','AGG':'R','GUU':'V','GUC':'V',
'GUA':'V','GUG':'V','GCU':'A','GCC':'A','GCA':'A','GCG':'A','GAU':'D','GAC':'D','GAA':'E','GAG':'E','GGU':'G','GGC':'G','GGA':'G','GGG':'G'}
pattern = re.compile(r'(?=(AUG(?:...)*?)(?=UAA|UAG|UGA))')
frags = []
for s in re.findall(pattern, rna_seq):
frags.append(s)
for s in re.findall(pattern, reverse_comp):
frags.append(s)
for sequence in frags:
split_codons = []
prot_seq = []
for i in range(0, len(sequence), 3):
codon = sequence[i:i+3]
split_codons.append(codon)
for s in split_codons:
prot_seq.append(amino_acid_codon_dict[s])
for i in prot_seq:
print(''.join(prot_seq))
break
My output for one of the datasets given is:
MAKLIIRAHLKSVYYC
MYHDIGMIPSPVNIGDIVVYLRSAGRLGCPNDHQLKVRSYIAIYDRTFN
MT
MIPSPVNIGDIVVYLRSAGRLGCPNDHQLKVRSYIAIYDRTFN
MPQ
MTIN
MVICFITYGGTPFDLPNRSHVRMLKHGIQDLGQAPQPIRGAPGANLRMRFDNTGTKATDSRLRPRGLGTKLSPCFLCRFIRTEQRVQWL
MSVC
MLKHGIQDLGQAPQPIRGAPGANLRMRFDNTGTKATDSRLRPRGLGTKLSPCFLCRFIRTEQRVQWL
MVYRIWDKPPSPSVVLRVRTSGCDLTIPALKPRIAVSDLEAWAPS
MRFDNTGTKATDSRLRPRGLGTKLSPCFLCRFIRTEQRVQWL
MLSMSFHSNRAEGSVAVERNG
MSFHSNRAEGSVAVERNG
MSG
MKRHRKHGLSLVPKPRGRRRLSVALVPVLSNRILRFAPGAPRMGWGACPKSCIPCFSIRTWLRFGRSNGVPP
MGSAWCPSLEVGDGYPWL
MGWGACPKSCIPCFSIRTWLRFGRSNGVPP
MFQHTDMATVWQIEWGTAVGYKTDDHQLKVRSYIAIYDRTFN
MATVWQIEWGTAVGYKTDDHQLKVRSYIAIYDRTFN
MGYRRRL
MTIN
MVIGASKPTCRP
MSPMLTGDGIIPMS
MLTGDGIIPMS
MS
MIHTKYSTQELLSRTC
MAVAVDAILGFNIGPERATVISEPL
The other person's code:
with open("/Users/danielpintard/downloads/rosalind_orf (1).txt","r") as f:
f.readline()
dna=''
for line in f:
dna+=line.strip()
dna=list(dna)
rnareverse=[]
for i in range(len(dna)):
ind=len(dna)-1-i
if dna[ind]=="T":
dna[ind]="U"
rnareverse.append("A")
else:
if dna[ind] == "A":
rnareverse.append("U")
else:
if dna[ind]=="C":
rnareverse.append("G")
else:
rnareverse.append("C")
rna=''.join(dna)
rnareverse=''.join(rnareverse)
codon={"UUU": "F", "CUU": "L", "AUU": "I", "GUU": "V", "UUC": "F", "CUC": "L", "AUC": "I", "GUC": "V", "UUA": "L",
"CUA": "L", "AUA": "I", "GUA": "V", "UUG": "L", "CUG": "L", "AUG": "M", "GUG": "V", "UCU": "S", "CCU": "P",
"ACU": "T", "GCU": "A", "UCC": "S", "CCC": "P", "ACC": "T", "GCC": "A", "UCA": "S", "CCA": "P", "ACA": "T",
"GCA": "A", "UCG": "S", "CCG": "P", "ACG": "T", "GCG": "A", "UAU": "Y", "CAU": "H", "AAU": "N", "GAU": "D",
"UAC": "Y", "CAC": "H", "AAC": "N", "GAC": "D", "UAA": "Stop", "CAA": "Q", "AAA": "K", "GAA": "E",
"UAG": "Stop", "CAG": "Q", "AAG": "K", "GAG": "E", "UGU": "C", "CGU": "R", "AGU": "S", "GGU": "G",
"UGC": "C", "CGC": "R", "AGC": "S", "GGC": "G", "UGA": "Stop", "CGA": "R", "AGA": "R", "GGA": "G",
"UGG": "W", "CGG": "R", "AGG": "R", "GGG": "G"}
answer=[]
for i in range(len(rna)-2):
if rna[i:i+3]=='AUG':
j=i
prot=''
letter='AUG'
while codon[letter]!="Stop":
prot+=codon[letter]
j+=3
if j>len(rna)-3:
break
letter=rna[j:j+3]
if codon[letter]=="Stop" and prot not in answer:
answer.append(prot)
for i in range(len(rnareverse)-2):
if rnareverse[i:i+3]=='AUG':
j=i
prot=''
letter='AUG'
while codon[letter]!="Stop":
prot+=codon[letter]
j+=3
if j>len(rnareverse)-3:
break
letter=rnareverse[j:j+3]
if codon[letter]=="Stop" and prot not in answer:
answer.append(prot)
for i in answer:
print(i)
Their output for given dataset:
MAKLIIRAHLKSVYYC
MYHDIGMIPSPVNIGDIVVYLRSAGRLGCPNDHQLKVRSYIAIYDRTFN
MT
MIPSPVNIGDIVVYLRSAGRLGCPNDHQLKVRSYIAIYDRTFN
MPQ
MTIN
MVICFITYGGTPFDLPNRSHVRMLKHGIQDLGQAPQPIRGAPGANLRMRFDNTGTKATDSRLRPRGLGTKLSPCFLCRFIRTEQRVQWL
MSVC
MLKHGIQDLGQAPQPIRGAPGANLRMRFDNTGTKATDSRLRPRGLGTKLSPCFLCRFIRTEQRVQWL
MVYRIWDKPPSPSVVLRVRTSGCDLTIPALKPRIAVSDLEAWAPS
MRFDNTGTKATDSRLRPRGLGTKLSPCFLCRFIRTEQRVQWL
MLSMSFHSNRAEGSVAVERNG
MSFHSNRAEGSVAVERNG
MSG
MKRHRKHGLSLVPKPRGRRRLSVALVPVLSNRILRFAPGAPRMGWGACPKSCIPCFSIRTWLRFGRSNGVPP
MGSAWCPSLEVGDGYPWL
MGWGACPKSCIPCFSIRTWLRFGRSNGVPP
MFQHTDMATVWQIEWGTAVGYKTDDHQLKVRSYIAIYDRTFN
MATVWQIEWGTAVGYKTDDHQLKVRSYIAIYDRTFN
MGYRRRL
MVIGASKPTCRP
MSPMLTGDGIIPMS
MLTGDGIIPMS
MS
MIHTKYSTQELLSRTC
MAVAVDAILGFNIGPERATVISEPL
For this given dataset, I noticed the main difference in our outputs is that my code outputs one extra possible protein sequence - "MTIN". I assume that my code may be taking too many orf's into account but I really am not sure. Thank you.
Omg, thank you so much. I feel like an idiot considering it was such a simple fix lol, but I guess that often occurs in programming - it's always nice to have another set of eyes.