I am new to bioinformatics and trying to use python and biopython specifically SeqIO to find all the ORFs in several FASTA files. My code is simple and I can easily find all the ORFs in an example string and I can use SeqIO to go through all the FASTA files but when I put the two together I get a problem. I get some seemingly random sequence(s) in the beginning before accessing and printing the sequences I want. Then, I only get some of the FASTA files and what files I access and what ORFs seems to be dependent on what print statements I use beforehand, either printing the sequence, printing the sequence ID, printing nothing etc.
It does seem to work (I think) if I only consider one reading frame and therefore don't add an extra for loop.
This doesn't make sense to me at all and I can't after much struggle find a solution for this. Is what I am trying to do just not possible with SeqIO or is there something I am not understanding?
Any help would be much appreciated.
#function to return ORF (open reading frame) given a reading frame and a DNA string, sequence
#first find the start codon ATG and then continue from there to find a stop codon all in a given reading frame
#(i.e start at a given index and increment +3
#if both the stop and start codon are not found nothing is returned or printed
def findORF(readingFrame, sequence):
results=[]
i=readingFrame
while i<len(sequence)-2:
if sequence[i:i+3] =="ATG":
j=i
while j<len(sequence)-2:
if sequence[j:j+3]=="TAG" or sequence[j:j+3]=="TAA" or sequence[j:j+3]=="TGA":
#could also make a list of stop codons and ask is sequence in list
results.append(sequence[i:j+3])
break
j=j+3
i=i+3
return results
def revComp(sequence):
result=""
sequence=sequence.upper()
i=len(sequence)-1
while i>=0:
if sequence[i]=="T":
result=result+"A"
elif sequence[i]=="A":
result=result+"T"
elif sequence[i]=="G":
result=result+"C"
elif sequence[i]=="C":
result=result+"G"
else:
result=result+"N"
i=i-1
return result
from Bio import SeqIO
#parse all fasta sequences
for file in SeqIO.parse("dna.example.fasta", "fasta"):
sequence=file.seq
sequences=[sequence,revComp(sequence)]
for seq in sequences:
#include forward and reverse strand
for i in range(3):
#print out all reading frames
print("Sequence is")
#print (seq.id)
print(seq)
print("Reading frame is %d"% i )
result=findORF(i,seq)
if len(result)>0:
print ("number of ORFs: %d" %len(result))
#print(*result, sep="\n \n")
for j in range(len(result)):
print ( "%d: %s" %(j+1,result[j]))
else:
print("no ORFS found \n")
Can you show the error msg or the input and the expected output? That would be helpful. Is this homework?
I have no error message. For the following code i get the following printout and I cannot figure out what all the first lines are from:
Sequence is 'GATGATTTTCAGCGTCACGCCGCGCTTGTTCGCCAGCGTGTACCGGCTCACCGGCTGCCCGGCCGCGGTCGTTCCGTCATCCGCGCGGCTGATCGATACGTCGGGCGCCGCCGATGCGAAGTCTGCGTGCAGCGCGAGCGCCCCGAACGCGCAGATCGCGAGCAGCGGGCAGCGCGGCCGGATTCGTTTCATGGCAACCTCCGATACGGTGCGATCGGTACGCGTGCGCAAGCGGCATGCCTGCTCGCGGCCACGTTTGCTATGCTTCGCGCATCGCCCGCCTGAGCCGATCGCCGCCATGAACGCCCTCGATTCCGATATCGCGCGCACGCTGCGCGCCGCCTGCGACGCGTGCTTCGGCACGACGACCGTGTGGCCGCTCGTCGAGCGCGCGTACGGCGAGCCGCAGCGCTTCTATCACACGCTCGCGCATCTGGCGGAACTGTTCGCGCACCTCGCGCCGTATCGTGCGGACCGGCTATGGCCGGCCATCGAGCTCGCCGTGTGGGCGCACGATGTCGTCTATGCGACGACACTGCCGGATTATGCGGACAACGAAGCGCTCAGCGCGCAATGGCTCGCGCAGGTCGCGCACGAACATTGCGACGCAGCCTGGTTGCACGCGCATGCATCGCACGTGTCCGTTGCCCGCGACCTGGTGCTGGCGACGAAGTCGCACCGGCTGCCTGACGGGTTCGCCGACGATGCGGAATTGCAGCGCGCCGCGCAAATTTTCCTCGACGCCGATCTCGCGATCCTCGCGGCGGCACCCGACCGGCTCCGCGAATACGACCGCGCGATCGCGCGCGAGTGGGCGCAGGATCCCGATGCACCGTCGGCAGCCTTCCGCGCCGGCCGCAGGCAGGCGCTCGAGCATCTGCGCGCGCAGGCCCCGTTGTTTCGATCGGCGGAATTCGCGCCGCTCGAGCAGCACGCGCAACGCAATCTCGAGATGCTGATCGGCTTCTACGCATAAGCACCCTGCGCACGCCGCTTCCGCGCCCGCCTCGCACTCCGCCCAATCGCGCCGGTCAGGATGCATACGCTCCGATACCGAACGACGAAGCGAACCTCACCTACCCGGCCCCGCATCACGACGATCAGGTATTGTCGGCGCGCCAGCGGCGAGGGCTGACGCCGGTCAGGCGCGTGAAGGTCCGCGAGAAGTGGCTCTGATCGGCGAAACCGCACGCGTCGGCGATCATGCTCAACGGCAGGCCCGAATTGCGCATCCACTCCTTAGCCCGTTCGACGCGCTGCACGATGAGCCAGCGATGCGGCGGCAGCCCGGTCGTCTGATGAAACGCCTTCACGAAATAGCTGCGAGACAGACCGCAGGCGCTCGCCACGTCGGCCAGCCCGAGGTTGCCGTCGAGATGCTCGAGCAGGAGCTCCTTCGCCCGGCGCGCCTGCGACGGCGTGAGTTTCCCGTACGTCTTTTCCCGTCGCTC' Reading frame is 0 number of ORFs: 5 1: ATGCTTCGCGCATCGCCCGCCTGA 2: ATGGCCGGCCATCGAGCTCGCCGTGTGGGCGCACGATGTCGTCTATGCGACGACACTGCCGGATTATGCGGACAACGAAGCGCTCAGCGCGCAATGGCTCGCGCAGGTCGCGCACGAACATTGCGACGCAGCCTGGTTGCACGCGCATGCATCGCACGTGTCCGTTGCCCGCGACCTGGTGCTGGCGACGAAGTCGCACCGGCTGCCTGA 3: ATGGCTCGCGCAGGTCGCGCACGAACATTGCGACGCAGCCTGGTTGCACGCGCATGCATCGCACGTGTCCGTTGCCCGCGACCTGGTGCTGGCGACGAAGTCGCACCGGCTGCCTGA 4: ATGCTCAACGGCAGGCCCGAATTGCGCATCCACTCCTTAGCCCGTTCGACGCGCTGCACGATGAGCCAGCGATGCGGCGGCAGCCCGGTCGTCTGA 5: ATGAGCCAGCGATGCGGCGGCAGCCCGGTCGTCTGA Sequence is 'GATGATTTTCAGCGTCACGCCGCGCTTGTTCGCCAGCGTGTACCGGCTCACCGGCTGCCCGGCCGCGGTCGTTCCGTCATCCGCGCGGCTGATCGATACGTCGGGCGCCGCCGATGCGAAGTCTGCGTGCAGCGCGAGCGCCCCGAACGCGCAGATCGCGAGCAGCGGGCAGCGCGGCCGGATTCGTTTCATGGCAACCTCCGATACGGTGCGATCGGTACGCGTGCGCAAGCGGCATGCCTGCTCGCGGCCACGTTTGCTATGCTTCGCGCATCGCCCGCCTGAGCCGATCGCCGCCATGAACGCCCTCGATTCCGATATCGCGCGCACGCTGCGCGCCGCCTGCGACGCGTGCTTCGGCACGACGACCGTGTGGCCGCTCGTCGAGCGCGCGTACGGCGAGCCGCAGCGCTTCTATCACACGCTCGCGCATCTGGCGGAACTGTTCGCGCACCTCGCGCCGTATCGTGCGGACCGGCTATGGCCGGCCATCGAGCTCGCCGTGTGGGCGCACGATGTCGTCTATGCGACGACACTGCCGGATTATGCGGACAACGAAGCGCTCAGCGCGCAATGGCTCGCGCAGGTCGCGCACGAACATTGCGACGCAGCCTGGTTGCACGCGCATGCATCGCACGTGTCCGTTGCCCGCGACCTGGTGCTGGCGACGAAGTCGCACCGGCTGCCTGACGGGTTCGCCGACGATGCGGAATTGCAGCGCGCCGCGCAAATTTTCCTCGACGCCGATCTCGCGATCCTCGCGGCGGCACCCGACCGGCTCCGCGAATACGACCGCGCGATCGCGCGCGAGTGGGCGCAGGATCCCGATGCACCGTCGGCAGCCTTCCGCGCCGGCCGCAGGCAGGCGCTCGAGCATCTGCGCGCGCAGGCCCCGTTGTTTCGATCGGCGGAATTCGCGCCGCTCGAGCAGCACGCGCAACGCAATCTCGAGATGCTGATCGGCTTCTACGCATAAGCACCCTGCGCACGCCGCTTCCGCGCCCGCCTCGCACTCCGCCCAATCGCGCCGGTCAGGATGCATACGCTCCGATACCGAACGACGAAGCGAACCTCACCTACCCGGCCCCGCATCACGACGATCAGGTATTGTCGGCGCGCCAGCGGCGAGGGCTGACGCCGGTCAGGCGCGTGAAGGTCCGCGAGAAGTGGCTCTGATCGGCGAAACCGCACGCGTCGGCGATCATGCTCAACGGCAGGCCCGAATTGCGCATCCACTCCTTAGCCCGTTCGACGCGCTGCACGATGAGCCAGCGATGCGGCGGCAGCCCGGTCGTCTGATGAAACGCCTTCACGAAATAGCTGCGAGACAGACCGCAGGCGCTCGCCACGTCGGCCAGCCCGAGGTTGCCGTCGAGATGCTCGAGCAGGAGCTCCTTCGCCCGGCGCGCCTGCGACGGCGTGAGTTTCCCGTACGTCTTTTCCCGTCGCTC' Reading frame is 1 number of ORFs: 5 1: ATGATTTTCAGCGTCACGCCGCGCTTGTTCGCCAGCGTGTACCGGCTCACCGGCTGCCCGGCCGCGGTCGTTCCGTCATCCGCGCGGCTGATCGATACGTCGGGCGCCGCCGATGCGAAGTCTGCGTGCAGCGCGAGCGCCCCGAACGCGCAGATCGCGAGCAGCGGGCAGCGCGGCCGGATTCGTTTCATGGCAACCTCCGATACGGTGCGATCGGTACGCGTGCGCAAGCGGCATGCCTGCTCGCGGCCACGTTTGCTATGCTTCGCGCATCGCCCGCCTGAGCCGATCGCCGCCATGAACGCCCTCGATTCCGATATCGCGCGCACGCTGCGCGCCGCCTGCGACGCGTGCTTCGGCACGACGACCGTGTGGCCGCTCGTCGAGCGCGCGTACGGCGAGCCGCAGCGCTTCTATCACACGCTCGCGCATCTGGCGGAACTGTTCGCGCACCTCGCGCCGTATCGTGCGGACCGGCTATGGCCGGCCATCGAGCTCGCCGTGTGGGCGCACGATGTCGTCTATGCGACGACACTGCCGGATTATGCGGACAACGAAGCGCTCAGCGCGCAATGGCTCGCGCAGGTCGCGCACGAACATTGCGACGCAGCCTGGTTGCACGCGCATGCATCGCACGTGTCCGTTGCCCGCGACCTGGTGCTGGCGACGAAGTCGCACCGGCTGCCTGACGGGTTCGCCGACGATGCGGAATTGCAGCGCGCCGCGCAAATTTTCCTCGACGCCGATCTCGCGATCCTCGCGGCGGCACCCGACCGGCTCCGCGAATACGACCGCGCGATCGCGCGCGAGTGGGCGCAGGATCCCGATGCACCGTCGGCAGCCTTCCGCGCCGGCCGCAGGCAGGCGCTCGAGCATCTGCGCGCGCAGGCCCCGTTGTTTCGATCGGCGGAATTCGCGCCGCTCGAGCAGCACGCGCAACGCAATCTCGAGATGCTGATCGGCTTCTACGCATAA 2: ATGGCAACCTCCGATACGGTGCGATCGGTACGCGTGCGCAAGCGGCATGCCTGCTCGCGGCCACGTTTGCTATGCTTCGCGCATCGCCCGCCTGAGCCGATCGCCGCCATGAACGCCCTCGATTCCGATATCGCGCGCACGCTGCGCGCCGCCTGCGACGCGTGCTTCGGCACGACGACCGTGTGGCCGCTCGTCGAGCGCGCGTACGGCGAGCCGCAGCGCTTCTATCACACGCTCGCGCATCTGGCGGAACTGTTCGCGCACCTCGCGCCGTATCGTGCGGACCGGCTATGGCCGGCCATCGAGCTCGCCGTGTGGGCGCACGATGTCGTCTATGCGACGACACTGCCGGATTATGCGGACAACGAAGCGCTCAGCGCGCAATGGCTCGCGCAGGTCGCGCACGAACATTGCGACGCAGCCTGGTTGCACGCGCATGCATCGCACGTGTCCGTTGCCCGCGACCTGGTGCTGGCGACGAAGTCGCACCGGCTGCCTGACGGGTTCGCCGACGATGCGGAATTGCAGCGCGCCGCGCAAATTTTCCTCGACGCCGATCTCGCGATCCTCGCGGCGGCACCCGACCGGCTCCGCGAATACGACCGCGCGATCGCGCGCGAGTGGGCGCAGGATCCCGATGCACCGTCGGCAGCCTTCCGCGCCGGCCGCAGGCAGGCGCTCGAGCATCTGCGCGCGCAGGCCCCGTTGTTTCGATCGGCGGAATTCGCGCCGCTCGAGCAGCACGCGCAACGCAATCTCGAGATGCTGATCGGCTTCTACGCATAA 3: ATGAACGCCCTCGATTCCGATATCGCGCGCACGCTGCGCGCCGCCTGCGACGCGTGCTTCGGCACGACGACCGTGTGGCCGCTCGTCGAGCGCGCGTACGGCGAGCCGCAGCGCTTCTATCACACGCTCGCGCATCTGGCGGAACTGTTCGCGCACCTCGCGCCGTATCGTGCGGACCGGCTATGGCCGGCCATCGAGCTCGCCGTGTGGGCGCACGATGTCGTCTATGCGACGACACTGCCGGATTATGCGGACAACGAAGCGCTCAGCGCGCAATGGCTCGCGCAGGTCGCGCACGAACATTGCGACGCAGCCTGGTTGCACGCGCATGCATCGCACGTGTCCGTTGCCCGCGACCTGGTGCTGGCGACGAAGTCGCACCGGCTGCCTGACGGGTTCGCCGACGATGCGGAATTGCAGCGCGCCGCGCAAATTTTCCTCGACGCCGATCTCGCGATCCTCGCGGCGGCACCCGACCGGCTCCGCGAATACGACCGCGCGATCGCGCGCGAGTGGGCGCAGGATCCCGATGCACCGTCGGCAGCCTTCCGCGCCGGCCGCAGGCAGGCGCTCGAGCATCTGCGCGCGCAGGCCCCGTTGTTTCGATCGGCGGAATTCGCGCCGCTCGAGCAGCACGCGCAACGCAATCTCGAGATGCTGATCGGCTTCTACGCATAA 4: ATGCTGATCGGCTTCTACGCATAA 5: ATGCATACGCTCCGATACCGAACGACGAAGCGAACCTCACCTACCCGGCCCCGCATCACGACGATCAGGTATTGTCGGCGCGCCAGCGGCGAGGGCTGA
etc. etc.
I know I am not going through all the files because when I print seq.id I do not see all of them (which I can only do so far when I remove reverse complement).
I don't have an easy way to write an expected output. Is there a way to share FASTA files? It shouldn't really matter which ones I use. I wish this was homework. My school never had classes like this:( I am just trying to learn as a personal project but I took the files from a coursera site.
Thanks for sharing. Here is my small tip about Python.
Python naming styles
You don't have to follow the style guide offered by Python community but it would be nice to do that. If you stick to the camelCase naming then you may not want to use the snake_case naming. e.g.
rev_comp(args)
Python f-string
This is an improved string formatting syntax.
Python type hint
This syntax would be useful when you debug/reread your code. Hope they are useful to you.