Question

Biopython first project help

0

Entering edit mode

2.6 years ago

Eli • 0

I am new to bioinformatics and trying to use python and biopython specifically SeqIO to find all the ORFs in several FASTA files. My code is simple and I can easily find all the ORFs in an example string and I can use SeqIO to go through all the FASTA files but when I put the two together I get a problem. I get some seemingly random sequence(s) in the beginning before accessing and printing the sequences I want. Then, I only get some of the FASTA files and what files I access and what ORFs seems to be dependent on what print statements I use beforehand, either printing the sequence, printing the sequence ID, printing nothing etc.

It does seem to work (I think) if I only consider one reading frame and therefore don't add an extra for loop.

This doesn't make sense to me at all and I can't after much struggle find a solution for this. Is what I am trying to do just not possible with SeqIO or is there something I am not understanding?

Any help would be much appreciated.

      #function to return ORF (open reading frame) given a reading frame and a DNA string, sequence
     #first find the start codon ATG and then continue from there to find a stop codon all in a given reading frame 
     #(i.e start at a given index and increment +3 
     #if both the stop and start codon are not found nothing is returned or printed

    def findORF(readingFrame, sequence):


        results=[]
        i=readingFrame


        while i<len(sequence)-2:
            if sequence[i:i+3] =="ATG":
                j=i
                while j<len(sequence)-2:
                    if sequence[j:j+3]=="TAG" or  sequence[j:j+3]=="TAA" or sequence[j:j+3]=="TGA": 
                      #could also make a list of stop codons and ask is sequence in list
                        results.append(sequence[i:j+3])

                        break
                    j=j+3    
            i=i+3  
        return results  

    def revComp(sequence):
        result=""
        sequence=sequence.upper()
        i=len(sequence)-1
        while i>=0:
            if sequence[i]=="T":
              result=result+"A"
            elif sequence[i]=="A":
                   result=result+"T"
            elif sequence[i]=="G":
                   result=result+"C" 
            elif  sequence[i]=="C":
                   result=result+"G" 
            else:
                result=result+"N"
            i=i-1    
        return result  


from Bio import SeqIO
  #parse all fasta sequences
for file in SeqIO.parse("dna.example.fasta", "fasta"):
  sequence=file.seq
  sequences=[sequence,revComp(sequence)]
  for seq in sequences:  
        #include forward and reverse strand
    for i in range(3):
            #print out all reading frames
        print("Sequence is")
        #print (seq.id)
        print(seq)
        print("Reading frame is %d"% i )

        result=findORF(i,seq)
        if len(result)>0: 
            print ("number of ORFs: %d" %len(result))
            #print(*result, sep="\n \n")
            for j in range(len(result)):
                print ( "%d: %s" %(j+1,result[j]))


        else:  
            print("no ORFS found \n")

Biopython • 2.0k views

ADD COMMENT • link 2.6 years ago by Eli • 0

0

Entering edit mode

Can you show the error msg or the input and the expected output? That would be helpful. Is this homework?

ADD REPLY • link 2.6 years ago by jkim ▴ 180

0

Entering edit mode

I have no error message. For the following code i get the following printout and I cannot figure out what all the first lines are from:

2: ATGGCCGGCCATAGCCGGTCCGCACGATACGGCGCGAGGTGCGCGAACAGTTCCGCCAGATGCGCGAGCGTGTGA

Sequence is 'GATGATTTTCAGCGTCACGCCGCGCTTGTTCGCCAGCGTGTACCGGCTCACCGGCTGCCCGGCCGCGGTCGTTCCGTCATCCGCGCGGCTGATCGATACGTCGGGCGCCGCCGATGCGAAGTCTGCGTGCAGCGCGAGCGCCCCGAACGCGCAGATCGCGAGCAGCGGGCAGCGCGGCCGGATTCGTTTCATGGCAACCTCCGATACGGTGCGATCGGTACGCGTGCGCAAGCGGCATGCCTGCTCGCGGCCACGTTTGCTATGCTTCGCGCATCGCCCGCCTGAGCCGATCGCCGCCATGAACGCCCTCGATTCCGATATCGCGCGCACGCTGCGCGCCGCCTGCGACGCGTGCTTCGGCACGACGACCGTGTGGCCGCTCGTCGAGCGCGCGTACGGCGAGCCGCAGCGCTTCTATCACACGCTCGCGCATCTGGCGGAACTGTTCGCGCACCTCGCGCCGTATCGTGCGGACCGGCTATGGCCGGCCATCGAGCTCGCCGTGTGGGCGCACGATGTCGTCTATGCGACGACACTGCCGGATTATGCGGACAACGAAGCGCTCAGCGCGCAATGGCTCGCGCAGGTCGCGCACGAACATTGCGACGCAGCCTGGTTGCACGCGCATGCATCGCACGTGTCCGTTGCCCGCGACCTGGTGCTGGCGACGAAGTCGCACCGGCTGCCTGACGGGTTCGCCGACGATGCGGAATTGCAGCGCGCCGCGCAAATTTTCCTCGACGCCGATCTCGCGATCCTCGCGGCGGCACCCGACCGGCTCCGCGAATACGACCGCGCGATCGCGCGCGAGTGGGCGCAGGATCCCGATGCACCGTCGGCAGCCTTCCGCGCCGGCCGCAGGCAGGCGCTCGAGCATCTGCGCGCGCAGGCCCCGTTGTTTCGATCGGCGGAATTCGCGCCGCTCGAGCAGCACGCGCAACGCAATCTCGAGATGCTGATCGGCTTCTACGCATAAGCACCCTGCGCACGCCGCTTCCGCGCCCGCCTCGCACTCCGCCCAATCGCGCCGGTCAGGATGCATACGCTCCGATACCGAACGACGAAGCGAACCTCACCTACCCGGCCCCGCATCACGACGATCAGGTATTGTCGGCGCGCCAGCGGCGAGGGCTGACGCCGGTCAGGCGCGTGAAGGTCCGCGAGAAGTGGCTCTGATCGGCGAAACCGCACGCGTCGGCGATCATGCTCAACGGCAGGCCCGAATTGCGCATCCACTCCTTAGCCCGTTCGACGCGCTGCACGATGAGCCAGCGATGCGGCGGCAGCCCGGTCGTCTGATGAAACGCCTTCACGAAATAGCTGCGAGACAGACCGCAGGCGCTCGCCACGTCGGCCAGCCCGAGGTTGCCGTCGAGATGCTCGAGCAGGAGCTCCTTCGCCCGGCGCGCCTGCGACGGCGTGAGTTTCCCGTACGTCTTTTCCCGTCGCTC' Reading frame is 0 number of ORFs: 5 1: ATGCTTCGCGCATCGCCCGCCTGA 2: ATGGCCGGCCATCGAGCTCGCCGTGTGGGCGCACGATGTCGTCTATGCGACGACACTGCCGGATTATGCGGACAACGAAGCGCTCAGCGCGCAATGGCTCGCGCAGGTCGCGCACGAACATTGCGACGCAGCCTGGTTGCACGCGCATGCATCGCACGTGTCCGTTGCCCGCGACCTGGTGCTGGCGACGAAGTCGCACCGGCTGCCTGA 3: ATGGCTCGCGCAGGTCGCGCACGAACATTGCGACGCAGCCTGGTTGCACGCGCATGCATCGCACGTGTCCGTTGCCCGCGACCTGGTGCTGGCGACGAAGTCGCACCGGCTGCCTGA 4: ATGCTCAACGGCAGGCCCGAATTGCGCATCCACTCCTTAGCCCGTTCGACGCGCTGCACGATGAGCCAGCGATGCGGCGGCAGCCCGGTCGTCTGA 5: ATGAGCCAGCGATGCGGCGGCAGCCCGGTCGTCTGA Sequence is 'GATGATTTTCAGCGTCACGCCGCGCTTGTTCGCCAGCGTGTACCGGCTCACCGGCTGCCCGGCCGCGGTCGTTCCGTCATCCGCGCGGCTGATCGATACGTCGGGCGCCGCCGATGCGAAGTCTGCGTGCAGCGCGAGCGCCCCGAACGCGCAGATCGCGAGCAGCGGGCAGCGCGGCCGGATTCGTTTCATGGCAACCTCCGATACGGTGCGATCGGTACGCGTGCGCAAGCGGCATGCCTGCTCGCGGCCACGTTTGCTATGCTTCGCGCATCGCCCGCCTGAGCCGATCGCCGCCATGAACGCCCTCGATTCCGATATCGCGCGCACGCTGCGCGCCGCCTGCGACGCGTGCTTCGGCACGACGACCGTGTGGCCGCTCGTCGAGCGCGCGTACGGCGAGCCGCAGCGCTTCTATCACACGCTCGCGCATCTGGCGGAACTGTTCGCGCACCTCGCGCCGTATCGTGCGGACCGGCTATGGCCGGCCATCGAGCTCGCCGTGTGGGCGCACGATGTCGTCTATGCGACGACACTGCCGGATTATGCGGACAACGAAGCGCTCAGCGCGCAATGGCTCGCGCAGGTCGCGCACGAACATTGCGACGCAGCCTGGTTGCACGCGCATGCATCGCACGTGTCCGTTGCCCGCGACCTGGTGCTGGCGACGAAGTCGCACCGGCTGCCTGACGGGTTCGCCGACGATGCGGAATTGCAGCGCGCCGCGCAAATTTTCCTCGACGCCGATCTCGCGATCCTCGCGGCGGCACCCGACCGGCTCCGCGAATACGACCGCGCGATCGCGCGCGAGTGGGCGCAGGATCCCGATGCACCGTCGGCAGCCTTCCGCGCCGGCCGCAGGCAGGCGCTCGAGCATCTGCGCGCGCAGGCCCCGTTGTTTCGATCGGCGGAATTCGCGCCGCTCGAGCAGCACGCGCAACGCAATCTCGAGATGCTGATCGGCTTCTACGCATAAGCACCCTGCGCACGCCGCTTCCGCGCCCGCCTCGCACTCCGCCCAATCGCGCCGGTCAGGATGCATACGCTCCGATACCGAACGACGAAGCGAACCTCACCTACCCGGCCCCGCATCACGACGATCAGGTATTGTCGGCGCGCCAGCGGCGAGGGCTGACGCCGGTCAGGCGCGTGAAGGTCCGCGAGAAGTGGCTCTGATCGGCGAAACCGCACGCGTCGGCGATCATGCTCAACGGCAGGCCCGAATTGCGCATCCACTCCTTAGCCCGTTCGACGCGCTGCACGATGAGCCAGCGATGCGGCGGCAGCCCGGTCGTCTGATGAAACGCCTTCACGAAATAGCTGCGAGACAGACCGCAGGCGCTCGCCACGTCGGCCAGCCCGAGGTTGCCGTCGAGATGCTCGAGCAGGAGCTCCTTCGCCCGGCGCGCCTGCGACGGCGTGAGTTTCCCGTACGTCTTTTCCCGTCGCTC' Reading frame is 1 number of ORFs: 5 1: ATGATTTTCAGCGTCACGCCGCGCTTGTTCGCCAGCGTGTACCGGCTCACCGGCTGCCCGGCCGCGGTCGTTCCGTCATCCGCGCGGCTGATCGATACGTCGGGCGCCGCCGATGCGAAGTCTGCGTGCAGCGCGAGCGCCCCGAACGCGCAGATCGCGAGCAGCGGGCAGCGCGGCCGGATTCGTTTCATGGCAACCTCCGATACGGTGCGATCGGTACGCGTGCGCAAGCGGCATGCCTGCTCGCGGCCACGTTTGCTATGCTTCGCGCATCGCCCGCCTGAGCCGATCGCCGCCATGAACGCCCTCGATTCCGATATCGCGCGCACGCTGCGCGCCGCCTGCGACGCGTGCTTCGGCACGACGACCGTGTGGCCGCTCGTCGAGCGCGCGTACGGCGAGCCGCAGCGCTTCTATCACACGCTCGCGCATCTGGCGGAACTGTTCGCGCACCTCGCGCCGTATCGTGCGGACCGGCTATGGCCGGCCATCGAGCTCGCCGTGTGGGCGCACGATGTCGTCTATGCGACGACACTGCCGGATTATGCGGACAACGAAGCGCTCAGCGCGCAATGGCTCGCGCAGGTCGCGCACGAACATTGCGACGCAGCCTGGTTGCACGCGCATGCATCGCACGTGTCCGTTGCCCGCGACCTGGTGCTGGCGACGAAGTCGCACCGGCTGCCTGACGGGTTCGCCGACGATGCGGAATTGCAGCGCGCCGCGCAAATTTTCCTCGACGCCGATCTCGCGATCCTCGCGGCGGCACCCGACCGGCTCCGCGAATACGACCGCGCGATCGCGCGCGAGTGGGCGCAGGATCCCGATGCACCGTCGGCAGCCTTCCGCGCCGGCCGCAGGCAGGCGCTCGAGCATCTGCGCGCGCAGGCCCCGTTGTTTCGATCGGCGGAATTCGCGCCGCTCGAGCAGCACGCGCAACGCAATCTCGAGATGCTGATCGGCTTCTACGCATAA 2: ATGGCAACCTCCGATACGGTGCGATCGGTACGCGTGCGCAAGCGGCATGCCTGCTCGCGGCCACGTTTGCTATGCTTCGCGCATCGCCCGCCTGAGCCGATCGCCGCCATGAACGCCCTCGATTCCGATATCGCGCGCACGCTGCGCGCCGCCTGCGACGCGTGCTTCGGCACGACGACCGTGTGGCCGCTCGTCGAGCGCGCGTACGGCGAGCCGCAGCGCTTCTATCACACGCTCGCGCATCTGGCGGAACTGTTCGCGCACCTCGCGCCGTATCGTGCGGACCGGCTATGGCCGGCCATCGAGCTCGCCGTGTGGGCGCACGATGTCGTCTATGCGACGACACTGCCGGATTATGCGGACAACGAAGCGCTCAGCGCGCAATGGCTCGCGCAGGTCGCGCACGAACATTGCGACGCAGCCTGGTTGCACGCGCATGCATCGCACGTGTCCGTTGCCCGCGACCTGGTGCTGGCGACGAAGTCGCACCGGCTGCCTGACGGGTTCGCCGACGATGCGGAATTGCAGCGCGCCGCGCAAATTTTCCTCGACGCCGATCTCGCGATCCTCGCGGCGGCACCCGACCGGCTCCGCGAATACGACCGCGCGATCGCGCGCGAGTGGGCGCAGGATCCCGATGCACCGTCGGCAGCCTTCCGCGCCGGCCGCAGGCAGGCGCTCGAGCATCTGCGCGCGCAGGCCCCGTTGTTTCGATCGGCGGAATTCGCGCCGCTCGAGCAGCACGCGCAACGCAATCTCGAGATGCTGATCGGCTTCTACGCATAA 3: ATGAACGCCCTCGATTCCGATATCGCGCGCACGCTGCGCGCCGCCTGCGACGCGTGCTTCGGCACGACGACCGTGTGGCCGCTCGTCGAGCGCGCGTACGGCGAGCCGCAGCGCTTCTATCACACGCTCGCGCATCTGGCGGAACTGTTCGCGCACCTCGCGCCGTATCGTGCGGACCGGCTATGGCCGGCCATCGAGCTCGCCGTGTGGGCGCACGATGTCGTCTATGCGACGACACTGCCGGATTATGCGGACAACGAAGCGCTCAGCGCGCAATGGCTCGCGCAGGTCGCGCACGAACATTGCGACGCAGCCTGGTTGCACGCGCATGCATCGCACGTGTCCGTTGCCCGCGACCTGGTGCTGGCGACGAAGTCGCACCGGCTGCCTGACGGGTTCGCCGACGATGCGGAATTGCAGCGCGCCGCGCAAATTTTCCTCGACGCCGATCTCGCGATCCTCGCGGCGGCACCCGACCGGCTCCGCGAATACGACCGCGCGATCGCGCGCGAGTGGGCGCAGGATCCCGATGCACCGTCGGCAGCCTTCCGCGCCGGCCGCAGGCAGGCGCTCGAGCATCTGCGCGCGCAGGCCCCGTTGTTTCGATCGGCGGAATTCGCGCCGCTCGAGCAGCACGCGCAACGCAATCTCGAGATGCTGATCGGCTTCTACGCATAA 4: ATGCTGATCGGCTTCTACGCATAA 5: ATGCATACGCTCCGATACCGAACGACGAAGCGAACCTCACCTACCCGGCCCCGCATCACGACGATCAGGTATTGTCGGCGCGCCAGCGGCGAGGGCTGA

etc. etc.

I know I am not going through all the files because when I print seq.id I do not see all of them (which I can only do so far when I remove reverse complement).

I don't have an easy way to write an expected output. Is there a way to share FASTA files? It shouldn't really matter which ones I use. I wish this was homework. My school never had classes like this:( I am just trying to learn as a personal project but I took the files from a coursera site.

ADD REPLY • link 2.6 years ago by Eli • 0

0

Entering edit mode

Thanks for sharing. Here is my small tip about Python.

Python naming styles

You don't have to follow the style guide offered by Python community but it would be nice to do that. If you stick to the camelCase naming then you may not want to use the snake_case naming. e.g. rev_comp(args)

Python f-string

This is an improved string formatting syntax.

Python type hint

This syntax would be useful when you debug/reread your code. Hope they are useful to you.

ADD REPLY • link 2.6 years ago by jkim ▴ 180

score 1 · Answer 1 · 2022-03-25

1

Entering edit mode

2.6 years ago

Mensur Dlakic ★ 28k

Your code actually works fine. The only thing I changed is the way to make a reverse complement, and I added a couple of features such as minimum ORF length and different output formatting. If you want to make this proper, you might want to consider eliminating ORFs that are a subset of a larger ORF. This happens when there are multiple Met residues, and the largest ORF is likely to be most relevant.

# Minimum ORF length, not counting the stop codon ; 90 bases is 30 amino-acids, which man people consider a minimum ORF threshold
min_ORF = 90

def findORF(readingFrame, sequence):
    results=[]
    coords=[]
    i=readingFrame
    while i<len(sequence)-2:
        if sequence[i:i+3] =="ATG":
            j=i
            while j<len(sequence)-2:
                if (sequence[j:j+3]=="TAG" or  sequence[j:j+3]=="TAA" or sequence[j:j+3]=="TGA") and ((j-i) >= min_ORF):
                    results.append(sequence[i:j+3])
                    coords.append([i+1, j+4])
                    break
                j=j+3
        i=i+3
    return results, coords

from Bio import SeqIO
for file in SeqIO.parse("dna.example.fasta", "fasta"):
    sequence=file.seq
    # include forward and reverse strand
    sequences=[sequence, sequence.reverse_complement()]
    reverse = False
    for seq in sequences:
        for i in range(3):
            if reverse:
                print("Reading frame is reverse %d" % (i+1) )
            else:
                print("Reading frame is %d" % (i+1) )

            result, coords = findORF(i,seq)
            if len(result)>0:
                print ("number of ORFs: %d" %len(result))
                for j in range(len(result)):
                    if reverse:
                        print ( ">%s.%d  reverse nt %d..%d\n%s" % ( str(file.id), j+1, coords[j][1], coords[j][0], result[j]))
                    else:
                        print ( ">%s.%d  nt %d..%d\n%s" % ( str(file.id), j+1, coords[j][0], coords[j][1], result[j]))
            else:
                print("no ORFs found \n")
        reverse = True

ADD COMMENT • link 2.6 years ago by Mensur Dlakic ★ 28k

0

Entering edit mode

Thank you so much for your advice and including information I did not even know to think about. I did not know there was a minimum ORF length. From my understanding the changes you made which didn't change the overall logic of the code were as follows. I think you just added minimum ORF length (which is nice, thank you!), used a built in reverse complement function and included and printed out the coordinates for the nucleotides in an ORF (which makes sense and is a nice touch, thanks!). However, I do not think this fixes the problem which I believe I can now better define. I believe neither of our codes are printing results from all the FASTA files. I can show this by examining file IDs. With the following code I can print all file IDs out.

from Bio import SeqIO
for file in SeqIO.parse("dna.example.fasta", "fasta"):
    print(file.id)

This is the result:

gi|142022655|gb|EQ086233.1|43
gi|142022655|gb|EQ086233.1|160
gi|142022655|gb|EQ086233.1|41
gi|142022655|gb|EQ086233.1|221
gi|142022655|gb|EQ086233.1|294
gi|142022655|gb|EQ086233.1|323
gi|142022655|gb|EQ086233.1|564
gi|142022655|gb|EQ086233.1|521
gi|142022655|gb|EQ086233.1|455
gi|142022655|gb|EQ086233.1|229
gi|142022655|gb|EQ086233.1|422
gi|142022655|gb|EQ086233.1|384
gi|142022655|gb|EQ086233.1|280
gi|142022655|gb|EQ086233.1|158
gi|142022655|gb|EQ086233.1|59
gi|142022655|gb|EQ086233.1|319
gi|142022655|gb|EQ086233.1|438
gi|142022655|gb|EQ086233.1|210
gi|142022655|gb|EQ086233.1|237
gi|142022655|gb|EQ086233.1|507
gi|142022655|gb|EQ086233.1|350
gi|142022655|gb|EQ086233.1|245
gi|142022655|gb|EQ086233.1|279
gi|142022655|gb|EQ086233.1|378
gi|142022655|gb|EQ086233.1|101

However, when I run your code for example I see only the file IDs from 237 to 101, the last seven IDs shown above. Do you notice the same thing?

ADD REPLY • link 2.6 years ago by Eli • 0

0

Entering edit mode

I have no idea why you are seeing this as I don't have your sequences. When I run the code on a file with 20 short DNA sequence (7-8 kB each), all of them are read in correctly and ORFs are found in all of them. I suspect something is off with your sequences, not the code. Maybe you can figure it out by testing the code on some random DNA sequences:

https://www.bioinformatics.org/sms2/random_dna.html

ADD REPLY • link 2.6 years ago by Mensur Dlakic ★ 28k

0

Entering edit mode

It is good to know it works well for you. That helps to isolate the variables. It seems either a problem with my files or my computer system or version. I tested out the code on orchid.fasta found here with similar results as my fasta files. This file should be formatted correctly as I found it off biopython and simply downloaded and saved the fasta file. However, I only see the results for the last several sequences. I tried the random DNA sequences but despite testing that I am able to successfully parse the file it can find no ORFs for 10 random sequences of 8kb which seems unlikely.

ADD REPLY • link 2.6 years ago by Eli • 0

0

Entering edit mode

I figured it out! I was using Spyder as my IDE and it automatically limits the console to 500 lines. That is why I was only observing the last of the results. I was able to change this default setting and it works fine. Seems pretty stupid to me to have such a low limit. Has this not caused trouble with others?

ADD REPLY • link 2.6 years ago by Eli • 0

score 0 · Answer 2 · 2022-03-24

0

Entering edit mode

2.6 years ago

Mensur Dlakic ★ 28k

It is difficult to follow your code because of inconsistent indentation. Also, if you add the right keywords there will be some solutions offered on the right-hand side of this page. Besides, you can do a search on your own for solutions to similar problems on BioStars. Here, I did it for you, as there is no need to reinvent the wheel:

ADD COMMENT • link 2.6 years ago by Mensur Dlakic ★ 28k

0

Entering edit mode

Thanks for your advice! I will definitely try to better understand these as they seem to draw on more built in biopython functions I did not know about and are much more elegant. However, in order to learn better I still want to understand why my solution cannot work as it seems by every bit of logic and understanding that I have and have read about that it should.

ADD REPLY • link 2.6 years ago by Eli • 0