Question: Trouble Finding ORFs in DNA Sequence
0
gravatar for kmussar
10 months ago by
kmussar0
kmussar0 wrote:

Hi,

I'm trying to find the longest ORFs in a fasta file containing a few genomic sequences. I've been working with the function in the biopython documents, but am not sure I quite understand the code. Any clarification would be helpful!

My questions:

  • Do I need to worry about iterating through nucleotides in groups of 3? Since this function translates into proteins, I don't think I do?
  • For the standard code, translation table 1), why is the amino acid L being marked as a start codon? (https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi#SG1) I was under the impression only M (AUG/ATG) is a start codon.
  • The code given in the python documents (http://biopython.org/DIST/docs/tutorial/Tutorial.pdf) does not include any statements indicating that the ORF should start at a Met/AUG/ATG. Why is this?
  • I believe I am finding the longest ORF by finding the first M and going until reaching a *. I may be missing shorter reading frames in this example by not looking for subsequent M downstream of the first one. But I think that is okay given the question - to find the longest ORFs. Is this reasoning correct?

Code:

table = 1Output of code listed below as image. 
min_pro_len = 100
ORF_f2 = []

def find_orfs(filename):    
    for record in SeqIO.parse(filename, "fasta"):
        for strand, nuc in [(+1, record.seq), (-1, record.seq.reverse_complement())]:
            for frame in range(3):
                # Splits on stop codons
                for pro in nuc[frame:].translate(table).split("*"):
                    # Truncate sequence to start with the start codon
                    pro = pro[pro.find('M'):]
                    if frame == 2: 
                        # add length of sequence to list if in frame 2
                        ORF_f2.append(len(pro))
                        if len(pro) >= min_pro_len:
                            print("%s...%s - length %i, strand %i, frame %i" \
                                % (pro[:30], pro[-3:], len(pro), strand, frame))

    print(max(ORF_f2))

find_orfs('dna.example.fasta')

Output of code:

MSTFNRLHLIRQVDLFTLKLFVSAVEEQQI...PRD - length 305, strand 1, frame 2
MQPIVLGRLEIQKVVEMETSLPIEVLFPGV...TQV - length 294, strand 1, frame 2
MASSQGTVDFIVEQMAAAGTVSARKMFGEY...RRR - length 107, strand 1, frame 2
MSSFLCSRRMRDGRAHVSQNTPGRSATWHA...PAS - length 247, strand -1, frame 2
MAVLVQNVTLSLMKVSDRGTQNFSFSRPEN...IRR - length 128, strand -1, frame 2
MHRATGFERTPATHPVGGDRERIVQRPIPV...DDQ - length 535, strand 1, frame 2
MPTDDASAASIDAGGTRNAWRPGLGAHEPD...RTG - length 299, strand 1, frame 2
MRLSHERLEKDLLSKPTTLRDSITELRRLS...VHV - length 314, strand -1, frame 2

(truncated)

biopython python • 544 views
ADD COMMENTlink written 10 months ago by kmussar0

GTG (V) and TTG (L) are alternative start cordons. The most common one is ATG (M), the other two are rare in eukaryotes.

You don't need to group by 3 because of the translate function as you mentioned.

This link might also help:

How to extract the longest orf?

ADD REPLYlink modified 10 months ago • written 10 months ago by Fatima890
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2518 users visited in the last hour
_