Question

Open Reading Frame in Biopython

0

Entering edit mode

4.6 years ago

antonio.mauceri87 • 0

Hi everyone, I recently started using python with biopython. I'm trying to practice to get the translate ORF using this gene taken from Genbank as input: NM_100684.3

However, my output does not show me the correct ORF and I get a different amino acid sequence both in composition and length.

What am I doing wrong?

These are the scripts used by me

>>>from Bio import SeqIO

>>>record = SeqIO.read("sequence.fasta", "fasta")

>>> table = 1

>>> min_pro_len = 100

>>>for strand, nuc in [(+1, record.seq), (-1, record.seq.reverse_complement())]:
        for frame in range(3):
            length = 3 * ((len(record)-frame) // 3) #Multiple of three
            for pro in nuc[frame:frame+length].translate(table).split("*"):
                if len(pro) >= min_pro_len:
                    print("%s...%s - length %i, strand %i, frame %i" \
                          % (pro[:30], pro[-3:], len(pro), strand, frame))



YSDIDQINLNQISNLQRNLKYFITMGDSTG...NNV - length 554, strand 1, frame 2
SSPGDKGHNCKGGSASSLCPHREEHHSHNG...ILT - length 162, strand -1, frame 1
IEHQDSHDDVQPTGYKEGDPPGREGCGTAA...HNW - length 216, strand -1, frame 1
TKVTGNVQATIITPIHVSPCSVVKCEVEKK...SDA - length 122, strand -1, frame 2

This above is my output but isn't corrected and do not start with methionine, in Genbank the correct protein has 530 a.a. and start with "MGDSTGEPGSSMHGVTGREQ ..."

sequence gene • 6.2k views

ADD COMMENT • link updated 3.4 years ago by schagas • 0 • written 4.6 years ago by antonio.mauceri87 • 0

0

Entering edit mode

From the docs you're following:

A very simplistic first step at identifying possible genes is to look for open reading frames (ORFs). By this we mean look in all six frames for long regions without stop codons – an ORF is just a region of nucleotides with no in frame stop codons.

Of course, to find a gene you would also need to worry about locating a start codon, possible promoters – and in Eukaryotes there are introns to worry about too. However, this approach is still useful in viruses and Prokaryotes.

As it stands, all you've assessed is regions uninterrupted by stop codons, you haven't gone to the next step of identifying starts etc.

ADD REPLY • link 4.6 years ago by Joe 21k

score 0 · Answer 1 · 2019-09-11

As far as I can see your code does not have a check for the starting methionine, you are retrieving ORFs that generate at least 100 amino acid long peptides.

And I am pretty sure the 530 a.a. long "Genbank peptide" is within your 554 a.a. long peptide prediction. Check from 25th amino acid on in your first result, that goes like MGDSTG...

score 0 · Answer 2 · 2020-11-18

I was tryng something similar, try this:

from Bio import SeqIO
record = SeqIO.read("NC_005816.1.fna", "fasta")
table = 11
min_pro_len = 50
x= 0

for strand, nuc in [(+1, record.seq), (-1, record.seq.reverse_complement())]:
     for frame in range(3):
          length = 3 * ((len(record)-frame) // 3) #Multiple of three
         for pro in nuc[frame:frame+length].translate(table).split("*"):
              splitlocal = pro.find('M')
              seq_final = pro[splitlocal:]
              if len(seq_final) >= min_pro_len:
                 print("%s...%s - length %i, strand %i, frame %i" \
                 % (seq_final[:10], pro[-3:], len(seq_final), strand, frame))

                   x = x+1
 print("Numero de ORFs:",x)
 MVAHRFTCSL...MTI - length 54, strand 1, frame 0
 MLKQPTATVC...KIW - length 58, strand 1, frame 0
 MLWMPTSRRP...SGF - length 66, strand 1, frame 0
 MPHRCVRRTC...KNM - length 56, strand 1, frame 0
 MKKSSIVATI...YRF - length 312, strand 1, frame 0
 MMELQHQRLM...NPE - length 259, strand 1, frame 1