Open Reading Frame in Biopython
2
0
Entering edit mode
4.6 years ago

Hi everyone, I recently started using python with biopython. I'm trying to practice to get the translate ORF using this gene taken from Genbank as input: NM_100684.3

However, my output does not show me the correct ORF and I get a different amino acid sequence both in composition and length.

What am I doing wrong?

These are the scripts used by me

>>>from Bio import SeqIO

>>>record = SeqIO.read("sequence.fasta", "fasta")

>>> table = 1

>>> min_pro_len = 100

>>>for strand, nuc in [(+1, record.seq), (-1, record.seq.reverse_complement())]:
        for frame in range(3):
            length = 3 * ((len(record)-frame) // 3) #Multiple of three
            for pro in nuc[frame:frame+length].translate(table).split("*"):
                if len(pro) >= min_pro_len:
                    print("%s...%s - length %i, strand %i, frame %i" \
                          % (pro[:30], pro[-3:], len(pro), strand, frame))



YSDIDQINLNQISNLQRNLKYFITMGDSTG...NNV - length 554, strand 1, frame 2
SSPGDKGHNCKGGSASSLCPHREEHHSHNG...ILT - length 162, strand -1, frame 1
IEHQDSHDDVQPTGYKEGDPPGREGCGTAA...HNW - length 216, strand -1, frame 1
TKVTGNVQATIITPIHVSPCSVVKCEVEKK...SDA - length 122, strand -1, frame 2

This above is my output but isn't corrected and do not start with methionine, in Genbank the correct protein has 530 a.a. and start with "MGDSTGEPGSSMHGVTGREQ ..."

sequence gene • 6.2k views
ADD COMMENT
0
Entering edit mode

From the docs you're following:

A very simplistic first step at identifying possible genes is to look for open reading frames (ORFs). By this we mean look in all six frames for long regions without stop codons – an ORF is just a region of nucleotides with no in frame stop codons.

Of course, to find a gene you would also need to worry about locating a start codon, possible promoters – and in Eukaryotes there are introns to worry about too. However, this approach is still useful in viruses and Prokaryotes.

As it stands, all you've assessed is regions uninterrupted by stop codons, you haven't gone to the next step of identifying starts etc.

ADD REPLY
0
Entering edit mode
4.6 years ago
Haci ▴ 680

As far as I can see your code does not have a check for the starting methionine, you are retrieving ORFs that generate at least 100 amino acid long peptides.

And I am pretty sure the 530 a.a. long "Genbank peptide" is within your 554 a.a. long peptide prediction. Check from 25th amino acid on in your first result, that goes like MGDSTG...

ADD COMMENT
0
Entering edit mode
3.4 years ago
schagas • 0

I was tryng something similar, try this:

from Bio import SeqIO
record = SeqIO.read("NC_005816.1.fna", "fasta")
table = 11
min_pro_len = 50
x= 0

for strand, nuc in [(+1, record.seq), (-1, record.seq.reverse_complement())]:
     for frame in range(3):
          length = 3 * ((len(record)-frame) // 3) #Multiple of three
         for pro in nuc[frame:frame+length].translate(table).split("*"):
              splitlocal = pro.find('M')
              seq_final = pro[splitlocal:]
              if len(seq_final) >= min_pro_len:
                 print("%s...%s - length %i, strand %i, frame %i" \
                 % (seq_final[:10], pro[-3:], len(seq_final), strand, frame))

                   x = x+1
 print("Numero de ORFs:",x)
 MVAHRFTCSL...MTI - length 54, strand 1, frame 0
 MLKQPTATVC...KIW - length 58, strand 1, frame 0
 MLWMPTSRRP...SGF - length 66, strand 1, frame 0
 MPHRCVRRTC...KNM - length 56, strand 1, frame 0
 MKKSSIVATI...YRF - length 312, strand 1, frame 0
 MMELQHQRLM...NPE - length 259, strand 1, frame 1
ADD COMMENT

Login before adding your answer.

Traffic: 2735 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6