Question: Open Reading Frame in Biopython
gravatar for antonio.mauceri87
12 days ago by
antonio.mauceri870 wrote:

Hi everyone, I recently started using python with biopython. I'm trying to practice to get the translate ORF using this gene taken from Genbank as input: NM_100684.3

However, my output does not show me the correct ORF and I get a different amino acid sequence both in composition and length.

What am I doing wrong?

These are the scripts used by me

>>>from Bio import SeqIO

>>>record ="sequence.fasta", "fasta")

>>> table = 1

>>> min_pro_len = 100

>>>for strand, nuc in [(+1, record.seq), (-1, record.seq.reverse_complement())]:
        for frame in range(3):
            length = 3 * ((len(record)-frame) // 3) #Multiple of three
            for pro in nuc[frame:frame+length].translate(table).split("*"):
                if len(pro) >= min_pro_len:
                    print("%s...%s - length %i, strand %i, frame %i" \
                          % (pro[:30], pro[-3:], len(pro), strand, frame))

YSDIDQINLNQISNLQRNLKYFITMGDSTG...NNV - length 554, strand 1, frame 2
SSPGDKGHNCKGGSASSLCPHREEHHSHNG...ILT - length 162, strand -1, frame 1
IEHQDSHDDVQPTGYKEGDPPGREGCGTAA...HNW - length 216, strand -1, frame 1
TKVTGNVQATIITPIHVSPCSVVKCEVEKK...SDA - length 122, strand -1, frame 2

This above is my output but isn't corrected and do not start with methionine, in Genbank the correct protein has 530 a.a. and start with "MGDSTGEPGSSMHGVTGREQ ..."

sequence gene • 56 views
ADD COMMENTlink modified 12 days ago by Haci120 • written 12 days ago by antonio.mauceri870

From the docs you're following:

A very simplistic first step at identifying possible genes is to look for open reading frames (ORFs). By this we mean look in all six frames for long regions without stop codons – an ORF is just a region of nucleotides with no in frame stop codons.

Of course, to find a gene you would also need to worry about locating a start codon, possible promoters – and in Eukaryotes there are introns to worry about too. However, this approach is still useful in viruses and Prokaryotes.

As it stands, all you've assessed is regions uninterrupted by stop codons, you haven't gone to the next step of identifying starts etc.

ADD REPLYlink written 12 days ago by Joe14k
gravatar for Haci
12 days ago by
Haci120 wrote:

As far as I can see your code does not have a check for the starting methionine, you are retrieving ORFs that generate at least 100 amino acid long peptides.

And I am pretty sure the 530 a.a. long "Genbank peptide" is within your 554 a.a. long peptide prediction. Check from 25th amino acid on in your first result, that goes like MGDSTG...

ADD COMMENTlink modified 12 days ago • written 12 days ago by Haci120
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1927 users visited in the last hour