Question: Open Reading Frame in Biopython
gravatar for antonio.mauceri87
11 months ago by
antonio.mauceri870 wrote:

Hi everyone, I recently started using python with biopython. I'm trying to practice to get the translate ORF using this gene taken from Genbank as input: NM_100684.3

However, my output does not show me the correct ORF and I get a different amino acid sequence both in composition and length.

What am I doing wrong?

These are the scripts used by me

>>>from Bio import SeqIO

>>>record ="sequence.fasta", "fasta")

>>> table = 1

>>> min_pro_len = 100

>>>for strand, nuc in [(+1, record.seq), (-1, record.seq.reverse_complement())]:
        for frame in range(3):
            length = 3 * ((len(record)-frame) // 3) #Multiple of three
            for pro in nuc[frame:frame+length].translate(table).split("*"):
                if len(pro) >= min_pro_len:
                    print("%s...%s - length %i, strand %i, frame %i" \
                          % (pro[:30], pro[-3:], len(pro), strand, frame))

YSDIDQINLNQISNLQRNLKYFITMGDSTG...NNV - length 554, strand 1, frame 2
SSPGDKGHNCKGGSASSLCPHREEHHSHNG...ILT - length 162, strand -1, frame 1
IEHQDSHDDVQPTGYKEGDPPGREGCGTAA...HNW - length 216, strand -1, frame 1
TKVTGNVQATIITPIHVSPCSVVKCEVEKK...SDA - length 122, strand -1, frame 2

This above is my output but isn't corrected and do not start with methionine, in Genbank the correct protein has 530 a.a. and start with "MGDSTGEPGSSMHGVTGREQ ..."

sequence gene • 635 views
ADD COMMENTlink modified 11 months ago by Haci370 • written 11 months ago by antonio.mauceri870

From the docs you're following:

A very simplistic first step at identifying possible genes is to look for open reading frames (ORFs). By this we mean look in all six frames for long regions without stop codons – an ORF is just a region of nucleotides with no in frame stop codons.

Of course, to find a gene you would also need to worry about locating a start codon, possible promoters – and in Eukaryotes there are introns to worry about too. However, this approach is still useful in viruses and Prokaryotes.

As it stands, all you've assessed is regions uninterrupted by stop codons, you haven't gone to the next step of identifying starts etc.

ADD REPLYlink written 11 months ago by Joe17k
gravatar for Haci
11 months ago by
Haci370 wrote:

As far as I can see your code does not have a check for the starting methionine, you are retrieving ORFs that generate at least 100 amino acid long peptides.

And I am pretty sure the 530 a.a. long "Genbank peptide" is within your 554 a.a. long peptide prediction. Check from 25th amino acid on in your first result, that goes like MGDSTG...

ADD COMMENTlink modified 11 months ago • written 11 months ago by Haci370
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1053 users visited in the last hour