Question: Ensembl Perl Api Translateable_Seq Returns Sequences That Aren'T Multiples Of 3 Nucleotides Long
3
gravatar for Jeff Hussmann
8.3 years ago by
Jeff Hussmann120
United States
Jeff Hussmann120 wrote:

I am using Ensembl's Perl API to retrieve (nucleotide) coding sequences from the Ensembl databases. The relevant part of my Perl code is

my $gene_adaptor = $registry->get_adaptor($species, "core", "gene");
$genes = $gene_adaptor->fetch_all();
for $gene (@$genes) {
    my $transcript = $gene->canonical_transcript;
    if ($transcript->translation) {
        my $sequence = $transcript->translateable_seq;
        unless (length($sequence) % 3 == 0) {
            print $gene->stable_id . " translateable sequence not divisible by 3\n";
            next;
        }
        if ($sequence =~ /[^TCAG]/) {
            print $gene->stable_id . " translateable sequence has a non-TCAG character\n";
            next;
        }
        print ENS $gene->stable_id . "\t" . $transcript->stable_id . "\n";
        print ENS $transcript->translateable_seq . "\n";
        $succeeded++;
    }
}

The two sanity checks on the translateableseqs returned - that they be multiples of 3 nucleotides long and contain no non-TCAG characters - are each triggered a substantial number of times when the script is run on the human or mouse genomes. If I understand what translateableseq claims to be returning correctly, this should not be the case. Is there a problem with my understanding, or is there a problem with Ensembl's API and/or databases?

perl ensembl api • 2.7k views
ADD COMMENTlink written 8.3 years ago by Jeff Hussmann120

which gene is it ?

ADD REPLYlink written 8.3 years ago by Pierre Lindenbaum119k

Many, many genes - ~10000 out of the ~30000 for mouse, for example.

ADD REPLYlink written 8.3 years ago by Jeff Hussmann120

can you give one example please.

ADD REPLYlink written 8.3 years ago by Pierre Lindenbaum119k

ENSMUSG00000064363, canonical transcript ENSMUST00000082414, returns a translateable_seq that is 1378 nt long.

ADD REPLYlink written 8.3 years ago by Jeff Hussmann120
3
gravatar for Giulietta - Ensembl Helpdesk
8.2 years ago by
Cambridge, UK

Hi Jeff,

A portion of Ensembl genes and transcripts come from manual annotation (by the VEGA/Havana project). The manual annotaters use EST and cDNA evidence to determine the transcript set- and they do annotate partial codons. Basically, they try for the longest sequence they can- if the cDNA is not complete, they will annotate the cDNA through a partial codon, rather than leaving off the codon altogether. It sounds like most of your examples are manual annotation by Havana. The Ensembl pipeline does not allow partial codons, so transcripts coming from the automatic annotation pipeline will not have partial codons.

As for the documentation saying "all defined RNA edits", these would be seleno-cysteines and other non-standard amino acids- this does not include extending a codon to three nucleotides.

The particular mRNA being discussed is a mitochondrial cDNA on the mouse genome (which are also manually annotated):

http://www.ensembl.org/Mus_musculus/Transcript/Exons?db=core;g=ENSMUSG00000064363;r=MT:10167-11544;t=ENSMUST00000082414

The partial codon comes straight from the original record pointed out by Pierre:

http://www.ncbi.nlm.nih.gov/nuccore/34538597

Though two "A" nucleotides are expected to complete the last codon (and form a stop codon), Ensembl is only able to show the first T. The reason is, Ensembl translates cDNAs off the genome itself, and the genome is telling us that after that last T in the coding sequence, a G and a T follow (not two A's). You can see this in the Transcript/Exons view (link above).

I hope this helps?

By the way, these types of questions can either be sent to helpdesk@ensembl.org, or consider joining the dev mailing list for discussion:

http://www.ensembl.org/info/about/contact/mailing.html

There is quite a lot of API discussion on the dev list.

ADD COMMENTlink written 8.2 years ago by Giulietta - Ensembl Helpdesk1.2k
1
gravatar for Pierre Lindenbaum
8.3 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum119k wrote:

In the case of your mRNA ENSMUSG00000064363 , as far as I can see, it is the Ensembl Transcript for NP_904337 (459aa) where it is said that

"TAA stop codon is completed by the addition of 3' A residues to the mRNA"

(see also the sequence of the mitochondrial genome with the same comment: http://www.ncbi.nlm.nih.gov/nuccore/34538597 ).

I don't know why there is an error in Genbank (error in sequencing ?) but the Ensembl API just used this information without modifying it.

ADD COMMENTlink written 8.3 years ago by Pierre Lindenbaum119k

The translateable_seq documentation says that it applys "all defined RNA edits" to the sequence - so essentially in the case of this (randomly chosen and of no special interest to me) gene there appears a known RNA edit that hasn't made its way into the EnsEMBL database. Do I conclude that there are 10,000 other cases of this and lose faith in the EnsEMBL databases ability to get these sequences right?

ADD REPLYlink written 8.3 years ago by Jeff Hussmann120

The translateable_seq documentation says that it applies "all defined RNA edits" to sequences before returning them - so essentially in the case of this (randomly chosen and of no special interest to me) gene there appears a known RNA edit that hasn't made its way into the EnsEMBL database. Do I conclude that there are 10,000 other cases of this and lose faith in the EnsEMBL databases ability to get these sequences right?

ADD REPLYlink written 8.3 years ago by Jeff Hussmann120

The translateable_seq documentation says that it applies "all defined RNA edits" to sequences before returning them - so essentially in the case of this (randomly chosen and of no special interest to me) gene there appears a known RNA edit that hasn't made its way into the EnsEMBL database. Do I conclude that there are 10,000 other cases of this and lose faith in the EnsEMBL database's ability to get these sequences right?

ADD REPLYlink written 8.3 years ago by Jeff Hussmann120

The translateable_seq documentation says that it applies "all defined RNA edits" to sequences before returning them - so essentially in the case of this (randomly chosen and of no special interest to me) gene there appears to be a known RNA edit that hasn't made its way into the EnsEMBL database. Do I conclude that there are 10,000 other cases of this and lose faith in the EnsEMBL database's ability to get these sequences right?

ADD REPLYlink written 8.3 years ago by Jeff Hussmann120
0
gravatar for Ian Longden
8.2 years ago by
Ian Longden0 wrote:

In addition to the answer given :-

If your script is going very slow you may want to edit it to do either of the following. The script listed in the original question will use a lot of memory as all the genes are loaded at once and then the transcripts are obtained and then the sequence, which are all kept for the whole of the script. To reduce the memory overhead do either:-

1)

while (my $gene = shift @$genes){ }

So after the loop the gene, transcript and sequence are removed.

2) my $gene_ids = $gene_adaptor->list_dbIDs(); foreach my $gene_id (@$gene_ids) { my $gene = $gene_adaptor->fetch_by_dbID($gene_id); ... }

Only one gene object exists at once, and we just have an array of the internal

identifiers.

-Ian.

ADD COMMENTlink written 8.2 years ago by Ian Longden0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2192 users visited in the last hour