Question

How To Handle Transcripts Which Have No Peptide Sequence In Ensembl Perl Api?

1

Entering edit mode

12.5 years ago

abhishekniroula7 ▴ 50

Hello everyone, I am new to perl API for EnsEMBL. I am even new to perl but doing some readings on both of these, I could write a small code to extract the protein and cDNA sequences of canonical transcript for given Ensembl gene ids. I have pasted my code below:

#!/usr/bin/env perl

use strict;
use warnings;

use Bio::EnsEMBL::Registry;
use Data::Dumper;

my $registry = "Bio::EnsEMBL::Registry";
$registry->load_registry_from_db( -host =>'ensembldb.ensembl.org', -user => 'anonymous' );
my $gene_adaptor  = $registry->get_adaptor( 'Human', 'Core', 'Gene' );
my @gene_ids=("ENSG00000001629","ENSG00000001630","ENSG00000001631","ENSG00000002016");

foreach my $gene_id (@gene_ids) {

my $gene = $gene_adaptor->fetch_by_stable_id($gene_id);
my $cdsseq = $gene->canonical_transcript()->translateable_seq();
my $protseq = $gene->canonical_transcript()->translate()->seq();
print $gene_id,"_cDNA\n",$cdsseq,"\n",$gene_id,"_protein\n",$protseq,"\n";
}

But, what I really want to do is to get all the translatable transcripts (having protein sequence) for a gene including the ensembl protein id and transcript id. (ENSG..............|ENSP...................|ENST...............\n peptide sequence) in a fasta file. And similary, cDNA sequences in another fasta file. Also, my script breaks when there is no translatable sequence for a gene, how do i set an if clause for that? Thank you for your help!!

Abhishek

ensembl perl api • 4.9k views

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 12.5 years ago by abhishekniroula7 ▴ 50

1

Entering edit mode

12.4 years ago

Emily 24k

Hi Abhishek

If you want every coding transcript for each gene you can create an if loop to select only protein coding transcripts. Also, you can use the coding_region_start and coding_region_end calls on the transcript to get these.

while (my $gene_id = shift @gene_ids) {

my $gene = $gene_adaptor->fetch_by_stable_id($gene_id);

my @transcripts = @{$gene->get_all_Transcripts};

while (my $transcript = shift @transcripts){

    if ($transcript->biotype eq 'protein_coding' {
        my $cdsseq = $transcript->translateable_seq;
        my $cdsstart = $transcript->coding_region_start;
        my $cdsend = $transcript->coding_region_end
        my $protseq = $transcript->translate->seq;
        print #stuff
        }
    }
}

See http://www.ensembl.org/info/docs/Doxygen/core-api/classBio_1_1EnsEMBL_1_1Transcript.html#a29a5039bad42ce6939b42cfeef1e3b4e

ADD COMMENT • link 12.4 years ago by Emily 24k

1

Entering edit mode

10.4 years ago

jpascualanaya ▴ 10

So, I know this is old, but haven't found another thread more related to my problem... I have being trying to obtain a fasta file of either cDNA or CDS sequences of only protein coding genes and only of the canonical transcript. For that, I was checking if the translatable sequence was defined for a given gene, and then print the corresponding sequence. But it doesn't matter how or where I set the if (defined ....), that the result is always a list with a canonical transcript for ALL genes of a given species. I was trying to use my script to retrieve the sequences for example of Nematostella vectensis. The command I run was:

perl script.pl nematostella_vectensis 1 cdna

Where with 1, I am setting 1 bp as the min length to retrieve a seq (i.e., all of protein coding ones) and with "cdna", the whole transcript.

And the script's code:

#!/usr/bin/perl -w

use strict;
use warnings;

use Bio::EnsEMBL::Registry;
use Data::Dumper;

my $species=$ARGV[0];
my $minlength=$ARGV[1];
my $seqtype=$ARGV[2];
if ($seqtype ne "cds" && $seqtype ne "cdna"){
    print "Select the type of sequence you want to retrieve: cdna or cds. \"$seqtype\" is not either.\nUsage: perl retrieve_protcod_canonicaltranscript speciesname mintranscriptlength cdna/cds\nExample: perl retrieve_all_canonicaltranscript pelodiscus_sinensis 200 cdna\n\n";
    exit;
}

my $registry = "Bio::EnsEMBL::Registry";
$registry->load_registry_from_multiple_dbs(
    {-host => 'mysql-eg-publicsql.ebi.ac.uk',
     -port => 4157, 
     -user => 'anonymous'
    },
    {-host => 'ensembldb.ensembl.org',
     -port => 5306,
     -user    => 'anonymous'
    }
);

unless (open(OUTPUT, ">${species}_protcod_canonicaltranscript$seqtype.fa")){
    print "Cannot create an output file\n\n";
    exit;
}

my $gene_adaptor  = $registry->get_adaptor( $species, 'Core', 'Gene' );
my $transcript_adaptor = $registry->get_adaptor( $species, 'Core', 'Transcript');
my @gene_ids= @{$gene_adaptor->list_stable_ids()};

while (my $gene_id = shift @gene_ids) {
    my $gene = $gene_adaptor->fetch_by_stable_id($gene_id);
    if (defined ( my $cds = $gene->canonical_transcript()->translateable_seq())){
        my $cdna = $gene->canonical_transcript()->spliced_seq();
        if (length $cdna >= $minlength && $seqtype eq 'cdna') {
            print OUTPUT ">$gene_id\n$cdna\n";
        } elsif (length $cds >= $minlength && $seqtype eq 'cds') {
            print OUTPUT ">$gene_id\n$cds\n";
        } else {
            next;
        }
    }
}
exit;

ADD COMMENT • link updated 3.2 years ago by Ram 45k • written 10.4 years ago by jpascualanaya ▴ 10

0

Entering edit mode

translateable_seq returns a string, whether the transcript translates or not.

http://www.ensembl.org/info/docs/Doxygen/core-api/classBio_1_1EnsEMBL_1_1Transcript.html#ab32d5612d5f3b1d33189d5ba306d1561

If it does not translate, it returns an empty string, which will still be defined.

You could replace it with a call to translate instead. This will return a Bio::Seq object if there is a translation, undefined otherwise.

The other solution would be to have the check in two steps:

if (defined ($gene->canonical_transcript()->translate)) {
  my $cds = $gene->canonical_transcript->translateable_seq();
  my $cdna = $gene->canonical_transcript->spliced_seq();
}

Hope that helps,
Magali

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by Magali_Ensembl ▴ 160

score 2 · Accepted Answer · 2013-01-10

2

Entering edit mode

12.5 years ago

Nolwenn Lavielle ▴ 90

Hello, you may probably check if the translatable sequence is define before your print. If the translatetable sequence is not define, you will not print values and go to next turn. Or if the translatable sequence is define, you print your values but do nothing if not.

For exemple:

#!/usr/bin/env perl

use strict;
use warnings;

use Bio::EnsEMBL::Registry;
use Data::Dumper;

my $registry = "Bio::EnsEMBL::Registry";
$registry->load_registry_from_db( -host =>'ensembldb.ensembl.org', -user => 'anonymous' );
my $gene_adaptor  = $registry->get_adaptor( 'Human', 'Core', 'Gene' );
my @gene_ids=("ENSG00000001629","ENSG00000001630","ENSG00000001631","ENSG00000002016");

foreach my $gene_id (@gene_ids) {
    if ( defined ( $gene->canonical_transcript()->translateable_seq() ) ) {
        my $gene = $gene_adaptor->fetch_by_stable_id($gene_id);
        my $cdsseq = $gene->canonical_transcript()->translateable_seq();
        my $protseq = $gene->canonical_transcript()->translate()->seq();
        print $gene_id,"_cDNA\n",$cdsseq,"\n",$gene_id,"_protein\n",$protseq,"\n";
    }
}

I didn't test this short script but it may works.

Nolwenn

ADD COMMENT • link 12.5 years ago by Nolwenn Lavielle ▴ 90

0

Entering edit mode

Thank you Nolwenn Lavielle!! It looks simple, I didnt know what condition to use. I am still waiting if somebody shows me way how to get all the translatable transcripts for each gene.

ADD REPLY • link 12.5 years ago by abhishekniroula7 ▴ 50

0

Entering edit mode

Small fix: you need to set $gene first, since it is used to get the value for $cdsseq:

#!/usr/bin/env perl
...
foreach my $gene_id (@gene_ids) {
    my $gene = $gene_adaptor->fetch_by_stable_id($gene_id);
    if ( defined ( my $cdsseq = $gene->canonical_transcript()->translateable_seq() ) ) {
        my $protseq = $gene->canonical_transcript()->translate()->seq();
        print $gene_id,"_cDNA\n",$cdsseq,"\n",$gene_id,"_protein\n",$protseq,"\n";
    }
}

ADD REPLY • link 12.5 years ago by Chris Maloney ▴ 360

0

Entering edit mode

I didn't notice this detail. Thanks for your fix!

ADD REPLY • link 12.5 years ago by Nolwenn Lavielle ▴ 90

0

Entering edit mode

thank you for the correction!

ADD REPLY • link 12.5 years ago by abhishekniroula7 ▴ 50

0

Entering edit mode

One more question, is it possible to extract the start and end positions of translatable cdna sequence? I tried to extract the positions but I could only extract start and end positions of transcript which might include non-translatable part as well.

ADD REPLY • link 12.4 years ago by abhishekniroula7 ▴ 50