Question: How To Handle Transcripts Which Have No Peptide Sequence In Ensembl Perl Api?
1
gravatar for abhishekniroula7
6.5 years ago by
Sweden
abhishekniroula750 wrote:

Hello everyone, I am new to perl API for EnsEMBL. I am even new to perl but doing some readings on both of these, I could write a small code to extract the protein and cDNA sequences of canonical transcript for given Ensembl gene ids. I have pasted my code below:

#!/usr/bin/env perl

use strict;
use warnings;

use Bio::EnsEMBL::Registry;
use Data::Dumper;

my $registry = "Bio::EnsEMBL::Registry";
$registry->load_registry_from_db( -host =>'ensembldb.ensembl.org', -user => 'anonymous' );
my $gene_adaptor  = $registry->get_adaptor( 'Human', 'Core', 'Gene' );
my @gene_ids=("ENSG00000001629","ENSG00000001630","ENSG00000001631","ENSG00000002016");

foreach my $gene_id (@gene_ids) {

my $gene = $gene_adaptor->fetch_by_stable_id($gene_id);
my $cdsseq = $gene->canonical_transcript()->translateable_seq();
my $protseq = $gene->canonical_transcript()->translate()->seq();
print $gene_id,"_cDNA\n",$cdsseq,"\n",$gene_id,"_protein\n",$protseq,"\n";
}

But, what I really want to do is to get all the translatable transcripts (having protein sequence) for a gene including the ensembl protein id and transcript id. (ENSG..............|ENSP...................|ENST...............\n peptide sequence) in a fasta file. And similary, cDNA sequences in another fasta file. Also, my script breaks when there is no translatable sequence for a gene, how do i set an if clause for that? Thank you for your help!!

Abhishek

perl ensembl api • 2.9k views
ADD COMMENTlink modified 4.3 years ago by jpascualanaya10 • written 6.5 years ago by abhishekniroula750
2
gravatar for Nolwenn Lavielle
6.5 years ago by
Paris (France)
Nolwenn Lavielle90 wrote:

Hello, you may probably check if the translatable sequence is define before your print. If the translatetable sequence is not define, you will not print values and go to next turn. Or if the translatable sequence is define, you print your values but do nothing if not.

For exemple:

#!/usr/bin/env perl

use strict;
use warnings;

use Bio::EnsEMBL::Registry;
use Data::Dumper;

my $registry = "Bio::EnsEMBL::Registry";
$registry->load_registry_from_db( -host =>'ensembldb.ensembl.org', -user => 'anonymous' );
my $gene_adaptor  = $registry->get_adaptor( 'Human', 'Core', 'Gene' );
my @gene_ids=("ENSG00000001629","ENSG00000001630","ENSG00000001631","ENSG00000002016");

foreach my $gene_id (@gene_ids) {
    if ( defined ( $gene->canonical_transcript()->translateable_seq() ) ) {
        my $gene = $gene_adaptor->fetch_by_stable_id($gene_id);
        my $cdsseq = $gene->canonical_transcript()->translateable_seq();
        my $protseq = $gene->canonical_transcript()->translate()->seq();
        print $gene_id,"_cDNA\n",$cdsseq,"\n",$gene_id,"_protein\n",$protseq,"\n";
    }
}

I didn't test this short script but it may works.

Nolwenn

ADD COMMENTlink written 6.5 years ago by Nolwenn Lavielle90

Thank you Nolwenn Lavielle!! It looks simple, I didnt know what condition to use. I am still waiting if somebody shows me way how to get all the translatable transcripts for each gene.

ADD REPLYlink written 6.5 years ago by abhishekniroula750

Small fix: you need to set $gene first, since it is used to get the value for $cdsseq:

#!/usr/bin/env perl
...
foreach my $gene_id (@gene_ids) {
    my $gene = $gene_adaptor->fetch_by_stable_id($gene_id);
    if ( defined ( my $cdsseq = $gene->canonical_transcript()->translateable_seq() ) ) {
        my $protseq = $gene->canonical_transcript()->translate()->seq();
        print $gene_id,"_cDNA\n",$cdsseq,"\n",$gene_id,"_protein\n",$protseq,"\n";
    }
}
ADD REPLYlink modified 6.5 years ago • written 6.5 years ago by Chris Maloney330

I didn't notice this detail. Thanks for your fix!

ADD REPLYlink written 6.5 years ago by Nolwenn Lavielle90

thank you for the correction!

ADD REPLYlink written 6.5 years ago by abhishekniroula750

One more question, is it possible to extract the start and end positions of translatable cdna sequence? I tried to extract the positions but I could only extract start and end positions of transcript which might include non-translatable part as well.

ADD REPLYlink written 6.4 years ago by abhishekniroula750
1
gravatar for Emily_Ensembl
6.3 years ago by
Emily_Ensembl18k
EMBL-EBI
Emily_Ensembl18k wrote:

Hi Abhishek

If you want every coding transcript for each gene you can create an if loop to select only protein coding transcripts. Also, you can use the coding_region_start and coding_region_end calls on the transcript to get these.

while (my $gene_id = shift @gene_ids) {

my $gene = $gene_adaptor->fetch_by_stable_id($gene_id);

my @transcripts = @{$gene->get_all_Transcripts};

while (my $transcript = shift @transcripts){

    if ($transcript->biotype eq 'protein_coding' {
        my $cdsseq = $transcript->translateable_seq;
        my $cdsstart = $transcript->coding_region_start;
        my $cdsend = $transcript->coding_region_end
        my $protseq = $transcript->translate->seq;
        print #stuff
        }
    }
}

See http://www.ensembl.org/info/docs/Doxygen/core-api/classBio_1_1EnsEMBL_1_1Transcript.html#a29a5039bad42ce6939b42cfeef1e3b4e

ADD COMMENTlink modified 6.3 years ago • written 6.3 years ago by Emily_Ensembl18k
1
gravatar for jpascualanaya
4.3 years ago by
Japan
jpascualanaya10 wrote:

So, I know this is old, but haven't found another thread more related to my problem... I have being trying to obtain a fasta file of either cDNA or CDS sequences of only protein coding genes and only of the canonical transcript. For that, I was checking if the translateable sequence was defined for a given gene, and then print the corresponding sequence. But it doesn't matter how or where I set the if (defined ....), that the result is always a list with a canonical transcript fo ALL genes of a given species. I was trying to use my script to retrieve the sequences for example of Nematostella vectensis. The command I run was:

perl script.pl nematostella_vectensis 1 cdna

Where with 1, I am setting 1 bp as the min length to retrieve a seq (ie, all of protein coding ones) and with "cdna", the whole transcrirpt.

And the script's code:

#!/usr/bin/perl -w

use strict;
use warnings;

use Bio::EnsEMBL::Registry;
use Data::Dumper;

my $species=$ARGV[0];
my $minlength=$ARGV[1];
my $seqtype=$ARGV[2];
if ($seqtype ne "cds" && $seqtype ne "cdna"){
    print "Select the type of sequence you want to retrieve: cdna or cds. \"$seqtype\" is not either.\nUsage: perl retrieve_protcod_canonicaltranscript speciesname mintranscriptlength cdna/cds\nExample: perl retrieve_all_canonicaltranscript pelodiscus_sinensis 200 cdna\n\n";
    exit;
}

my $registry = "Bio::EnsEMBL::Registry";
$registry->load_registry_from_multiple_dbs(
    {-host => 'mysql-eg-publicsql.ebi.ac.uk',
     -port => 4157, 
     -user => 'anonymous'
    },
    {-host => 'ensembldb.ensembl.org',
     -port => 5306,
     -user    => 'anonymous'
    }
);

unless (open(OUTPUT, ">${species}_protcod_canonicaltranscript$seqtype.fa")){
    print "Cannot create an output file\n\n";
    exit;
}

my $gene_adaptor  = $registry->get_adaptor( $species, 'Core', 'Gene' );
my $transcript_adaptor = $registry->get_adaptor( $species, 'Core', 'Transcript');
my @gene_ids= @{$gene_adaptor->list_stable_ids()};

while (my $gene_id = shift @gene_ids) {
    my $gene = $gene_adaptor->fetch_by_stable_id($gene_id);
    if (defined ( my $cds = $gene->canonical_transcript()->translateable_seq())){
        my $cdna = $gene->canonical_transcript()->spliced_seq();
        if (length $cdna >= $minlength && $seqtype eq 'cdna') {
            print OUTPUT ">$gene_id\n$cdna\n";
        } elsif (length $cds >= $minlength && $seqtype eq 'cds') {
            print OUTPUT ">$gene_id\n$cds\n";
        } else {
            next;
        }
    }
}
exit;
 
ADD COMMENTlink modified 4.3 years ago • written 4.3 years ago by jpascualanaya10

translateable_seq returns a string, whether the transcript translates or not.
http://www.ensembl.org/info/docs/Doxygen/core-api/classBio_1_1EnsEMBL_1_1Transcript.html#ab32d5612d5f3b1d33189d5ba306d1561

If it does not translate, it returns an empty string, which will still be defined.
You could replace it with a call to translate instead. This will return a Bio::Seq object if there is a translation, undefined otherwise.

The other solution would be to have the check in two steps:
if (defined ($gene->canonical_transcript()->translate)) {
  my $cds = $gene->canonical_transcript->translateable_seq();
  my $cdna = $gene->canonical_transcript->spliced_seq();
}

Hope that helps,
Magali

ADD REPLYlink written 4.3 years ago by Magali_Ensembl130
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 830 users visited in the last hour