How To Handle Transcripts Which Have No Peptide Sequence In Ensembl Perl Api?
3
1
Entering edit mode
8.9 years ago

Hello everyone, I am new to perl API for EnsEMBL. I am even new to perl but doing some readings on both of these, I could write a small code to extract the protein and cDNA sequences of canonical transcript for given Ensembl gene ids. I have pasted my code below:

#!/usr/bin/env perl

use strict;
use warnings;

use Bio::EnsEMBL::Registry;
use Data::Dumper;

my $registry = "Bio::EnsEMBL::Registry";$registry->load_registry_from_db( -host =>'ensembldb.ensembl.org', -user => 'anonymous' );
my $gene_adaptor =$registry->get_adaptor( 'Human', 'Core', 'Gene' );
my @gene_ids=("ENSG00000001629","ENSG00000001630","ENSG00000001631","ENSG00000002016");

foreach my $gene_id (@gene_ids) { my$gene = $gene_adaptor->fetch_by_stable_id($gene_id);
my $cdsseq =$gene->canonical_transcript()->translateable_seq();
my $protseq =$gene->canonical_transcript()->translate()->seq();
print $gene_id,"_cDNA\n",$cdsseq,"\n",$gene_id,"_protein\n",$protseq,"\n";
}


But, what I really want to do is to get all the translatable transcripts (having protein sequence) for a gene including the ensembl protein id and transcript id. (ENSG..............|ENSP...................|ENST...............\n peptide sequence) in a fasta file. And similary, cDNA sequences in another fasta file. Also, my script breaks when there is no translatable sequence for a gene, how do i set an if clause for that? Thank you for your help!!

Abhishek

ensembl perl api • 3.4k views
2
Entering edit mode
8.9 years ago

Hello, you may probably check if the translatable sequence is define before your print. If the translatetable sequence is not define, you will not print values and go to next turn. Or if the translatable sequence is define, you print your values but do nothing if not.

For exemple:

#!/usr/bin/env perl

use strict;
use warnings;

use Bio::EnsEMBL::Registry;
use Data::Dumper;

my $registry = "Bio::EnsEMBL::Registry";$registry->load_registry_from_db( -host =>'ensembldb.ensembl.org', -user => 'anonymous' );
my $gene_adaptor =$registry->get_adaptor( 'Human', 'Core', 'Gene' );
my @gene_ids=("ENSG00000001629","ENSG00000001630","ENSG00000001631","ENSG00000002016");

foreach my $gene_id (@gene_ids) { if ( defined ($gene->canonical_transcript()->translateable_seq() ) ) {
my $gene =$gene_adaptor->fetch_by_stable_id($gene_id); my$cdsseq = $gene->canonical_transcript()->translateable_seq(); my$protseq = $gene->canonical_transcript()->translate()->seq(); print$gene_id,"_cDNA\n",$cdsseq,"\n",$gene_id,"_protein\n",$protseq,"\n"; } }  I didn't test this short script but it may works. Nolwenn ADD COMMENT 0 Entering edit mode Thank you Nolwenn Lavielle!! It looks simple, I didnt know what condition to use. I am still waiting if somebody shows me way how to get all the translatable transcripts for each gene. ADD REPLY 0 Entering edit mode Small fix: you need to set$gene first, since it is used to get the value for $cdsseq: #!/usr/bin/env perl ... foreach my$gene_id (@gene_ids) {
my $gene =$gene_adaptor->fetch_by_stable_id($gene_id); if ( defined ( my$cdsseq = $gene->canonical_transcript()->translateable_seq() ) ) { my$protseq = $gene->canonical_transcript()->translate()->seq(); print$gene_id,"_cDNA\n",$cdsseq,"\n",$gene_id,"_protein\n",$protseq,"\n"; } }  ADD REPLY 0 Entering edit mode I didn't notice this detail. Thanks for your fix! ADD REPLY 0 Entering edit mode thank you for the correction! ADD REPLY 0 Entering edit mode One more question, is it possible to extract the start and end positions of translatable cdna sequence? I tried to extract the positions but I could only extract start and end positions of transcript which might include non-translatable part as well. ADD REPLY 1 Entering edit mode 8.8 years ago Emily 23k Hi Abhishek If you want every coding transcript for each gene you can create an if loop to select only protein coding transcripts. Also, you can use the coding_region_start and coding_region_end calls on the transcript to get these. while (my$gene_id = shift @gene_ids) {

my $gene =$gene_adaptor->fetch_by_stable_id($gene_id); my @transcripts = @{$gene->get_all_Transcripts};

while (my $transcript = shift @transcripts){ if ($transcript->biotype eq 'protein_coding' {
my $cdsseq =$transcript->translateable_seq;
my $cdsstart =$transcript->coding_region_start;
my $cdsend =$transcript->coding_region_end
my $protseq =$transcript->translate->seq;
print #stuff
}
}
}

1
Entering edit mode
6.8 years ago

So, I know this is old, but haven't found another thread more related to my problem... I have being trying to obtain a fasta file of either cDNA or CDS sequences of only protein coding genes and only of the canonical transcript. For that, I was checking if the translateable sequence was defined for a given gene, and then print the corresponding sequence. But it doesn't matter how or where I set the if (defined ....), that the result is always a list with a canonical transcript fo ALL genes of a given species. I was trying to use my script to retrieve the sequences for example of Nematostella vectensis. The command I run was:

perl script.pl nematostella_vectensis 1 cdna

Where with 1, I am setting 1 bp as the min length to retrieve a seq (ie, all of protein coding ones) and with "cdna", the whole transcrirpt.

And the script's code:

#!/usr/bin/perl -w

use strict;
use warnings;

use Bio::EnsEMBL::Registry;
use Data::Dumper;

my $species=$ARGV[0];
my $minlength=$ARGV[1];
my $seqtype=$ARGV[2];
if ($seqtype ne "cds" &&$seqtype ne "cdna"){
print "Select the type of sequence you want to retrieve: cdna or cds. \"$seqtype\" is not either.\nUsage: perl retrieve_protcod_canonicaltranscript speciesname mintranscriptlength cdna/cds\nExample: perl retrieve_all_canonicaltranscript pelodiscus_sinensis 200 cdna\n\n"; exit; } my$registry = "Bio::EnsEMBL::Registry";
$registry->load_registry_from_multiple_dbs( {-host => 'mysql-eg-publicsql.ebi.ac.uk', -port => 4157, -user => 'anonymous' }, {-host => 'ensembldb.ensembl.org', -port => 5306, -user => 'anonymous' } ); unless (open(OUTPUT, ">${species}_protcod_canonicaltranscript$seqtype.fa")){ print "Cannot create an output file\n\n"; exit; } my$gene_adaptor  = $registry->get_adaptor($species, 'Core', 'Gene' );
my $transcript_adaptor =$registry->get_adaptor( $species, 'Core', 'Transcript'); my @gene_ids= @{$gene_adaptor->list_stable_ids()};

while (my $gene_id = shift @gene_ids) { my$gene = $gene_adaptor->fetch_by_stable_id($gene_id);
if (defined ( my $cds =$gene->canonical_transcript()->translateable_seq())){
my $cdna =$gene->canonical_transcript()->spliced_seq();
if (length $cdna >=$minlength && $seqtype eq 'cdna') { print OUTPUT ">$gene_id\n$cdna\n"; } elsif (length$cds >= $minlength &&$seqtype eq 'cds') {
print OUTPUT ">$gene_id\n$cds\n";
} else {
next;
}
}
}
exit;

0
Entering edit mode

translateable_seq returns a string, whether the transcript translates or not.
http://www.ensembl.org/info/docs/Doxygen/core-api/classBio_1_1EnsEMBL_1_1Transcript.html#ab32d5612d5f3b1d33189d5ba306d1561

If it does not translate, it returns an empty string, which will still be defined.
You could replace it with a call to translate instead. This will return a Bio::Seq object if there is a translation, undefined otherwise.

The other solution would be to have the check in two steps:
if (defined ($gene->canonical_transcript()->translate)) { my$cds = $gene->canonical_transcript->translateable_seq(); my$cdna = \$gene->canonical_transcript->spliced_seq();
}

Hope that helps,
Magali