Hi i have a list of ensembl gene id's i need to get their corresponding protein sequences using perl.Kindly suggest how to achieve this using ensemblAPI
Hi i have a list of ensembl gene id's i need to get their corresponding protein sequences using perl.Kindly suggest how to achieve this using ensemblAPI
It's clearly stated in the Ensembl core API tutorial that you can get protein sequence from Transcript object.
Translation objects and protein sequence can be extracted from a Transcript object. It is important to remember that some Ensembl transcripts are non-coding (pseudo-genes, ncRNAs, etc.) and have no translation. The primary purpose of a Translation object is to define the CDS and UTRs of its associated Transcript object. Peptide sequence is obtained directly from a Transcript object not a Translation object as might be expected. The following example obtains the protein sequence of a Transcript and the Translation's stable identifier
my $stable_id = 'ENST00000044768';
my $transcript_adaptor =
$registry->get_adaptor( 'Human', 'Core', 'Transcript' );
my $transcript = $transcript_adaptor->fetch_by_stable_id($stable_id);
print $transcript->translation()->stable_id(), "\n";
print $transcript->translate()->seq(), "\n";
Is it that hard to go through the documentation?
In addition, all objects in the Ensembl core API and their methods can be found here: http://www.ensembl.org/info/docs/Pdoc/ensembl/index.html.
This is in the Ensembl documentation as has been pointed out. You say you need to go through the Perl API- but this would actually be easier in BioMart. If that's an option for you, watch this tutorial video.
There is a BioMart web interface you can use. Filters would be your IDs, and Attributes would be the sequences page, protein sequences.
I believe you could use the Ensembl API that is provided by Ensembl and can be found at their site. It allows perl programs to access their database.
If your question is more specific it would be nice to know.
Here is a PERL script using LWP::Simple Module, to retrive any kind of sequence linked to a Ensemble Transcript ID. It worked for me, hope other can use it with simple modification.
Usage: perl SCRIPT_NAME.pl FILE_CONTAINING_ENSEMBL_ID
YOu can edit the script to fetch specific annotations, like cds sequence, cdna, peptide, exons or introns.
+++++++++++++++++++++PERL CODE+++++++++++++++++++++
### Script to retrive ensembl sequence using ensembl trascript ID
use strict;
use LWP::UserAgent;
use LWP::UserAgent;
use LWP::Simple;
use HTTP::Cookies;
my $input_file=shift|| die "Insufficient Parameters!!!\n Usage: perl $0 <FILE CONATIING_ENSEMBLE_IDS="">\n File must have one id per line.\n";
open(IN,"$input_file") or die "$! $input_file\n";
my @inputs=<IN>;
print STDERR "You have entered ".scalar @inputs." IDs\n\n";
my $ensmbl_ids=join "",@inputs;
$ensmbl_ids=~s/\n/\t/g;
#print "$ensmbl_ids\n";
my $flank3_display=0; ##upstream, downstream
my $flank5_display=0;
my $strand='strand'; ## 1, forwd or -1 revrese
my $output='fasta'; ## output format, bed,csv,tab, gtf, gff, gff3, embl, genbank
my $fasta_genomic='off'; #unmasked,soft_masked, hard_masked, 5_flanking, 3_flanking, 5_3_flanking
########################EDIT TYPE OF SEQUENCE TO FETCH######################################
#use 0 to turn off and 1 to turn on; default all 'ON'
my $cdna='1';
my $coding='1';
my $peptide='1';
my $utr5='1';
my $utr3='1';
my $exon='1';
my $intron='1';
#############################################################################################
#===================UNIPRTO BOT=====================source: uniprot site
my $base = 'http://www.uniprot.org';
my $tool = 'mapping';
my $params = {
to => 'ACC',
from => 'ENSEMBL_TRS_ID',
format => 'tab',
query => $ensmbl_ids,
};
my $contact = ''; # Please set your email address here to help us debug in case of problems.
my $agent = LWP::UserAgent->new(agent => "libwww-perl $contact");
push @{$agent->requests_redirectable}, 'POST';
my $response = $agent->post("$base/$tool/", $params);
while (my $wait = $response->header('Retry-After')) {
print STDERR "Waiting ($wait)...\n";
sleep $wait;
$response = $agent->get($response->base);
}
my %ensemle_id_acc_id;
if($response->is_success ){ my @l=split (/\n/, $response->content); foreach my $l(@l) {my($k,$v)=split(/\s+/,$l); $ensemle_id_acc_id{$k}=$v if $k ne 'From'; } }
else{die 'Failed, got ' . $response->status_line . ' for ' . $response->request->uri . "\n";}
foreach(sort keys %ensemle_id_acc_id)
{
print "#Ensembl_ID=$_\tUniprot_ACC_ID: $ensemle_id_acc_id{$_}\n";
my $uniprot_url='http://www.uniprot.org/uniprot/'.$ensemle_id_acc_id{$_}.'.txt';
my $content_uniprot = get $uniprot_url;
my $org_code;
if($content_uniprot=~ m/OS\s+(\S+)\s+(\S+)\s*/i) {
$org_code=lc($1)."_".lc($2); ## Fetching Organism name from Uniprot;
print "#Uniprot Organism: $1 $2\n";
if($org_code)
{
## constructing ensembl URL
my $ensembl_url='http://www.ensembl.org/'.$org_code.'/Export/Output/Transcript?db=core;'.'flank3_display='.$flank3_display.';flank5_display='.$flank5_display.';output='.$output.';strand='.$strand.';t='.$_.';';
$ensembl_url.="param=cdna;" if($cdna);
$ensembl_url.="param=coding;" if $coding;
$ensembl_url.= "param=peptide;" if $peptide;
$ensembl_url.="param=utr5;" if $utr5;
$ensembl_url.="param=utr3;" if $utr3;
$ensembl_url.="param=exon;" if $exon;
$ensembl_url.="param=intron;" if $intron;
$ensembl_url.='genomic=off;_format=Text';
#print "$ensembl_url\n";
my $content_ensembl_seq = get $ensembl_url;
print "$content_ensembl_seq\n";
}
}
else {
print "!!!ORG CODE ERROR!!! : $_\n";
}
print "//\n";
}
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Answered largely here.
have you tried anything so far? Is there a specific thing you are stuck with?
I suggest reading the documentation and trying the examples. Then ask again if you have difficulty.
I suggest reading the documentation and trying the examples ;-)
I can say from my own experience with EnsEMBL that this task isn't as easy as normally perceived. A gene id can be linked to possibly many transcripts. Each one can be linked to a protein id. But, many protein id represent exactly the same protein, differing only at transcript level. Many genes don't have the is_canonical attribute set to a value. So, I suggest a more precise specification of your question. Otherwise, go to biomart, paste your gene id list as a filter and download the data as a csv file.