How To Retrieve A Protein Sequence Given An Ensembl Gene Id Using Perl
4
1
Entering edit mode
13.1 years ago

Hi i have a list of ensembl gene id's i need to get their corresponding protein sequences using perl.Kindly suggest how to achieve this using ensemblAPI

perl ensembl homework • 11k views
ADD COMMENT
7
Entering edit mode

Answered largely here.

ADD REPLY
3
Entering edit mode

have you tried anything so far? Is there a specific thing you are stuck with?

ADD REPLY
2
Entering edit mode

I suggest reading the documentation and trying the examples. Then ask again if you have difficulty.

ADD REPLY
0
Entering edit mode

I suggest reading the documentation and trying the examples ;-)

ADD REPLY
0
Entering edit mode

I can say from my own experience with EnsEMBL that this task isn't as easy as normally perceived. A gene id can be linked to possibly many transcripts. Each one can be linked to a protein id. But, many protein id represent exactly the same protein, differing only at transcript level. Many genes don't have the is_canonical attribute set to a value. So, I suggest a more precise specification of your question. Otherwise, go to biomart, paste your gene id list as a filter and download the data as a csv file.

ADD REPLY
6
Entering edit mode
13.1 years ago
Thaman ★ 3.3k

It's clearly stated in the Ensembl core API tutorial that you can get protein sequence from Transcript object.

Translation objects and protein sequence can be extracted from a Transcript object. It is important to remember that some Ensembl transcripts are non-coding (pseudo-genes, ncRNAs, etc.) and have no translation. The primary purpose of a Translation object is to define the CDS and UTRs of its associated Transcript object. Peptide sequence is obtained directly from a Transcript object not a Translation object as might be expected. The following example obtains the protein sequence of a Transcript and the Translation's stable identifier

my $stable_id = 'ENST00000044768';

my $transcript_adaptor =
  $registry->get_adaptor( 'Human', 'Core', 'Transcript' );
my $transcript = $transcript_adaptor->fetch_by_stable_id($stable_id);

print $transcript->translation()->stable_id(), "\n";
print $transcript->translate()->seq(),         "\n";

Is it that hard to go through the documentation?

ADD COMMENT
0
Entering edit mode

In addition, all objects in the Ensembl core API and their methods can be found here: http://www.ensembl.org/info/docs/Pdoc/ensembl/index.html.

ADD REPLY
0
Entering edit mode

Thanks for addition mentoring!

ADD REPLY
3
Entering edit mode
13.1 years ago

This is in the Ensembl documentation as has been pointed out. You say you need to go through the Perl API- but this would actually be easier in BioMart. If that's an option for you, watch this tutorial video.

There is a BioMart web interface you can use. Filters would be your IDs, and Attributes would be the sequences page, protein sequences.

ADD COMMENT
0
Entering edit mode

BioMart works well! Extrally, you can use a R package called biomaRt to achieve this!

ADD REPLY
0
Entering edit mode
13.1 years ago

I believe you could use the Ensembl API that is provided by Ensembl and can be found at their site. It allows perl programs to access their database.

If your question is more specific it would be nice to know.

ADD COMMENT
0
Entering edit mode
10.7 years ago

Here is a PERL script using LWP::Simple Module, to retrive any kind of sequence linked to a Ensemble Transcript ID. It worked for me, hope other can use it with simple modification.

Usage: perl SCRIPT_NAME.pl FILE_CONTAINING_ENSEMBL_ID


YOu can edit the script to fetch specific annotations, like cds sequence, cdna, peptide, exons or introns.

+++++++++++++++++++++PERL CODE+++++++++++++++++++++

### Script to retrive ensembl sequence using ensembl trascript ID

use strict;
use LWP::UserAgent;
use LWP::UserAgent;
use LWP::Simple;
use HTTP::Cookies;


my $input_file=shift|| die "Insufficient Parameters!!!\n Usage: perl $0  <FILE CONATIING_ENSEMBLE_IDS="">\n File must have one id per line.\n";

open(IN,"$input_file") or die "$! $input_file\n";
my @inputs=<IN>;
print STDERR "You have entered ".scalar @inputs." IDs\n\n";
my $ensmbl_ids=join "",@inputs;

$ensmbl_ids=~s/\n/\t/g;
#print "$ensmbl_ids\n";

my $flank3_display=0;            ##upstream, downstream
my $flank5_display=0;
my $strand='strand';                ## 1, forwd or -1 revrese
my $output='fasta';                ## output format, bed,csv,tab, gtf, gff, gff3, embl, genbank

my $fasta_genomic='off';        #unmasked,soft_masked, hard_masked, 5_flanking, 3_flanking, 5_3_flanking

########################EDIT TYPE OF SEQUENCE TO FETCH######################################
#use 0 to turn off and 1 to turn on; default all 'ON'
my $cdna='1';
my $coding='1';
my $peptide='1';
my $utr5='1';
my $utr3='1';
my $exon='1';
my $intron='1';
#############################################################################################




#===================UNIPRTO BOT=====================source: uniprot site
my $base = 'http://www.uniprot.org';
my $tool = 'mapping';

my $params = {
  to => 'ACC',
  from => 'ENSEMBL_TRS_ID',                    
  format => 'tab',
  query =>  $ensmbl_ids,
};

my $contact = ''; # Please set your email address here to help us debug in case of problems.
my $agent = LWP::UserAgent->new(agent => "libwww-perl $contact");
push @{$agent->requests_redirectable}, 'POST';

my $response = $agent->post("$base/$tool/", $params);

while (my $wait = $response->header('Retry-After')) {
  print STDERR "Waiting ($wait)...\n";
  sleep $wait;
  $response = $agent->get($response->base);
}


my %ensemle_id_acc_id;
if($response->is_success ){ my @l=split (/\n/, $response->content);  foreach my $l(@l) {my($k,$v)=split(/\s+/,$l); $ensemle_id_acc_id{$k}=$v if $k ne 'From';  }    }
else{die 'Failed, got ' . $response->status_line .    ' for ' . $response->request->uri . "\n";}


foreach(sort keys %ensemle_id_acc_id)
{
    print "#Ensembl_ID=$_\tUniprot_ACC_ID: $ensemle_id_acc_id{$_}\n";
    my $uniprot_url='http://www.uniprot.org/uniprot/'.$ensemle_id_acc_id{$_}.'.txt';
    my $content_uniprot = get $uniprot_url;
    my $org_code;
    if($content_uniprot=~ m/OS\s+(\S+)\s+(\S+)\s*/i) {

     $org_code=lc($1)."_".lc($2);            ## Fetching Organism name from Uniprot;

    print "#Uniprot Organism: $1 $2\n";            
    if($org_code)
            {
                ## constructing ensembl URL
                my $ensembl_url='http://www.ensembl.org/'.$org_code.'/Export/Output/Transcript?db=core;'.'flank3_display='.$flank3_display.';flank5_display='.$flank5_display.';output='.$output.';strand='.$strand.';t='.$_.';';



                $ensembl_url.="param=cdna;" if($cdna);
                $ensembl_url.="param=coding;" if $coding;
                $ensembl_url.= "param=peptide;" if  $peptide;
                $ensembl_url.="param=utr5;"  if $utr5;
                $ensembl_url.="param=utr3;"  if $utr3;
                $ensembl_url.="param=exon;" if $exon;
                $ensembl_url.="param=intron;" if $intron;

                $ensembl_url.='genomic=off;_format=Text';
                #print "$ensembl_url\n";
                my $content_ensembl_seq = get $ensembl_url;
                print "$content_ensembl_seq\n";
            }    

      } 
      else {
        print "!!!ORG CODE ERROR!!! : $_\n";
      }    
print "//\n";        
}
ADD COMMENT

Login before adding your answer.

Traffic: 1799 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6