Question: From Ensembl protein ID to sequence
2
gravatar for Joseph Hughes
5.3 years ago by
Joseph Hughes2.8k
Scotland, UK
Joseph Hughes2.8k wrote:

Hi,

 

I have a list of protein IDs from ensembl:

ENSMUSP00000137272
ENSMUSP00000137602
ENSRNOP00000057248
ENSRNOP00000057253
ENSMICP00000006596
ENSMICP00000013787
ENSTBEP00000002813
ENSTBEP00000003741
ENSTBEP00000004212

From Uniprot, you can do it using the following URL:

http://www.uniprot.org/uniprot/?query=ENSMUSP00000137272&format=fasta​

Is there a way to retrieve the corresponding protein sequences from Ensembl without knowing which species they come from? Or using Biomart?

Thanks

protein sequence ensembl • 2.2k views
ADD COMMENTlink modified 5.3 years ago by Tariq Daouda210 • written 5.3 years ago by Joseph Hughes2.8k

The simple way might be to actually find out which species they come from. It is rather easy actually considering that the prefix of these ids are always ENS<Species Code>P<ID> (where P represent protein). So the simple way will be tokenize your list and find out what species was containing in your list and then use biomart to download the sequence of the corresponding species. 

Some examples are:

MUS = mouse

RNO = Rat

TBE = Tupaia belangeri (Tree Shrew)

You can find the information here

ADD REPLYlink written 5.3 years ago by Sam3.0k
2
gravatar for Emily_Ensembl
5.3 years ago by
Emily_Ensembl21k
EMBL-EBI
Emily_Ensembl21k wrote:

Have you tried the REST API? The GET sequence/id endpoint pulls out a sequence with just the ID. For example http://rest.ensembl.org/sequence/id/ENSMUSP00000137272?content-type=text/x-fasta;type=protein.

ADD COMMENTlink modified 5.3 years ago • written 5.3 years ago by Emily_Ensembl21k
1
gravatar for Tariq Daouda
5.3 years ago by
Tariq Daouda210
IRIC | Institute for Research in Immunology and Cancer
Tariq Daouda210 wrote:

pyGeno is also you friend. It does not require access to a REST API so it is more reliable and faster if you have a lot of proteins.

from pyGeno.Genome import *

ref = Genome(name = 'GRCh37.75')

prot = ref.get(Protein, id = 'ENSMUSP00000137272')[0]

And you also get all the information supplied by Ensembl for free:

prot.gene.biotype, prot.transcript.sequence, prot.transcript.exons etc...

 

ADD COMMENTlink written 5.3 years ago by Tariq Daouda210
0
gravatar for Joseph Hughes
5.3 years ago by
Joseph Hughes2.8k
Scotland, UK
Joseph Hughes2.8k wrote:

Actually, once you know which adaptor to use, it is quite simple. Here's a perl script that does it for an input text file with protein identifiers on each line:

 

 

#!/usr/bin/env perl

########################################################################### 
# script to download all the protein sequences from a list of identifiers

use strict;
use warnings;
use Bio::EnsEMBL::Registry;
use Bio::EnsEMBL::ApiVersion;
printf( "The API version used is %s\n", software_version() );

my $list=$ARGV[0];
print "Parsing IDs from $list\n";
open(LIST,"<$list")||die "Can't open $list\n";
my (@IDs);
while(<LIST>){
  chomp($_);
  push(@IDs,$_);
}
 
# Load the registry automatically
my $registry = 'Bio::EnsEMBL::Registry';
$registry->load_registry_from_db(
  -host=>'ensembldb.ensembl.org',
  -user=>'anonymous', 
);

open(PROT,">$list\_out.fa")||die "Can't open $list\_out.fa\n";

foreach my $ID (@IDs) {
  print PROT ">$ID\n";
  my $seqmember_adaptor = Bio::EnsEMBL::Registry->get_adaptor('Multi','compara','SeqMember');
  # fetch a Member
  my $seqmember = $seqmember_adaptor->fetch_by_stable_id($ID);
  print PROT $seqmember->sequence(),"\n";

}

 

ADD COMMENTlink modified 5.3 years ago • written 5.3 years ago by Joseph Hughes2.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1131 users visited in the last hour