Question: How To Get Ensembl Id (Gene, Transcript, Protein) Mapping Information?
gravatar for Unode
8.4 years ago by
Lisbon, Portugal
Unode170 wrote:

The goal is to map Ensembl identifiers (ENSG, ENST, ENSP) between each other in a programmatic way.

I've searched in the Ensembl website but I couldn't find straightforward instructions to achieve this.

I've also tried to use the public MySQL server, but I couldn't make sense of the schema and the API documentation is overwhelming and quite Perl centered.

Edit: If possible using the website only as I would like to avoid having to worry about if I'm using the latest version or not.

ensembl mapping identifiers • 57k views
ADD COMMENTlink modified 3.9 years ago by Reece250 • written 8.4 years ago by Unode170

Just to point out that the Ensembl site has a BioMart interface - - and recommends it for data mining. You don't need to worry about versions; BioMart accesses the latest.

ADD REPLYlink written 8.4 years ago by Neilfws48k

Ah that's great

ADD REPLYlink written 8.4 years ago by Unode170

This question exposes the downsides of the complexity of the Ensembl schema. Ironically, it is easier to achieve this result via SQL from UCSC (see Pierre's solution) than from Ensembl (see Fred's solution) and justifies the existence of BioMart

ADD REPLYlink written 7.7 years ago by Casey Bergman18k
gravatar for Neilfws
8.4 years ago by
Sydney, Australia
Neilfws48k wrote:

BioMart is very useful for mapping identifiers. You can use it at the website and there are also libraries for several languages; for example, biomaRt for R Bioconductor.

Here's an example. Say you have the human transcript ENST00000296026 and you want gene (ENSG) and protein (ENSP). You'd do the following at the BioMart website:

  1. Click MARTVIEW (top menu)
  2. Choose ENSEMBL GENES 59 (SANGER UK) for database and Homo sapiens genes GRCh37 for dataset
  3. Click "Filters" (left menu) and expand GENE
  4. Choose "Ensembl Transcript ID(s)" and paste your ID(s) or upload a file of IDs
  5. Click "Attributes" (left menu) and expand GENE
  6. Check Ensembl Gene ID, Transcript ID and Protein ID
  7. Click "Results" (top left menu)

This should return ENSG00000163734, ENST00000296026 and ENSP00000296026. Note that you can export results and map many kinds of IDs.

Here's the same thing using R and biomaRt:

# define biomart object
mart <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")
# query biomart
results <- getBM(attributes = c("ensembl_gene_id", "ensembl_transcript_id", "ensembl_peptide_id"),
                 filters = "ensembl_transcript_id", values = "ENST00000296026",
                 mart = mart)
#   ensembl_gene_id ensembl_transcript_id ensembl_peptide_id
# 1 ENSG00000163734       ENST00000296026    ENSP00000296026
ADD COMMENTlink written 8.4 years ago by Neilfws48k

+1 for a simple solution. However it suffers from the fact that is using release 59 when release 60 is the current version in Edited question to reflect this point.

ADD REPLYlink written 8.4 years ago by Unode170

In addition, although this is not a requirement (using python), from a quick search in the web I couldn't find any decent interface to BioMart using Python. I can still use rpy but I would like to prevent the complexity of the task to grow beyond necessary.

ADD REPLYlink written 8.4 years ago by Unode170

Maybe it is too late but there is a python package for biomart

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by Lluís R.810

Use PyCogent! It isn't version specific and has a very simple API! I'd say the best available for Python coders!

ADD REPLYlink written 8.4 years ago by Steve Moss2.2k

Also, BioMart has EnsEMBL Genes 60 available!?

ADD REPLYlink written 8.4 years ago by Steve Moss2.2k

See this link

ADD REPLYlink written 8.4 years ago by Steve Moss2.2k

@gawbul it was updated in the meantime, thanks

ADD REPLYlink written 8.4 years ago by Unode170
gravatar for Pierre Lindenbaum
8.4 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum118k wrote:

you can also use the tables ensGene and ensGtp in the UCSC mysql server:

mysql --user=genome -A -D hg18

mysql> select * from ensGene as G,ensGtp as T
where  and T.gene="ENSG00000215719" limit 1\G
*************************** 1. row ***************************
         bin: 585
        name: ENST00000369958
       chrom: chr1_random
      strand: -
     txStart: 35366
       txEnd: 37336
    cdsStart: 35366
      cdsEnd: 37336
   exonCount: 3
  exonStarts: 35366,36978,37308,
    exonEnds: 35537,37076,37336,
       score: 0
       name2: ENSG00000215719
cdsStartStat: cmpl
  cdsEndStat: incmpl
  exonFrames: 0,1,0,
        gene: ENSG00000215719
  transcript: ENST00000369958
     protein: ENSP00000358974
1 row in set (0.20 sec)
ADD COMMENTlink written 8.4 years ago by Pierre Lindenbaum118k
gravatar for Fred Fleche
8.4 years ago by
Fred Fleche4.3k
Paris, France
Fred Fleche4.3k wrote:

You can get a nice explanation of the Ensembl Database Schema at the following url.

Then you can get the information you are looking for using this SQL query : see below in a PHP script.


//Connect to the server
if (!$connectionServer = mysql_connect('', 'anonymous', '')) die('Could not connect: ' . mysql_error());

//Connect to the database
if (!$database = mysql_select_db('homo_sapiens_core_47_36i', $connectionServer)) die('Could not select the database');

//Get the Ensembl IDs for Genes, Transcripts, and Proteins
$result = mysql_query("
gsi.stable_id as geneid, 
tsi.stable_id as transcriptid, 
tlsi.stable_id as translationid

gene g, 
gene_stable_id gsi, 
transcript t, 
transcript_stable_id tsi, 
translation tl, 
translation_stable_id tlsi

WHERE g.gene_id = gsi.gene_id 
AND g.gene_id = t.gene_id 
AND t.transcript_id = tsi.transcript_id
AND t.transcript_id = tl.transcript_id
AND tl.translation_id = tlsi.translation_id

LIMIT 10");

if (!$result) {
    die('Invalid query: ' . mysql_error());
while ($row = mysql_fetch_assoc($result)) {
    print $row['geneid'] . " - " . $row['transcriptid'] . " - " . $row['translationid'] . "<br />";



ENSG00000146556 - ENST00000326632 - ENSP00000317668
ENSG00000197194 - ENST00000379481 - ENSP00000368794
ENSG00000197490 - ENST00000359752 - ENSP00000352790
ENSG00000215918 - ENST00000401099 - ENSP00000383878
ENSG00000177757 - ENST00000326734 - ENSP00000317958
ENSG00000188405 - ENST00000338633 - ENSP00000342867
ENSG00000187642 - ENST00000341290 - ENSP00000343864
ENSG00000215917 - ENST00000401098 - ENSP00000383877
ENSG00000215916 - ENST00000379325 - ENSP00000368629
ENSG00000205231 - ENST00000379317 - ENSP00000368621
ADD COMMENTlink written 8.4 years ago by Fred Fleche4.3k

Your example is useful but the way you constructed the query will cause to only return complete triplets. If the transcript is a pseudogene like the ones corresponding to ENSG00000146556, you won't get them as a result (at least in homo_sapiens_core_60_37e). Thanks for the schema info though.

ADD REPLYlink written 8.4 years ago by Unode170

Thanks Fred, this saved a lot of time for me today.

ADD REPLYlink written 7.7 years ago by Casey Bergman18k
gravatar for Reece
3.9 years ago by
United States
Reece250 wrote:

Perhaps this Python solution will be useful to someone:

from biomart import BiomartServer

atts = ['external_gene_name','external_gene_source','ensembl_gene_id',

server = BiomartServer( "" )
hge = server.datasets['hsapiens_gene_ensembl']

s ={'attributes': atts}, header=1)
for l in s.iter_lines():


Associated Gene Name	Associated Gene Source	Ensembl Gene ID	Ensembl Transcript ID	Ensembl Protein ID
MT-TF	HGNC Symbol	ENSG00000210049	ENST00000387314	
MT-RNR1	HGNC Symbol	ENSG00000211459	ENST00000389680	
MT-TV	HGNC Symbol	ENSG00000210077	ENST00000387342	
MT-RNR2	HGNC Symbol	ENSG00000210082	ENST00000387347	
MT-TL1	HGNC Symbol	ENSG00000209082	ENST00000386347	
MT-ND1	HGNC Symbol	ENSG00000198888	ENST00000361390	ENSP00000354687


ADD COMMENTlink written 3.9 years ago by Reece250

Thank you for the post. This is what I am looking for!

ADD REPLYlink written 2.1 years ago by jkkim30
gravatar for Steve Moss
8.4 years ago by
Steve Moss2.2k
United Kingdom
Steve Moss2.2k wrote:

I added a couple of comments, but thought I would clarity with an answer too.

You can use the PyCogent package available here It has the best API for EnsEMBL access in Python. PyGr has one too, but it doesn't seem to be maintained and only works with version 0.7.

Tutorials are available on EnsEMBL access using PyCogent here with cookbook examples here, although the former is better.

BioMart access is available directly from EnsEMBL via the following link, which has release 60 access.

Also, an additional (non-Python) library is CGL from the Yandell Lab here, written in Perl. It provides easy access to relationships between the different annotation features in a transparent manner.

ADD COMMENTlink written 8.4 years ago by Steve Moss2.2k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 925 users visited in the last hour