Question: How To Get Ensembl Id (Gene, Transcript, Protein) Mapping Information?
13
gravatar for Unode
9.1 years ago by
Unode170
Lisbon, Portugal
Unode170 wrote:

The goal is to map Ensembl identifiers (ENSG, ENST, ENSP) between each other in a programmatic way.

I've searched in the Ensembl website but I couldn't find straightforward instructions to achieve this.

I've also tried to use the public MySQL server, but I couldn't make sense of the schema and the API documentation is overwhelming and quite Perl centered.

Edit: If possible using the Ensembl.org website only as I would like to avoid having to worry about if I'm using the latest version or not.

ensembl mapping identifiers • 61k views
ADD COMMENTlink modified 4.6 years ago by Reece270 • written 9.1 years ago by Unode170
4

Just to point out that the Ensembl site has a BioMart interface - http://www.ensembl.org/biomart/martview - and recommends it for data mining. You don't need to worry about versions; BioMart accesses the latest.

ADD REPLYlink written 9.1 years ago by Neilfws48k

Ah that's great

ADD REPLYlink written 9.1 years ago by Unode170

This question exposes the downsides of the complexity of the Ensembl schema. Ironically, it is easier to achieve this result via SQL from UCSC (see Pierre's solution) than from Ensembl (see Fred's solution) and justifies the existence of BioMart

ADD REPLYlink written 8.4 years ago by Casey Bergman18k
31
gravatar for Neilfws
9.1 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

BioMart is very useful for mapping identifiers. You can use it at the website and there are also libraries for several languages; for example, biomaRt for R Bioconductor.

Here's an example. Say you have the human transcript ENST00000296026 and you want gene (ENSG) and protein (ENSP). You'd do the following at the BioMart website:

  1. Click MARTVIEW (top menu)
  2. Choose ENSEMBL GENES 59 (SANGER UK) for database and Homo sapiens genes GRCh37 for dataset
  3. Click "Filters" (left menu) and expand GENE
  4. Choose "Ensembl Transcript ID(s)" and paste your ID(s) or upload a file of IDs
  5. Click "Attributes" (left menu) and expand GENE
  6. Check Ensembl Gene ID, Transcript ID and Protein ID
  7. Click "Results" (top left menu)

This should return ENSG00000163734, ENST00000296026 and ENSP00000296026. Note that you can export results and map many kinds of IDs.

Here's the same thing using R and biomaRt:

library(biomaRt)
# define biomart object
mart <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")
# query biomart
results <- getBM(attributes = c("ensembl_gene_id", "ensembl_transcript_id", "ensembl_peptide_id"),
                 filters = "ensembl_transcript_id", values = "ENST00000296026",
                 mart = mart)
results
#   ensembl_gene_id ensembl_transcript_id ensembl_peptide_id
# 1 ENSG00000163734       ENST00000296026    ENSP00000296026
ADD COMMENTlink modified 3 months ago by RamRS25k • written 9.1 years ago by Neilfws48k

+1 for a simple solution. However it suffers from the fact that is using release 59 when release 60 is the current version in Ensembl.org. Edited question to reflect this point.

ADD REPLYlink written 9.1 years ago by Unode170

In addition, although this is not a requirement (using python), from a quick search in the web I couldn't find any decent interface to BioMart using Python. I can still use rpy but I would like to prevent the complexity of the task to grow beyond necessary.

ADD REPLYlink written 9.1 years ago by Unode170

Maybe it is too late but there is a python package for biomart

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by Lluís R.880

Use PyCogent! It isn't version specific and has a very simple API! I'd say the best available for Python coders!

ADD REPLYlink written 9.1 years ago by Steve Moss2.3k

Also, BioMart has EnsEMBL Genes 60 available!?

ADD REPLYlink written 9.1 years ago by Steve Moss2.3k

See this link http://www.ensembl.org/biomart/martview

ADD REPLYlink written 9.1 years ago by Steve Moss2.3k

@gawbul it was updated in the meantime, thanks

ADD REPLYlink written 9.0 years ago by Unode170
6
gravatar for Pierre Lindenbaum
9.1 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum124k wrote:

You can also use the tables ensGene and ensGtp in the UCSC mysql server:

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg18

mysql> select * from ensGene as G,ensGtp as T
where G.name=T.transcript  and T.gene="ENSG00000215719" limit 1\G
*************************** 1. row ***************************
         bin: 585
        name: ENST00000369958
       chrom: chr1_random
      strand: -
     txStart: 35366
       txEnd: 37336
    cdsStart: 35366
      cdsEnd: 37336
   exonCount: 3
  exonStarts: 35366,36978,37308,
    exonEnds: 35537,37076,37336,
       score: 0
       name2: ENSG00000215719
cdsStartStat: cmpl
  cdsEndStat: incmpl
  exonFrames: 0,1,0,
        gene: ENSG00000215719
  transcript: ENST00000369958
     protein: ENSP00000358974
1 row in set (0.20 sec)
ADD COMMENTlink modified 3 months ago by RamRS25k • written 9.1 years ago by Pierre Lindenbaum124k
4
gravatar for Fred Fleche
9.1 years ago by
Fred Fleche4.3k
Paris, France
Fred Fleche4.3k wrote:

You can get a nice explanation of the Ensembl Database Schema at the following url.

http://www.to.infn.it/ftbio/Ensembl_materials/Core/EnsSchema.ppt

Then you can get the information you are looking for using this SQL query : see below in a PHP script.

CODE

<?php
//Connect to the server
if (!$connectionServer = mysql_connect('ensembldb.ensembl.org:3306', 'anonymous', '')) die('Could not connect: ' . mysql_error());

//Connect to the database
if (!$database = mysql_select_db('homo_sapiens_core_47_36i', $connectionServer)) die('Could not select the database');

//Get the Ensembl IDs for Genes, Transcripts, and Proteins
$result = mysql_query("
SELECT DISTINCT 
gsi.stable_id as geneid, 
tsi.stable_id as transcriptid, 
tlsi.stable_id as translationid

FROM 
gene g, 
gene_stable_id gsi, 
transcript t, 
transcript_stable_id tsi, 
translation tl, 
translation_stable_id tlsi

WHERE g.gene_id = gsi.gene_id 
AND g.gene_id = t.gene_id 
AND t.transcript_id = tsi.transcript_id
AND t.transcript_id = tl.transcript_id
AND tl.translation_id = tlsi.translation_id

LIMIT 10");

if (!$result) {
    die('Invalid query: ' . mysql_error());
}
while ($row = mysql_fetch_assoc($result)) {
    print $row['geneid'] . " - " . $row['transcriptid'] . " - " . $row['translationid'] . "<br />";
}

mysql_free_result($result);
mysql_close($connectionServer);
?>

RESULT

ENSG00000146556 - ENST00000326632 - ENSP00000317668
ENSG00000197194 - ENST00000379481 - ENSP00000368794
ENSG00000197490 - ENST00000359752 - ENSP00000352790
ENSG00000215918 - ENST00000401099 - ENSP00000383878
ENSG00000177757 - ENST00000326734 - ENSP00000317958
ENSG00000188405 - ENST00000338633 - ENSP00000342867
ENSG00000187642 - ENST00000341290 - ENSP00000343864
ENSG00000215917 - ENST00000401098 - ENSP00000383877
ENSG00000215916 - ENST00000379325 - ENSP00000368629
ENSG00000205231 - ENST00000379317 - ENSP00000368621
ADD COMMENTlink modified 3 months ago by RamRS25k • written 9.1 years ago by Fred Fleche4.3k

Your example is useful but the way you constructed the query will cause to only return complete triplets. If the transcript is a pseudogene like the ones corresponding to ENSG00000146556, you won't get them as a result (at least in homo_sapiens_core_60_37e). Thanks for the schema info though.

ADD REPLYlink written 9.0 years ago by Unode170

Thanks Fred, this saved a lot of time for me today.

ADD REPLYlink written 8.4 years ago by Casey Bergman18k
4
gravatar for Reece
4.6 years ago by
Reece270
United States
Reece270 wrote:

Perhaps this Python solution will be useful to someone:

from biomart import BiomartServer

atts = ['external_gene_name','external_gene_source','ensembl_gene_id',
        'ensembl_transcript_id','ensembl_peptide_id']

server = BiomartServer("http://www.biomart.org/biomart")
hge = server.datasets['hsapiens_gene_ensembl']

print(server.databases['ensembl'])
s = hge.search({'attributes': atts}, header=1)
for l in s.iter_lines():
    print(l)

Gives:

ENSEMBL GENES 79 (SANGER UK)
Associated Gene Name    Associated Gene Source  Ensembl Gene ID Ensembl Transcript ID   Ensembl Protein ID
MT-TF   HGNC Symbol ENSG00000210049 ENST00000387314 
MT-RNR1 HGNC Symbol ENSG00000211459 ENST00000389680 
MT-TV   HGNC Symbol ENSG00000210077 ENST00000387342 
MT-RNR2 HGNC Symbol ENSG00000210082 ENST00000387347 
MT-TL1  HGNC Symbol ENSG00000209082 ENST00000386347 
MT-ND1  HGNC Symbol ENSG00000198888 ENST00000361390 ENSP00000354687
ADD COMMENTlink modified 3 months ago by RamRS25k • written 4.6 years ago by Reece270

Thank you for the post. This is what I am looking for!

ADD REPLYlink written 2.7 years ago by jkkim30
3
gravatar for Steve Moss
9.1 years ago by
Steve Moss2.3k
United Kingdom
Steve Moss2.3k wrote:

I added a couple of comments, but thought I would clarity with an answer too.

You can use the PyCogent package available here http://pycogent.sourceforge.net/. It has the best API for EnsEMBL access in Python. PyGr has one too, but it doesn't seem to be maintained and only works with version 0.7.

Tutorials are available on EnsEMBL access using PyCogent here http://pycogent.sourceforge.net/examples/query_ensembl.html with cookbook examples here http://pycogent.sourceforge.net/cookbook/accessing_databases.html#ensemble, although the former is better.

BioMart access is available directly from EnsEMBL via the following link http://www.ensembl.org/biomart/martview, which has release 60 access.

Also, an additional (non-Python) library is CGL from the Yandell Lab here http://www.yandell-lab.org/software/cgl.html, written in Perl. It provides easy access to relationships between the different annotation features in a transparent manner.

ADD COMMENTlink written 9.1 years ago by Steve Moss2.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 759 users visited in the last hour