Mapping Uniprot To Ensembl Genes
7
5
Entering edit mode
10.7 years ago
user ▴ 930

I'm finding via a UniProt to Ensembl mapping (available, among other places, through UniProtKB: <ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/README)> that certain IDs do not have Ensembl gene IDs in various tables. For example:

Q9Y5I3

Clearly maps to a gene, which in turn has an Ensembl ID, but it does not appear in the tables available from uniprot like the one linked to above. Why is this and where can I find a complete mapping from any uniprot/uniref ID to an Ensembl gene ID? thanks.

uniprot ensembl proteomics • 16k views
2
Entering edit mode

As usual, the answer to "how do I map ID X to ID Y" is to use BioMart or the UCSC Table browser. Please search this site for answers on those topics; if you have trouble leave a comment and we can supply brief instructions. I just tried BioMart using UniProt/SwissProt Accession Q9Y5I3 as filter and it returned ENSG00000204970/ENST00000378133 as Gene ID and Transcript ID attributes.

2
Entering edit mode

I am very well aware of these resources and tried them. When you download BioMart tables and ask for the Uniprot ID along with ENSG ids, you get a table back that does not contain Q9Y5I3. The ID is not found. Searching for this ID as a filter with each ID is not a solution -- I am looking for a table that contains a proper mapping so that it can be programmatically searched.

0
Entering edit mode

I managed to get a table from BioMart that has the Q9Y5I13 Id but now it's missing others like A2VEC9. I don't understand why these ids are not in tables

0
Entering edit mode

Why are you downloading tables? The example I have was via the web interface. If certain IDs do not map, it's simply because Ensembl is unsure whether there's a canonical gene for that protein product.

0
Entering edit mode

The reason you're not getting ALL the IDs when you download from BioMart is that BioMart cannot handle that amount of data. It just stops working partway through your query. You need to filter by your list.

Alternatively, if you do want a complete list, then you can use the Perl API.

2
Entering edit mode

UniProt to Ensembl cross references are currently being cleaned up and corrected. This is not an instantaneous process and will take some time. Try writing to help@uniprot.org for more details.

3
Entering edit mode
10.7 years ago

Using the UCSC mysql server and the tables uniProt.extDbRef and uniProt.extDb :

$echo -e "Q9Y5I3\nQ04721" |\ awk '{printf("select REF.acc,REF.extAcc1,REF.extAcc2,REF.extAcc3 from uniProt.extDbRef as REF, uniProt.extDb as EXT where EXT.val=\"ENSEMBL\" and EXT.id=REF.extDb and REF.acc=\"%s\";\n",$0);}' |\
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -N

Q9Y5I3  ENST00000378133 ENSP00000367373 ENSG00000204970
Q04721  ENST00000256646 ENSP00000256646 ENSG00000134250

0
Entering edit mode

is this available as a standard UCSC text table?

0
Entering edit mode

Solved my problem. thank you pierre!

2
Entering edit mode
10.7 years ago
Chris ★ 1.6k

As usual, when it comes to mappings between uniprot and external ids, the most reliable approach is to look into the Trembl/SwissProt flatfiles and parse for the uniprot accession and the desired external accession. Uniprot acc is listed in the AC field, external ones are within the DR fields. So in the case of Q9Y5I3, this looks like this (uniprot_sprot.dat):

DR   Ensembl; ENST00000378133; ENSP00000367373; ENSG00000204970.


BioPython offers a nice interface for parsing UniProt files without much ado. I'm sure BioPerl/... have similar interfaces.

Chris

1
Entering edit mode

I could not find this. When I downloaded uniprot_sprot.dat from UniProt, I got this:

$grep Q9Y5I3 uniprot_sprot.dat AC Q9Y5I3; O75288; Q9NRT7; CC IsoId=Q9Y5I3-1; Sequence=Displayed; CC IsoId=Q9Y5I3-2; Sequence=VSP_000670; CC IsoId=Q9Y5I3-3; Sequence=VSP_000671, VSP_000672; DR ProteinModelPortal; Q9Y5I3; -. DR SMR; Q9Y5I3; 27-678. DR IntAct; Q9Y5I3; 1. DR STRING; Q9Y5I3; -. DR PRIDE; Q9Y5I3; -. DR neXtProt; NX_Q9Y5I3; -. DR InParanoid; Q9Y5I3; -. DR Genevestigator; Q9Y5I3; -.  So no Ensembl reference... where did you get the flat file you mention? ADD REPLY 1 Entering edit mode True, the Ensembl reference for Q9Y5I3 seems to have vanished during last SwissProt's update, i.e. I cannot find it either anymore, strange. ADD REPLY 2 Entering edit mode 10.7 years ago Andy Yates ▴ 120 UniParc is another way of doing this mapping since this is based on checksum collisions from numerous protein resources. If you want a 1:1 mapping this is a good place to look. http://www.uniprot.org/uniparc/UPI00001273C7 ADD COMMENT 1 Entering edit mode ADD REPLY 1 Entering edit mode 10.7 years ago Julian ▴ 200 As well as the above pieces of software, or going into the flat file, is a tool called PICR (Protein Identifier Cross-Reference Service; http://www.ebi.ac.uk/Tools/picr/search.do). This was done a few years back to assist in issues like this, but was originally written to help cross-mapping of data placed into PRIDE (http://www.ebi.ac.uk/pride/). It is something I resort to frequently to identify other database entries for a particular protein or gene. As to why it happens, I am not sure. But SwissProt from which Q9Y5I3 comes from is the manually created part of UniProt - i.e., the data is fully annotated by a human being. It has probably been taken from a skeleton generated by software used to create the TrEMBL portion of UniProt. The problem as always is how many links do you follow, and which links and annotation do you trust. The reason SwissProt has such a great reputation is the degree and quality of annotation it provides. Another possibility is that data has been updated elsewhere, and not been amended in the UniProt entry: databases are continually updated and keeping databases in sync is a nightmare task. ADD COMMENT 1 Entering edit mode 10.7 years ago cdsouthan ★ 1.9k I was writing this at the same time as Julian was adding his useful comment. I am also a Swiss-Prot annotation fan but I can confirm that there are constitutive problems for ID x-mapping in general for a significant proportion of human proteins as indicated by the following UniProt queries (organism:"Homo sapiens [9606]") AND reviewed:yes = 20,237 (organism:"Homo sapiens [9606]") AND reviewed:yes AND database:(type:ensembl) = 18,685 (organism:"Homo sapiens [9606]") AND reviewed:yes AND database:(type:ensembl) AND database:(type:hgnc) AND database:(type:geneid) = 18,250 Ensembl 67.37 = 21,065 including 568 novel (i.e. not 100% match to UniProt) The Biomart numbers should be similar but any way you look at it there is ~ 8% discordance Swiss-Prot > Ensembl and residual for HGNC and EGID. The numbers also indicate ~ 1000 Ensembl proteins are not in Swiss-Prot (but some may be in TrEMBL) For Q9Y5I3 it looks like the flat file had the x-ref but not the UniProt web interface (i.e. I can click UniProt > HGNC > Ensembl but not direct) Maybe this is the sync problem Julian points out. Julian, can you get PICR nos that are concordant with the type I have shown ? ADD COMMENT 1 Entering edit mode 7.7 years ago ostrokach ▴ 350 I recommend the Uniprot idmapping.dat.gz file. After comparing mappings available from Uniprot, Ensembl, UCSC Table browser and Biomart, I found Uniprot's to be the most complete: $ grep Q9Y5I3 idmapping.dat
Q9Y5I3 Ensembl ENSG00000204970
Q9Y5I3-3 Ensembl_TRS ENST00000378133
Q9Y5I3-3 Ensembl_PRO ENSP00000367373
Q9Y5I3-2 Ensembl_TRS ENST00000394633
Q9Y5I3-2 Ensembl_PRO ENSP00000378129
Q9Y5I3-1 Ensembl_TRS ENST00000504120
Q9Y5I3-1 Ensembl_PRO ENSP00000420840
0
Entering edit mode

This has also been my experience..

Couple things worth noting:

1. There are organism-level mapping files so you aren't required to use the global one.
2. The gene annotations in the xx_idmapping.dat.gz and xx_idmapping_selected.tab.gz files do not contain the same information -- despite being smaller, the .dat.gz file includes gene annotations which are left empty in the tabular "selected" version of the mapping.
0
Entering edit mode
10.7 years ago
cdsouthan ★ 1.9k

OK - so which of the six sources we have referred to (UniProt/UniParc/Ensembl/BioMart/UCSC/PICR) operationally executes the primary UniProt > Ensembl mapping and how ?

1
Entering edit mode

So as far as my knowledge of all of these things go Ensembl maps its proteins to UniProtKB accessions using a 100% identity match (please assume this even though data in 67 will disagree) or by using a direct association given to Ensembl by UniProt. The BioMart referred to in this post is the Ensembl Gene Mart so same rules apply as before.

UniParc does it's own mappings using MD5 digests of sequence and clusters identical checksums together. PICR uses UniParc in its mappings but can also use other forms of alignment/lookup (see http://www.ebi.ac.uk/Tools/picr/implementation.do for more information).

UniProt also do a mapping to Ensembl but I would rather let them comment on this process to avoid mis-informing you.