Question: Mapping Uniprot To Ensembl Genes
4
gravatar for user
6.8 years ago by
user790
United States
user790 wrote:

I'm finding via a UniProt to Ensembl mapping (available, among other places, through UniProtKB: <ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/README)> that certain IDs do not have Ensembl gene IDs in various tables. For example:

Q9Y5I3

Clearly maps to a gene, which in turn has an Ensembl ID, but it does not appear in the tables available from uniprot like the one linked to above. Why is this and where can I find a complete mapping from any uniprot/uniref ID to an Ensembl gene ID? thanks.

proteomics ensembl uniprot • 10.0k views
ADD COMMENTlink modified 3.7 years ago by ostrokach280 • written 6.8 years ago by user790
2

As usual, the answer to "how do I map ID X to ID Y" is to use BioMart or the UCSC Table browser. Please search this site for answers on those topics; if you have trouble leave a comment and we can supply brief instructions. I just tried BioMart using UniProt/SwissProt Accession Q9Y5I3 as filter and it returned ENSG00000204970/ENST00000378133 as Gene ID and Transcript ID attributes.

ADD REPLYlink modified 6.8 years ago • written 6.8 years ago by Neilfws48k
2

I am very well aware of these resources and tried them. When you download BioMart tables and ask for the Uniprot ID along with ENSG ids, you get a table back that does not contain Q9Y5I3. The ID is not found. Searching for this ID as a filter with each ID is not a solution -- I am looking for a table that contains a proper mapping so that it can be programmatically searched.

ADD REPLYlink written 6.8 years ago by user790

I managed to get a table from BioMart that has the Q9Y5I13 Id but now it's missing others like A2VEC9. I don't understand why these ids are not in tables

ADD REPLYlink written 6.8 years ago by user790

Why are you downloading tables? The example I have was via the web interface. If certain IDs do not map, it's simply because Ensembl is unsure whether there's a canonical gene for that protein product.

ADD REPLYlink written 6.8 years ago by Neilfws48k

The reason you're not getting ALL the IDs when you download from BioMart is that BioMart cannot handle that amount of data. It just stops working partway through your query. You need to filter by your list.

Alternatively, if you do want a complete list, then you can use the Perl API.

ADD REPLYlink written 6.1 years ago by Emily_Ensembl18k
2

UniProt to Ensembl cross references are currently being cleaned up and corrected. This is not an instantaneous process and will take some time. Try writing to help@uniprot.org for more details.

ADD REPLYlink written 6.8 years ago by Jerven640
2
gravatar for Chris
6.8 years ago by
Chris1.6k
Munich
Chris1.6k wrote:

As usual, when it comes to mappings between uniprot and external ids, the most reliable approach is to look into the Trembl/SwissProt flatfiles and parse for the uniprot accession and the desired external accession. Uniprot acc is listed in the AC field, external ones are within the DR fields. So in the case of Q9Y5I3, this looks like this (uniprot_sprot.dat):

DR   Ensembl; ENST00000378133; ENSP00000367373; ENSG00000204970.

BioPython offers a nice interface for parsing UniProt files without much ado. I'm sure BioPerl/... have similar interfaces.

Chris

ADD COMMENTlink written 6.8 years ago by Chris1.6k
1

I could not find this. When I downloaded uniprot_sprot.dat from UniProt, I got this:

$ grep Q9Y5I3 uniprot_sprot.dat AC Q9Y5I3; O75288; Q9NRT7; CC IsoId=Q9Y5I3-1; Sequence=Displayed; CC IsoId=Q9Y5I3-2; Sequence=VSP_000670; CC IsoId=Q9Y5I3-3; Sequence=VSP_000671, VSP_000672; DR ProteinModelPortal; Q9Y5I3; -. DR SMR; Q9Y5I3; 27-678. DR IntAct; Q9Y5I3; 1. DR STRING; Q9Y5I3; -. DR PRIDE; Q9Y5I3; -. DR neXtProt; NX_Q9Y5I3; -. DR InParanoid; Q9Y5I3; -. DR Genevestigator; Q9Y5I3; -.

So no Ensembl reference... where did you get the flat file you mention?

ADD REPLYlink written 6.8 years ago by user790
1

True, the Ensembl reference for Q9Y5I3 seems to have vanished during last SwissProt's update, i.e. I cannot find it either anymore, strange.

ADD REPLYlink written 6.8 years ago by Chris1.6k
2
gravatar for Andy Yates
6.8 years ago by
Andy Yates110
Cambridge
Andy Yates110 wrote:

UniParc is another way of doing this mapping since this is based on checksum collisions from numerous protein resources. If you want a 1:1 mapping this is a good place to look. http://www.uniprot.org/uniparc/UPI00001273C7

ADD COMMENTlink written 6.8 years ago by Andy Yates110
1

Just to add another link into this mix from the Ensembl interface; http://www.ensembl.org/Homo_sapiens/Transcript/Similarity?db=core;g=ENSG00000204970;r=5:140165876-140391929;t=ENST00000378133 .

ADD REPLYlink modified 6.8 years ago • written 6.8 years ago by Andy Yates110
1
gravatar for Julian
6.8 years ago by
Julian200
Manchester, UK
Julian200 wrote:

As well as the above pieces of software, or going into the flat file, is a tool called PICR (Protein Identifier Cross-Reference Service; http://www.ebi.ac.uk/Tools/picr/search.do). This was done a few years back to assist in issues like this, but was originally written to help cross-mapping of data placed into PRIDE (http://www.ebi.ac.uk/pride/). It is something I resort to frequently to identify other database entries for a particular protein or gene.

As to why it happens, I am not sure. But SwissProt from which Q9Y5I3 comes from is the manually created part of UniProt - i.e., the data is fully annotated by a human being. It has probably been taken from a skeleton generated by software used to create the TrEMBL portion of UniProt. The problem as always is how many links do you follow, and which links and annotation do you trust. The reason SwissProt has such a great reputation is the degree and quality of annotation it provides. Another possibility is that data has been updated elsewhere, and not been amended in the UniProt entry: databases are continually updated and keeping databases in sync is a nightmare task.

ADD COMMENTlink written 6.8 years ago by Julian200
1
gravatar for cdsouthan
6.8 years ago by
cdsouthan1.8k
cdsouthan1.8k wrote:

I was writing this at the same time as Julian was adding his useful comment. I am also a Swiss-Prot annotation fan but I can confirm that there are constitutive problems for ID x-mapping in general for a significant proportion of human proteins as indicated by the following UniProt queries

(organism:"Homo sapiens [9606]") AND reviewed:yes = 20,237 (organism:"Homo sapiens [9606]") AND reviewed:yes AND database:(type:ensembl) = 18,685 (organism:"Homo sapiens [9606]") AND reviewed:yes AND database:(type:ensembl) AND database:(type:hgnc) AND database:(type:geneid) = 18,250

Ensembl 67.37 = 21,065 including 568 novel (i.e. not 100% match to UniProt)

The Biomart numbers should be similar but any way you look at it there is ~ 8% discordance Swiss-Prot > Ensembl and residual for HGNC and EGID. The numbers also indicate ~ 1000 Ensembl proteins are not in Swiss-Prot (but some may be in TrEMBL)

For Q9Y5I3 it looks like the flat file had the x-ref but not the UniProt web interface (i.e. I can click UniProt > HGNC > Ensembl but not direct) Maybe this is the sync problem Julian points out.

Julian, can you get PICR nos that are concordant with the type I have shown ?

ADD COMMENTlink modified 6.8 years ago • written 6.8 years ago by cdsouthan1.8k
1
gravatar for Pierre Lindenbaum
6.8 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum119k wrote:

Using the UCSC mysql server and the tables uniProt.extDbRef and uniProt.extDb :

$ echo -e "Q9Y5I3\nQ04721" |\
awk '{printf("select REF.acc,REF.extAcc1,REF.extAcc2,REF.extAcc3 from uniProt.extDbRef as REF, uniProt.extDb as EXT where EXT.val=\"ENSEMBL\" and EXT.id=REF.extDb and REF.acc=\"%s\";\n",$0);}' |\
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -N

Q9Y5I3  ENST00000378133 ENSP00000367373 ENSG00000204970
Q04721  ENST00000256646 ENSP00000256646 ENSG00000134250
ADD COMMENTlink written 6.8 years ago by Pierre Lindenbaum119k

is this available as a standard UCSC text table?

ADD REPLYlink written 6.8 years ago by user790
0
gravatar for cdsouthan
6.8 years ago by
cdsouthan1.8k
cdsouthan1.8k wrote:

OK - so which of the six sources we have referred to (UniProt/UniParc/Ensembl/BioMart/UCSC/PICR) operationally executes the primary UniProt > Ensembl mapping and how ?

ADD COMMENTlink modified 6.8 years ago • written 6.8 years ago by cdsouthan1.8k
1

So as far as my knowledge of all of these things go Ensembl maps its proteins to UniProtKB accessions using a 100% identity match (please assume this even though data in 67 will disagree) or by using a direct association given to Ensembl by UniProt. The BioMart referred to in this post is the Ensembl Gene Mart so same rules apply as before.

UniParc does it's own mappings using MD5 digests of sequence and clusters identical checksums together. PICR uses UniParc in its mappings but can also use other forms of alignment/lookup (see http://www.ebi.ac.uk/Tools/picr/implementation.do for more information).

UniProt also do a mapping to Ensembl but I would rather let them comment on this process to avoid mis-informing you.

ADD REPLYlink written 6.8 years ago by Andy Yates110
0
gravatar for ostrokach
3.7 years ago by
ostrokach280
Canada
ostrokach280 wrote:

I recommend the Uniprot idmapping.dat.gz file. After comparing mappings available from Uniprot, Ensembl, UCSC Table browser and Biomart, I found Uniprot's to be the most complete:

$ grep Q9Y5I3 idmapping.dat
Q9Y5I3 Ensembl ENSG00000204970
Q9Y5I3-3 Ensembl_TRS ENST00000378133
Q9Y5I3-3 Ensembl_PRO ENSP00000367373
Q9Y5I3-2 Ensembl_TRS ENST00000394633
Q9Y5I3-2 Ensembl_PRO ENSP00000378129
Q9Y5I3-1 Ensembl_TRS ENST00000504120
Q9Y5I3-1 Ensembl_PRO ENSP00000420840
ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by ostrokach280
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1181 users visited in the last hour