ID unifiying across different datasets
1
0
Entering edit mode
1 day ago
zizigolu ★ 4.4k

Hi

I'm working with several mass-spec protein spreadsheets. Each file reports proteins using different UniProt accessions, even when they refer to the same underlying protein/gene.

Examples include: – reviewed vs unreviewed (SwissProt vs TrEMBL) IDs for the same protein, – different isoform accessions for the same gene, – MaxQuant protein-group IDs (e.g. B2RPK0;P09429) that collapse to one protein.

This makes it hard to compare proteins across datasets because the same protein appears under different UniProt IDs.

What I need is a reliable way to coagulate/combine multiple UniProt accessions into a single canonical identifier, ideally a gene symbol, so annotation is consistent across files.

Has anyone solved this harmonization/mapping problem across multiple MS datasets?

Thanks for any thoughts

UniProt GeneSymbole ID • 393 views
ADD COMMENT
1
Entering edit mode

I always map everything back to Ensembl gene IDs, as this is the most stable identifier to me. You would need to collect translation tables, e.g. from biomaRt that mappes all the available protein and peptide identifiers to Ensembl gene IDs, and then go down the rabbit hole of doing all the merging and joining between the identifiers to end up with a clean annotation. I don't have code to show here though as it's always super custom.

ADD REPLY
0
Entering edit mode

Thank you; The problem is let's say in the protein ID column I see Q5VTE0;P68104;Q05639 here I am sure I should get which ENSG for each of these because each one has a different ENSG. Thanks for any thoughts

ADD REPLY
2
Entering edit mode

You can use the ID mapping tool from UniProt site (third ID did not map from the list above):

From    To
P68104  ENSG00000156508.19
Q05639  ENSG00000101210.14

You could download the underlying data from the same page if you have a lot of ID's to parse so you can do that locally.

ADD REPLY
0
Entering edit mode

sorry, why some gene names are often lacking though the Uniprot ID is still valid (using biomart)?

ADD REPLY
0
Entering edit mode

In this case the entry notes that this is likely the product of a pseudogene: https://www.uniprot.org/uniprotkb/Q5VTE0/entry and existence of protein is uncertain.

ADD REPLY
0
Entering edit mode

Thanks a lot to be helpful

I’m trying to build a complete UniProtID to GeneSymbol lookup table and I’m running into a wall. I’ve already tried UniProt’s ID mapping service, BioDBnet, g:Profiler, and even processed UniProt’s full idmapping file in the terminal. Still, my dataset ends up with 393 UniProt IDs that come back with no GeneSymbol.

> head(merged)
  Accession GeneSymbol
1    B2R6J3          -
2    Q53EU7          -
3    A8K0T9          -
4    A8K0T9          -
5    Q53GS0          -
6    B2R9T9          -
> 
> 

What’s strange is that if I paste many of these UniProt IDs directly into Google, a gene name appears, but programmatically, through databases or command-line processing, they return blank.

Has anyone figured out a reliable or “complete” way to map UniProt IDs to gene symbols? I’d really appreciate hearing how others have solved this.

ADD REPLY
0
Entering edit mode

Each database has their own method of annotation/validation. Depending on resources available there may be a lag between the databases in terms of validation.

https://www.uniprot.org/uniprotkb/Q53GS0/entry - this entry has not been reviewed (TrEMBL) but there is evidence at transcript level.
https://www.uniprot.org/uniprotkb/Q53EU7/entry - same as above.
https://www.uniprot.org/uniprotkb/A8K0T9/entry

You get the idea ..

I paste many of these UniProt IDs directly into Google, a gene name appears, but programmatically, through databases or command-line processing, they return blank.

You should trust the source database than google in this case. Eventually these may get a gene symbol assignment, but for now there is not much you can do about these. You could take the sequence from these and manually try them to assign a ENSG* ID by searches/alignments, only if you feel confident about the assignment.

ADD REPLY
1
Entering edit mode
1 day ago

B2RPK0 and P09429 are two different proteins, both in UniProtKB/Swiss-Prot (and not in TrEMBL):

https://www.uniprot.org/uniprotkb?query=accession%3AB2RPK0+OR+accession%3AP09429

Both entries have distinct ENSG IDs.

ADD COMMENT

Login before adding your answer.

Traffic: 4073 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6