Question

ID unifiying across different datasets

0

Entering edit mode

1 day ago

zizigolu ★ 4.4k

Hi

I'm working with several mass-spec protein spreadsheets. Each file reports proteins using different UniProt accessions, even when they refer to the same underlying protein/gene.

Examples include: – reviewed vs unreviewed (SwissProt vs TrEMBL) IDs for the same protein, – different isoform accessions for the same gene, – MaxQuant protein-group IDs (e.g. B2RPK0;P09429) that collapse to one protein.

This makes it hard to compare proteins across datasets because the same protein appears under different UniProt IDs.

What I need is a reliable way to coagulate/combine multiple UniProt accessions into a single canonical identifier, ideally a gene symbol, so annotation is consistent across files.

Has anyone solved this harmonization/mapping problem across multiple MS datasets?

Thanks for any thoughts

UniProt GeneSymbole ID • 352 views

ADD COMMENT • link 49 minutes ago by zizigolu ★ 4.4k

1

Entering edit mode

I always map everything back to Ensembl gene IDs, as this is the most stable identifier to me. You would need to collect translation tables, e.g. from biomaRt that mappes all the available protein and peptide identifiers to Ensembl gene IDs, and then go down the rabbit hole of doing all the merging and joining between the identifiers to end up with a clean annotation. I don't have code to show here though as it's always super custom.

ADD REPLY • link 1 day ago by ATpoint 90k

0

Entering edit mode

Thank you; The problem is let's say in the protein ID column I see Q5VTE0;P68104;Q05639 here I am sure I should get which ENSG for each of these because each one has a different ENSG. Thanks for any thoughts

ADD REPLY • link 1 day ago by zizigolu ★ 4.4k

2

Entering edit mode

You can use the ID mapping tool from UniProt site (third ID did not map from the list above):

From    To
P68104  ENSG00000156508.19
Q05639  ENSG00000101210.14

You could download the underlying data from the same page if you have a lot of ID's to parse so you can do that locally.

ADD REPLY • link 22 hours ago by GenoMax 154k

0

Entering edit mode

sorry, why some gene names are often lacking though the Uniprot ID is still valid (using biomart)?

ADD REPLY • link 16 hours ago by zizigolu ★ 4.4k

0

Entering edit mode

In this case the entry notes that this is likely the product of a pseudogene: https://www.uniprot.org/uniprotkb/Q5VTE0/entry and existence of protein is uncertain.

ADD REPLY • link 13 hours ago by GenoMax 154k

0

Entering edit mode

Thanks a lot to be helpful

I’m trying to build a complete UniProtID to GeneSymbol lookup table and I’m running into a wall. I’ve already tried UniProt’s ID mapping service, BioDBnet, g:Profiler, and even processed UniProt’s full idmapping file in the terminal. Still, my dataset ends up with 393 UniProt IDs that come back with no GeneSymbol.

> head(merged)
  Accession GeneSymbol
1    B2R6J3          -
2    Q53EU7          -
3    A8K0T9          -
4    A8K0T9          -
5    Q53GS0          -
6    B2R9T9          -
> 
>

What’s strange is that if I paste many of these UniProt IDs directly into Google, a gene name appears, but programmatically, through databases or command-line processing, they return blank.

Has anyone figured out a reliable or “complete” way to map UniProt IDs to gene symbols? I’d really appreciate hearing how others have solved this.

ADD REPLY • link 49 minutes ago by zizigolu ★ 4.4k

score 1 · Answer 1 · 2025-11-25

1

Entering edit mode

1 day ago

Elisabeth Gasteiger ★ 2.4k

B2RPK0 and P09429 are two different proteins, both in UniProtKB/Swiss-Prot (and not in TrEMBL):

https://www.uniprot.org/uniprotkb?query=accession%3AB2RPK0+OR+accession%3AP09429

Both entries have distinct ENSG IDs.

ADD COMMENT • link 1 day ago by Elisabeth Gasteiger ★ 2.4k