ID unifiying across different datasets
1
0
Entering edit mode
20 hours ago
zizigolu ★ 4.4k

Hi

I'm working with several mass-spec protein spreadsheets. Each file reports proteins using different UniProt accessions, even when they refer to the same underlying protein/gene.

Examples include: – reviewed vs unreviewed (SwissProt vs TrEMBL) IDs for the same protein, – different isoform accessions for the same gene, – MaxQuant protein-group IDs (e.g. B2RPK0;P09429) that collapse to one protein.

This makes it hard to compare proteins across datasets because the same protein appears under different UniProt IDs.

What I need is a reliable way to coagulate/combine multiple UniProt accessions into a single canonical identifier, ideally a gene symbol, so annotation is consistent across files.

Has anyone solved this harmonization/mapping problem across multiple MS datasets?

Thanks for any thoughts

UniProt GeneSymbole ID • 286 views
ADD COMMENT
1
Entering edit mode

I always map everything back to Ensembl gene IDs, as this is the most stable identifier to me. You would need to collect translation tables, e.g. from biomaRt that mappes all the available protein and peptide identifiers to Ensembl gene IDs, and then go down the rabbit hole of doing all the merging and joining between the identifiers to end up with a clean annotation. I don't have code to show here though as it's always super custom.

ADD REPLY
0
Entering edit mode

Thank you; The problem is let's say in the protein ID column I see Q5VTE0;P68104;Q05639 here I am sure I should get which ENSG for each of these because each one has a different ENSG. Thanks for any thoughts

ADD REPLY
2
Entering edit mode

You can use the ID mapping tool from UniProt site (third ID did not map from the list above):

From    To
P68104  ENSG00000156508.19
Q05639  ENSG00000101210.14

You could download the underlying data from the same page if you have a lot of ID's to parse so you can do that locally.

ADD REPLY
0
Entering edit mode

sorry, why some gene names are often lacking though the Uniprot ID is still valid (using biomart)?

ADD REPLY
0
Entering edit mode

In this case the entry notes that this is likely the product of a pseudogene: https://www.uniprot.org/uniprotkb/Q5VTE0/entry and existence of protein is uncertain.

ADD REPLY
1
Entering edit mode
19 hours ago

B2RPK0 and P09429 are two different proteins, both in UniProtKB/Swiss-Prot (and not in TrEMBL):

https://www.uniprot.org/uniprotkb?query=accession%3AB2RPK0+OR+accession%3AP09429

Both entries have distinct ENSG IDs.

ADD COMMENT

Login before adding your answer.

Traffic: 4399 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6