Hi
I'm working with several mass-spec protein spreadsheets. Each file reports proteins using different UniProt accessions, even when they refer to the same underlying protein/gene.
Examples include: – reviewed vs unreviewed (SwissProt vs TrEMBL) IDs for the same protein, – different isoform accessions for the same gene, – MaxQuant protein-group IDs (e.g. B2RPK0;P09429) that collapse to one protein.
This makes it hard to compare proteins across datasets because the same protein appears under different UniProt IDs.
What I need is a reliable way to coagulate/combine multiple UniProt accessions into a single canonical identifier, ideally a gene symbol, so annotation is consistent across files.
Has anyone solved this harmonization/mapping problem across multiple MS datasets?
Thanks for any thoughts
I always map everything back to Ensembl gene IDs, as this is the most stable identifier to me. You would need to collect translation tables, e.g. from biomaRt that mappes all the available protein and peptide identifiers to Ensembl gene IDs, and then go down the rabbit hole of doing all the merging and joining between the identifiers to end up with a clean annotation. I don't have code to show here though as it's always super custom.
Thank you; The problem is let's say in the protein ID column I see
Q5VTE0;P68104;Q05639here I am sure I should get whichENSGfor each of these because each one has a differentENSG. Thanks for any thoughtsYou can use the ID mapping tool from UniProt site (third ID did not map from the list above):
You could download the underlying data from the same page if you have a lot of ID's to parse so you can do that locally.
sorry, why some gene names are often lacking though the Uniprot ID is still valid (using biomart)?
In this case the entry notes that this is likely the product of a pseudogene: https://www.uniprot.org/uniprotkb/Q5VTE0/entry and existence of protein is
uncertain.Thanks a lot to be helpful
I’m trying to build a complete
UniProtIDtoGeneSymbollookup table and I’m running into a wall. I’ve already tried UniProt’s ID mapping service,BioDBnet,g:Profiler, and even processed UniProt’s fullidmappingfile in the terminal. Still, my dataset ends up with393UniProt IDs that come back with noGeneSymbol.What’s strange is that if I paste many of these UniProt IDs directly into Google, a gene name appears, but programmatically, through databases or command-line processing, they return blank.
Has anyone figured out a reliable or “complete” way to map UniProt IDs to gene symbols? I’d really appreciate hearing how others have solved this.