Hi! I am currently working with human proteins and I need to map between RefSeq, Ensembl, and UniProt, because my sources of "location on protein" span multiple databases (ClinVar, UniProt etc.). Since Ensembl provides the great BioMart which helps match a lot of identifiers, I am currently using it (ENSP) as the "standard reference" for protein sequences, and all identifiers mapped to the same ENSP id are assumed to refer to the same protein sequence. Please note that I am using all transcripts/proteins as long as they have records in my evidence source databases, so it does not necessarily need to be a canonical one, etc.
However, I realized soon that it seems the id-sequence mapping consistency (let's call it so for now...) might not always be guaranteed. Taking mRNA as an example, it looks like only the MANE-selected trancripts are verified to be the same, even though BioMart can successfully map some other ENST with RefSeq transcript ID (NM). I'm not sure if protein sequences also have such variations, though I assume it would be less severe since we have a deterministic (?) MET start and TER end. The UniProt mapping adds yet another difficulty, since it is using PDB sequences where possible. Is this issue valid, and if so, is it even solvable? Would appreciate any suggestions/advice!