Are mapped RefSeq, Ensembl, and UniProt IDs pointing to the same protein sequence?
1
0
Entering edit mode
6 weeks ago
Yep ▴ 10

Hi! I am currently working with human proteins and I need to map between RefSeq, Ensembl, and UniProt, because my sources of "location on protein" span multiple databases (ClinVar, UniProt etc.). Since Ensembl provides the great BioMart which helps match a lot of identifiers, I am currently using it (ENSP) as the "standard reference" for protein sequences, and all identifiers mapped to the same ENSP id are assumed to refer to the same protein sequence. Please note that I am using all transcripts/proteins as long as they have records in my evidence source databases, so it does not necessarily need to be a canonical one, etc.

However, I realized soon that it seems the id-sequence mapping consistency (let's call it so for now...) might not always be guaranteed. Taking mRNA as an example, it looks like only the MANE-selected trancripts are verified to be the same, even though BioMart can successfully map some other ENST with RefSeq transcript ID (NM). I'm not sure if protein sequences also have such variations, though I assume it would be less severe since we have a deterministic (?) MET start and TER end. The UniProt mapping adds yet another difficulty, since it is using PDB sequences where possible. Is this issue valid, and if so, is it even solvable? Would appreciate any suggestions/advice!

mapping uniprot ensembl refseq protein • 875 views
0
Entering edit mode

Consider using HGNC official list of human gene symbols (this link will open the file up in browser). This file includes mappings to various databases.

$awk -F "\t" '{OFS="\t"}{print$2,$20,$22,$24,$26}' hgnc_complete_set.txt

symbol  ensembl_gene_id ucsc_id refseq_accession    uniprot_ids
A1BG    ENSG00000121410 uc002qsd.5  NM_130786   P04217
A1BG-AS1    ENSG00000268895 uc002qse.3  NR_015380
A1CF    ENSG00000148584 uc057tgv.1  NM_014576   Q9NQ94
A2M ENSG00000175899 uc001qvk.2  NM_000014   P01023
A2M-AS1 ENSG00000245105 uc009zgj.2  NR_026971
A2ML1   ENSG00000166535 uc001quz.6  NM_144670   A8K2U0
A2ML1-AS1   ENSG00000256661 uc058kxy.1
A2ML1-AS2   ENSG00000256904 uc058kyb.1
A2MP1   ENSG00000256069     NG_001067
A3GALT2 ENSG00000184389 uc031plq.1  NM_001080438    U3KPV4
A4GALT  ENSG00000128274 uc062ewl.1  NM_017436   Q9NPC4
A4GNT   ENSG00000118017 uc003ers.2  NM_016161   Q9UNA3
AAAS    ENSG00000094914 uc001scr.5  NM_001173466    Q9NRG9
AACS    ENSG00000081760 uc001uhc.4  NM_023928   Q86V21
AACSP1  ENSG00000250420     NR_024035
AAGAB   ENSG00000103591 uc002aqk.6  NM_024666   Q6PD74
AAK1    ENSG00000115977 uc002sfp.3  NM_014911   Q2M2I8

0
Entering edit mode

This will miss every protein for which there is no assigned HGNC symbol; which may include proteins in Ensembl + UniProt but not in RefSeq.

0
Entering edit mode

Sure but it would help unambiguously with a large % of proteins that are well known.

Are you aware of how many there are that don't fall in this category. I see that the hgnc file currently has 43k+ entries.

0
Entering edit mode

Thanks for sharing the resource! From a brief look HGNC seems not providing all transcripts/proteins of each gene. Since my task is protein-centric (my core evidence comes from ClinVar, which is essentially protein-centric), I do need different transcripts/peptides from the same gene (which raises this issue of aligning IDs and seqs concurrently) though.

As a demo example, HGNC:20603 DHDDS contains one entry only, but in ENSP and RefSeq there are a lot more.

1
Entering edit mode

I have moved my post to a comment in light of your needs. Hopefully it would be of help to someone who may be looking for a single mapping that is already validated by HGNC.

1
Entering edit mode
6 weeks ago
LChart 840

This is annoying for gene ids (HGNC, Ensembl, entrez, refseq) as well as proteins (Uniprot, refseq, ensembl).

For a conservative mapping; all 3 sources maintain their own set of mappings, and UniProt provides an ID mapping service. By grabbing the annotation files, you can extract the following injections:

Uniprot -> Ensembl (from Uniprot)
Uniprot -> Refseq (from Uniprot)
Refseq -> Ensembl (from Refseq)
Refseq -> Uniprot (from Refseq)
Ensembl -> Refseq (from Ensembl)
Ensembl -> Uniprot (from Ensembl)


The conservative approach is to define a mapping as an agreement of (A -> B (fromA)) and (B -> A (from B)).

If you want to go 100% clarity but 100% overkill, download the .faa files and break out MAFFT (or equivalent).

0
Entering edit mode

Thanks a lot for your response! Yes, this is totally annoying, also for the other part of my analysis, phenotypes IDs (even more annoying because you'll need grouping or propagating etc.).

Conservative mapping sounds like a clever idea! I think I'll still keep a "central standard", either ENSP or UniProt. Will give it a try. Also, do you happen to know where I can find the mapping from RefSeq to Ensembl/UniProt? I can see the mapping service from the other two but not from RefSeq, and I think people usually recommend BioMart (which is from Ensembl) to map between Ensembl and RefSeq.

0
Entering edit mode

RefSeq maintains a mapping here: https://ftp.ncbi.nlm.nih.gov/refseq/uniprotkb/gene_refseq_uniprotkb_collab.gz

though from the readme (also here) it looks like they're taking UniProt's mappings.

For Ensembl, the mapping is in the MANE gff file which has ENSP and NM_ ids: https://ftp.ncbi.nlm.nih.gov/refseq/MANE/MANE_human/current/MANE.GRCh38.v1.0.ensembl_genomic.gff.gz

0
Entering edit mode

Thanks for sharing the great resource! I'm not too familiar with RefSeq so it helps a lot. From the README it looks like the majority of proteins should be consistent in both ways (ids and seqs). Will do a small check tomorrow.

Luckily the MANE labels have been integrated to BioMart so I was directly querying it. But it really doesn't cover many protein sequences, and some of the non-MANE ENSPs do match bidirectionally to RefSeq NP. Will see how the coverage goes.

It looks like three-way mapping filters out more proteins, though the amount left is still acceptable. E.g. NP_001306888. It appears in the MANE mapping between ENSP and NM, and also in the BioMart mapping between ENSP and UniProtKB, but it just doesn't appear in the RefSeq-UniProtKB. There are two NPs matching the corresponding ENSP though. It might be because the matching in BioMart is done at the transcript level.

0
Entering edit mode

There are two NPs matching the corresponding ENSP though. It might be because the matching in BioMart is done at the transcript level.

Yes; and alternate 3' / 5' UTR will generate a unique transcript translating to an identical protein.

0
Entering edit mode

That's fair. After a careful look, it seems MANE still only considers one transcript & protein product for each gene. In other words, there seems to be no mapping from RefSeq NP to Ensembl ENSP at the protein level. For my purpose, I think I can only do a quick validation by comparing the actual sequences of NP and ENSP matched by Ensembl BioMart. Hopefully that won't be too annoying.

1
Entering edit mode

MANE still only considers one transcript & protein product for each gene.

That is by design. MANE is targeted for clinical reporting. See: https://www.ncbi.nlm.nih.gov/refseq/MANE/

0
Entering edit mode

Thanks for sharing the link. I checked the mappings between non-standardized (standardized: either by MANE or RefSeq-UniProtKB mapping) proteins, and it is quite messy. After careful thought I believe it's better to stick to MANE standardized proteins (similar to using HGNC as you proposed initially), which most ClinVar record have also been standardized to.