Question

Stability of Ensembl and refseq stable IDs

4

Entering edit mode

8.7 years ago

crisime ▴ 290

Hi everyone,

I am working with Ensembles stable IDs for transcripts (ENSTs) and thought the idea of a stable ID was to point to identical transcripts on different versions of the database in terms of sequence (DNA and protein).

Now I found some ENSTs containing major changes in the sequence on different versions:

Examples:

SYNGAP1 ENST00000418600 - lost coding Exon, -58aa, -88bp, based on different Vega transcript:

ENST00000418600 Ensembl release 78

ENST00000418600 Ensembl release 81

CTDP1 ENST00000299543 - lost coding Exon, -119 aa, -373bp, different annotation method

ENST00000299543 Ensembl release 75

ENST00000299543 Ensembl release 81

So I have some questions arising from this:

1. What is guaranteed to stay stable for ensembles "stable" IDs (mainly ENSTs)?

All information I could find on this is:

Ensembl's stable identifiers are mapped between re-annotation processes using a combination of location based mappings and those generated by Exonerate. The process performs exon based mapping and deriving subsequent identifier mappings based upon these findings.

2. Why is the community using ENSTs without version numbers (which exist), which would guarantee sequence stability (according to the documentation), while refseq NM_s are usually used with version? Examples:

CCDS, HGVS (recommending Variant annotation on ENST without version), Uniprot, ...

3. What stability do refseq stable IDs guarantee? Could you point me to any document defining the stable features I can assume for NM_ respectively versioned NM_?

Thanks for any help!

Edit: Reformatted the links and picked better database versions

ensembl refseq • 3.5k views

ADD COMMENT • link updated 18 months ago by Ram 43k • written 8.7 years ago by crisime ▴ 290

0

Entering edit mode

8.7 years ago

Jean-Karim Heriche 27k

As I understand it, Ensembl keeps IDs the same between two annotations of the genome if there are no significant changes between the two annotations. Versioning allows to keep track of small changes, which most of the time we don't care about hence it's not often used. In your first example, it looks like the difference relates to the annotation of the 5' and 3' UTRs, probably not enough to make change the ID since the exon structure is the same.

It's never been clear to me what RefSeq version numbers represent but then I don't work with RefSeq most of the time.

ADD COMMENT • link 8.7 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Hi Jean-Karim,

I think you did not see, what I meant in my examples. Please correct me if I'm wrong.

I think you compared SYNGAP1-001 with SYNGAP1-001 in the two versions of Ensembl. Here you are correct. It looks like just a change in UTR-length (I didn't check for other change). I don't see a problem here, because a new ENST (ENST00000629380 version 79) was assigned in Version 79 (even for UTR change!).

The problem I see, is that the old ENST of SYNGAP1-001 (ENST00000418600 version78) lives on in SYNGAP1-010 (ENST00000418600 version79), which is 58 aa shorter, because of an alternative splice site in exon 18, which leads to a stop, leaving exon 19 untranslated. So there is a big change on exon and protein level! I would expect that to be called a significant change between annotations, so they should have different stable IDs, as you say.

I would say the same.

ADD REPLY • link updated 18 months ago by Ram 43k • written 8.7 years ago by crisime ▴ 290

0

Entering edit mode

I did look at ENST00000418600 in both versions but in fact I now see what you mean. However, it looks as if ENST00000418600 version78 and ENST00000418600 version79 are still annotated as producing the same Uniprot entry (Q96PV0) although the Ensembl translation is different. I would ask the Ensembl helpdesk for clarification.

ADD REPLY • link 8.7 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Hi Jean-Karim,

in Uniprot an entry can contain different isoforms of a protein. So it is common that many Ensembl transcripts with different protein sequence annotations map to the same entry.

I already asked the ensembl helpdesk, but got no helpfull answer yet.

I did not find any information or discussion about this and assumed it might be of general interest because of the importance of stable reference sequences.

Thought it would be an easy one for biostars.

ADD REPLY • link 8.7 years ago by crisime ▴ 290

score 1 · Accepted Answer · 2015-09-04

In the meantime I got a reply from the ensembl helpdesk (there were some problems with reopening the question):

Summarised answer to question 1:

-they consider maintaining the IDs for my two examples wrong, but they never change IDs retrospectively

-they try to be conservative in their mapping in the sense that they try to keep their IDs alive

-I did not get a precise answer on the general question what stability is guaranteed, so i guess you have to expect any kind of change.

So take home message for me is, to never use ENSTs without the version of Ensembl I refer to, or at least the ENST with version number (which should guaratee sequence stability at exon level: documentation).

Regarding question 2 they told me they would discuss making the Version number more prominent on their website, which would make their importance more obvious for the community.(edit: they did in ensembl 85)

Answer to question 3 from RefSeq FAQ

What updates to RefSeq records need a simple version number change and which require a new accession number?

The following cases require a simple version change:

the RefSeq record is updated to make minor corrections (e.g., fix mismatches or indels)

the RefSeq record is updated by an end extension or trim in the UTRs but does not add or remove exons or change any splice sites

the RefSeq record is updated by an end extension or trim in the 5’ or 3’ end of the coding sequence and/or UTRs and DOES add or remove terminal exons. In this case, the replaced or updated RefSeq must be completely contained within the other version (i.e., no addition or removal of internal exons or changes to any splice sites).

All other cases require the old record to be suppressed rather than updated, and a new record with a new accession number to be created. In addition, RefSeqGene records cannot be updated when the exon definition or protein length is changed.

edit: added answer to question 3 and updated answer 2