Ensembl ID differences between assemblies?
4
1
Entering edit mode
6.8 years ago
jiab ▴ 60

Hi,

I have a list of transcripts, such as: ENST00000416191. I want to find the Ensembl IDs for the genes they are associated with. Would I get identical results, whether I query GRCh38 or GRCh37? Note that I am not interested in their coordinates, only the IDs.

Thank you

Ensembl • 5.0k views
ADD COMMENT
0
Entering edit mode

In general, ensembl IDs do not change between assemblies. A new ID may come in (for eg. new finding) and an existing ID (for eg. deprecation) may disappear. But for an existing entity (gene, contig, transcript etc), IDs do not change between assemblies. This would be same for NCBI, UCSC, EBI, Uniprot etc.

ADD REPLY
3
Entering edit mode

The IDs may stay the same but the information (your entity) they are pointing to can change quite a lot!

Same ENST may point to a

in different builds. Depending on what you are using your ID and the collected data for, this can make a huge difference! Stay carefull and don't expect too much identity in the underlying data, when you search for the same ID in a new database version a few years later.

ADD REPLY
0
Entering edit mode

It is unfortunate that in some case it does change even within the same genome version. Here is one example I have mentioned in my previous post.

ADD REPLY
4
Entering edit mode
6.8 years ago
crisime ▴ 290

Hi jiab, to give you an idea of the differences here is a little bash script, which gives you a list of ENSTs mapping to different ENSGs in grch37 and grch38:

#!/bin/bash
#download of all ENST mapped to ENSG from Ensembls grch37 biomart
wget -O mapping_37.txt 'http://grch37.ensembl.org/biomart/martservice?query=<Query virtualSchemaName="default" formatter="TSV" header="0" uniqueRows="1" count="" datasetConfigVersion="0.6"><Dataset name="hsapiens_gene_ensembl" interface="default"><Attribute name="ensembl_transcript_id"/><Attribute name="ensembl_gene_id"/></Dataset></Query>'

#download of all ENST mapped to ENSG from Ensembls grch38 biomart
wget -O mapping_38.txt 'http://www.ensembl.org/biomart/martservice?query=  <Query virtualSchemaName="default" formatter="TSV" header="0" uniqueRows="1" count="" datasetConfigVersion="0.6">  <Dataset name="hsapiens_gene_ensembl" interface="default"> <Attribute name="ensembl_transcript_id"/> <Attribute name="ensembl_gene_id"/> </Dataset> </Query>'

#join the two mappings on ENST
join <(sort mapping_37.txt) <(sort mapping_38.txt) >joined_on_ENST.txt

#get just the lines, where ENSGs differ for grch37 and grch38 for the same ENST
awk '$2!=$3 {print $1, $2, $3}' joined_on_ENST.txt >different_ENSG_37_38.txt

If you have a look at the number of lines in each file:

  wc -l *
 1421 different_ENSG_37_38.txt
 190566 joined_on_ENST.txt
 215170 mapping_37.txt
 218207 mapping_38.txt

It shows you that the mappings in 37 and 38 share about 190k ENST having 25k (37) and 28k (38) unique ENSTs and 1421 ENSTs mapping to different ENSG in both builds. Out of about 243k different ENSTs (190566 +25k+28k) you could search your ENSG for, in the different versions, you would not get the same result for both builds (no ENSG in one build, or different ENSG) for about 54k queries (25k+28k+1421).

ADD COMMENT
0
Entering edit mode

Very on point answer, thanks!

ADD REPLY
2
Entering edit mode
6.8 years ago

Most likely the gene IDs would be different because annotations of GRCh37 are based on EnsEMBL v75 and are not updated whereas current annotations of GRCh38 are at version 89. Also, for the same reason, I wouldn't necessarily expect transcript IDs to be conserved. By the way, you can easily test this yourself using the API.

EDIT: Just to clarify in light of the other answers: A transcript or gene ID wouldn't change if the structure of the object it represents doesn't change but as the assemblies are different, genes/transcripts in regions that differ between the two assemblies may end up being different. In short, because the assemblies are different, I wouldn't expect their annotations to be identical.

ADD COMMENT
2
Entering edit mode
6.8 years ago
John Ma ▴ 310

Just check on the Ensembl browser... For this particular gene, hg38 and hg37 has the same gene-level ID as of Ensembl 89.

ADD COMMENT
1
Entering edit mode
6.8 years ago

Hello,

as a transcript is bind to a specific gene, you should get the same result in most cases. What can happen is, that a transcript that exists in hg37 does not have to exist in hg38 and vice versa. Also you cannot be sure that transcript with the same id has the same sequence and exons/introns in both reference genomes.

fin swimmer

ADD COMMENT

Login before adding your answer.

Traffic: 1692 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6