is the lowest numbered Ensembl transcript ID always the "canonical" transcript?
Entering edit mode
2.4 years ago

I would like to identify a "canonical" transcript for every protein-coding gene in Ensembl. For project-related reasons, I'm using the EnsDb.Hsapiens.v75 package in R. I realize, of course, that "canonical" is a working definition at best, and inappropriate in some cases - but for ease of graphing some data I just want one transcript per gene for now. From manually inspecting genes in Ensembl, it looks like the lowest-numbered transcript ID for each corresponds to what I'm looking for. Some code to pull out a few examples:


genes <- keys(EnsDb.Hsapiens.v75, keytype='GENEID')
ensembl <- AnnotationDbi::select(EnsDb.Hsapiens.v75, keys=genes, keytype='GENEID',
                                 columns=c('GENEID', 'SYMBOL', 'GENEBIOTYPE'))
ensembl_cds <- filter(ensembl, GENEBIOTYPE=='protein_coding')
ensembl_cds_tx <- AnnotationDbi::select(EnsDb.Hsapiens.v75, keys=genes, keytype='GENEID',
                                        columns=c('SYMBOL', 'TXID'))

gois <- c('RSPO1', 'PRSS1', 'CDH1')
gois_tx <- filter(ensembl_cds_tx, SYMBOL %in% gois) %>% arrange(SYMBOL, TXID) %>% print()
gois_tx_lowest <- gois_tx[!duplicated(gois_tx$SYMBOL),] %>% print()

Each of the lowest transcript IDs pulled out above (ENST00000261769, ENST00000311737, ENST00000356545) corresponds to an Ensembl transcript for the respective genes (CDH1, PRSS1, RSPO1) that matches with Refseq and the Consensus CDS database. (Although, for RSPO1, there are three other transcripts that also have Refseq matches, which speaks to the arbitrariness of picking a single canonical transcript.)

My question is, is this the general practice across the Ensembl transcript database, that the lowest numbered transcript for a gene corresponds to a canonical or semi-canonical transcript, or have I just gotten lucky so far?

transcript ensembl bioconductor r • 1.1k views
Entering edit mode

Emily_Ensembl can clarify but I doubt that is the case.

We had talked about using data from MANE in one of your other threads. MANE probably represents the most current understanding of human transcripts (since that is an active project). If you are not finding genes in that set then they may have been reassigned/renamed/changed in some way.

Entering edit mode

I'm pretty sure its not the case. In fact, if we define canonical as the transcript in the REFSEQ or CCDS releases of the same date, then I think there are quite a lot of cases in Ensembl v75 where there is no ensembl transcript that is a perfect match. I think in later releases of all three databases, a lot of work has been done to make them more comparable.

Entering edit mode
2.4 years ago
Emily 23k

No. The numbers are arbitrary. The canonical transcript is the one which is labelled canonical, which you can get as a filter or an attribute.

The stable IDs are assigned in order, so the first transcript every identified was ENST00000000001, the second ENST00000000002 etc. This means that for a gene, the one with the lowest number was the first one to be identified. In all probability, the first one identified is the one that is the most highly expressed, highly conserved and well-studied, which makes it coincidentally also the canonical. But it's not always the case.


Login before adding your answer.

Traffic: 1508 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6