Question

How to get the canonical transcript from gene ensemble ID

2

Entering edit mode

17 months ago

galbodek ▴ 20

Hello, I want to get the canonical transcript of a particular gene by a gene ensemble ID. I am using the pyensembl package. In pyensembl by using "transcript_ids_of_gene_id" you can get a list of all transcripts associated with the gene ID, but I can't get any information on which one is the canonical one. Any suggestions?

Thank you in advance :)

canonical pyensembl transcripts ensemble python • 1.2k views

ADD COMMENT • link updated 17 months ago by benformatics 3.9k • written 17 months ago by galbodek ▴ 20

score 2 · Answer 1 · 2022-11-15

You can use biomaRt in R: https://bioconductor.org/packages/release/bioc/html/biomaRt.html. You might be able to do this faster using a filter but this works.

library(biomaRt)

mart <- useEnsembl("ensembl",dataset="hsapiens_gene_ensembl")
## get everything
BM.info <- getBM(attributes=c("ensembl_gene_id","ensembl_transcript_id","hgnc_symbol","transcript_is_canonical"),mart=mart)

subset(BM.info, hgnc_symbol == 'SRSF2')
ensembl_gene_id ensembl_transcript_id hgnc_symbol transcript_is_canonical
108336 ENSG00000161547       ENST00000359995       SRSF2                       1
108337 ENSG00000161547       ENST00000392485       SRSF2                      NA
108338 ENSG00000161547       ENST00000585202       SRSF2                      NA
108339 ENSG00000161547       ENST00000582449       SRSF2                      NA
108340 ENSG00000161547       ENST00000586778       SRSF2                      NA
108341 ENSG00000161547       ENST00000452355       SRSF2                      NA
108342 ENSG00000161547       ENST00000589919       SRSF2                      NA
108343 ENSG00000161547       ENST00000508921       SRSF2                      NA
108344 ENSG00000161547       ENST00000592676       SRSF2                      NA
108345 ENSG00000161547       ENST00000583836       SRSF2                      NA
108346 ENSG00000161547       ENST00000358156       SRSF2                      NA

## canonical transcripts
BM.info.canon <- subset(BM.info,transcript_is_canonical == 1)

subset(BM.info.canon, hgnc_symbol == 'SRSF2')
ensembl_gene_id ensembl_transcript_id hgnc_symbol transcript_is_canonical
108336 ENSG00000161547       ENST00000359995       SRSF2                       1

print(head(BM.info.canon$ensembl_transcript_id))
[1] "ENST00000387314" "ENST00000389680" "ENST00000387342" "ENST00000387347" "ENST00000386347" "ENST00000361390"

EDIT: If you want to do this with python you should be able to roughly replicate this script using something like https://github.com/jrderuiter/pybiomart