Ensembl: Protein coding transcript ids
3
0
Entering edit mode
9.5 years ago
bsmith030465 ▴ 240

Hi,

I wanted to use only protein coding transcripts from ensembl (Biomart - bioconductor). Is there a way to differentiate these from the transcript ids? Some example data that I have:

gene_symbol      ensembl_id ensembl_transcript_id transcript_start transcript_end ensembl_exon_id exon_chrom_start exon_chrom_end strand chromosome_name
1      OR4F29 ENSG00000235249       ENST00000426406           367640         368634 ENSE00002316283           367640         368634      1               1
2      OR4F16 ENSG00000185097       ENST00000332831           621059         622053 ENSE00002324228           621059         622053     -1               1
3      SAMD11 ENSG00000187634       ENST00000420190           860260         874671 ENSE00001637883           860260         860328      1               1
4      SAMD11 ENSG00000187634       ENST00000420190           860260         874671 ENSE00001763717           861302         861393      1               1
5      SAMD11 ENSG00000187634       ENST00000420190           860260         874671 ENSE00002727207           865535         865716      1               1

=========

or, how can I change my query such that only protein coding transcripts are returned? My query is:

gb <- getBM(attributes=c("ensembl_transcript_id","transcript_start","transcript_end","ensembl_exon_id","exon_chrom_start","exon_chrom_end","strand","chromosome_name"),
                filters = "ensembl_gene_id", values=ensembl_id, mart=ensembl) 

thanks!

id biomart ensembl transcript protein coding • 5.1k views
ADD COMMENT
4
Entering edit mode
9.5 years ago
komal.rathi ★ 4.1k

Add the attribute transcript_biotype to your query, save to a data.frame and filter:

res = getBM(attributes=c("ensembl_transcript_id","transcript_start","transcript_end","ensembl_exon_id","exon_chrom_start","exon_chrom_end","strand","chromosome_name","gene_biotype","transcript_biotype"), filters = "ensembl_gene_id", values=ensembl_id, mart=ensembl)

# only keep transcripts that are protein coding
res = res[which(res$transcript_biotype=="protein_coding"),] 
ADD COMMENT
0
Entering edit mode

Thanks!

I probably haven't explained myself adequately or clearly! For a gene, which is protein coding, there may be several transcripts associated with it. Some of these may be protein coding ("protein coding"), and other transcripts (for the same gene) may be non-coding ("retained intron" or something else).

So, for each transcript how can I retrieve whether it is classified as 'protein coding' or something else?

And apologies for an inadequate problem formulation!

ADD REPLY
0
Entering edit mode

bsmith030465 I have edited my answer. Please check. Use transcript_biotype attribute and filter out any that are not protein_coding.

ADD REPLY
1
Entering edit mode
9.5 years ago
onuralp ▴ 190

No, you can not deduce this information from the transcript id alone.

You have two options:

  1. If you use the Biomart web interface, you may choose "transcript biotype" from the attributes to report.
  2. If you want to use biomaRt package, you modify your query as follows:

Your query:

gb <- getBM(attributes=c("ensembl_transcript_id","transcript_start","transcript_end","ensembl_exon_id","exon_chrom_start","exon_chrom_end","strand","chromosome_name"), filters = "ensembl_gene_id", values=ensembl_id, mart=ensembl)

Modified query - notice the new attribute transcript_biotype:

gb <- getBM(attributes=c("ensembl_transcript_id","transcript_start","transcript_end","ensembl_exon_id","exon_chrom_start","exon_chrom_end","strand","chromosome_name", **"transcript_biotype"**), filters = "ensembl_gene_id", values=ensembl_id, mart=ensembl)

What do the different biotypes in Ensembl mean?

ADD COMMENT
0
Entering edit mode

onuralp I think you did not read the other answers but I mentioned that in my answer already. Look above.

ADD REPLY
0
Entering edit mode

Hey Komal! Sorry, I have not seen your answer. I use an RSS reader to sift through the questions. I upvoted your comment.

ADD REPLY
0
Entering edit mode

No worries. I like to reinforce the 'no redundancy' policy whenever and wherever possible :)

ADD REPLY
0
Entering edit mode
9.5 years ago
Manvendra Singh ★ 2.2k

add Ensembl Protein ID also In your attributes,

All rows having ENSPXXXXXXXXXXX [0,1] would be Ensembl protein id of protein coding transcript

HTH

ADD COMMENT

Login before adding your answer.

Traffic: 2514 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6