Question

Does ENSEMBL have a descriptor for 'full length' transcript

0

Entering edit mode

8.5 years ago

george.wiggins ▴ 10

Simply all I want to know, is there an identifier in ensembl gtfs that indicator which transcript is the 'full-length' transcript. My assumption, is to use the transcript_name provided (i.e. APC-001) where exact transcript in number. I assume the 001 would be the full length transcript, however I need to be sure before I proceed.

If it is not, does anyone have an suggestion how to identify the consensus FL-transcript for the whole transcriptome

gtf ENSEMBL transcripts • 2.2k views

ADD COMMENT • link updated 8.5 years ago by Jean-Karim Heriche 27k • written 8.5 years ago by george.wiggins ▴ 10

2

Entering edit mode

Please do not send identical messages to BioStars and Ensembl helpdesk. It is a waste of effort if we are trying to respond and the people on BioStars are as well. I will delete the Ensembl helpdesk ticket as Jean-Karim has already answered your question.

ADD REPLY • link 8.5 years ago by Emily 23k

0

Entering edit mode

Thank you for your quick reply. I wasn't aware that Ensembl was so good as keeping up with questions on biostar. I was assuming I would get a community response here and a more official response from the helpdesk. Nevertheless, I won't repeat questions to the helpdesk in the future.

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.5 years ago by george.wiggins ▴ 10

score 1 · Answer 1 · 2015-11-05

1

Entering edit mode

8.5 years ago

Jean-Karim Heriche 27k

There can be several full-length transcripts for a given gene. Depending on your problem, you may be able to use CCDS transcripts or if you need only one transcript per gene, you'll have to come up with some rule(s) to decide which one to pick among the different transcripts associated with a gene.

Edit: Just remembered that EnsEMBL defines a canonical transcript for a gene which can be retrieved with the API:

my $gene = $gene_adaptor->fetch_by_stable_id($gene_id);

my $canonical_transcript = $gene->canonical_transcript();

ADD COMMENT • link 8.5 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Thanks for you reply. I am aware that there are numerous FL transcripts, but I was naively hoping that there maybe a system which people use an single 'agreed' FL transcript.

Maybe if I explain my problem, then you might be able to help point me in the right direction? I have targeted-RNASeq data, the assays were designed to overlap junctions (only) of genes. What I am trying to do is annotate a junction count file with exon numbering. To do this I need to select one transcript to be the FL (this can be relatively arbitrary but would be good to have a logical system) and number exons relative to this transcript.

I can't use transcript assemble tools (cufflinks etc) as I am missing to much exonic data. This will have to be an abundance of junctions analysis (might be able to tease more out later).

I have compressed my GTF to only have one exon or UTR (based on start stop positions) represented with a list of transcripts that overlap the exact coordinates. Now I need to match these coordinate to my junction file (easy enough) and name the junction exon x-y.

ADD REPLY • link 8.5 years ago by george.wiggins ▴ 10

0

Entering edit mode

If possible, I'd go with the canonical transcripts. In the past, when I needed on representative transcript per gene, I used the one producing the longest protein or failing that, simply the longest transcript. However, in your case, I think it might be preferable to concatenate all the exons of a gene so that you don't miss alternatively spliced exons that may not be present in the selected representative transcript.

ADD REPLY • link 8.5 years ago by Jean-Karim Heriche 27k