Question: Does ENSEMBL have a descriptor for 'full length' transcript
3.4 years ago
New Zealand
george.wiggins wrote:

Simply all I want to know, is there an identifier in ensembl gtfs that indicator which transcript is the 'full-length' transcript. My assumption, is to use the transcript_name provided (i.e. APC-001) where exact transcript in number. I assume the 001 would be the full length transcript, however I need to be sure before I proceed.

If it is not, does anyone have an suggestion how to identify the consensus FL-transcript for the whole transcriptome

3.4 years ago
EMBL Heidelberg, Germany
Jean-Karim Heriche wrote:

There can be several full-length transcripts for a given gene. Depending on your problem, you may be able to use CCDS transcripts or if you need only one transcript per gene, you'll have to come up with some rule(s) to decide which one to pick among the different transcripts associated with a gene.

Edit: Just remembered that EnsEMBL defines a canonical transcript for a gene which can be retrieved with the API:

my $gene = $gene_adaptor->fetch_by_stable_id($gene_id);

my $canonical_transcript = $gene->canonical_transcript();
Thanks for you reply. I am aware that there are numerous FL transcripts, but I was naively hoping that there maybe a system which people use an single 'agreed' FL transcript.

Maybe if I explain my problem, then you might be able to help point me in the right direction? I have targeted-RNASeq data, the assays were designed to overlap junctions (only) of genes. What I am trying to do is annotate a junction count file with exon numbering. To do this I need to select one transcript to be the FL (this can be relatively arbitrary but would be good to have a logical system) and number exons relative to this transcript.

I can't use transcript assemble tools (cufflinks etc) as I am missing to much exonic data. This will have to be an abundance of junctions analysis (might be able to tease more out later).

I have compressed my GTF to only have one exon or UTR (based on start stop positions) represented with a list of transcripts that overlap the exact coordinates. Now I need to match these coordinate to my junction file (easy enough) and name the junction exon x-y.

If possible, I'd go with the canonical transcripts. In the past, when I needed on representative transcript per gene, I used the one producing the longest protein or failing that, simply the longest transcript. However, in your case, I think it might be preferable to concatenate all the exons of a gene so that you don't miss alternatively spliced exons that may not be present in the selected representative transcript.

