Question

What Is Rationale Behind UCSC/Ensembl Canonical Transcripts?

19

Entering edit mode

12.6 years ago

Obi Griffith 20k

As I understand it, for each gene locus in the UCSC known genes set, a single "canonical transcript" is reported. Please correct me if I am misunderstanding. I can see where this would be sometimes useful, like when you want a single TSS for each gene, and you need to choose one option. Does anyone know how the canonical transcript is chosen for each gene? What is the rationale behind this? A follow-up thought: It seems dangerous for downstream analysis (e.g., expression analysis) to use a single transcript as representative of the gene. Who is to say that there aren't two equally valid/important transcripts that could both be considered "canonical"?

UPDATE: I edited this post to ask the same question about Ensembl. I'm hoping we can document here the similar thought process that Ensembl uses for choosing canonical transcripts.

ucsc • 16k views

ADD COMMENT • link updated 8.9 years ago by crisime ▴ 290 • written 12.6 years ago by Obi Griffith 20k

score 9 · Answer 1 · 2012-03-23

9

Entering edit mode

12.6 years ago

Gjain 5.8k

Hi Obi,

I hope this link should clarify your doubt.

The canonical transcript is defined as either the longest CDS, if the gene has translated transcripts, or the longest cDNA.

https://lists.soe.ucsc.edu/pipermail/genome/2010-April/021963.html

ADD COMMENT • link 12.6 years ago by Gjain 5.8k

2

Entering edit mode

Thanks for providing the criteria for canonical transcript. But, actually this link illustrates perfectly my reasons for doubting the wisdom of assigning (or trusting) a canonical transcript for each gene locus. The example being discussed in the post is for MYC. We would consider that a single biological "gene" but apparently (at least at the time of that posting) it had been divided into two gene clusters. And, for each a gene cluster a canonical transcript was chosen based on arbitrary criteria (length). Who's the say a slightly shorter isoform is not equally/more important.

ADD REPLY • link 12.6 years ago by Obi Griffith 20k

0

Entering edit mode

agree but you can filter it out with some other important criteria in your data.

ADD REPLY • link 12.6 years ago by Gjain 5.8k

score 8 · Answer 2 · 2012-03-23

8

Entering edit mode

12.6 years ago

Pierre Lindenbaum 164k

The table browser says:

knownCanonical identifies the canonical isoform of each cluster ID, or gene. Generally, this is the longest isoform.

ADD COMMENT • link 12.6 years ago by Pierre Lindenbaum 164k

1

Entering edit mode

Pierre gets the up-vote because he was seven minutes faster on the draw. Remember back when it took more than an hour to get questions answered on the interwebs?

ADD REPLY • link 12.6 years ago by David Quigley 11k

1

Entering edit mode

Thanks for explaining how canonical isoform gets chosen. This makes sense. It is a criteria that has been in use for years (e.g., MGC project). But, for my downstream analysis I don't see a good argument for using the canonical isoform versus any other known transcripts. Maybe it is better to average/sum all transcripts to gene level or work at transcript level and treat all isoforms as potentially equally important, without caring if they are marked canonical or not.

ADD REPLY • link 12.6 years ago by Obi Griffith 20k

1

Entering edit mode

@David the truth is the question had been already answered by GJain when I was about to write my answer. But I spent some times to check the validity of my answer using mysql: select X.geneSymbol,K.name,K.txEnd-K.txStart as "KL", C.chromEnd-C.chromStart as "CL" from (kgXref as X , knownGene as K) left join knownCanonical as C on C.transcript=X.kgId where X.kgId=K.name

ADD REPLY • link 12.6 years ago by Pierre Lindenbaum 164k

score 2 · Answer 3 · 2013-01-10

The topic of canonical transcript (and by definition the ORF from it) is a big moot point. Swiss-Prot started the idea at the protein level years back as (IMHO) the best solution of mapping protein variants of all forms back to feature lines spawned from the curatorialy-selected canonical protein. I'm assuming Obi's question relates to the balance between biological relevance, curatorial judgment and computational tractability. The longest-by-default rule seems useful because its simple i.e. all exons and the longest 5' and 3' (as the last polyA in data-supported cDNA) and you can get a clean non-redundant proteome set. The biological precedents for this default are much more arguable but there simply isn't enough data on ubiquity of alternative transcript abundance or any objective measure of "importance" which could be context-dependent anyway (i.e. alternative promoters, major splice forms and polyA usage all being tissue-specific). What is more of a problem is how different concepts of "canonical-ness" play out between the major pipelines of RefSec, UniProt, Ensembl and CCDs, because they generate discordant cross-mappings. For example Ensembl's canonical selection is "transcript 1" but it also uniquely predicts and/or Vega annotates splice variants with no cDNA. Note also that RefSeq and Swiss-Prot don't always agree on their cDNA-supported splice forms (or even SNP major alleles). Also Swiss-Prot do not (as far as know) consider UTRs in their ORF/CDS choice. The point about UCSC is that they don't generate primary data but they just mark it up (excellently). However, here again, their own clustering rules may produce different canonical results to the other pipelines. My guess is this would select the longest RefSeqN in (all ?) cases (you can see this in the GUI tracks)

Ram · Answer 4 · 2015-12-02

answering UPDATE/ENSEMBL:

For human it is defined in the ensembl glossary:

http://www.ensembl.org/Help/Glossary?id=346

Canonical transcript:

For human, the canonical transcript for a gene is set according to the following hierarchy:

Longest *CCDS *translation with no stop codons.

If no (1), choose the longest Ensembl/Havana merged translation with no stop codons.

If no (2), choose the longest translation with no stop codons.

If no translation, choose the* longest non-protein-coding transcript*.

score 1 · Answer 5 · 2013-01-09

1

Entering edit mode

11.8 years ago

dsbreak ▴ 170

I can't find the quoted information in the above references. Here's where I was able to find it:

https://lists.soe.ucsc.edu/pipermail/genome/2005-July/008123.html

ADD COMMENT • link 11.8 years ago by dsbreak ▴ 170