As I understand it, for each gene locus in the UCSC known genes set, a single "canonical transcript" is reported. Please correct me if I am misunderstanding. I can see where this would be sometimes useful, like when you want a single TSS for each gene, and you need to choose one option. Does anyone know how the canonical transcript is chosen for each gene? What is the rationale behind this? A follow-up thought: It seems dangerous for downstream analysis (e.g., expression analysis) to use a single transcript as representative of the gene. Who is to say that there aren't two equally valid/important transcripts that could both be considered "canonical"?
UPDATE: I edited this post to ask the same question about Ensembl. I'm hoping we can document here the similar thought process that Ensembl uses for choosing canonical transcripts.
Thanks for providing the criteria for canonical transcript. But, actually this link illustrates perfectly my reasons for doubting the wisdom of assigning (or trusting) a canonical transcript for each gene locus. The example being discussed in the post is for MYC. We would consider that a single biological "gene" but apparently (at least at the time of that posting) it had been divided into two gene clusters. And, for each a gene cluster a canonical transcript was chosen based on arbitrary criteria (length). Who's the say a slightly shorter isoform is not equally/more important.
agree but you can filter it out with some other important criteria in your data.