As I understand it, for each gene locus in the UCSC known genes set, a single "canonical transcript" is reported. Please correct me if I am misunderstanding. I can see where this would be sometimes useful, like when you want a single TSS for each gene, and you need to choose one option. Does anyone know how the canonical transcript is chosen for each gene? What is the rationale behind this? A follow-up thought: It seems dangerous for downstream analysis (e.g., expression analysis) to use a single transcript as representative of the gene. Who is to say that there aren't two equally valid/important transcripts that could both be considered "canonical"?
UPDATE: I edited this post to ask the same question about Ensembl. I'm hoping we can document here the similar thought process that Ensembl uses for choosing canonical transcripts.