Question: Why the list of genes in UCSC "knownGene" table is strikingly different than the list of genes in UCSC "known canonical" table?
1
gravatar for lakhujanivijay
6 months ago by
lakhujanivijay4.4k
India
lakhujanivijay4.4k wrote:

Hi,

I am trying to generate a BED file following the below steps:

  • Go to human hg19 UCSC Table Browser:

  • track: UCSC Genes

  • table: knownCanonical

  • output format: select fields from primary and related tables

  • press get output

  • select fields from hg19.knownCanonical: chrom, chromStart, chromEnd,transcript

  • select fields from hg19.kgXref: geneSymbol

  • press get output

This BED file does not have the gene BBS5 in it. I am sure that this gene do exist

However, when I follow the same steps except selecting the table = knownGenes (instead of knownCanonical) , I can see BBS5 gene. The question is why this gene does not appear in the table knownCanonical?

Any insights?

hg19 tablebrowser ucsc bed • 565 views
ADD COMMENTlink modified 6 months ago • written 6 months ago by lakhujanivijay4.4k

UPDATE#1

In fact, I noticed that there are 1325 genes which are there in the knownGenes table but not present in knownCanonical table.

venn.png

ADD REPLYlink modified 6 months ago • written 6 months ago by lakhujanivijay4.4k

UPDATE#2

There is striking difference between the row counts of the 2 tables.

Schema for knownGene Row Count: 82,960

Schema for knownCanonical Row Count: 31,848

ADD REPLYlink written 6 months ago by lakhujanivijay4.4k
1

knownCanonical is generally the longest isoform so it is not surprising that the number is smaller. See the definitions under Related Data section on this page.

ADD REPLYlink written 6 months ago by genomax71k

genomax I agree about the definition of "canonical". Numbers shown in the venn diagram are count of unique gene entries in both sets. Hence, it does not explain why the list of genes in "knownCanonical" is smaller (see Venn above). Every gene must have once canonical isoform.

Actually I am trying to replicate the steps mentioned in the below post:

How To Get Bed File Containing Exons Of Canonical Transcripts And Their Corresponding Gene Symbols

The problem is that there are several genes missing from the final BED file.

ADD REPLYlink modified 6 months ago • written 6 months ago by lakhujanivijay4.4k
3
gravatar for Luis Nassar
6 months ago by
Luis Nassar120
Luis Nassar120 wrote:

Hello Vijay,

What you are observing here with the missing BBS5 entry in the knownCanonical table is an artifact of how that table was created for hg19. If you take a look at the hg19 UCSC Genes description page (http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=knownGene ) we define knownCanonical as the following:

knownCanonical identifies the canonical isoform of each cluster ID, or gene. Generally, this is the longest isoform.

The problem is, however, when two genes have overlapping coordinates, and one of them is entirely within another, the algorithm considers them isoforms and the smaller gene will be missed by knownCanonical. You can see this with BBS5 by going to the following session (http://genome.ucsc.edu/s/Lou/hg19_MLQ1 ). KLHL41 has a transcript with the same start site as BBS5, however, it extends much further. All of the BBS5 transcripts fall within it. If you query the Table Browser for these coordinates you see only KLHL41.

Using coordinates chr2:170,331,250-170,374,046:

chr2 170366211 170382772 17243 uc002ueu.1 uc002ueu.1 KLHL41

In order to get around this, you can use the complete knownGenes table, or you could also use the knownCanonical table for hg38. For the hg38 assembly (http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg38&g=knownGene ) the table was generated differently:

knownCanonical identifies the canonical isoform of each cluster ID or gene using the ENSEMBL gene IDs to define each cluster. The canonical transcript is chosen using the APPRIS principal transcript when available. If no APPRIS tag exists for any transcript associated with the cluster, then a transcript in the BASIC set is chosen. If no BASIC transcript exists, then the longest isoform is used.

This new method did not have the same issue as hg19, as it uses APPRIS tags, then GENCODE sets, and then finally if those are not available the longest isoform. If you convert the region from the session above to hg38 (View in the top bluebar -> In Other Genomes) you will get the following coordinates (chr2:169,474,740-169,517,536), then if you query the position on the knownCanonical table on the Table Browser you get the following results:

chr2 169479177 169506655 10932 ENST00000295240.7 ENSG00000163093.11 BBS5

chr2 169479479 169525922 41682 ENST00000513963.1 ENSG00000251569.1 AC093899.2

chr2 169509701 169526262 36897 ENST00000284669.1 ENSG00000239474.6 KLHL41

ADD COMMENTlink modified 6 months ago by genomax71k • written 6 months ago by Luis Nassar120
1

@Luis: Do you represent UCSC genome support or just happen to be very familiar with their methods?

ADD REPLYlink written 6 months ago by genomax71k

I do, currently QA at UCSC Genome Browser. Our main support address is genome@soe.ucsc.edu, but we keep an eye on biostars when we can (the UCSC tag helps).

ADD REPLYlink written 6 months ago by Luis Nassar120

Thanks for the comprehensive explanation Luis Nassar.

Let me give that a try and see if this solves by problem, and then I am going to accept this as an answer so that everyone could be benefited.

ADD REPLYlink written 6 months ago by lakhujanivijay4.4k

Hi Luis Nassar

How can I get a file from the knownGene table which look like this?

chrom   transcript_start    transcript_stop transcript_id   gene symbol
chr1    11873   14409   uc010nxq.1  DDX11L1
chr1    14361   19759   uc009viu.3  WASH7P
chr1    14406   29370   uc009viw.2  WASH7P
chr1    34610   36081   uc001aak.3  FAM138F
chr1    69090   70008   uc001aal.1  OR4F5
chr1    134772  140566  uc021oeg.2  LOC729737
chr1    321083  321115  uc001aaq.2  DQ597235
chr1    321145  321207  uc001aar.2  DQ599768
chr1    322036  326938  uc009vjk.2  LOC100133331
ADD REPLYlink modified 6 months ago • written 6 months ago by lakhujanivijay4.4k

Hello Vijay,

To get an output from the knownGene table like the one you described you will want to use the "selected fields from primary and related tables" option in the Table Browser:

Table Browser: http://genome.ucsc.edu/cgi-bin/hgTables
assembly: hg19
track: UCSC Genes
table: knownGene
output format: selected fields from primary and related tables
get output
Select Fields from hg19.knownGene: chrom, txStart, txEnd
transcript select fields from hg19.kgXref fields: kgID, geneSymbol
get output

Your output file should look like this (first 10 entries from chrom1):

#hg19.knownGene.chrom    hg19.knownGene.txStart    hg19.knownGene.txEnd    hg19.kgXref.kgID    hg19.kgXref.geneSymbol
chr1    11873    14409    uc001aaa.3    DDX11L1
chr1    11873    14409    uc010nxr.1    DDX11L1
chr1    11873    14409    uc010nxq.1    DDX11L1
chr1    14361    16765    uc009vis.3    WASH7P
chr1    14361    19759    uc009vit.3    WASH7P
chr1    14361    19759    uc009viu.3    WASH7P
chr1    14361    19759    uc001aae.4    WASH7P
chr1    14361    29370    uc001aah.4    WASH7P
chr1    14361    29370    uc009vir.3    WASH7P

If you would like to get this output with knownCanonical (like your example) you can follow the steps above with the following changes:

table: knownCanonical
...
Select Fields from hg19.knownCanonical: chrom, chromStart, chromEnd
transcript select fields from hg19.kgXref fields: kgID, geneSymbol

Your output file should look like this (first 10 entries from chrom1):

#hg19.knownCanonical.chrom    hg19.knownCanonical.chromStart    hg19.knownCanonical.chromEnd    hg19.kgXref.kgID    hg19.kgXref.geneSymbol
chr1    11873    14409    uc010nxq.1    DDX11L1
chr1    14361    19759    uc009viu.3    WASH7P
chr1    14406    29370    uc009viw.2    WASH7P
chr1    34610    36081    uc001aak.3    FAM138F
chr1    69090    70008    uc001aal.1    OR4F5
chr1    134772    140566    uc021oeg.2    LOC729737
chr1    321083    321115    uc001aaq.2    DQ597235
chr1    321145    321207    uc001aar.2    DQ599768
chr1    322036    326938    uc009vjk.2    LOC100133331
chr1    327545    328439    uc021oei.1    LOC388312

It may also be worth mentioning that our data tables (such as the example output) use 0-start, half open coordinates. If this is relevant you, we have a blog post on the topic: http://genome.ucsc.edu/blog/the-ucsc-genome-browser-coordinate-counting-systems/

ADD REPLYlink modified 6 months ago • written 6 months ago by Luis Nassar120
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1492 users visited in the last hour