UCSC different exome sets per each gene
2
0
Entering edit mode
2.4 years ago
cocchi.e89 ▴ 170

I'm trying to collect the exomes' start-end for a set of gene. I downloaded the UCSC tables, but I found out that different sets are outputted for each gene. As example UMOD gene:

#hg19.knownGene.name    hg19.knownGene.chrom    hg19.knownGene.txStart  hg19.knownGene.txEnd    hg19.knownGene.exonCount    hg19.knownGene.exonStarts   hg19.knownGene.exonEnds hg19.kgXref.geneSymbol
uc002dgz.3  chr16   20344372    20364037    11  20344372,20346803,20347967,20348612,20352412,20355345,20357447,20359544,20359757,20361971,20364010, 20344697,20346842,20348049,20348775,20352658,20355494,20357656,20359652,20360534,20362161,20364037, UMOD
uc002dha.3  chr16   20344372    20364037    11  20344372,20346803,20347967,20348612,20352412,20355345,20357447,20359544,20359757,20361971,20364010, 20344697,20346842,20348049,20348775,20352658,20355494,20357656,20359652,20360534,20362098,20364037, UMOD
uc002dhb.3  chr16   20344372    20364037    12  20344372,20346803,20347967,20348612,20352412,20355345,20357447,20359544,20359757,20361092,20361971,20364010,    20344697,20346842,20348049,20348775,20352658,20355494,20357656,20359652,20360534,20361191,20362161,20364037,    UMOD


so my question is: which one am I supposed to rely on? And how do you select it?

Thanks a lot in advance for any help!

ucsc exomes gene coordinates • 591 views
0
Entering edit mode
2.4 years ago

knownGene a misleading name. What you're seeing here are 3 transcripts (uc002dgz.3; uc002dha.3 uc002dhb.3 ) for the same gene. UMOD

one am I supposed to rely on?

There is no quick answer for this: depends of your needs: the largest, the most covered, etc... or just use the min / max coordinates.

0
Entering edit mode

probably the largest, shall I calculate it for each one? Or there is a sort of "indicator" of the largest set?

0
Entering edit mode
2.4 years ago
Luis Nassar ▴ 550

As Pierre mentioned, knownGene includes a large set of transcripts, in total it has 82,960 items. If you are just looking for one representative transcript per gene, then I would recommend you use the knownCanonical table instead. This table is a subset of the knownGene, generally the longest isoform. You may search our forums for more details on its generation (https://groups.google.com/a/soe.ucsc.edu/forum/#!forum/genome).

You may also get your query from the Table Browser, using the following link (http://genome.ucsc.edu/cgi-bin/hgTables?db=hg19&hgta_doMainPage=1&hgta_group=genes&hgta_track=knownGene&hgta_table=knownGene) and following these steps:

1. change the table to knownCannonical then change the output to selected fields from primary and related tables
3. get output
4. select chrom, chromStart, chromEnd, protein, and geneSymbol
5. get output

The results should look as follows:

#hg19.knownCanonical.chrom  hg19.knownCanonical.chromStart  hg19.knownCanonical.chromEnd    hg19.knownCanonical.protein hg19.kgXref.geneSymbol
chr1    11873   14409   uc010nxq.1  DDX11L1
chr1    14361   19759   uc009viu.3  WASH7P
chr1    14406   29370   uc009viw.2  WASH7P
chr1    34610   36081   uc001aak.3  FAM138F
...


Lou UCSC GB

0
Entering edit mode

Thank you very much for the suggestions Luis, but I need the start-end of each exome, not of the overall gene.

1
Entering edit mode

Ah, in that case you can change the output to BED then get output and in the following page you will see:

Create one BED record per:

If you select Coding Exons (or Exons plus 0 if you want to include UTR regions), you should get an output like such, with one entry for each exon:

chr1    12189   12227   uc010nxq.1_cds_0_0_chr1_12190_f 0   +
chr1    12594   12721   uc010nxq.1_cds_1_0_chr1_12595_f 0   +
chr1    13402   13639   uc010nxq.1_cds_2_0_chr1_13403_f 0   +
chr1    69090   70008   uc001aal.1_cds_0_0_chr1_69091_f 0   +

0
Entering edit mode

if I leave "UCSC Genes" in the query page it doesn't allow me to select "Coding Exons" in the BED page, but if I change it to "NCBI RefSeq" it then does and I get:

chr1    67000041    67000051    NM_001308203.1_cds_1_0_chr1_67000042_f  0   +
chr1    67091529    67091593    NM_001308203.1_cds_2_0_chr1_67091530_f  0   +
chr1    67098752    67098777    NM_001308203.1_cds_3_0_chr1_67098753_f  0   +


but how can I put then the gene symbol here? Or retrieve it from somewhere else...

0
Entering edit mode
0
Entering edit mode

Oh, I see, I believe I understand, you're looking for each of the individual exon start/stop for one isoform per gene?

You're right, the knownCannonical table does not have exon start/stop coordinates. The following should work for you though:

1. Choose the knownCannonical table as described above, give a file name to download, then Select fields from primary....

This gives you a file with each cannonical transcript ID, e.x.:

#protein
uc010nxq.1
uc009viu.3
uc009viw.2

1. Go back to the table browser and switch to the knownGene table
2. For identifiers (names/accessions): choose upload list and select the file you just created

You should see a message that Note: 1 of the 31849 failed to upload, which is the '#protein' file header. At this point you are restricting the knownGene data set to just one isoform.

2. Choose the fields you want, e.x. name, chrom, strand, exonCount, exonStarts, exonEnds, geneSymbol, then get output

This output should give you a list of 31849 cannonical isoforms with individual exon start/stop sites, e.x.

#hg19.knownGene.name    hg19.knownGene.chrom    hg19.knownGene.strand   hg19.knownGene.exonCount    hg19.knownGene.exonStarts   hg19.knownGene.exonEnds hg19.kgXref.geneSymbol
uc010nxq.1  chr1    +   3   11873,12594,13402,  12227,12721,14409,  DDX11L1
uc009viu.3  chr1    -   10  14361,14969,15795,16606,16857,17232,17914,18267,18500,18912,    14829,15038,15947,16765,17055,17742,18061,18369,18554,19759,    WASH7P
uc009viw.2  chr1    -   7   14406,16857,17232,17914,18267,24737,29320,  16765,17055,17742,18061,18366,24891,29370,  WASH7P
uc001aak.3  chr1    -   3   34610,35276,35720,  35174,35481,36081,  FAM138F