Question: UCSC different exome sets per each gene
0
gravatar for cocchi.e89
23 days ago by
cocchi.e8920
cocchi.e8920 wrote:

I'm trying to collect the exomes' start-end for a set of gene. I downloaded the UCSC tables, but I found out that different sets are outputted for each gene. As example UMOD gene:

#hg19.knownGene.name    hg19.knownGene.chrom    hg19.knownGene.txStart  hg19.knownGene.txEnd    hg19.knownGene.exonCount    hg19.knownGene.exonStarts   hg19.knownGene.exonEnds hg19.kgXref.geneSymbol
uc002dgz.3  chr16   20344372    20364037    11  20344372,20346803,20347967,20348612,20352412,20355345,20357447,20359544,20359757,20361971,20364010, 20344697,20346842,20348049,20348775,20352658,20355494,20357656,20359652,20360534,20362161,20364037, UMOD
uc002dha.3  chr16   20344372    20364037    11  20344372,20346803,20347967,20348612,20352412,20355345,20357447,20359544,20359757,20361971,20364010, 20344697,20346842,20348049,20348775,20352658,20355494,20357656,20359652,20360534,20362098,20364037, UMOD
uc002dhb.3  chr16   20344372    20364037    12  20344372,20346803,20347967,20348612,20352412,20355345,20357447,20359544,20359757,20361092,20361971,20364010,    20344697,20346842,20348049,20348775,20352658,20355494,20357656,20359652,20360534,20361191,20362161,20364037,    UMOD

so my question is: which one am I supposed to rely on? And how do you select it?

Thanks a lot in advance for any help!

ucsc exomes coordinates gene • 122 views
ADD COMMENTlink modified 23 days ago by Luis Nassar110 • written 23 days ago by cocchi.e8920
0
gravatar for Pierre Lindenbaum
23 days ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum120k wrote:

knownGene a misleading name. What you're seeing here are 3 transcripts (uc002dgz.3; uc002dha.3 uc002dhb.3 ) for the same gene. UMOD

one am I supposed to rely on?

There is no quick answer for this: depends of your needs: the largest, the most covered, etc... or just use the min / max coordinates.

ADD COMMENTlink written 23 days ago by Pierre Lindenbaum120k

probably the largest, shall I calculate it for each one? Or there is a sort of "indicator" of the largest set?

ADD REPLYlink written 23 days ago by cocchi.e8920
0
gravatar for Luis Nassar
23 days ago by
Luis Nassar110
Luis Nassar110 wrote:

As Pierre mentioned, knownGene includes a large set of transcripts, in total it has 82,960 items. If you are just looking for one representative transcript per gene, then I would recommend you use the knownCanonical table instead. This table is a subset of the knownGene, generally the longest isoform. You may search our forums for more details on its generation (https://groups.google.com/a/soe.ucsc.edu/forum/#!forum/genome).

Here is a link to the .txt.gz knownCannonical data: http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/knownCanonical.txt.gz

You may also get your query from the Table Browser, using the following link (http://genome.ucsc.edu/cgi-bin/hgTables?db=hg19&hgta_doMainPage=1&hgta_group=genes&hgta_track=knownGene&hgta_table=knownGene) and following these steps:

  1. change the table to knownCannonical then change the output to selected fields from primary and related tables
  2. add a file name to download the file
  3. get output
  4. select chrom, chromStart, chromEnd, protein, and geneSymbol
  5. get output

The results should look as follows:

#hg19.knownCanonical.chrom  hg19.knownCanonical.chromStart  hg19.knownCanonical.chromEnd    hg19.knownCanonical.protein hg19.kgXref.geneSymbol
chr1    11873   14409   uc010nxq.1  DDX11L1
chr1    14361   19759   uc009viu.3  WASH7P
chr1    14406   29370   uc009viw.2  WASH7P
chr1    34610   36081   uc001aak.3  FAM138F
...

Lou UCSC GB

ADD COMMENTlink written 23 days ago by Luis Nassar110

Thank you very much for the suggestions Luis, but I need the start-end of each exome, not of the overall gene.

ADD REPLYlink written 23 days ago by cocchi.e8920
1

Ah, in that case you can change the output to BED then get output and in the following page you will see:

Create one BED record per:

If you select Coding Exons (or Exons plus 0 if you want to include UTR regions), you should get an output like such, with one entry for each exon:

chr1    12189   12227   uc010nxq.1_cds_0_0_chr1_12190_f 0   +
chr1    12594   12721   uc010nxq.1_cds_1_0_chr1_12595_f 0   +
chr1    13402   13639   uc010nxq.1_cds_2_0_chr1_13403_f 0   +
chr1    69090   70008   uc001aal.1_cds_0_0_chr1_69091_f 0   +
ADD REPLYlink written 23 days ago by Luis Nassar110

if I leave "UCSC Genes" in the query page it doesn't allow me to select "Coding Exons" in the BED page, but if I change it to "NCBI RefSeq" it then does and I get:

chr1    67000041    67000051    NM_001308203.1_cds_1_0_chr1_67000042_f  0   +
chr1    67091529    67091593    NM_001308203.1_cds_2_0_chr1_67091530_f  0   +
chr1    67098752    67098777    NM_001308203.1_cds_3_0_chr1_67098753_f  0   +

but how can I put then the gene symbol here? Or retrieve it from somewhere else...

ADD REPLYlink written 23 days ago by cocchi.e8920

Found a good answer finally at: How To Get Bed File Containing Exons Of Canonical Transcripts And Their Corresponding Gene Symbols

ADD REPLYlink written 23 days ago by cocchi.e8920

Oh, I see, I believe I understand, you're looking for each of the individual exon start/stop for one isoform per gene?

You're right, the knownCannonical table does not have exon start/stop coordinates. The following should work for you though:

  1. Choose the knownCannonical table as described above, give a file name to download, then Select fields from primary....
  2. Choose only protein then get output to download the file

This gives you a file with each cannonical transcript ID, e.x.:

#protein
uc010nxq.1
uc009viu.3
uc009viw.2
  1. Go back to the table browser and switch to the knownGene table
  2. For identifiers (names/accessions): choose upload list and select the file you just created

You should see a message that Note: 1 of the 31849 failed to upload, which is the '#protein' file header. At this point you are restricting the knownGene data set to just one isoform.

  1. Add a file name to download, then Select fields from primary.... get output
  2. Choose the fields you want, e.x. name, chrom, strand, exonCount, exonStarts, exonEnds, geneSymbol, then get output

This output should give you a list of 31849 cannonical isoforms with individual exon start/stop sites, e.x.

#hg19.knownGene.name    hg19.knownGene.chrom    hg19.knownGene.strand   hg19.knownGene.exonCount    hg19.knownGene.exonStarts   hg19.knownGene.exonEnds hg19.kgXref.geneSymbol
uc010nxq.1  chr1    +   3   11873,12594,13402,  12227,12721,14409,  DDX11L1
uc009viu.3  chr1    -   10  14361,14969,15795,16606,16857,17232,17914,18267,18500,18912,    14829,15038,15947,16765,17055,17742,18061,18369,18554,19759,    WASH7P
uc009viw.2  chr1    -   7   14406,16857,17232,17914,18267,24737,29320,  16765,17055,17742,18061,18366,24891,29370,  WASH7P
uc001aak.3  chr1    -   3   34610,35276,35720,  35174,35481,36081,  FAM138F
ADD REPLYlink written 23 days ago by Luis Nassar110
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1240 users visited in the last hour