Download a bed file for the canonical transcripts using UCSC Table Browser:
- track: UCSC Genes
- table: knownCanonical
- output format: select fields from primary and related tables
- press get output
- select fields from hg19.knownCanonical: chrom, chromStart, chromEnd,
- transcript select fields from hg19.kgXref: geneSymbol
- press get output
The file UCSC_canonical.bed looks like:
#hg19.knownCanonical.chrom      hg19.knownCanonical.chromStart  hg19.knownCanonical.chromEnd    hg19.knownCanonical.transcript  hg19.kgXref.geneSymbol
chr1    11873   14409   uc010nxq.1      DDX11L1
chr1    14361   19759   uc009viu.3      WASH7P
chr1    14406   29370   uc009viw.2      WASH7P
chr1    34610   36081   uc001aak.3      FAM138F
chr1    69090   70008   uc001aal.1      OR4F5
chr1    134772  140566  uc021oeg.2      LOC729737
chr1    321083  321115  uc001aaq.2      DQ597235
chr1    321145  321207  uc001aar.2      DQ599768
chr1    322036  326938  uc009vjk.2      LOC100133331
Download a bed file for all UCSC exons using UCSC Table Browser:
- track: UCSC Genes
- table: knownGene
- output format: BED - browser extensible data
- press get output
- select option Exons
- press get BED
The file UCSC_exons.bed looks like that:
chr1    11873   12227   uc001aaa.3_exon_0_0_chr1_11874_f        0       +
chr1    12612   12721   uc001aaa.3_exon_1_0_chr1_12613_f        0       +
chr1    13220   14409   uc001aaa.3_exon_2_0_chr1_13221_f        0       +
chr1    11873   12227   uc010nxr.1_exon_0_0_chr1_11874_f        0       +
chr1    12645   12697   uc010nxr.1_exon_1_0_chr1_12646_f        0       +
chr1    13220   14409   uc010nxr.1_exon_2_0_chr1_13221_f        0       +
chr1    11873   12227   uc010nxq.1_exon_0_0_chr1_11874_f        0       +
chr1    12594   12721   uc010nxq.1_exon_1_0_chr1_12595_f        0       +
chr1    13402   14409   uc010nxq.1_exon_2_0_chr1_13403_f        0       +
chr1    14361   14829   uc009vis.3_exon_0_0_chr1_14362_r        0       -
Modify the file to separate the transcript name of the rest of information:
awk '{split ($4,a,"_"); {print $1"\t"$2"\t"$3"\t"a[1]"\t"a[3]"\t"$6}}' UCSC_exons.bed > UCSC_exons_modif.bed
The file UCSC_exons_modif.bed:
chr1    11873   12227   uc001aaa.3      0       +
chr1    12612   12721   uc001aaa.3      1       +
chr1    13220   14409   uc001aaa.3      2       +
chr1    11873   12227   uc010nxr.1      0       +
chr1    12645   12697   uc010nxr.1      1       +
chr1    13220   14409   uc010nxr.1      2       +
chr1    11873   12227   uc010nxq.1      0       +
chr1    12594   12721   uc010nxq.1      1       +
chr1    13402   14409   uc010nxq.1      2       +
chr1    14361   14829   uc009vis.3      0       -
Join the sorted files based on the transcript identificator:
join -1 4 -2 4 <(sort -k4 UCSC_exons_modif.bed ) <(sort -k4 UCSC_canonical.bed) | awk '{print $2"\t"$3"\t"$4"\t"$10"\t"$5"\t"$6}' | bedtools sort -i "-" > UCSC_exons_modif_canonical.bed
The final file contains exons of the canonical transcripts:
chr1    11873   12227   DDX11L1 0       +
chr1    12594   12721   DDX11L1 1       +
chr1    13402   14409   DDX11L1 2       +
chr1    14361   14829   WASH7P  0       -
chr1    14406   16765   WASH7P  0       -
chr1    14969   15038   WASH7P  1       -
chr1    15795   15947   WASH7P  2       -
chr1    16606   16765   WASH7P  3       -
chr1    16857   17055   WASH7P  4       -
chr1    16857   17055   WASH7P  1       -
                    
                
                 
At this step: transcript select fields from hg19.kgXref: geneSymbol, one need to check kgID in that hg19.kgXref table to produce the expected UCSC_canonical.bed.
Another important thing: exon numeration is always forward! So if the gene is in reverse complement strand than Exon 0 there is the last exon.
Pulling this thread again since I got this error:
Did you solve this issue?
I found doing the following worked to reproduce the output as shown by pristanna, I made just one minor change to the method pristanna detailed:
Download a bed file for the canonical transcripts using UCSC Table Browser:
track: UCSC Genes table: knownCanonical output format: select fields from primary and related tables press get output select fields from hg19.knownCanonical: chrom, chromStart, chromEnd,transcript transcript select fields from hg19.kgXref: geneSymbol press get output
I had the same problem. Just download again both bed from UCSC. Probably you need to wait and confirm the download.
Hi pristanna
Pulling this thread once again for a question:
Why the file "UCSC_canonical.bed" contains all the transcripts of a gene with multiple isoforms? Should'nt a canonical bed contain the "longest one" ? For example, here for the this gene "WASH7P", this file has 2 transcripts; uc009viu.3 and uc009viw.2. What I wanted (or I thought rather) is that it should have the canonical (which is mostly the longest; though there are varied views on that).
If it is expected to contain all the isoforms, then , is there any way from "UCSC" to get ONLY the longest one?
I happened to write a custom python script to fetch the canonical transcript.
There is a consequential error in you response.
From your answer:
- select fields from hg19.knownCanonical: chrom, chromStart, chromEnd,
- transcript select fields from hg19.kgXref: geneSymbol
Should be:
- select fields from hg19.knownCanonical: chrom, chromStart, chromEnd, transcript
- select fields from hg19.kgXref: geneSymbol
Once I figured that out, it worked perfectly. Thank you!