7.0 years ago by
Czech Republic
Download a bed file for the canonical transcripts using UCSC Table Browser:
- track: UCSC Genes
- table: knownCanonical
- output format: select fields from primary and related tables
- press get output
- select fields from hg19.knownCanonical: chrom, chromStart, chromEnd,
- transcript select fields from hg19.kgXref: geneSymbol
- press get output
The file UCSC_canonical.bed looks like:
#hg19.knownCanonical.chrom hg19.knownCanonical.chromStart hg19.knownCanonical.chromEnd hg19.knownCanonical.transcript hg19.kgXref.geneSymbol
chr1 11873 14409 uc010nxq.1 DDX11L1
chr1 14361 19759 uc009viu.3 WASH7P
chr1 14406 29370 uc009viw.2 WASH7P
chr1 34610 36081 uc001aak.3 FAM138F
chr1 69090 70008 uc001aal.1 OR4F5
chr1 134772 140566 uc021oeg.2 LOC729737
chr1 321083 321115 uc001aaq.2 DQ597235
chr1 321145 321207 uc001aar.2 DQ599768
chr1 322036 326938 uc009vjk.2 LOC100133331
Download a bed file for all UCSC exons using UCSC Table Browser:
- track: UCSC Genes
- table: knownGene
- output format: BED - browser extensible data
- press get output
- select option Exons
- press get BED
The file UCSC_exons.bed looks like that:
chr1 11873 12227 uc001aaa.3_exon_0_0_chr1_11874_f 0 +
chr1 12612 12721 uc001aaa.3_exon_1_0_chr1_12613_f 0 +
chr1 13220 14409 uc001aaa.3_exon_2_0_chr1_13221_f 0 +
chr1 11873 12227 uc010nxr.1_exon_0_0_chr1_11874_f 0 +
chr1 12645 12697 uc010nxr.1_exon_1_0_chr1_12646_f 0 +
chr1 13220 14409 uc010nxr.1_exon_2_0_chr1_13221_f 0 +
chr1 11873 12227 uc010nxq.1_exon_0_0_chr1_11874_f 0 +
chr1 12594 12721 uc010nxq.1_exon_1_0_chr1_12595_f 0 +
chr1 13402 14409 uc010nxq.1_exon_2_0_chr1_13403_f 0 +
chr1 14361 14829 uc009vis.3_exon_0_0_chr1_14362_r 0 -
Modify the file to separate the transcript name of the rest of information:
awk '{split ($4,a,"_"); {print $1"\t"$2"\t"$3"\t"a[1]"\t"a[3]"\t"$6}}' UCSC_exons.bed > UCSC_exons_modif.bed
The file UCSC_exons_modif.bed:
chr1 11873 12227 uc001aaa.3 0 +
chr1 12612 12721 uc001aaa.3 1 +
chr1 13220 14409 uc001aaa.3 2 +
chr1 11873 12227 uc010nxr.1 0 +
chr1 12645 12697 uc010nxr.1 1 +
chr1 13220 14409 uc010nxr.1 2 +
chr1 11873 12227 uc010nxq.1 0 +
chr1 12594 12721 uc010nxq.1 1 +
chr1 13402 14409 uc010nxq.1 2 +
chr1 14361 14829 uc009vis.3 0 -
Join the sorted files based on the transcript identificator:
join -1 4 -2 4 <(sort -k4 UCSC_exons_modif.bed ) <(sort -k4 UCSC_canonical.bed) | awk '{print $2"\t"$3"\t"$4"\t"$10"\t"$5"\t"$6}' | bedtools sort -i "-" > UCSC_exons_modif_canonical.bed
The final file contains exons of the canonical transcripts:
chr1 11873 12227 DDX11L1 0 +
chr1 12594 12721 DDX11L1 1 +
chr1 13402 14409 DDX11L1 2 +
chr1 14361 14829 WASH7P 0 -
chr1 14406 16765 WASH7P 0 -
chr1 14969 15038 WASH7P 1 -
chr1 15795 15947 WASH7P 2 -
chr1 16606 16765 WASH7P 3 -
chr1 16857 17055 WASH7P 4 -
chr1 16857 17055 WASH7P 1 -
•
link
modified 14 months ago
by
Ram ♦ 32k
•
written
7.0 years ago by
pristanna • 610
Hello.
I would need to get a BED file with coordinates for each exon of the canonical RefSeq transcripts.
I tried the UCSC solution above but I understand that - UCSC known canonical transcripts do not necessarily correspond to RefSeq canonical (?) - I do not manage to get the information by exon (eg. NM_000xxx_exon1, NM_000xxx_exon2, ...).
As a second step, I would like to limit the gene content of my BED file to the gene list of the Clinical Genomic Database.
Could somebody help me with this ?
Thank you in advance!