Question: Multiple Exon Start and End sites in UCSC Exons table
gravatar for dally
5.3 years ago by
United States
dally190 wrote:

So i'm attempting to get some information for hg19 exons:


I need chr, start, end, strand, and gene name

I was able to do this via UCSC table browser -> refseq -> refGene -> selected field from primary and related tables:


I choose the following fields:

chrom, strand, exonStarts, exonEnds, name2

and this gives me exactly what I need except that it gives me multiple exonStarts and exonEnds on the same row and thus i'm not able to run it in typical programs I use (bedtools etc).

I know that it's possible to seperate out these start sites and end sites into seperate rows using something like awk, but after spending a bit of time (Obtaining Exon Lengths:) trying to figure it out, I can't seem to do it. 

Was hoping somebody could tell me what to do to seperate out these multiple exon starts and ends into different rows and remove duplicate start and end sites (for instance the first two rows have similar exon start and end sites).


Thank you!

awk ucsc • 1.4k views
ADD COMMENTlink modified 5.3 years ago by Vivek2.4k • written 5.3 years ago by dally190
gravatar for Vivek
5.3 years ago by
Vivek2.4k wrote:

You an export the data directly in BED from the table browser which is more convenient for downstream analysis, you'll end up with the following format:

chr1 66999638 67000051 NM_032291_exon_0_0_chr1_66999639_f 0 +
chr1 67091529 67091593 NM_032291_exon_1_0_chr1_67091530_f 0 +
chr1 67098752 67098777 NM_032291_exon_2_0_chr1_67098753_f 0 +
chr1 67101626 67101698 NM_032291_exon_3_0_chr1_67101627_f 0 +
ADD COMMENTlink modified 14 months ago by _r_am32k • written 5.3 years ago by Vivek2.4k

I've looked into this as well, the problem is that I need the gene name information which is why I didn't consider this. Is it possible to annotate this bed file and then perhaps use that (after removing unnecessary columns)?

Basically i'll be taking this bed file as a text file and using it to overlap against Pol II chip-seq peaks.

ADD REPLYlink written 5.3 years ago by dally190

I thought you could probably get that with GTF but UCSC is using the transcript name for gene game in GTF format. Best I can think of is to annotate the BED file using a script which uses a look up table/hash for each transcript name to gene name and add it to the last column of the BED file.

ADD REPLYlink written 5.3 years ago by Vivek2.4k

Gotcha. I think I can do that. Thanks.

ADD REPLYlink written 5.3 years ago by dally190
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1158 users visited in the last hour