Question: Non-Overlapping, de-deduplicated BED file download with exons as separate records
gravatar for abaluapuri
4.1 years ago by
abaluapuri0 wrote:

I would like to have a table of all genes (hg19 assembly, RefSeq) with their respective exons as separate records, but without isoforms, duplications or alternative splice variants. How do I download such a bed file ?

Secondly, I would like to calculate percent of tags falling in Exons, Introns and intergenic area of hg19 assembly using the above bed file(and similar for intron etc). Should I use CoverageBed or BedOPS ?

Thanks in advance,

ADD COMMENTlink modified 4.1 years ago by tiago2112871.2k • written 4.1 years ago by abaluapuri0
gravatar for tiago211287
4.1 years ago by
tiago2112871.2k wrote:

In your linux command line you can use this mysql command to get all exons start and end positions from hg19 with the strand:

mysql --user=genome -A -D hg19  -N -e 'select chrom,exonStarts,exonEnds,name2,strand from refGene ' > h19.genes

After that, you run this awk command to separate all exon (comma separated) fields into different rows:

awk '{ n = split($2, a, ","); split($3, b, ","); for(i=1; i<n; ++i) print $1, a[i], b[i], $4, $5 }' h19.genes > h19.genes.bed

you will need to sort the bed file like this:

sort -k1,1V h19.genes.bed > h19.genes.sorted.bed

sometimes the file has spaces instead of tabs and this will crash bedtools, to fix it use:

sed -i 's/ \+/\t/g' h19.genes.bed

And add the 5º column to the 6º, as bedtools merge expect strandness at 6º column

awk -v OFS="\t" '{print $1,$2,$3,$4,$5,$5}' h19.genes.sorted.bed > h19.genes.bed

run bedtools merge to collapse all isoform variants into one:

merge -s -c 4 -o distinct -i h19.genes.bed > tmp && mv tmp h19.genes.bed

go and use the bedtools multicov for count the number of things per bed record

Alternatively, you can use the table browser at and set the options you need as in (1 for exons, and other for introns): enter image description here enter image description here

The caveat here is that you do not get the gene names, only ucsc id's (at least for what I know).

For intergenic regions I do not know any simple way to get this. I would construct a BED file simple by subtracting the regions between genes using the coordenates of each gene.

Maybe some more seniour bioinformatician here could help more.

ADD COMMENTlink written 4.1 years ago by tiago2112871.2k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1761 users visited in the last hour