Say you are working with
hg19 and have a BED file called
hg19.extents.bed that defines bounds of each chromosome.
You can generate 5 kbase windows over the extents using BEDOPS bedops:
$ bedops --chop 5000 hg19.extents.bed > windows.bed
Say you also grab a GTF file containing GENCODE exon annotations of interest, which are converted to a BED file called
$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gff3.gz \ | gunzip --stdout - \ | awk '$3=="exon"' - \ | convert2bed -i gff - \ > exons.bed
Using the windows and exon annotation BED files, you then map exons to windows and count the lengths of mapped exons using BEDOPS bedmap and the
$ bedmap --echo --echo-overlap-size --fraction-map 1 --delim '\t' windows.bed exons.bed > windows_with_exon_lengths.bed
(We add the
--fraction-map 1 operator to ensure that the exon overlaps the window entirely (this eliminates potential double counts, if an exon straddles two adjoining windows). The
--delim '\t' operator puts the length results in their own column.)
The fourth column of the file
windows_with_exon_lengths.bed is a semi-colon-delimited string of numbers representing lengths of exons that overlap their associated window:
You can bring this file into R, parse the length strings in the fourth column into a long vector and make a histogram. Or do whatever calculation that makes sense for your experiment.
The steps above are for exons. You would repeat this process for introns, either sourcing a file containing intron annotations and making it into a BED file, or you can use gene and exon boundaries in a GENCODE or other annotation source to make a BED file containing introns that you can feed to bedmap.