Hi everyone!
May I ask a question? I wish to plot the exon and intron length distribution with a given window size (e.x 5000bp) in a genome. Can anyone please find a tool (script) for me?
Thanks
Hi everyone!
May I ask a question? I wish to plot the exon and intron length distribution with a given window size (e.x 5000bp) in a genome. Can anyone please find a tool (script) for me?
Thanks
Say you are working with hg19
and have a BED file called hg19.extents.bed
that defines bounds of each chromosome.
You can generate 5 kbase windows over the extents using BEDOPS bedops:
$ bedops --chop 5000 hg19.extents.bed > windows.bed
Say you also grab a GTF file containing GENCODE exon annotations of interest, which are converted to a BED file called exons.bed
:
$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gff3.gz \
| gunzip --stdout - \
| awk '$3=="exon"' - \
| convert2bed -i gff - \
> exons.bed
Using the windows and exon annotation BED files, you then map exons to windows and count the lengths of mapped exons using BEDOPS bedmap and the --echo-overlap-size
operator:
$ bedmap --echo --echo-overlap-size --fraction-map 1 --delim '\t' windows.bed exons.bed > windows_with_exon_lengths.bed
(We add the --fraction-map 1
operator to ensure that the exon overlaps the window entirely (this eliminates potential double counts, if an exon straddles two adjoining windows). The --delim '\t'
operator puts the length results in their own column.)
The fourth column of the file windows_with_exon_lengths.bed
is a semi-colon-delimited string of numbers representing lengths of exons that overlap their associated window:
123;245;6421;36;2156;272;...
You can bring this file into R, parse the length strings in the fourth column into a long vector and make a histogram. Or do whatever calculation that makes sense for your experiment.
The steps above are for exons. You would repeat this process for introns, either sourcing a file containing intron annotations and making it into a BED file, or you can use gene and exon boundaries in a GENCODE or other annotation source to make a BED file containing introns that you can feed to bedmap.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.