Question

Intron and exon distribution in a genome

0

Entering edit mode

9.0 years ago

xiaofeng.dong12 ▴ 20

Hi everyone!

May I ask a question? I wish to plot the exon and intron length distribution with a given window size (e.x 5000bp) in a genome. Can anyone please find a tool (script) for me?

Thanks

sequencing • 2.5k views

ADD COMMENT • link updated 14 months ago by Ram 43k • written 9.0 years ago by xiaofeng.dong12 ▴ 20

Ram · Answer 1 · 2015-05-05

Say you are working with hg19 and have a BED file called hg19.extents.bed that defines bounds of each chromosome.

You can generate 5 kbase windows over the extents using BEDOPS bedops:

$ bedops --chop 5000 hg19.extents.bed > windows.bed

Say you also grab a GTF file containing GENCODE exon annotations of interest, which are converted to a BED file called exons.bed:

$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gff3.gz \
   | gunzip --stdout - \
   | awk '$3=="exon"' - \
   | convert2bed -i gff - \
   > exons.bed

Using the windows and exon annotation BED files, you then map exons to windows and count the lengths of mapped exons using BEDOPS bedmap and the --echo-overlap-size operator:

$ bedmap --echo --echo-overlap-size --fraction-map 1 --delim '\t' windows.bed exons.bed > windows_with_exon_lengths.bed

(We add the --fraction-map 1 operator to ensure that the exon overlaps the window entirely (this eliminates potential double counts, if an exon straddles two adjoining windows). The --delim '\t'operator puts the length results in their own column.)

The fourth column of the file windows_with_exon_lengths.bed is a semi-colon-delimited string of numbers representing lengths of exons that overlap their associated window:

123;245;6421;36;2156;272;...

You can bring this file into R, parse the length strings in the fourth column into a long vector and make a histogram. Or do whatever calculation that makes sense for your experiment.

The steps above are for exons. You would repeat this process for introns, either sourcing a file containing intron annotations and making it into a BED file, or you can use gene and exon boundaries in a GENCODE or other annotation source to make a BED file containing introns that you can feed to bedmap.