Intron and exon distribution in a genome
1
0
Entering edit mode
9.0 years ago

Hi everyone!

May I ask a question? I wish to plot the exon and intron length distribution with a given window size (e.x 5000bp) in a genome. Can anyone please find a tool (script) for me?

Thanks

sequencing • 2.5k views
ADD COMMENT
0
Entering edit mode
9.0 years ago

Say you are working with hg19 and have a BED file called hg19.extents.bed that defines bounds of each chromosome.

You can generate 5 kbase windows over the extents using BEDOPS bedops:

$ bedops --chop 5000 hg19.extents.bed > windows.bed

Say you also grab a GTF file containing GENCODE exon annotations of interest, which are converted to a BED file called exons.bed:

$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gff3.gz \
   | gunzip --stdout - \
   | awk '$3=="exon"' - \
   | convert2bed -i gff - \
   > exons.bed

Using the windows and exon annotation BED files, you then map exons to windows and count the lengths of mapped exons using BEDOPS bedmap and the --echo-overlap-size operator:

$ bedmap --echo --echo-overlap-size --fraction-map 1 --delim '\t' windows.bed exons.bed > windows_with_exon_lengths.bed

(We add the --fraction-map 1 operator to ensure that the exon overlaps the window entirely (this eliminates potential double counts, if an exon straddles two adjoining windows). The --delim '\t'operator puts the length results in their own column.)

The fourth column of the file windows_with_exon_lengths.bed is a semi-colon-delimited string of numbers representing lengths of exons that overlap their associated window:

123;245;6421;36;2156;272;...

You can bring this file into R, parse the length strings in the fourth column into a long vector and make a histogram. Or do whatever calculation that makes sense for your experiment.

The steps above are for exons. You would repeat this process for introns, either sourcing a file containing intron annotations and making it into a BED file, or you can use gene and exon boundaries in a GENCODE or other annotation source to make a BED file containing introns that you can feed to bedmap.

ADD COMMENT

Login before adding your answer.

Traffic: 1780 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6