Question: Intron and exon distribution in a genome
0
gravatar for xiaofeng.dong12
4.5 years ago by
xiaofeng.dong1220 wrote:

Hi everyone!

May I ask a question? I wish to plot the exon and intron length distribution with a given window size (e.x 5000bp) in a genome. Can anyone please find a tool (script) for me?

Thanks 

sequencing • 1.7k views
ADD COMMENTlink modified 4.5 years ago by Alex Reynolds29k • written 4.5 years ago by xiaofeng.dong1220
0
gravatar for Alex Reynolds
4.5 years ago by
Alex Reynolds29k
Seattle, WA USA
Alex Reynolds29k wrote:

Say you are working with hg19 and have a BED file called hg19.extents.bed that defines bounds of each chromosome.

You can generate 5 kbase windows over the extents using BEDOPS bedops:

$ bedops --chop 5000 hg19.extents.bed > windows.bed

Say you also grab a GTF file containing GENCODE exon annotations of interest, which are converted to a BED file called exons.bed:

$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gff3.gz \
   | gunzip --stdout - \
   | awk '$3=="exon"' - \
   | convert2bed -i gff - \
   > exons.bed

Using the windows and exon annotation BED files, you then map exons to windows and count the lengths of mapped exons using BEDOPS bedmap and the --echo-overlap-size operator:

$ bedmap --echo --echo-overlap-size --fraction-map 1 --delim '\t' windows.bed exons.bed > windows_with_exon_lengths.bed

(We add the --fraction-map 1 operator to ensure that the exon overlaps the window entirely (this eliminates potential double counts, if an exon straddles two adjoining windows). The --delim '\t' operator puts the length results in their own column.)

The fourth column of the file windows_with_exon_lengths.bed is a semi-colon-delimited string of numbers representing lengths of exons that overlap their associated window:

123;245;6421;36;2156;272;...

You can bring this file into R, parse the length strings in the fourth column into a long vector and make a histogram. Or do whatever calculation that makes sense for your experiment.

The steps above are for exons. You would repeat this process for introns, either sourcing a file containing intron annotations and making it into a BED file, or you can use gene and exon boundaries in a GENCODE or other annotation source to make a BED file containing introns that you can feed to bedmap.

ADD COMMENTlink modified 27 days ago by RamRS24k • written 4.5 years ago by Alex Reynolds29k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1216 users visited in the last hour