Quickly Plotting Distribution Of Points From Bed File?
3
3
Entering edit mode
8.9 years ago

I've got about 10e5-10e6 small regions in a bed file and I want to quickly inspect their genome distribution. I expect them to be evenly distributed in all chromosomes, so just plotting the results in something like a Manhattan plot would be useful. Is there anything like that I can use?

Something like:

plot_genomic_distribution --input myfile.bed --output myfile.png


Even something that just plots on a terminal would be good, like a stem and leaves plot vertically over the chromosomal arms.

bed • 7.9k views
5
Entering edit mode
8.9 years ago
Irsan ★ 7.6k

No need for binning the bed-file to plot a distribution if you use famous ggplot2 R library. Have a look at the ggplot2-solution at Plotting Density Of Snps On Chromosomes for plotting a distribution of features (in the example SNPs) across all chromosomes

3
Entering edit mode
8.9 years ago

One option is to map your regions to disjoint windows over a chromosome, counting the number of regions within each window.

To this end, you can use BEDOPS bedmap and its --count operator.

First, decide on chromosome bounds. For instance, for the hg19 build of the human genome, you could use the extents available here in hg19.extents.bed:

Second, make some 10 kb windows with awk and store them in a file called hg19.disjointWindows.bed (or adjust window size, depending on the granularity you want):

$awk ' \ BEGIN {\ windowSize = 10000; \ } \ { \ extentChromosome =$1; \
extentStart = $2; \ extentStop =$3; \
for (windowStart = extentStart; windowStart < extentStop; windowStart += windowSize) { \
windowStop = windowStart + windowSize; \
print extentChromosome"\t"windowStart"\t"windowStop; \
} \
}' hg19.extents.bed \
> hg19.disjointWindows.bed


This is guaranteed to be sorted, so we don't need to use any other tools (e.g., sort-bed) to sort this data. So we can use these windows with BEDOPS tools directly.

Third, we prepare your regions. Let's say your regions are in a BED file called unsortedRegions.bed, but we don't know its sort order. So we first sort this data with BEDOPS sort-bed:

$sort-bed unsortedRegions.bed > regions.bed  Fourth, we count: $ bedmap --echo-map --count regions.bed hg19.disjointWindows.bed > answer.bed


The file answer.bed will contain data in the following format:

[ window-1 ] | [ number-of-regions-overlapping-window-1 ]
[ window-2 ] | [ number-of-regions-overlapping-window-2 ]
. . .
[ window-n ] | [ number-of-regions-overlapping-window-n ]


Where each [ window-i ] is the i-th element in hg19.disjointWindows.bed, and [ number-of-regions-... ] is the count of overlapping regions over that window.

If you want to split this file further by chromosome, use BEDOPS bedextract. In a bash shell, this is a very fast one-liner:

$for chr in bedextract --list-chr answer.bed; do bedextract$chr answer.bed > $chr.answer.bed; done  You can then bring answer.bed or its separated chromosome files into R or gnuplot, plotting a log-scaled histogram or visualizing it in any number of other ways. ADD COMMENT 0 Entering edit mode Hi Alex, Thanks for posting this, wondering about the way you generated the extents file, I need to do essentially the same thing for mm10. Rob. ADD REPLY 1 Entering edit mode Could use Kent utilities: $ fetchChromSizes mm10 | awk '{ print $1"0\t"$2 }' > mm10.extents.bed

0
Entering edit mode

Yes that works well, weirdly though I get an empty mm10.disjointWindows.bed file after running the awk script.

0
Entering edit mode

Yeah, not sure why that is. Maybe do:

$fetchChromSizes mm10 | head  and see what you get. ADD REPLY 0 Entering edit mode Ah, sorry, I misread. I'm not sure why you get an empty result. I'd break things down and use head to see what comes out at various stages. ADD REPLY 1 Entering edit mode Another option is to use bedops --chop to make the 10k windows, instead of awk: $ bedops --chop 10000 mm10.extents.bed > mm10.disjointWindows.bed


1
Entering edit mode
8.9 years ago

You could load the bed file into a GRanges object and then just use plotGrandLinear from ggbio to get a manhattan plot.