Question

Softwares For Simple Statistics Of Bed Files

5

Entering edit mode

13.3 years ago

Bioscientist ★ 1.7k

I'm been dealing with structural variants using bed files for a long time. For example:

chr1 232323 434344 deletion
chr2 673484 898788 deletion
...

For SV study, we always need to plot the frequency of SVs of each different sizes. Say, to calculate the frequency of deletions less than 200bp, I usually do the following bash command:

cat xxx.bed|awk '{if($3-$2<200) print}'|wc -l

However, for a large number of intervals, and a large number of bed files as well, it'll be painful to only rely on such commands.

Do we have any software to do such job in large-scale in a handy way? I usually plot histogram in Openoffice; now I wanna switch to, say, R/Bioconfuctor, but can anyone show me how to do it?

bed r bioconductor statistics • 4.3k views

ADD COMMENT • link updated 13.3 years ago by Alex Reynolds 36k • written 13.3 years ago by Bioscientist ★ 1.7k

score 6 · Answer 1 · 2012-03-24

6

Entering edit mode

13.3 years ago

Sean Davis 27k

You will need to do some reading on R, but to get you going:

bedfile = read.table('xxx.bed',sep="\t")
intervalSize = abs(bedfile[,3]-bedfile[,2])
hist(intervalSize)

There are MANY resources for learning to use R, but a useful place to start is here:

http://manuals.bioinformatics.ucr.edu/

ADD COMMENT • link 13.3 years ago by Sean Davis 27k

1

Entering edit mode

+1 for a very nice R tutorial link

ADD REPLY • link 13.3 years ago by Gjain 5.8k

0

Entering edit mode

thx! very nice R tutorial!

ADD REPLY • link 13.3 years ago by Bioscientist ★ 1.7k

score 1 · Answer 2 · 2012-06-18

The BEDOPS suite includes bedmap, which performs statistical calculations on BED elements. For your question, the --bases operator returns the number of bases of overlap between the reference element and mapping elements. If we use the input BED file for both reference and mapping, then we simply get back the number of bases that make up each element.

For example, let's say we have the following BED file:

chr2    160 210 id-2    4
chr2    220 490 id-3    10

We can run bedmap on this file to get the number of bases in each element:

$ bedmap --echo --bases --delim '\t' test.bed
chr2    160 210 id-2    4   50
chr2    220 390 id-3    10  170

We can pipe to awk or any other downstream tool for filtering:

$ bedmap --echo --bases --delim '\t' test.bed | awk '{if ($6 < 200) print}' -
chr2    160 210 id-2    4   50

This result can be piped to wc -l to get the number of filtered elements:

$ bedmap --echo --bases --delim '\t' test.bed | awk '{if ($6 < 200) print $0}' - | wc -l
1

Or we can pipe to cut to get back filtered elements:

$ bedmap --echo --bases --delim '\t' test.bed | awk '{if ($6 < 200) print}' - | cut -f1-5
chr2    160 210 id-2    4

In all, this seems a more complicated approach than what you use for your example case. However, BEDOPS bedmap can perform many statistical operations on BED elements, and so you might find it useful for other applications.

BEDOPS also has a very low memory footprint and scales efficiently to very large genomic datasets, whereas it can be easier to run into memory, performance and scaling issues with using R with large datasets.