Softwares For Simple Statistics Of Bed Files
2
5
Entering edit mode
9.6 years ago
Bioscientist ★ 1.7k

I'm been dealing with structural variants using bed files for a long time. For example:

chr1 232323 434344 deletion
chr2 673484 898788 deletion
...


For SV study, we always need to plot the frequency of SVs of each different sizes. Say, to calculate the frequency of deletions less than 200bp, I usually do the following bash command:

cat xxx.bed|awk '{if($3-$2<200) print}'|wc -l


However, for a large number of intervals, and a large number of bed files as well, it'll be painful to only rely on such commands.

Do we have any software to do such job in large-scale in a handy way? I usually plot histogram in Openoffice; now I wanna switch to, say, R/Bioconfuctor, but can anyone show me how to do it?

bed r bioconductor statistics • 2.9k views
6
Entering edit mode
9.6 years ago

You will need to do some reading on R, but to get you going:

bedfile = read.table('xxx.bed',sep="\t")
intervalSize = abs(bedfile[,3]-bedfile[,2])
hist(intervalSize)


There are MANY resources for learning to use R, but a useful place to start is here:

http://manuals.bioinformatics.ucr.edu/

1
Entering edit mode

+1 for a very nice R tutorial link

0
Entering edit mode

thx! very nice R tutorial!

1
Entering edit mode
9.4 years ago

The BEDOPS suite includes bedmap, which performs statistical calculations on BED elements. For your question, the --bases operator returns the number of bases of overlap between the reference element and mapping elements. If we use the input BED file for both reference and mapping, then we simply get back the number of bases that make up each element.

For example, let's say we have the following BED file:

chr2    160 210 id-2    4
chr2    220 490 id-3    10


We can run bedmap on this file to get the number of bases in each element:

$bedmap --echo --bases --delim '\t' test.bed chr2 160 210 id-2 4 50 chr2 220 390 id-3 10 170  We can pipe to awk or any other downstream tool for filtering: $ bedmap --echo --bases --delim '\t' test.bed | awk '{if ($6 < 200) print}' - chr2 160 210 id-2 4 50  This result can be piped to wc -l to get the number of filtered elements: $ bedmap --echo --bases --delim '\t' test.bed | awk '{if ($6 < 200) print$0}' - | wc -l
1


Or we can pipe to cut to get back filtered elements:

$bedmap --echo --bases --delim '\t' test.bed | awk '{if ($6 < 200) print}' - | cut -f1-5
chr2    160 210 id-2    4


In all, this seems a more complicated approach than what you use for your example case. However, BEDOPS bedmap can perform many statistical operations on BED elements, and so you might find it useful for other applications.

BEDOPS also has a very low memory footprint and scales efficiently to very large genomic datasets, whereas it can be easier to run into memory, performance and scaling issues with using R with large datasets.