Question

Can Bedtools/Bedops Used To Extract Regions Where Scores Are Higher Than A Given Value?

0

Entering edit mode

10.8 years ago

Giovanni M Dall'Olio 28k

I have a very basic question about bedtools and bedops. Can I use these tools to filter all the regions where the score is higher (or lower) than a given value?

For example, let's say that I have a BED file like the following:

chr7    127471196  127472363  Pos1  12   +  127471196  127472363  255,0,0
chr7    127472363  127473530  Pos2  200  +  127472363  127473530  255,0,0
chr7    127473530  127474697  Pos3  120  +  127473530  127474697  255,0,0
chr7    127474697  127475864  Pos4  54   +  127474697  127475864  255,0,0
chr7    127475864  127477031  Neg1  2    -  127475864  127477031  0,0,255
chr7    127477031  127478198  Neg2  15   -  127477031  127478198  0,0,255
chr7    127478198  127479365  Neg3  25   -  127478198  127479365  0,0,255
chr7    127479365  127480532  Pos5  2    +  127479365  127480532  255,0,0
chr7    127480532  127481699  Neg4  9    -  127480532  127481699  0,0,255

According to the BED format's specs, the fifth column contains a score, between 0 and 1000 (alternatively, in the bedGraph format the score is on the 4th position). If I want to get all the regions that have a score higher than 20, for example, I can do an awk search:

$: awk '$5 > 20 {print}' mybedfile.bed

However, in order to use awk, I have to keep the BED file in a uncompressed format. It would be much better if I could use the .starch format in Bedops, or if I could combine any Bedops/Bedtools operation with the score search (e.g. get all scores that overlap a region and are higher than a value).

bedtools awk filter • 5.1k views

ADD COMMENT • link updated 10.8 years ago by Alex Reynolds 35k • written 10.8 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

I don't get it, why can't you just do: zcat file.bed.gz | awk '$5 > 20 {print}' |bgzip -c > out.gz

ADD REPLY • link 10.8 years ago by Gabriel R. ★ 2.9k

0

Entering edit mode

mainly two reasons: 1) I want to try the .starch format in BEDOPS, without having to decompress every time; 2) I wonder why there are options to calculate the mean, max scores for a region without having to decompress the file, and not for getting values higher than a threshold.

ADD REPLY • link 10.8 years ago by Giovanni M Dall'Olio 28k

1

Entering edit mode

It seems that the program merely calls burrows-wheeler zip or normal gzip. So internally it will decompress. For 2) I would use tabix indexed files and iterate over regions. I am not familiar enough with bedops to know whether it can do that.

ADD REPLY • link 10.8 years ago by Gabriel R. ★ 2.9k

0

Entering edit mode

We actually do quite a bit better than other algorithms by first pre-processing the BED data before compression:

Starch overview

We also offer per-chromosome indexing with the --chrom operator. I am working on modifying our Starch format that would extend our chromosome-indexing approach to support querying regions, but it remains to be seen whether this can be done without sacrificing too much of the compression efficiency improvements in the Starch approach.

Part of the cost of pre-processing means having to re-process on extraction. Querying regions may require some overhead to store "signposts" that allow reconstruction of the original data without starting from the front end of each per-chromosome data stream. (Or we skip pre-processing and just naively compress sorted BED data, but that's not as interesting a problem.)

ADD REPLY • link 10.8 years ago by Alex Reynolds 35k

Ram · Answer 1 · 2013-06-21

First, compress sorted BED data to Starch:

$ sort-bed mybedfile.bed \
    | starch - \
    > mybedfile.starch

Once you have this archive ready, you can extract to standard output, piping to awk to filter on the score value:

$ unstarch mybedfile.starch \
    | awk '$5 > 20' - \
    > filteredFile.bed

If you want to filter just one chromosome, you can specify that as an option to unstarch:

$ unstarch chr7 mybedfile.starch \
    | awk '$5 > 20' - \
    > filteredChr7.bed

The Starch format does indexing and random access by chromosome. It also pre-processes the BED data before compressing again with bzip2 or gzip, providing significantly improved space savings for most BED inputs.

You can compress a large dataset more efficiently than with a straight-up application of bzip2 or gzip, as well as instantaneously pull out and work on only the chromosome you are interested in.

If you want to keep everything compressed, just pipe the result into starch again:

$ unstarch mybedfile.starch \
    | awk '$5 > 20' - \
    | starch - \
    > filteredFile.starch

I suspect that the bedmap tool offers more calculation options than filtering, because it is fairly trivial to use UNIX input/output streams to pass data to tools like awk to implement that kind of custom logic.

There are occasions when filtering-like options are added when it suits some internal need and reduces complexity — I think --indicator is one example, where it replaces a specific combination of --count and awk in a straightforward manner, i.e.:

$ bedmap --count foo bar | awk '{ print ($1 > 0 ? "1" : "0") }' -

is equivalent to:

$ bedmap --indicator foo bar

If you're just interested in results for one chromosome in a larger BED dataset, you can also pass the --chrom operator:

$ bedmap --chrom chr7 --indicator foo bar

The files foo and bar can be a mix of sorted BED files or Starch archives.

The --chrom operator facilitates easy parallelization of whole-genome analyses, or when you just want to do some quick work on one chromosome without wasting time streaming through the entire file.