Can Bedtools/Bedops Used To Extract Regions Where Scores Are Higher Than A Given Value?
1
0
Entering edit mode
8.4 years ago

I have a very basic question about bedtools and bedops. Can I use these tools to filter all the regions where the score is higher (or lower) than a given value?

For example, let's say that I have a BED file like the following:

chr7    127471196  127472363  Pos1  12   +  127471196  127472363  255,0,0
chr7    127472363  127473530  Pos2  200  +  127472363  127473530  255,0,0
chr7    127473530  127474697  Pos3  120  +  127473530  127474697  255,0,0
chr7    127474697  127475864  Pos4  54   +  127474697  127475864  255,0,0
chr7    127475864  127477031  Neg1  2    -  127475864  127477031  0,0,255
chr7    127477031  127478198  Neg2  15   -  127477031  127478198  0,0,255
chr7    127478198  127479365  Neg3  25   -  127478198  127479365  0,0,255
chr7    127479365  127480532  Pos5  2    +  127479365  127480532  255,0,0
chr7    127480532  127481699  Neg4  9    -  127480532  127481699  0,0,255


According to the BED format's specs, the fifth column contains a score, between 0 and 1000 (alternatively, in the bedGraph format the score is on the 4th position). If I want to get all the regions that have a score higher than 20, for example, I can do an awk search:

$: awk '$5 > 20 {print}' mybedfile.bed


However, in order to use awk, I have to keep the BED file in a uncompressed format. It would be much better if I could use the .starch format in Bedops, or if I could combine any Bedops/Bedtools operation with the score search (e.g. get all scores that overlap a region and are higher than a value).

bedtools awk filter • 3.6k views
0
Entering edit mode

I don't get it, why can't you just do: zcat file.bed.gz | awk '$5 > 20 {print}' |bgzip -c > out.gz ADD REPLY 0 Entering edit mode mainly two reasons: 1) I want to try the .starch format in BEDOPS, without having to decompress every time; 2) I wonder why there are options to calculate the mean, max scores for a region without having to decompress the file, and not for getting values higher than a threshold. ADD REPLY 1 Entering edit mode It seems that the program merely calls burrows-wheeler zip or normal gzip. So internally it will decompress. For 2) I would use tabix indexed files and iterate over regions. I am not familiar enough with bedops to know whether it can do that. ADD REPLY 0 Entering edit mode We actually do quite a bit better than other algorithms by first pre-processing the BED data before compression: We also offer per-chromosome indexing with the --chrom operator. I am working on modifying our Starch format that would extend our chromosome-indexing approach to support querying regions, but it remains to be seen whether this can be done without sacrificing too much of the compression efficiency improvements in the Starch approach. Part of the cost of pre-processing means having to re-process on extraction. Querying regions may require some overhead to store "signposts" that allow reconstruction of the original data without starting from the front end of each per-chromosome data stream. (Or we skip pre-processing and just naively compress sorted BED data, but that's not as interesting a problem.) ADD REPLY 2 Entering edit mode 8.4 years ago First, compress sorted BED data to Starch: $ sort-bed mybedfile.bed \
| starch - \
> mybedfile.starch


Once you have this archive ready, you can extract to standard output, piping to awk to filter on the score value:

$unstarch mybedfile.starch \ | awk '$5 > 20' - \
> filteredFile.bed


If you want to filter just one chromosome, you can specify that as an option to unstarch:

$unstarch chr7 mybedfile.starch \ | awk '$5 > 20' - \
> filteredChr7.bed


The Starch format does indexing and random access by chromosome. It also pre-processes the BED data before compressing again with bzip2 or gzip, providing significantly improved space savings for most BED inputs.

You can compress a large dataset more efficiently than with a straight-up application of bzip2 or gzip, as well as instantaneously pull out and work on only the chromosome you are interested in.

If you want to keep everything compressed, just pipe the result into starch again:

$unstarch mybedfile.starch \ | awk '$5 > 20' - \
| starch - \
> filteredFile.starch


I suspect that the bedmap tool offers more calculation options than filtering, because it is fairly trivial to use UNIX input/output streams to pass data to tools like awk to implement that kind of custom logic.

There are occasions when filtering-like options are added when it suits some internal need and reduces complexity — I think --indicator is one example, where it replaces a specific combination of --count and awk in a straightforward manner, i.e.:

$bedmap --count foo bar | awk '{ print ($1 > 0 ? "1" : "0") }' -


is equivalent to:

$bedmap --indicator foo bar  If you're just interested in results for one chromosome in a larger BED dataset, you can also pass the --chrom operator: $ bedmap --chrom chr7 --indicator foo bar


The files foo and bar can be a mix of sorted BED files or Starch archives.

The --chrom operator facilitates easy parallelization of whole-genome analyses, or when you just want to do some quick work on one chromosome without wasting time streaming through the entire file.

0
Entering edit mode

Thank you Alex, in fact I was hoping from an answer from you, to confirm me that this operation is usually left to awk.