Question

Memory Efficient Bedtools Sort And Merge With Millions Of Entries?

2

Entering edit mode

11.0 years ago

14134125465346445 ★ 3.6k

I would like to know if there is a memory-efficent way of sorting and merging a large amount of bed files, each of them containing millions of entries, into a single bed file that merges the entries, either duplicated or partially overlapping, so that they are unique in the file.

I have tried the following but it blows up in memory beyond the 32G I have available here:

find /my/path -name '*.bed.gz' | xargs gunzip -c | ~/src/bedtools-2.17.0/bin/bedtools sort | ~/src/bedtools-2.17.0/bin/bedtools merge | gzip -c > bed.all.gz

Any suggestions?

bedtools bed • 5.8k views

ADD COMMENT • link updated 2.5 years ago by e.rempel ★ 1.1k • written 11.0 years ago by 14134125465346445 ★ 3.6k

score 4 · Answer 1 · 2013-05-08

Perhaps consider using BEDOPS sort-bed --max-mem to perform a sort within system memory (for example, --max-mem 24G, which asks for 24 GB of your host's 32 GB of system memory) and bedops --merge to calculate merged elements from sorted data:

$ find /my/path -name '*.bed.gz' -print0 \
    | xargs -0 gunzip -c \
    | sort-bed --max-mem 24G - \
    | bedops --merge - \
    | gzip -c \
    > answer.gz

In this example, the sort-bed --max-mem operation will run a quicksort on 24 GB chunks of data, and then apply a merge sort on each quicksort-sorted chunk. The bedops --merge operation will run fast (about 1/3rd the execution time of alternatives, which is a substantial savings for operations on data of this scale) and with a very low, constant memory profile on sorted input, but it will discard ID, score and strand data (if present) in calculating overlapping regions.

score 3 · Answer 2 · 2013-05-08

3

Entering edit mode

11.0 years ago

brentp 24k

use the sort provided in your linux distribution.

... | sort -k1,1 -k2,2n | ...

that will write temporary files to disk as needed.

ADD COMMENT • link 11.0 years ago by brentp 24k

0

Entering edit mode

a minor addition to this, there is an -m option for the sort that takest files that are already individually sorted and merges them into one

ADD REPLY • link 11.0 years ago by Istvan Albert 100k

0

Entering edit mode

I think this would work if I wasn't gunzip'ing the files.

ADD REPLY • link 11.0 years ago by 14134125465346445 ★ 3.6k

0

Entering edit mode

I tried this version and it uses a very small amount of memory. It is slower than the equivalent bedtools sort, but it solves my problem.

ADD REPLY • link 11.0 years ago by 14134125465346445 ★ 3.6k

score 0 · Answer 3 · 2013-05-08

merge first each individual bed file, then merge the merged files like:

for f in `find /my/path -name '*.bed.gz'`; do gunzip -c $f | ~/src/bedtools-2.17.0/bin/bedtools sort | ~/src/bedtools-2.17.0/bin/bedtools merge ; done | ~/src/bedtools-2.17.0/bin/bedtools sort | ~/src/bedtools-2.17.0/bin/bedtools merge | gzip -c > bed.all.gz

score 0 · Answer 4 · 2021-11-03

0

Entering edit mode

2.5 years ago

e.rempel ★ 1.1k

I had now good experience with mosdepth

ADD COMMENT • link 2.5 years ago by e.rempel ★ 1.1k