Question: Memory Efficient Bedtools Sort And Merge With Millions Of Entries?
2
gravatar for 14134125465346445
5.7 years ago by
United Kingdom
141341254653464453.4k wrote:

I would like to know if there is a memory-efficent way of sorting and merging a large amount of bed files, each of them containing millions of entries, into a single bed file that merges the entries, either duplicated or partially overlapping, so that they are unique in the file.

I have tried the following but it blows up in memory beyond the 32G I have available here:

find /my/path -name '*.bed.gz' | xargs gunzip -c | ~/src/bedtools-2.17.0/bin/bedtools sort | ~/src/bedtools-2.17.0/bin/bedtools merge | gzip -c > bed.all.gz

Any suggestions?

bedtools bed • 2.9k views
ADD COMMENTlink modified 4.9 years ago by Biostar ♦♦ 20 • written 5.7 years ago by 141341254653464453.4k
4
gravatar for Alex Reynolds
5.7 years ago by
Alex Reynolds27k
Seattle, WA USA
Alex Reynolds27k wrote:

Perhaps consider using BEDOPS sort-bed --max-mem to perform a sort within system memory (for example, --max-mem 24G, which asks for 24 GB of your host's 32 GB of system memory) and bedops --merge to calculate merged elements from sorted data:

$ find /my/path -name '*.bed.gz' -print0 \
    | xargs -0 gunzip -c \
    | sort-bed --max-mem 24G - \
    | bedops --merge - \
    | gzip -c \
    > answer.gz

In this example, the sort-bed --max-mem operation will run a quicksort on 24 GB chunks of data, and then apply a merge sort on each quicksort-sorted chunk. The bedops --merge operation will run fast (about 1/3rd the execution time of alternatives, which is a substantial savings for operations on data of this scale) and with a very low, constant memory profile on sorted input, but it will discard ID, score and strand data (if present) in calculating overlapping regions.

ADD COMMENTlink modified 4.9 years ago • written 5.7 years ago by Alex Reynolds27k
3
gravatar for brentp
5.7 years ago by
brentp22k
Salt Lake City, UT
brentp22k wrote:

use the sort provided in your linux distribution.

... | sort -k1,1 -k2,2n | ...

that will write temporary files to disk as needed.

ADD COMMENTlink modified 5.7 years ago • written 5.7 years ago by brentp22k

a minor addition to this, there is an -m option for the sort that takest files that are already individually sorted and merges them into one

ADD REPLYlink written 5.7 years ago by Istvan Albert ♦♦ 78k

I think this would work if I wasn't gunzip'ing the files.

ADD REPLYlink written 5.7 years ago by 141341254653464453.4k

I tried this version and it uses a very small amount of memory. It is slower than the equivalent bedtools sort, but it solves my problem.

ADD REPLYlink written 5.7 years ago by 141341254653464453.4k
0
gravatar for Ido Tamir
5.7 years ago by
Ido Tamir4.9k
Austria
Ido Tamir4.9k wrote:

merge first each individual bed file, then merge the merged files like:

for f in `find /my/path -name '*.bed.gz'`; do gunzip -c $f | ~/src/bedtools-2.17.0/bin/bedtools sort | ~/src/bedtools-2.17.0/bin/bedtools merge ; done | ~/src/bedtools-2.17.0/bin/bedtools sort | ~/src/bedtools-2.17.0/bin/bedtools merge | gzip -c > bed.all.gz
ADD COMMENTlink written 5.7 years ago by Ido Tamir4.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1793 users visited in the last hour