Memory Efficient Bedtools Sort And Merge With Millions Of Entries?
3
2
Entering edit mode
7.9 years ago

I would like to know if there is a memory-efficent way of sorting and merging a large amount of bed files, each of them containing millions of entries, into a single bed file that merges the entries, either duplicated or partially overlapping, so that they are unique in the file.

I have tried the following but it blows up in memory beyond the 32G I have available here:

find /my/path -name '*.bed.gz' | xargs gunzip -c | ~/src/bedtools-2.17.0/bin/bedtools sort | ~/src/bedtools-2.17.0/bin/bedtools merge | gzip -c > bed.all.gz

Any suggestions?

bedtools bed • 3.9k views
ADD COMMENT
4
Entering edit mode
7.9 years ago

Perhaps consider using BEDOPS sort-bed --max-mem to perform a sort within system memory (for example, --max-mem 24G, which asks for 24 GB of your host's 32 GB of system memory) and bedops --merge to calculate merged elements from sorted data:

$ find /my/path -name '*.bed.gz' -print0 \
    | xargs -0 gunzip -c \
    | sort-bed --max-mem 24G - \
    | bedops --merge - \
    | gzip -c \
    > answer.gz

In this example, the sort-bed --max-mem operation will run a quicksort on 24 GB chunks of data, and then apply a merge sort on each quicksort-sorted chunk. The bedops --merge operation will run fast (about 1/3rd the execution time of alternatives, which is a substantial savings for operations on data of this scale) and with a very low, constant memory profile on sorted input, but it will discard ID, score and strand data (if present) in calculating overlapping regions.

ADD COMMENT
3
Entering edit mode
7.9 years ago
brentp 23k

use the sort provided in your linux distribution.

... | sort -k1,1 -k2,2n | ...

that will write temporary files to disk as needed.

ADD COMMENT
0
Entering edit mode

a minor addition to this, there is an -m option for the sort that takest files that are already individually sorted and merges them into one

ADD REPLY
0
Entering edit mode

I think this would work if I wasn't gunzip'ing the files.

ADD REPLY
0
Entering edit mode

I tried this version and it uses a very small amount of memory. It is slower than the equivalent bedtools sort, but it solves my problem.

ADD REPLY
0
Entering edit mode
7.9 years ago
Ido Tamir 5.2k

merge first each individual bed file, then merge the merged files like:

for f in `find /my/path -name '*.bed.gz'`; do gunzip -c $f | ~/src/bedtools-2.17.0/bin/bedtools sort | ~/src/bedtools-2.17.0/bin/bedtools merge ; done | ~/src/bedtools-2.17.0/bin/bedtools sort | ~/src/bedtools-2.17.0/bin/bedtools merge | gzip -c > bed.all.gz
ADD COMMENT

Login before adding your answer.

Traffic: 1822 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6