Question

Any Quick Sort Method For Large Bed File As 20G In Size?

5

Entering edit mode

11.1 years ago

Xianjun ▴ 310

We are often suggested to sort the input bed file by "sort -k1,1 -k2,2n" in order to invokes a memory-efficient algorithm designed for large files, for example, bedtools intersect ( http://bedtools.readthedocs.org/en/latest/content/tools/intersect.html)

But this is slow for a large file with >20G in size. Any quick-sorting problem around?

Here is a solution I can think of:

inputbed=$1
awk -v inputbed=$inputbed '{print $0 >> inputbed"."$1}' $inputbed
for i in $inputbed.*; do sort -k2,2n $i > $i.s & done
# when it's done
sort -m -k1,1 -k2,2n $inputbed.*s > $inputbed.sorted

The basic idea is to split the large file into small files by the first key, then sort each of them by the second key, and finally use "sort -m" to merge the sorted ones. It can save time by parallel sorting individual small file. But you need a way to track when the individual sorting is done.

I can also read the records into a large hash table, and then sort the key before output. Below is an example of sorting by 1st column (using a hash of hash can sort by two keys):

perl -e 'while (<>) {$l=$_; @a=split("\t", $l); push(@{$HoA{$a[0]}}, $l);}{foreach $i (sort keys %HoA) {print join("", @{$HoA{$i}});}}'

But again, this method requires large memory.

I am wondering if any quick-sorting program around that you guys can recommend. Thanks

Edit1: Use the parallel sorting from GNU coreutils with "sort --parallel=N" (change the number of sorts run concurrently to N) option. Also set the main buffer size and traditional locale.

LC_ALL=C sort --parallel=24 --buffer-size=5G -k1,1 -k2,2n input.bed > sorted.bed

Edit2: Use the sort-bed tool from BEDOPS

sort-bed --max-mem 5G input.bed > sorted.bed

sort bed • 11k views

ADD COMMENT • link updated 12 weeks ago by Dan ▴ 530 • written 11.1 years ago by Xianjun ▴ 310

6

Entering edit mode

Also use sort -S to change the in-RAM buffer size. The default buffer size is pretty small, which hurts performance.

ADD REPLY • link 11.1 years ago by lh3 33k

0

Entering edit mode

I am strong support everyone to increase the in-RAM buffer size, since you would get some very strange error report in bedtools operation when the RAM buffer size is not enough. You can try do some simple test to bedtools sort some large bed files. When the memory can not allow for the input, awesome awesome awesome error will give you. such as:

Differing number of BED fields encountered at line: 8695756.  Exiting...

Actually, nothing special for line:8695756, except the memory can only hold 8695755 line and 8695756 line can not hold to memory completely.

ADD REPLY • link 8.1 years ago by Shicheng Guo ★ 9.4k

3

Entering edit mode

and set LC_ALL=C before sorting. see http://stackoverflow.com/questions/28881

ADD REPLY • link 11.1 years ago by Pierre Lindenbaum 161k

score 7 · Accepted Answer · 2013-03-20

7

Entering edit mode

11.1 years ago

Zev.Kronenberg 12k

You might not need to go through the pain of rolling your own.

There are multithreaded sort commands.

have a look @ stackoverflow

http://stackoverflow.com/questions/6365176/multi-threaded-sort

ADD COMMENT • link 11.1 years ago by Zev.Kronenberg 12k

0

Entering edit mode

Cool! Thanks

btw, the latest coreutils-8.21 (which contains the parallel sort and many other enhanced tools) can be downloaded here: http://ftp.gnu.org/gnu/coreutils/

ADD REPLY • link 11.1 years ago by Xianjun ▴ 310

0

Entering edit mode

I'm using parallel and -S with sort, but what about bcftools sort?

ADD REPLY • link 12 weeks ago by Dan ▴ 530

Ram · Accepted Answer · 2013-03-20

You can use BEDOPS sort-bed with the --max-mem option, which will allow you to specify the amount of system memory allocated to sorting a BED file that is larger than system memory.

For example, the following command allocates two gigabytes of memory to sorting a large BED file:

$ sort-bed --max-mem 2G unsortedBigData.bed > sortedBigData.bed

This revision of sort-bed includes the functionality of the former BEDOPS bbms (Big Bed Merge Sort) script, which implemented a merge sort of BED input on memory-constrained workstations.

With bbms, once each of the subsets was sorted in memory with sort-bed, sorted output was constructed from the entire set, applying a multiset union operation with BEDOPS bedops.

With sort-bed, the first step - sorting each subset in memory - uses qsort(). See the lexSortBedData() function in SortDetails.cpp. The qsort() function is an implementation of the quicksort algorithm.

The second step - the multiset union part of the merge sort algorithm - uses very little memory. All that is needed is enough space to compare the first ("front-most") BED element of each now-sorted subset to put the final output in correct order.

The new sort-bed now incorporates the functionality of both steps through the use of --max-mem; it is no longer necessary to use bbms on large BED datasets. Nonetheless, a variant of bbms could be easily parallelized to operate on multicore workstations or multiple hosts via SSH with GNU Parallel, as has been done with BEDOPS starch.

It is not clear to me how much bbms would benefit from parallelization via GNU Parallel, however. In the multicore scenario, we would still be limited to qsort() operations within system memory, in which case we might just as well use sort-bed --max-mem. While delegating each subset-sort to each machine in a group of SSH-capable hosts would get around the memory issue, the I/O overhead of moving large chunks of encrypted data over the network with SSH may significantly hurt performance. Reading data from a local hard disk is slow as it is; network latency would probably be worse, in most cases.

Another option for working with 20 GB (or otherwise very large) BED files is to compress them to BEDOPS Starch-formatted archives. This archive format is lossless and can, depending on the content, reduce a BED file to approximately 5% of its original size. The major BEDOPS tools (bedops, bedmap, bedextract, closest-features) can accept either sorted BED or Starch files as inputs. Further, use of the --chrom operator can quickly focus BEDOPS-based analyses to specific chromosomes, which leads to obvious parallelization approaches.

These features highlight use of BEDOPS for handling whole-genome-sized ("Big Data") analyses, by offering the fastest tools with low memory overhead and easy parallelization options for power users. Sorting inputs once is the only requirement for BEDOPS tools, and I think it is fair to say that sort-bed does a pretty nice job at this task.