Question: Any Quick Sort Method For Large Bed File As 20G In Size?
5
gravatar for Xianjun
4.0 years ago by
Xianjun190
Great Boston Area
Xianjun190 wrote:

We are often suggested to sort the input bed file by "sort -k1,1 -k2,2n" in order to invokes a memory-efficient algorithm designed for large files, for example, bedtools intersect ( http://bedtools.readthedocs.org/en/latest/content/tools/intersect.html)

But this is slow for a large file with >20G in size. Any quick-sorting problem around?

Here is a solution I can think of:

inputbed=$1
awk -v inputbed=$inputbed '{print $0 >> inputbed"."$1}' $inputbed
for i in $inputbed.*; do sort -k2,2n $i > $i.s & done
# when it's done
sort -m -k1,1 -k2,2n $inputbed.*s > $inputbed.sorted

The basic idea is to split the large file into small files by the first key, then sort each of them by the second key, and finally use "sort -m" to merge the sorted ones. It can save time by parallel sorting individual small file. But you need a way to track when the individual sorting is done.

I can also read the records into a large hash table, and then sort the key before output. Below is an example of sorting by 1st column (using a hash of hash can sort by two keys):

perl -e 'while (<>) {$l=$_; @a=split("\t", $l); push(@{$HoA{$a[0]}}, $l);}{foreach $i (sort keys %HoA) {print join("", @{$HoA{$i}});}}'

But again, this method requires large memory.

I am wondering if any quick-sorting program around that you guys can recommend. Thanks

Edit1: Use the parallel sorting from GNU coreutils with "sort --parallel=N" (change the number of sorts run concurrently to N) option. Also set the main buffer size and traditional locale.

LC_ALL=C sort --parallel=24 --buffer-size=5G -k1,1 -k2,2n input.bed > sorted.bed

Edit2: Use the sort-bed tool from BEDOPS

sort-bed --max-mem 5G input.bed > sorted.bed
bed sort • 4.9k views
ADD COMMENTlink modified 2.8 years ago by Biostar ♦♦ 10 • written 4.0 years ago by Xianjun190
6

Also use sort -S to change the in-RAM buffer size. The default buffer size is pretty small, which hurts performance.

ADD REPLYlink written 4.0 years ago by lh328k

I am strong support everyone to increase the in-RAM buffer size, since you would get some very strange error report in bedtools operation when the RAM buffer size is not enough. You can try do some simple test to bedtools sort some large bed files. When the memory can not allow for the input, awesome awesome awesome error will give you. such as:

Differing number of BED fields encountered at line: 8695756.  Exiting...

Actually, nothing special for line:8695756, except the memory can only hold 8695755 line and 8695756 line can not hold to memory completely.

ADD REPLYlink written 12 months ago by Shicheng Guo4.2k
3

and set LC_ALL=C before sorting. see http://stackoverflow.com/questions/28881

ADD REPLYlink written 4.0 years ago by Pierre Lindenbaum91k
7
gravatar for Zev.Kronenberg
4.0 years ago by
United States
Zev.Kronenberg10k wrote:

You might not need to go through the pain of rolling your own.

There are multithreaded sort commands.

have a look @ stackoverflow

http://stackoverflow.com/questions/6365176/multi-threaded-sort

ADD COMMENTlink written 4.0 years ago by Zev.Kronenberg10k

Cool! Thanks

btw, the latest coreutils-8.21 (which contains the parallel sort and many other enhanced tools) can be downloaded here: http://ftp.gnu.org/gnu/coreutils/

ADD REPLYlink written 4.0 years ago by Xianjun190
6
gravatar for Alex Reynolds
4.0 years ago by
Alex Reynolds17k
Seattle, WA USA
Alex Reynolds17k wrote:

You can use BEDOPS sort-bed with the --max-mem option, which will allow you to specify the amount of system memory allocated to sorting a BED file that is larger than system memory.

For example, the following command allocates two gigabytes of memory to sorting a large BED file:

$ sort-bed --max-mem 2G unsortedBigData.bed > sortedBigData.bed

This revision of sort-bed includes the functionality of the former BEDOPS bbms (Big Bed Merge Sort) script, which implemented a merge sort of BED input on memory-constrained workstations.

With bbms, once each of the subsets was sorted in memory with sort-bed, sorted output was constructed from the entire set, applying a multiset union operation with BEDOPS bedops.

With sort-bed, the first step — sorting each subset in memory — uses qsort(). See the lexSortBedData() function in SortDetails.cpp. The qsort() function is an implementation of the quicksort algorithm.

The second step — the multiset union part of the merge sort algorithm — uses very little memory. All that is needed is enough space to compare the first ("front-most") BED element of each now-sorted subset to put the final output in correct order.

The new sort-bed now incorporates the functionality of both steps through the use of --max-mem; it is no longer necessary to use bbms on large BED datasets. Nonetheless, a variant of bbms could be easily parallelized to operate on multicore workstations or multiple hosts via SSH with GNU Parallel, as has been Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them with BEDOPS starch.

It is not clear to me how much bbms would benefit from parallelization via GNU Parallel, however. In the multicore scenario, we would still be limited to qsort() operations within system memory, in which case we might just as well use sort-bed --max-mem. While delegating each subset-sort to each machine in a group of SSH-capable hosts would get around the memory issue, the I/O overhead of moving large chunks of encrypted data over the network with SSH may significantly hurt performance. Reading data from a local hard disk is slow as it is; network latency would probably be worse, in most cases.

Another option for working with 20 GB (or otherwise very large) BED files is to compress them to BEDOPS Starch-formatted archives. This archive format is lossless and can, depending on the content, reduce a BED file to approximately 5% of its original size. The major BEDOPS tools (bedops, bedmap, bedextract, closest-features) can accept either sorted BED or Starch files as inputs. Further, use of the --chrom operator can quickly focus BEDOPS-based analyses to specific chromosomes, which leads to obvious parallelization approaches.

These features highlight use of BEDOPS for handling whole-genome-sized ("Big Data") analyses, by offering the fastest tools with low memory overhead and easy parallelization options for power users. Sorting inputs once is the only requirement for BEDOPS tools, and I think it is fair to say that sort-bed does a pretty nice job at this task.

ADD COMMENTlink modified 3.8 years ago • written 4.0 years ago by Alex Reynolds17k

Thanks, Alex. Yes, I noticed the performance comparison of sort-bed on your website. I was about to ask you the algorithm behind. It's good to see your reply here. Definitely I will look into it and play with the starch.

ADD REPLYlink written 4.0 years ago by Xianjun190
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1497 users visited in the last hour