We are often suggested to sort the input bed file by "sort -k1,1 -k2,2n" in order to invokes a memory-efficient algorithm designed for large files, for example, bedtools intersect ( http://bedtools.readthedocs.org/en/latest/content/tools/intersect.html)
But this is slow for a large file with >20G in size. Any quick-sorting problem around?
Here is a solution I can think of:
inputbed=$1
awk -v inputbed=$inputbed '{print $0 >> inputbed"."$1}' $inputbed
for i in $inputbed.*; do sort -k2,2n $i > $i.s & done
# when it's done
sort -m -k1,1 -k2,2n $inputbed.*s > $inputbed.sorted
The basic idea is to split the large file into small files by the first key, then sort each of them by the second key, and finally use "sort -m" to merge the sorted ones. It can save time by parallel sorting individual small file. But you need a way to track when the individual sorting is done.
I can also read the records into a large hash table, and then sort the key before output. Below is an example of sorting by 1st column (using a hash of hash can sort by two keys):
perl -e 'while (<>) {$l=$_; @a=split("\t", $l); push(@{$HoA{$a[0]}}, $l);}{foreach $i (sort keys %HoA) {print join("", @{$HoA{$i}});}}'
But again, this method requires large memory.
I am wondering if any quick-sorting program around that you guys can recommend. Thanks
Edit1: Use the parallel sorting from GNU coreutils with "sort --parallel=N" (change the number of sorts run concurrently to N) option. Also set the main buffer size and traditional locale.
LC_ALL=C sort --parallel=24 --buffer-size=5G -k1,1 -k2,2n input.bed > sorted.bed
Edit2: Use the sort-bed tool from BEDOPS
sort-bed --max-mem 5G input.bed > sorted.bed
Also use
sort -S
to change the in-RAM buffer size. The default buffer size is pretty small, which hurts performance.I am strong support everyone to increase the in-RAM buffer size, since you would get some very strange error report in bedtools operation when the RAM buffer size is not enough. You can try do some simple test to bedtools sort some large bed files. When the memory can not allow for the input, awesome awesome awesome error will give you. such as:
Actually, nothing special for line:8695756, except the memory can only hold 8695755 line and 8695756 line can not hold to memory completely.
and set LC_ALL=C before sorting. see http://stackoverflow.com/questions/28881