2.1 years ago by
MSKCC | New York, NY
Two-Stage Multithreaded Version for Sorted BAMs
While this thread already has some great answers, I wanted to suggest a parallelized version that is robust to open file limits (e.g., > 4096 files). This requires GNU parallel.
Code
find $BAM_DIR -name '*.bam' |
parallel -j8 -N4095 -m --files samtools merge -u - |
parallel --xargs samtools merge -@8 merged.bam {}";" rm {}
Overview
This will take all BAM files in $BAM_DIR
and run eight (-j8
) separate single-threaded merge operations, with the input files (mostly) equally distributed among the different jobs. This results in temporary files which are then merged into merged.bam
in a multithreaded operation. The temporary files are deleted at the end.
Options
One need not keep the number of simultaneous merge operations in the first round of merging (-j8
) in correspondence with the number of threads used for the second round (-@8
). It's likely the first round will be bottlenecked by too much simultaneous writing, so you may want to keep that lower.
Use the -N
flag to change the maximum number of arguments to be given to each first round merge operation. Here 4095 is just the common open files limit minus one (for the output file).
The -u
flag is there so the temporary files will be uncompressed, since we're deleting them in the end. That can be removed if you have concerns about storage space for the temp files.
I have sorted 1679 sorted bam files like sorted.bam.0000.bam to sorted.bam.1679.bam. how do i merge all sorted bam files into single sorted bam file. Can you please give me the script by using my files examples. Thank you