Anyone else have any issues with convert2bed when dealing with a large input bam file? convert2bed works fine with a 10 GB bam file. But when dealing with a 20 GB bam file, convert2bed runs without errors but generates a 0 byte bed file. The node I am running on has 384 GB. Any other thoughts why convert2bed might be failing to generate a >0 byte bed file? Thanks!
What versions of
samtools are you using?
Behind the scenes,
bam2bed) passes BED to
sort-bed. This sorting tool creates properly sorted BED.
One possibility is that your
/tmp folder might be on a disk mount/share smaller than what is needed to store intermediate uncompressed BED data. Because BAM files being processed are often larger than system memory,
sort-bed uses 2 GB of system memory, by default. Then
sort-bed uses your
/tmp folder (or whatever your operating system's temporary folder is set to) to store intermediate data for a merge sort.
If running out of memory in
/tmp while sorting is the issue, there are a few things you can do by setting appropriate
You could set
40G or similar, since you have a system with 384 GB of RAM. Then all the sorting work on uncompressed BED would be done in memory, which will be faster, and you wouldn't need to worry as much about using or running out of space in
Or, you could set the temporary sort directory via
--sort-tmpdir to a folder on a disk share that has at least 40 GB of free space.
Or, you could disable BED sorting altogether via
--do-not-sort. I really don't recommend this, since the sort order of BAM files can be either unknown or can be non-lexicographical, and the resulting sort order of the BED file will then be unknown or incorrect, possibly making it unusable for set operations with bedops, bedmap, etc.
I would only suggest using
--do-not-sort if you pipe to
cut to remove columns, e.g.:
$ bam2bed --do-not-sort < reads.bam | cut -f1-6 | sort-bed --max-mem 20G - > reads.sorted.bed
We try to be non-lossy about conversion between formats. You may only be interested in a subset of columns, however, so this is a fast way to discard columns you might not want or need, with the benefit that
sort-bed has a lot less input to sort within memory.
If your BAM file is indexed, a different option is to convert BAM to BED via one of several parallelized options, such as via GNU Parallel or via a SLURM or SGE computational cluster. This splits up the conversion work by chromosome, and those individual, per-chromosome conversion tasks are going to be much smaller. Conversion will go much faster, too, since we use some tricks that basically reduce the overall job to the time taken to convert the largest chromosome (i.e.
chr1 tends to be the largest).
In any case, there might be a bug with the BAM conversion routines, but I really couldn't possibly begin to troubleshoot without knowing versions of binaries and having some sample input to reproduce the error. So take a look at the options above and see if adjusting memory settings may help, or if parallelization is an option for you, then I definitely recommend that route, if your time is valuable to you and you have those computational resources.