Question

convert2bed issue with large input file?

0

Entering edit mode

7.0 years ago

cpak1981 ▴ 140

Anyone else have any issues with convert2bed when dealing with a large input bam file? convert2bed works fine with a 10 GB bam file. But when dealing with a 20 GB bam file, convert2bed runs without errors but generates a 0 byte bed file. The node I am running on has 384 GB. Any other thoughts why convert2bed might be failing to generate a >0 byte bed file? Thanks!

convert2bed bedops bam2bed • 3.7k views

ADD COMMENT • link 7.0 years ago by cpak1981 ▴ 140

1

Entering edit mode

Error message and some more information on how you mapped the file will be helpful. Generally the BAM file is truncated or messed up in some way would be my guess.

ADD REPLY • link 7.0 years ago by Sinji ★ 3.2k

0

Entering edit mode

Thanks for the response.

As I mentioned, there is no error message. convert2bed appears to run successfully but only generates an empty bed file.

Mapping of paired-end reads onto hg19 was performed with bwa-mem. Duplicates were removed with picard MarkDuplicates. No errors were generated during these steps. The resulting bam file (after de-duplication and sorting) can be used for peak calling (macs2), suggesting the bam file is okay in general.

ADD REPLY • link 7.0 years ago by cpak1981 ▴ 140

0

Entering edit mode

The aggregate size of files in the temp folder is around 35G by the way (the input BAM file is around 20 GB, and the resulting bed file is around 100 GB). I thought I'd post it in case anyone was interested.

ADD REPLY • link 7.0 years ago by cpak1981 ▴ 140

0

Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing posts (or edit original question with new relevant information) to keep threads logically organized.

ADD REPLY • link 7.0 years ago by GenoMax 141k

0

Entering edit mode

A worst-case scenario may require sorting an uncompressed file that is around 100 GB in size. (Imagine a BAM file where every read is randomly positioned.) So you might need a temporary folder that might hold up to that much data when extracting, converting, and sorting that large of a BAM file. Converting to BED in parallel, one chromosome at a time, is a good way to go with large BAM files, if they are indexed or can be indexed.

ADD REPLY • link 7.0 years ago by Alex Reynolds 35k

score 1 · Answer 1 · 2017-04-20

What versions of convert2bed, sort-bed and samtools are you using?

Behind the scenes, convert2bed (bam2bed) passes BED to sort-bed. This sorting tool creates properly sorted BED.

One possibility is that your /tmp folder might be on a disk mount/share smaller than what is needed to store intermediate uncompressed BED data. Because BAM files being processed are often larger than system memory, sort-bed uses 2 GB of system memory, by default. Then sort-bed uses your /tmp folder (or whatever your operating system's temporary folder is set to) to store intermediate data for a merge sort.

If running out of memory in /tmp while sorting is the issue, there are a few things you can do by setting appropriate convert2bed/bam2bed options.

You could set --max-mem to 40G or similar, since you have a system with 384 GB of RAM. Then all the sorting work on uncompressed BED would be done in memory, which will be faster, and you wouldn't need to worry as much about using or running out of space in /tmp.

Or, you could set the temporary sort directory via --sort-tmpdir to a folder on a disk share that has at least 40 GB of free space.

Or, you could disable BED sorting altogether via --do-not-sort. I really don't recommend this, since the sort order of BAM files can be either unknown or can be non-lexicographical, and the resulting sort order of the BED file will then be unknown or incorrect, possibly making it unusable for set operations with bedops, bedmap, etc.

I would only suggest using --do-not-sort if you pipe to awk or cut to remove columns, e.g.:

$ bam2bed --do-not-sort < reads.bam | cut -f1-6 | sort-bed --max-mem 20G - > reads.sorted.bed

We try to be non-lossy about conversion between formats. You may only be interested in a subset of columns, however, so this is a fast way to discard columns you might not want or need, with the benefit that sort-bed has a lot less input to sort within memory.

If your BAM file is indexed, a different option is to convert BAM to BED via one of several parallelized options, such as via GNU Parallel or via a SLURM or SGE computational cluster. This splits up the conversion work by chromosome, and those individual, per-chromosome conversion tasks are going to be much smaller. Conversion will go much faster, too, since we use some tricks that basically reduce the overall job to the time taken to convert the largest chromosome (i.e. chr1 tends to be the largest).

In any case, there might be a bug with the BAM conversion routines, but I really couldn't possibly begin to troubleshoot without knowing versions of binaries and having some sample input to reproduce the error. So take a look at the options above and see if adjusting memory settings may help, or if parallelization is an option for you, then I definitely recommend that route, if your time is valuable to you and you have those computational resources.