samtools collate runs out of temporary storage space: no -T option!
1
0
Entering edit mode
4.9 years ago
obuzko • 0

Hi All, I have a bit of a dilemma. I have to process a rather large BAM file into an interleaved FASTQ. After it's sorted and indexed, I use the samtools collate command as follows: samtools collate -u -o input_sorted_collated.bam input_sorted.bam

The size of input_sorted.bam is ~147GB and after a long execution the process (unsurprisingly) runs out of space on /tmp. Unfortunately, the collate command of samtools doesn't recognize the -T option to direct the temporary files elsewhere. I guess the developer hasn't considered the possibility of really large datasets.

Has anyone run into this issue before? Can you suggest a solution/hack/workaround? Any thoughts are very much appreciated!

Sasha

samtolls collate storage temporary • 3.6k views
ADD COMMENT
2
Entering edit mode

The solution Pierre Lindenbaum provided should work. Always read the manual first.

I guess the developer hasn't considered the possibility of really large datasets.

Just as a comment, better avoid these kinds of statements especially when the issue is about really basic things such as memory usage. The samtools developer(s) know what they do, these tools are routinely used by thousands of people all year, and some larger files are not uncommon. Fitting entire files into memory is uncommon as not every usage has a heavy server node available. Typically, if one encounters standard issues such as memory problems one is doing something wrong. 1xx Gb files are not uncommon either, standard 30-50x short-read human WGS for example. Thousands of these samples exist at NCBI.

ADD REPLY
2
Entering edit mode
4.9 years ago

I use the samtools collate command as follows: samtools collate -u -o input_sorted_collated.bam input_sorted.bam

(not tested) use <prefix>

http://www.htslib.org/doc/samtools.html

 samtools collate [options] in.sam|in.bam|in.cram [<prefix>]
  

<prefix> is optional. If <prefix> is absent, collate will write the temporary files to a system-dependent location (/tmp on UNIX).

ADD COMMENT
0
Entering edit mode

Thanks! I'll fire it up and update the comment with the result.

ADD REPLY
0
Entering edit mode

Thank you so much Pierre, I was stuck here, but specifying the prefix although I'm outputting to stdout solved it (the node ran out of space on /tmp). The docs of collate aren't very clear here... they just say the prefix is necessary if the explicit output options aren't specified. Brr.

ADD REPLY

Login before adding your answer.

Traffic: 1218 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6