dedupe from BBMAP suite woes with memory and stdout
2
2
Entering edit mode
5.2 years ago
ALchEmiXt ★ 1.9k

Working with dedupe from the BBMAP suite (highly appreciated suite!) to estimate/determine exact duplicates in Illumina PE150 metagenomic data. Potential overamplification bias in PCR at library prep.

Two things I hope to get some clues on:

1) can dedupe output the console output to a stats file? It seems to write those stats to the STDERR channel and not to STDOUT so simply piping the output to a log file is not working. Neither does piping stderr to stdout 2>&1. I temporarely solved it by running the tool in a logged gnu screen session but that is silly. I just want the total reads and the found exact duplicates in a stats file.

2) Why does a small set of two 10Gb each file run out of memory on a 64Gb node? I gave java all memory I could find. On a 128Gb node it runs over 95Gb at which I had to terminate it. How to run those? A simple GNU sort | uniq on just the sequences only consumed less than 15Gb memory. I do want to run this de novo without reference mapping.

bbmap dedupe linux duplicates fastq • 2.5k views
2
Entering edit mode
csf=<file>            (clusterstatsfile) Write a list of cluster names and sizes


That should write the stats to a file for dedupe.sh.

Look at clumpify.sh as an option since it uses temp files to hold data (Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates. ). There are ways to slice and dice the duplicates in many ways AND it does not require a reference/alignments. You may still run out of memory depending on the size of the data.

0
Entering edit mode

clumpify might indeed do what I want. I did play around with the clusterstats file but didn't get what I expected. I will reinvestigate it.

0
Entering edit mode

If you only need statistics then do not provide out= option(s) to clumpify.sh.

1
Entering edit mode
5.2 years ago
h.mon 34k

You may try storename=f storequality=f uniquenames=f to reduce memory usage. In addition, you may use the subset option, you should set to bigger than 1 - I never used it though.

1
Entering edit mode
5.2 years ago

Dedupe uses a lot of memory overhead for hash tables, and Java is not particularly memory-efficient anyway. However, you can actually use Clumpify now for calculating the number of exact (or inexact) duplicates:

clumpify.sh in1=r1.fq in2=r2.fq out1=clumped1.fq out2=clumped2.fq dedupe subs=0


This will write temp files if there's not enough memory so it should be able to work with large files on a low-memory node. As for stdout versus stderr, all statistics are written to stderr and you should be able to capture them as you describe; I always do this:

tool.sh argument=whatever 1>out.txt 2>&1


or

tool.sh argument=whatever 1>out.o 2>out.e


...and I've never had a problem. But I can add a way to redirect the stuff that gets printed to the screen to stdout.

0
Entering edit mode

Just learned about the clumpify tool. Thanks and keep up the good work. Appreciated! Shouldn't stderr return empty unless there is an error? So why not printing to stdout?

0
Entering edit mode

It's standard practice (or at least, that's my understanding) for programs that support piping to print warnings, status information, and so forth to stderr instead of stdout. That makes it easier to do things like:

reformat.sh in=reads.fq out=stdout.fq samplerate=0.1 | bbduk.sh in=stdin.fq out=results.fq ref=contam.fa


Reformat's read count just gets dumped to stderr and doesn't contaminate the pipe.

0
Entering edit mode

Speaking about piping, is clumpify.sh pipe-capable?

0
Entering edit mode

Yes, it is, actually. But bear in mind that it does try to grab all available memory unless you add a -Xmx flag. Also, it tries to check the size of the input file, and read the first few reads to test compressibility, so it knows whether or not it needs to write temp files and how many to use; if it is streaming from stdin it can't do that so it has to guess, in which case it uses a predetermined number of temp files even if the input will easily fit in memory. You can still override this with the "groups=" flag which will tell Clumpify how many temp files to use. With "groups=1", it won't write temp files, and will be faster.

Clumpify is conservative and by default writes temp files when they are not strictly necessary, because it's nice to have programs that just work. But, if you are certain that your data will fit in memory, you can use "in=stdin.fq out=stdout.fq interleaved ways=1 -Xmx40g" and as long as all the data fits in 40 GB RAM, it will be ~twice as fast as if you did not specify "ways". I added the "interleaved" flag because BBTools can autodetect whether the input data is interleaved when reading from a file, but NOT when reading from stdin, so for non-interleaved reads it should be excluded.

So yes, Clumpify supports piping, and I just tested it to verify that it works, but it's probably not overly useful in most cases; it still has to read and process the entire input file before it can output the first read, unlike BBDuk, BBMap, or similar streaming programs.