Question: dedupe from BBMAP suite woes with memory and stdout
gravatar for ALchEmiXt
21 months ago by
The Netherlands
ALchEmiXt1.9k wrote:

Working with dedupe from the BBMAP suite (highly appreciated suite!) to estimate/determine exact duplicates in Illumina PE150 metagenomic data. Potential overamplification bias in PCR at library prep.

Two things I hope to get some clues on:

1) can dedupe output the console output to a stats file? It seems to write those stats to the STDERR channel and not to STDOUT so simply piping the output to a log file is not working. Neither does piping stderr to stdout 2>&1. I temporarely solved it by running the tool in a logged gnu screen session but that is silly. I just want the total reads and the found exact duplicates in a stats file.

2) Why does a small set of two 10Gb each file run out of memory on a 64Gb node? I gave java all memory I could find. On a 128Gb node it runs over 95Gb at which I had to terminate it. How to run those? A simple GNU sort | uniq on just the sequences only consumed less than 15Gb memory. I do want to run this de novo without reference mapping.

ADD COMMENTlink modified 21 months ago by Brian Bushnell16k • written 21 months ago by ALchEmiXt1.9k
csf=<file>            (clusterstatsfile) Write a list of cluster names and sizes

That should write the stats to a file for

Look at as an option since it uses temp files to hold data (Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates. ). There are ways to slice and dice the duplicates in many ways AND it does not require a reference/alignments. You may still run out of memory depending on the size of the data.

ADD REPLYlink modified 21 months ago • written 21 months ago by genomax59k

clumpify might indeed do what I want. I did play around with the clusterstats file but didn't get what I expected. I will reinvestigate it.

ADD REPLYlink written 21 months ago by ALchEmiXt1.9k

If you only need statistics then do not provide out= option(s) to

ADD REPLYlink modified 21 months ago • written 21 months ago by genomax59k
gravatar for h.mon
21 months ago by
h.mon21k wrote:

You may try storename=f storequality=f uniquenames=f to reduce memory usage. In addition, you may use the subset option, you should set to bigger than 1 - I never used it though.

ADD COMMENTlink written 21 months ago by h.mon21k
gravatar for Brian Bushnell
21 months ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

Dedupe uses a lot of memory overhead for hash tables, and Java is not particularly memory-efficient anyway. However, you can actually use Clumpify now for calculating the number of exact (or inexact) duplicates: in1=r1.fq in2=r2.fq out1=clumped1.fq out2=clumped2.fq dedupe subs=0

This will write temp files if there's not enough memory so it should be able to work with large files on a low-memory node. As for stdout versus stderr, all statistics are written to stderr and you should be able to capture them as you describe; I always do this: argument=whatever 1>out.txt 2>&1

or argument=whatever 1>out.o 2>out.e

...and I've never had a problem. But I can add a way to redirect the stuff that gets printed to the screen to stdout.

ADD COMMENTlink written 21 months ago by Brian Bushnell16k

Just learned about the clumpify tool. Thanks and keep up the good work. Appreciated! Shouldn't stderr return empty unless there is an error? So why not printing to stdout?

ADD REPLYlink written 21 months ago by ALchEmiXt1.9k

It's standard practice (or at least, that's my understanding) for programs that support piping to print warnings, status information, and so forth to stderr instead of stdout. That makes it easier to do things like: in=reads.fq out=stdout.fq samplerate=0.1 | in=stdin.fq out=results.fq ref=contam.fa

Reformat's read count just gets dumped to stderr and doesn't contaminate the pipe.

ADD REPLYlink modified 21 months ago • written 21 months ago by Brian Bushnell16k

Speaking about piping, is pipe-capable?

ADD REPLYlink written 21 months ago by h.mon21k

Yes, it is, actually. But bear in mind that it does try to grab all available memory unless you add a -Xmx flag. Also, it tries to check the size of the input file, and read the first few reads to test compressibility, so it knows whether or not it needs to write temp files and how many to use; if it is streaming from stdin it can't do that so it has to guess, in which case it uses a predetermined number of temp files even if the input will easily fit in memory. You can still override this with the "groups=" flag which will tell Clumpify how many temp files to use. With "groups=1", it won't write temp files, and will be faster.

Clumpify is conservative and by default writes temp files when they are not strictly necessary, because it's nice to have programs that just work. But, if you are certain that your data will fit in memory, you can use "in=stdin.fq out=stdout.fq interleaved ways=1 -Xmx40g" and as long as all the data fits in 40 GB RAM, it will be ~twice as fast as if you did not specify "ways". I added the "interleaved" flag because BBTools can autodetect whether the input data is interleaved when reading from a file, but NOT when reading from stdin, so for non-interleaved reads it should be excluded.

So yes, Clumpify supports piping, and I just tested it to verify that it works, but it's probably not overly useful in most cases; it still has to read and process the entire input file before it can output the first read, unlike BBDuk, BBMap, or similar streaming programs.

ADD REPLYlink modified 21 months ago • written 21 months ago by Brian Bushnell16k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2007 users visited in the last hour