Tool: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates.
gravatar for Brian Bushnell
2.3 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

I'd like to introduce a new member of the BBMap package, Clumpify. This is a bit different from other tools in that it does not actually change your data at all, simply reorders it to maximize gzip compression. Therefore, the output files are still fully-compatible gzipped fastq files, and Clumpify has no effect on downstream analysis aside from making it faster. It’s quite simple to use: in=reads.fq.gz out=clumped.fq.gz reorder

This command assumes paired, interleaved reads or single-ended reads. But Clumpify also now works with paired reads in twin files like this: in1=r1.fq.gz in2=r2.fq.gz out1=clumped1.fq.gz out2=clumped2.fq.gz reorder

How does this work? Clumpify operates on a similar principle to that which makes sorted bam files smaller than unsorted bam files – the reads are reordered so that reads with similar sequence are nearby, which makes gzip compression more efficient. But unlike sorted bam, during this process, pairs are kept together so that an interleaved file will remain interleaved with pairing intact. Also unlike a sorted bam, it does not require mapping or a reference, and except in very unusual cases, can be done with an arbitrarily small amount of memory. So, it’s very fast and memory-efficient compared to mapping, and can be done with no knowledge of what organism(s) the reads came from.

Internally, Clumpify forms clumps of reads sharing special ‘pivot’ kmers, implying that those reads overlap. These clumps are then further sorted by position of the kmer in the read so that within a clump the reads are position-sorted. The net result is a list of sorted clumps of reads, yielding compression within a percent or so of sorted bam.

How long does Clumpify take? It's very fast. If all data can fit in memory, Clumpify needs the amount of time it takes to read and write the file once. If the data cannot fit in memory, it takes around twice that long. That's on our cluster, though, with a high-performance parallel filesystem. On some filesystems, or with a single spinning disk, it may take several times longer when data does not fit into memory, because multiple files are constantly being read and written at once.

Why does this increase speed? There are a lot of processes that are I/O limited. For example, on a multicore processor, using BBDuk, BBMerge, Reformat, etc. on a gzipped fastq will generally be rate-limited by gzip decompression (even if you use pigz, which is much faster at decompression than gzip). Gzip decompression seems to be rate-limited by the number of input bytes per second rather than output, meaning that a raw file of a given size will decompress X% faster if it is compressed Y% smaller; here X and Y are proportional, though not quite 1-to-1. In my tests, assembly with Spades and Megahit have time reductions from using Clumpified input that more than pays for the time needed to run Clumpify, largely because both are multi-kmer assemblers which read the input file multiple times (and according to Megahit's author, due to cache locality). Something purely CPU-limited like mapping would normally not benefit much in terms of speed (though still a bit due to improved cache locality).

When and how should Clumpify be used? If you want to clumpify data for compression, do it as early as possible (e.g. on the raw reads). Then run all downstream processing steps ensuring that read order is maintained (e.g. use the “ordered” flag if you use BBDuk for adapter-trimming) so that the clump order is maintained; thus, all intermediate files will benefit from the increased compression and increased speed. I recommend running Clumpify on ALL data that will ever go into long-term storage, or whenever there is a long pipeline with multiple steps and intermediate gzipped files. Also, even when data will not go into long-term storage, if a shared filesystem is being used or files need to be sent over the internet, running Clumpify as early as possible will conserve bandwidth. The only times I would not clumpify data are enumerated below.

When should Clumpify not be used? There are a few cases where it probably won’t help:

1) For reads with a very low kmer depth, due to either very low coverage (like 1x WGS) or super-high-error-rate (like raw PacBio data). It won’t hurt anything but won’t accomplish anything either.

2) For large volumes of amplicon data. This may work, and may not work; but if all of your reads are expected to share the same kmers, they may all form one giant clump and again nothing will be accomplished. Again, it won’t hurt anything, and if pivots are randomly selected from variable regions, it might increase compression.

3) When your process is dependent on the order of reads. If you always grab the first million reads from a file assuming they are a good representation of the rest of the file, Clumpify will cause your assumption to be invalid – just like grabbing the first million reads from a sorted bam file would not be representative. Fortunately, this is never a good practice so if you are currently doing that, now would be a good opportunity to change your pipeline anyway. Randomly subsampling is a much better approach.

4) If you are only going to read data fewer than ~3 times, it will never go into long-term storage, and it's being used on local disk so bandwidth is not an issue, there's no point in using Clumpify (or gzip, for that matter).

Please let me know if you have any questions, and please make sure you are using the latest version of BBTools when trying new functionality.

Edit: Although I was unaware when I wrote Clumpify, I have learned that there are prior examples of programs designed to put reads into buckets containing identical substrings - for example, SCALCE and MINCE. The approaches are quite different but the objective and results are similar. So, I encourage people seeking maximal compression (or planning to build something even better) to investigate those tools as well!

Edit: I've now released version 37.24 which has some nice optical deduplication improvements. It's now faster (many times faster in certain degenerate cases), and there are improvements in precision for NextSeq tile-edge duplicates. Specifically, it is now recommended that they be removed like this: in=nextseq.fq.gz out=clumped.fq.gz dedupe optical spany adjacent

This will remove all normal optical duplicates, and all tile-edge duplicates, but it will only consider reads to be tile-edge duplicates if they are in adjacent tiles and share their Y-coordinate (within dupedist), rather than before, in which they could be in any tiles and could share their X-coordinate. This means that there are fewer false-positives (PCR or coincidental duplicates that were being classified as optical/tile-edge duplicates). This is possible because on NextSeq the tile-edge duplicates are only present on the tile X-edges and the duplicates are only between adjacent tiles.

Normal optical duplicate removal (HiSeq, MiSeq, etc) should use this command: in=nextseq.fq.gz out=clumped.fq.gz dedupe optical

ADD COMMENTlink modified 22 months ago by chen1.8k • written 2.3 years ago by Brian Bushnell16k

Great work. Talking about fastq compression, it is worth mentioning this paper, and fqzcomp and dsrc in particular.

ADD REPLYlink written 2.3 years ago by lh331k

Also worth noting:



BBTools supports both fqzcomp and alapy (and pigz and pbzip2) if they are on the command line, meaning that you can do something like ref=phix.fa.bz2 out=clean.fqz and they will all be appropriately compressed/decompressed. However, in my tests, while Clumpify greatly improves .gz and .bz2 compression, it makes .fqz (fqzcomp) and .ac (alapy) compression slightly worse (presumably because consecutive headers become less similar after reordering). I think I had some kind of problem testing dsrc which is why it's not currently supported, but I should re-evaluate that.

ADD REPLYlink modified 22 months ago • written 22 months ago by Brian Bushnell16k

This is a good job, especially useful for deep sequencing data

ADD REPLYlink written 22 months ago by chen1.8k

I am using clumpify on HiSeq4000 data. As recommended, I increased dupedist to 2500 because of patterned flow cells. However, I get exclusively optical duplicates using this option:

dupedist=2500 dedupe=t optical=t

Gives the same output as:

dupedist=2500 dedupe=t

Any ideas? All optical duplicates sound not right.

ADD REPLYlink written 22 months ago by Pascal250

See if this explanation helps.

dedupe=f optical=f (default)
Nothing happens with regards to duplicates.

dedupe=t optical=f
All duplicates are detected, whether optical or not.  All copies except one are removed for each duplicate.

dedupe=f optical=t
Nothing happens.

dedupe=t optical=t

Only optical duplicates (those with an X or Y coordinate within dist) are detected.  All copies except one are removed for each duplicate.
The allduplicates flag makes all copies of duplicates removed, rather than leaving a single copy.  But like optical, it has no effect unless dedupe=t.`
ADD REPLYlink modified 22 months ago • written 22 months ago by genomax64k

If you set "dupedist" to anything greater than 0, "optical" gets enabled automatically. So please use:

dupedist=2500 dedupe=t optical=t



Perhaps I should clarify that...

ADD REPLYlink written 22 months ago by Brian Bushnell16k

I just tested clumpify with NA12878 fastq files.

The size of original read1 file (SRR952827_1.fastq) is 211G, but the size of clumpified data (clump.SRR952827_1.fastq) is 182G. Why it's smaller?

I didn't enable any deduplication. The command is: in=SRR952827_1.fastq  out=clump.SRR952827_1.fastq
ADD REPLYlink modified 16 months ago • written 16 months ago by chen1.8k

It's smaller because the sorted reads compress more efficiently than unsorted reads, since reads sharing sequence are adjacent in the file. The number of reads is the same. If you decompress the files they will be the same size.

Edit - for some reason I overlooked the fact that your files are not gzipped. The output size should be identical for uncompressed files unless the program was terminated before it was done writing. Can you post the exact command and the screen output, as well as the BBTools version? Also, can you run "wc" on both files?

ADD REPLYlink modified 16 months ago • written 16 months ago by Brian Bushnell16k

@Brian: Based on the names the files appear to be un-compressed though.

ADD REPLYlink written 16 months ago by genomax64k

Oh, hmm, that's odd. I guess I did not read the question carefully enough. Let me edit that post...

ADD REPLYlink written 16 months ago by Brian Bushnell16k

chen : How much RAM did you assign for this job?

ADD REPLYlink written 16 months ago by genomax64k

My system has 1T RAM, and I noticed 738G memory was available and clumpify automatically set the JVM memory limit to be 738G.

ADD REPLYlink written 16 months ago by chen1.8k

So it's based on the similarity of sequences? If two reads (50 bp) overlap with 25 bp, will they be cluster together? Usually, we use Hiseq2500 to do RNAseq, CLIPseq, 3'reads-seq, etc. Does the type of libraries affect the deduplication result? Thanks very much.

ADD REPLYlink written 7 months ago by linjc.xmu10

I like the tool. I'm using followed by Is there a way to sort by the number of copies in the output?

ADD REPLYlink written 24 days ago by genya3510

No. You will need to devise a custom script yourself.

ADD REPLYlink written 24 days ago by genomax64k
gravatar for Brian Bushnell
2.2 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

Clumpify can now do duplicate removal with the "dedupe" flag. Paired reads are only considered duplicates if both reads match. By default, all copies of a duplicate are removed except one - the highest-quality copy is retained. By default subs=2, so 2 substitutions (mismatches) are allowed between "duplicates", to compensate for sequencing error, but this can be overriden. I recommend allowing substitutions during duplicate removal; otherwise, it will enrich the dataset with reads containing errors.

Example commands:

Clumpify only; don't remove duplicates: in=reads.fq.gz out=clumped.fq.gz

Remove exact duplicates: in=reads.fq.gz out=clumped.fq.gz dedupe subs=0

Mark exact duplicates, but don't remove them (they get " duplicate" appended to the name): in=reads.fq.gz out=clumped.fq.gz markduplicates subs=0

Remove duplicates, allowing up to 5 substitutions between copies: in=reads.fq.gz out=clumped.fq.gz dedupe subs=5

Remove ALL copies of reads with duplicates rather than retaining the best copy: in=reads.fq.gz out=clumped.fq.gz dedupe allduplicates

Remove optical duplicates only (duplicates within 40 pixels of each other): in=reads.fq.gz out=clumped.fq.gz dedupe optical dist=40 spantiles=f

Note that the optimal setting for dist is platform-specific; 40 is fine for NextSeq and HiSeq2500/1T.

Remove optical duplicates and tile-edge duplicates: in=reads.fq.gz out=clumped.fq.gz dedupe optical dist=40

Clumpify only detects duplicates within the same clump. Therefore, it will always detect 100% of identical duplicates, but is not guaranteed to find all duplicates with mismatches. This is similar to deduplication by mapping - with enough mismatches, "duplicates" may map to different places or not map at all, and then they won't be detected. However, Clumpify is more sensitive to errors than mapping-based duplicate detection. To increase sensitivity, you can reduce the kmer length from the default of 31 to a smaller number like 19 with the flag "k=19", and increase the number of passes from the default of 1 to, say, 3: in=reads.fq.gz out=clumped.fq.gz dedupe k=19 passes=3 subs=5

Each pass will have a chance of identifying more duplicates, because a different kmer will be selected for seeding clumps; thus, eventually, any pair of duplicates will land in the same clump given enough passes if they share a single kmer, regardless of how many errors they have. But in practice the majority are identified in the first pass and you don't really get much more after about the 3rd pass. Decreasing K and (especially) using additional passes will take longer, and there is no point in doing them if you are running with subs=0 (identical duplicates only) because in that case all duplicates are guaranteed to be found in the first pass. If all the data fits in memory, additional passes are extremely fast; the speed penalty is only noticeable when the data does not all fit in memory. Even so, passes=3 will generally still be much faster than using mapping to remove duplicates.

ADD COMMENTlink written 2.2 years ago by Brian Bushnell16k

How do you think this will affect reads coming from two almost identical regions but for a few bases? For example SMN1/SMN2 have regions that are only differentiated by 1 base. Would with the recommended allowed subs parameter remove non-duplicated reads for these two regions?


ADD REPLYlink written 2.1 years ago by Carlos Borroto1.8k

There is no recommended subs parameter. You decide what is acceptable. If you want only 1 difference then you could set subs=1.

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by genomax64k

Hi Carlos,

As Genomax indicated, I don't have a recommended setting, but the default is 2. That said, this graph might help you:

Dupes vs Subs

As you can see, for HiSeq data, you get the vast majority of the duplicates allowing 1 mismatch. NextSeq is a different story, however. I would not worry very much about two genomic regions being similar unless you have very high coverage, in which case I'd suggest optical duplicate removal only. If you have 100x coverage and 100bp reads, there is a 37% chance of a given read having a duplicate arising purely by chance. But for paired reads, with an insert size range 200-400bp (let's say linearly distributed for simplicity), the insert size also has to match so you get a 0.18% rate of duplicates occurring by chance. This is pretty low. If your genome has a duplicated region, then the coincidental duplicate rate in that region will be roughly doubled to 0.37% instead. For most purposes, losing 0.37% of your data simply won't matter. But if it does matter, use the "optical" flag, which will virtually eliminate coincidental duplicates.

ADD REPLYlink written 2.1 years ago by Brian Bushnell16k

Is it possible to get a consensus sequence for each set of duplicated reads along with a count for how many reads make up the consensus?

ADD REPLYlink written 2.0 years ago by danielfortin860

Currently, you can get a consensus per clump, and rename the clump indicating how many reads it constitutes. Duplicates are handled by retaining the highest-quality member of the set of duplicates. I could modify it to produce a consensus just for duplicate reads and modify the header to indicate how many it represents, though... doesn't look like it would be overly difficult.

ADD REPLYlink written 2.0 years ago by Brian Bushnell16k

Currently, you can get a consensus per clump, and rename the clump indicating how many reads it constitutes.

I have not gone through the in-line help for clumpify recently but is that something new you recently added? What are the command options to do that?

ADD REPLYlink written 2.0 years ago by genomax64k

Consensus was one of the first functionalities I added, when I was trying to make Clumpify into an assembler... you can enable it with the "consensus" flag. Not sure how it plays with paired reads, so if you run in consensus mode, I suggest adding the flag "unpair".

ADD REPLYlink written 2.0 years ago by Brian Bushnell16k

Does dupesubs= affect consensus? Does it work only with dupesubs=0? Can the clumped files be "re-paired" ( afterwards as I assume they will go out of sync by specifying unpair?

ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by genomax64k

The way consensus is currently implemented is unrelated to the dedupe and dupesubs flags. Also, consensus (again, as it is currently) will fundamentally change the data in a way that the original pairing, or reads, can never be recovered. I may add per-duplicate consensus which will allow you to keep reads and pairing intact as per danielfortin's request, but right now, it works like this:

Reads sharing a kmer are organized into clumps. I've linked my Clumpify poster, below, which can help visualize what the clumps look like:

With the consensus flag, each clump is reduced to a single consensus sequence, which is usually longer than any original read in the clump since it spans the entire clump. Currently, I think allele-splitting is only enabled for error-correction, not consensus, but I plan to add it for consensus as well.

So, if you have 100x coverage before consensus, you will have perhaps 1.3x to 2x coverage after consensus (depending on the error rate); all of the original sequences and pairing information will be destroyed, and you can't recover the pairing because each consensus sequence will consist of many reads that have mates which were absorbed into different other clusters. The goal is to flatten the reads into something like an assembly. Indeed, there are certain operations (like Sketch, kmer-based operations in general, and alignment) that would perform much better on the consensus than the raw reads, and don't care much about continuity. I think it's probably not useful for most applications.

That said, the way error-correction via Clumpify works is:

1) Organize reads into clumps that share a kmer
2) Produce a consensus for each clump
3) Compare reads to the consensus, and split alleles (as indicated in the poster) so that reads that coincidentally share a kmer, but are not actually genomically overlapping, get put into different clumps
4) Repeat 2-3 until convergence
5) Depending on the strength of the consensus at each position, convert all bases not matching the consensus to the consensus base

Therefore, to get consensus of duplicate reads, you can currently do this: in=reads.fq out=corrected.fq ecc passes=6 unpair repair in=corrected.fq out=deduplicated_consensus.fq dedupe dupesubs=0

Clumpify in ecc mode does not obey the "dupesubs" parameter (which is only for dedupe mode), but it does allow "minid", so for 100bp reads you could add the flag "minid=0.98" to achieve the same effect as "minsubs=2".

So - you can get something like consensus sequence for duplicate reads, using the above method, but it also uses nonduplicate reads as sources of information and thus is not "safe" for purposes like low-frequency variant calling from PCR'd libraries. You cannot currently get consensus for duplicate pairs, retaining pairing. But I will investigate adding that. I don't think it will be difficult.

ADD REPLYlink written 2.0 years ago by Brian Bushnell16k

Thanks for sharing a copy of the poster (and additional details). My interest in clumpify is currently mainly for its ability to identify duplicates (though the ability to do local assembly is something to I plan to check out).

My wish list is slightly different. I am purely interested in duplicates (optical primarily but others too) and want to know how many duplicate sequences were present in a clump (based purely on sequence). If that number could be correlated (sequence - # of dups) in a separate stats file that would be great. AFAIK this is not currently possible directly.

ADD REPLYlink modified 24 months ago • written 24 months ago by genomax64k

That would be perfect for my application. I would really appreciate it if this could be added to the next version of clumpify

ADD REPLYlink written 24 months ago by danielfortin860

@Brian had added count functionality to clumpify (as of v.37.17). in=file.fq out=deduped.fq dedupe addcount

Note from @Brian: this does NOT work in "markduplicates" mode, only in "dedupe" mode. Reads that absorbed duplicates will have "copies=X" appended to the end to indicate how many reads they represent (including themselves, so the minimum you would see is 2).

ADD REPLYlink modified 15 months ago • written 23 months ago by genomax64k

What would be optimal command for removing only optical duplicates specifically from PE reads? Thanks.

ADD REPLYlink written 20 months ago by rbronste230

Depending on what sequencer your data is from change the value for dupedist. in1=reads1.fq.gz in2=reads2.fq.gz out1=clumped1.fq.gz out2=clumped2.fq.gz dedupe optical dupedist=NN

dupedist=40         (dist) Max distance to consider for optical duplicates.
                    Higher removes more duplicates but is more likely to
                    remove PCR rather than optical duplicates.
                    This is platform-specific; recommendations:
                       NextSeq      40  (and spany=t)
                       HiSeq 1T     40
                       HiSeq 2500   40
                       HiSeq 3k/4k  2500
                       Novaseq      12000
ADD REPLYlink modified 20 months ago • written 20 months ago by genomax64k

How are the files changed following this process (example command below), tried after to run them through cutadapt (output below) to remove adapter sequences and got the following:

bash in1= in2= out1= out2= dedupe optical dupedist=40 spany=t

unexpected end of file 
cutadapt: error: gzip process returned non-zero exit code 1. Is the input file truncated or corrupt?

This did not occur with the original fastq directly from NextSeq.

ADD REPLYlink modified 20 months ago • written 20 months ago by rbronste230

Can you give the full command and screen output (and BBMap version) of Clumpify? It sounds like the output was either corrupt, or Clumpify was not finished yet, when you ran Cutadapt.

ADD REPLYlink written 20 months ago by Brian Bushnell16k

Yes you are correct after checking the log:

Using the latest version of clumpify as I just downloaded BBMap, memory allocation being an issue is weird since it was running on cluster node. The node job was dumped but I missed that fact, many temp files accumulated actually. Cutadapt was run separately after that. The full command is as above with output in gz format.

GNU nano 2.0.9 File: STDIN.e17695366

java version "1.8.0_65"
Java(TM) SE Runtime Environment (build 1.8.0_65-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.65-b01, mixed mode)
java -ea -Xmx69077m -Xms69077m -cp /mnt/grid/tollkuhn/hpc_norepl/home/data/software/BBMap/bbmap/current/ clump.Clumpify in1=/sonas-hs$
Executing clump.Clumpify [in1=/sonas-hs/tollkuhn/hpc_norepl/home/rbronste/DNase-seq_DATA/atac_adult_R3/Vehicle_020117/298777_S5_R1_001$

Clumpify version 37.36
Read Estimate:          317335522
Memory Estimate:        161714 MB
Memory Available:   54245 MB
Set groups to 29
Executing clump.KmerSplit [in1=/sonas-hs/tollkuhn/hpc_norepl/home/rbronste/DNase-seq_DATA/atac_adult_R3/Vehicle_020117/298777_S5_R1_00$

Reset INTERLEAVED to false because paired input files were specified.
Set INTERLEAVED to false
Input is being processed as paired
Writing interleaved.
Made a comparator with k=31, seed=1, border=1, hashes=4
Time:                           185.291 seconds.
Reads Processed:        159m    859.68k reads/sec
Bases Processed:      12106m    65.34m bases/sec
Executing clump.KmerSort3 [in1=298777_S5_R1_001_dedup_clumpify_p1_temp%_29ab8e4910f898b2.fastq.gz, in2=null, out=/sonas-hs/tollkuhn/h$

Making comparator.
Made a comparator with k=31, seed=1, border=1, hashes=4
Making 2 fetch threads.
Starting threads.
Fetching reads.
Fetched 2697665 reads:  14.801 seconds.
Making clumps.
Clump time:     1.044 seconds.
Dedupe time:    0.900 seconds.
Fetching reads.
Fetched 2675797 reads:  0.000 seconds.
Making clumps.
Clump time:     0.975 seconds.
Dedupe time:    0.748 seconds.
Fetching reads.
Fetched 2716264 reads:  23.337 seconds.
Making clumps.
Clump time:     1.131 seconds.
Dedupe time:    0.769 seconds.
Fetching reads.
Fetched 2699556 reads:  0.000 seconds.
Making clumps.
Clump time:     1.038 seconds.
Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00002af6e1f12000, 12288, 0) failed; error='Cannot allocate memor$
ADD REPLYlink modified 20 months ago • written 20 months ago by rbronste230

@Brian will be along with an official answer but it looks like 161G memory is estimated to be needed and you have 54G available. Did you run out of space on the local disk where the temp files were being written to? Is this a high output NextSeq run (or are these more than one runs pooled?).

ADD REPLYlink written 20 months ago by genomax64k
Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00002af6e1f12000, 12288, 0) failed; error='Cannot allocate memor$

This is a weird message.. was the process running alone on the node, or were there other processes as well?

It looks like the free memory on the node got used up by some other process and then the Java virtual machine no longer had enough memory to operate (sort of independent of the memory being used by Clumpify itself). I'd suggest either trying it on a node where nothing else is running, or as genomax suggests, manually setting the -Xmx flag. But, possibly, I'd suggest reducing it slightly, since the memory set by -Xmx is for Clumpify, and actually competes with the free memory available for the Java virtual machine.

ADD REPLYlink written 20 months ago by Brian Bushnell16k

I managed to get it working with the flag:


One other question if I am invoking the following options it removes pcr AND optical duplicates?

dedupe dupedist=40 spany=t -Xmx24g


ADD REPLYlink written 20 months ago by rbronste230

If you ever set the "dupedist" flag, it automatically goes into optical-only mode. To remove all duplicates (PCR and optical), just use "dedupe" and not "dupedist" or "spany".

The modes are, in order of becoming more restrictive:

"dedupe" : Removes all duplicates (keeping one copy of each), including PCR, tile-edge, and optical.
"dedupe dist=X spany=t" : Removes optical duplicates and NextSeq-specific tile-edge duplicates.
"dedupe dist=X" : Removes traditional optical duplicates only.
ADD REPLYlink modified 20 months ago • written 20 months ago by Brian Bushnell16k

Thank you for clarification.

If using just dedupe is there a way to specify that its a NextSeq derived fastq - so optical and tile edge duplicates are removed accordingly?

ADD REPLYlink written 20 months ago by rbronste230

You don't need to tell it the data is NextSeq; just use "dedupe" only to remove PCR, tile-edge, and optical duplicates, or "dedupe dist=40 spany=t" to remove tile-edge and optical duplicates but not PCR duplicates.

ADD REPLYlink written 20 months ago by Brian Bushnell16k

I am mainly interested in the deduplication performance of Clumpify for NovaSeq data. from reading this thread it seems just using 'dedupe' is my best option, but for removal of PCR duplicates AND optical duplicates don't I need to also adjust the 'dupedist' value to 12000? Yet, from my understanding, by adding in that value I am now only removing optical duplicates. Am I able to better tailor the command for all duplicate removal including optical duplicates but with the recommended dist for NovaSeq?

ADD REPLYlink written 7 months ago by figueroakl0

but for removal of PCR duplicates AND optical duplicates don't I need to also adjust the 'dupedist' value to 12000?

Correct. dedupe is going to remove all duplicates leaving one best copy.

I am not sure what kind of data this is but if you are only interested in removing optical duplicates (which are true artifacts) then you should just do that using dupedist=12000 without dedupe.

ADD REPLYlink modified 7 months ago • written 7 months ago by genomax64k

New to BBMap and clumpify. I want to re-ask some of the questions above because I'm still unclear.

If we have data from NextSeq or NovaSeq, and we want to remove PCR AND optical duplicates, we just use the dedupe switch w/o the dupedist, spany, or adjacent switches. This will apparently take into account tile-edge and optical duplicates.

However, I am unclear how dedupe, by itself, understands that the data is from a NextSeq or a NovaSeq. The clumpify documentation makes a big deal about switch recommendations for both of those platforms, but the directions given here indicate that the user should ignore all of those when attempting to remove PCR and optical duplicates.

Am I missing something? Does enabling PCR duplication removal, with clumpify, negate the necessity for those other switch values?

Secondly, I want to confirm a best practices pipeline: We are to run clumpify on raw reads, before quality trimming, adapter trimming, or paired-end read merging. Following clumpify, trimming, and mapping, we shouldn't need to run any other PCR duplicate removal (rmdup, picard, etc) tools?

ADD REPLYlink written 4 months ago by jtack0

Those switches are a combination of two or three individual options that you would need to use. You can just use actual switches you need/want and ignore those aliases so you know what exactly you did.

What kind of sequencing are you working with (RNAseq, WGS etc). You would want to be careful about removing PCR duplicates (you would want to remove optical duplicates irregardless) (Pay attention to C: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remov ). You should run clumpify on raw reads before proceeding with other analysis.

ADD REPLYlink modified 4 months ago • written 4 months ago by genomax64k

Thanks for the quick response. We do want to remove PCR dups, as the data are from low coverage whole genomes or post-captured libraries, both from degraded, low endogenous molecule samples. We tend to have astronomical duplication rates and I was interested to see how clumpify would handle it verse the typical PCR duplicate removal software.

I did see that earlier post, and I now know where I was confused. Running dedupe alone will capture all optical duplicates, including tile-edge duplicates from NextSeq, because it will treat them all like PCR duplicates and remove based on a combination of read 1 and read 2 start positions and the given allowed substitutions subs=. When the user just wants to remove optical duplicates, that is when you need the specific switch values changed for NextSeq,NovaSeq or else you might get increased false positives/negatives. Do I have that correct?

We're testing the pipeline now, and given our large duplication count, we are finding we need to run multiple passes to get them all, or else a large amount of read pairs with exact start and end positions get through. We are currently at k=19 passes=6 subs=5.

ADD REPLYlink written 4 months ago by jtack0

When the user just wants to remove optical duplicates, that is when you need the specific switch values changed for NextSeq,NovaSeq or else you might get increased false positives/negatives.


You appear to have a rather unusual use case. It may indeed need more passes as you have discovered. You are allowing for 5 errors (not sure how long you reads are) but you may want to reconsider that. Illumina sequencing has a less than 0.1% error rate.

ADD REPLYlink modified 4 months ago • written 4 months ago by genomax64k

Good point. Is there a way to slide the threshold of duplicate identification to be based on total read base quality instead of or in addition to direct substitution count comparisons, assuming the reads in question have the same start and end positions?

ADD REPLYlink written 4 months ago by jtack0

Since you are using a high number of substitutions this may be tricky but clupmpify retains the highest quality copy by default.

ADD REPLYlink written 4 months ago by genomax64k

Yes will take a look at the memory issue, did not run out of disk space however. This is indeed from a high output NextSeq run. Just one PE run.

ADD REPLYlink modified 20 months ago • written 20 months ago by rbronste230

Can you try a run with an explicit -XmxNNg (use the max NN you can for your node) declaration in the clumpify command line?

ADD REPLYlink modified 20 months ago • written 20 months ago by genomax64k

What file extensions did you use for out=? BBMap is smart in the sense that if you provide .fq as file name it will keep the output uncompressed. If you do .fq.gz it will gzip the output.

Were the sequences going in in-sync (i.e. original untrimmed files from NextSeq). If the reads were not in sync then you should first them before using clumpify. It may be better to do the adapter trimming before you clumpify them. Use from BBMap suite to do that though @Brian may have a different opinion.

ADD REPLYlink modified 20 months ago • written 20 months ago by genomax64k

I'm running on a SGE compute cluster running Ubuntu 16.04.3. My jobs are dying due to insufficient memory, but I'm using group=auto, which I though prevented insufficient memory issues. An example output from a job is shown below:

openjdk version "1.8.0_121"
OpenJDK Runtime Environment (Zulu (build 1.8.0_121-b15)
OpenJDK 64-Bit Server VM (Zulu (build 25.121-b15, mixed mode)
java -ea -Xmx52g -Xms52g -cp /ebio/abt3_projects/Anxiety_Twins_Metagenomes/bin/llmgqc/.snakemake/conda/90596024/opt/bbmap-37.66/current/ clump.Clumpify in=/ebio/abt3_projects/temp_data/LLMGQC_8670014500/ID15202/R1.fq.gz in2=/ebio/abt3_projects/temp_data/LLMGQC_8670014500/ID15202/R2.fq.gz out=/ebio/abt3_projects/temp_data/LLMGQC_8670014500/ID15202/R1_dedup.fq.gz out2=/ebio/abt3_projects/temp_data/LLMGQC_8670014500/ID15202/R2_dedup.fq.gz overwrite=t usetmpdir=t tmpdir=/ebio/abt3_projects/temp_data/LLMGQC_8670014500/ID15202 dedupe=t dupedist=2500 optical=t -Xmx52g
Executing clump.Clumpify [in=/ebio/abt3_projects/temp_data/LLMGQC_8670014500/ID15202/R1.fq.gz, in2=/ebio/abt3_projects/temp_data/LLMGQC_8670014500/ID15202/R2.fq.gz, out=/ebio/abt3_projects/temp_data/LLMGQC_8670014500/ID15202/R1_dedup.fq.gz, out2=/ebio/abt3_projects/temp_data/LLMGQC_8670014500/ID15202/R2_dedup.fq.gz, overwrite=t, usetmpdir=t, tmpdir=/ebio/abt3_projects/temp_data/LLMGQC_8670014500/ID15202, dedupe=t, dupedist=2500, optical=t, -Xmx52g]

Clumpify version 37.66
Read Estimate:          279821357
Memory Estimate:        181755 MB
Memory Available:       41805 MB
Set groups to 43
Executing clump.KmerSplit [in1=/ebio/abt3_projects/temp_data/LLMGQC_8670014500/ID15202/R1.fq.gz, in2=/ebio/abt3_projects/temp_data/LLMGQC_8670014500/ID15202/R2.fq.gz, out=/ebio/abt3_projects/temp_data/LLMGQC_8670014500/ID15202/R1_dedup_clumpify_p1_temp%_3ed633414cb23eec.fq.gz, out2=null, groups=43, ecco=false, addname=f, shortname=f, unpair=false, repair=f, namesort=f, ow=true, dedupe=t, -Xmx52g]

Reset INTERLEAVED to false because paired input files were specified.
Set INTERLEAVED to false
Input is being processed as paired
Writing interleaved.
Made a comparator with k=31, seed=1, border=1, hashes=4
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 24 bytes for AllocateHeap
# An error report file with more information is saved as:
# /ebio/abt3_projects/Anxiety_Twins_Metagenomes/bin/llmgqc/hs_err_pid1140.log
ADD REPLYlink written 15 months ago by nyoungb20

I develop and test BBTools under Oracle JDK. For most programs, there's no significant difference between Oracle and OpenJDK. However, in some programs that use nearly all available memory, the conditions for memory crashes are different, as the two JVMs manage memory differently. For example, I recently discovered that in many cases BBNorm will run out of memory and crash in OpenJDK when it would work fine in OracleJDK.

In this case, the error is not caused by running out of heap memory (which is what the programmer controls), but java itself encountering a problem. This may be caused by other concurrent processes on the same node using more of the free memory, or something similar. But at any rate, the problem was that there was not enough free memory outside of java for java itself to function, rather than the program within java's JVM running out of memory. So, ironically, the best way to address it is to reduce the -Xmx flag down to, perhaps, 50g.

I recommend that you try one or all of these:

1) Launch with a lower request for heap memory, for example -Xmx50g
2) Make sure no other processes are running on the node
3) Use Oracle's JDK, which I have found to be more memory efficient, at least with the default garbage collector
ADD REPLYlink modified 15 months ago • written 15 months ago by Brian Bushnell16k

Thanks for the troubleshooting advice! Increasing the memory allocated for each qsub job seems to have fixed the memory errors. It appears that needs >10% more memory than what is set via -Xmx.

In regards to using Oracle's JDK instead of OpenJDK, I've installed bbtools via conda (bioconda channel), and the build recipe is set for OpenJDK (see Can I just install Oracle's JDK via conda or would the process be more elaborate to switch from OpenJDK to Oracle's JDK?

I'd prefer to keep using conda for managing the bbtools install, given that I'm using as part of a snakemake pipeline, where each snakemake rule utilizes it's own conda environment.

ADD REPLYlink written 15 months ago by nyoungb20

How much memory are you allocating to in your SGE script? It looks like clumpify wants to use 181G where you as you have about 42G available. Can you set group=25 and see if that works? Does the tmp location have enough space? I also suggest that you use out1= since you are using an out2=.

ADD REPLYlink written 15 months ago by genomax64k

This is BRILLIANT! I just tried removing duplicates using few small RADseq and deep full genome fastq files and inspected my results carefully. Simple, transparent, easy & user friendly, memory efficient & pretty fast, and makes a lot of sense doing this to fastq files! I won't criticize the similar functions of Picard or Samtools on bam files; they must be good, too, if only they wanted to work on my giant bam files; memory errors kept coming even on a pretty good university cluster at various settings, and I turned to BBTools once again (loved BBDuk before because of using kmers). Thank you.

ADD REPLYlink written 11 months ago by FatihSarigol120
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1419 users visited in the last hour