Tool:Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates.
1
44
Entering edit mode
5.8 years ago

I'd like to introduce a new member of the BBMap package, Clumpify. This is a bit different from other tools in that it does not actually change your data at all, simply reorders it to maximize gzip compression. Therefore, the output files are still fully-compatible gzipped fastq files, and Clumpify has no effect on downstream analysis aside from making it faster. It’s quite simple to use:

clumpify.sh in=reads.fq.gz out=clumped.fq.gz reorder


This command assumes paired, interleaved reads or single-ended reads. But Clumpify also now works with paired reads in twin files like this:

clumpify.sh in1=r1.fq.gz in2=r2.fq.gz out1=clumped1.fq.gz out2=clumped2.fq.gz reorder


How does this work? Clumpify operates on a similar principle to that which makes sorted bam files smaller than unsorted bam files – the reads are reordered so that reads with similar sequence are nearby, which makes gzip compression more efficient. But unlike sorted bam, during this process, pairs are kept together so that an interleaved file will remain interleaved with pairing intact. Also unlike a sorted bam, it does not require mapping or a reference, and except in very unusual cases, can be done with an arbitrarily small amount of memory. So, it’s very fast and memory-efficient compared to mapping, and can be done with no knowledge of what organism(s) the reads came from.

Internally, Clumpify forms clumps of reads sharing special ‘pivot’ kmers, implying that those reads overlap. These clumps are then further sorted by position of the kmer in the read so that within a clump the reads are position-sorted. The net result is a list of sorted clumps of reads, yielding compression within a percent or so of sorted bam.

How long does Clumpify take? It's very fast. If all data can fit in memory, Clumpify needs the amount of time it takes to read and write the file once. If the data cannot fit in memory, it takes around twice that long. That's on our cluster, though, with a high-performance parallel filesystem. On some filesystems, or with a single spinning disk, it may take several times longer when data does not fit into memory, because multiple files are constantly being read and written at once.

Why does this increase speed? There are a lot of processes that are I/O limited. For example, on a multicore processor, using BBDuk, BBMerge, Reformat, etc. on a gzipped fastq will generally be rate-limited by gzip decompression (even if you use pigz, which is much faster at decompression than gzip). Gzip decompression seems to be rate-limited by the number of input bytes per second rather than output, meaning that a raw file of a given size will decompress X% faster if it is compressed Y% smaller; here X and Y are proportional, though not quite 1-to-1. In my tests, assembly with Spades and Megahit have time reductions from using Clumpified input that more than pays for the time needed to run Clumpify, largely because both are multi-kmer assemblers which read the input file multiple times (and according to Megahit's author, due to cache locality). Something purely CPU-limited like mapping would normally not benefit much in terms of speed (though still a bit due to improved cache locality).

When and how should Clumpify be used? If you want to clumpify data for compression, do it as early as possible (e.g. on the raw reads). Then run all downstream processing steps ensuring that read order is maintained (e.g. use the “ordered” flag if you use BBDuk for adapter-trimming) so that the clump order is maintained; thus, all intermediate files will benefit from the increased compression and increased speed. I recommend running Clumpify on ALL data that will ever go into long-term storage, or whenever there is a long pipeline with multiple steps and intermediate gzipped files. Also, even when data will not go into long-term storage, if a shared filesystem is being used or files need to be sent over the internet, running Clumpify as early as possible will conserve bandwidth. The only times I would not clumpify data are enumerated below.

When should Clumpify not be used? There are a few cases where it probably won’t help:

1) For reads with a very low kmer depth, due to either very low coverage (like 1x WGS) or super-high-error-rate (like raw PacBio data). It won’t hurt anything but won’t accomplish anything either.

2) For large volumes of amplicon data. This may work, and may not work; but if all of your reads are expected to share the same kmers, they may all form one giant clump and again nothing will be accomplished. Again, it won’t hurt anything, and if pivots are randomly selected from variable regions, it might increase compression.

3) When your process is dependent on the order of reads. If you always grab the first million reads from a file assuming they are a good representation of the rest of the file, Clumpify will cause your assumption to be invalid – just like grabbing the first million reads from a sorted bam file would not be representative. Fortunately, this is never a good practice so if you are currently doing that, now would be a good opportunity to change your pipeline anyway. Randomly subsampling is a much better approach.

4) If you are only going to read data fewer than ~3 times, it will never go into long-term storage, and it's being used on local disk so bandwidth is not an issue, there's no point in using Clumpify (or gzip, for that matter).

Please let me know if you have any questions, and please make sure you are using the latest version of BBTools when trying new functionality.

Edit: Although I was unaware when I wrote Clumpify, I have learned that there are prior examples of programs designed to put reads into buckets containing identical substrings - for example, SCALCE and MINCE. The approaches are quite different but the objective and results are similar. So, I encourage people seeking maximal compression (or planning to build something even better) to investigate those tools as well!

Edit: I've now released version 37.24 which has some nice optical deduplication improvements. It's now faster (many times faster in certain degenerate cases), and there are improvements in precision for NextSeq tile-edge duplicates. Specifically, it is now recommended that they be removed like this:

clumpify.sh in=nextseq.fq.gz out=clumped.fq.gz dedupe optical spany adjacent

This will remove all normal optical duplicates, and all tile-edge duplicates, but it will only consider reads to be tile-edge duplicates if they are in adjacent tiles and share their Y-coordinate (within dupedist), rather than before, in which they could be in any tiles and could share their X-coordinate. This means that there are fewer false-positives (PCR or coincidental duplicates that were being classified as optical/tile-edge duplicates). This is possible because on NextSeq the tile-edge duplicates are only present on the tile X-edges and the duplicates are only between adjacent tiles.

Normal optical duplicate removal (HiSeq, MiSeq, etc) should use this command:

clumpify.sh in=nextseq.fq.gz out=clumped.fq.gz dedupe optical

compression bbmap clumpify pigz storage Tool • 23k views
2
Entering edit mode

Great work. Talking about fastq compression, it is worth mentioning this paper, and fqzcomp and dsrc in particular.

0
Entering edit mode

Also worth noting:

Alapy

uQ

BBTools supports both fqzcomp and alapy (and pigz and pbzip2) if they are on the command line, meaning that you can do something like bbduk.sh in=reads.fq.ac ref=phix.fa.bz2 out=clean.fqz and they will all be appropriately compressed/decompressed. However, in my tests, while Clumpify greatly improves .gz and .bz2 compression, it makes .fqz (fqzcomp) and .ac (alapy) compression slightly worse (presumably because consecutive headers become less similar after reordering). I think I had some kind of problem testing dsrc which is why it's not currently supported, but I should re-evaluate that.

1
Entering edit mode

This is a good job, especially useful for deep sequencing data

0
Entering edit mode

I am using clumpify on HiSeq4000 data. As recommended, I increased dupedist to 2500 because of patterned flow cells. However, I get exclusively optical duplicates using this option:

dupedist=2500 dedupe=t optical=t


Gives the same output as:

dupedist=2500 dedupe=t


Any ideas? All optical duplicates sound not right.

1
Entering edit mode

See if this explanation helps.

dedupe=f optical=f (default)
Nothing happens with regards to duplicates.

dedupe=t optical=f
All duplicates are detected, whether optical or not.  All copies except one are removed for each duplicate.

dedupe=f optical=t
Nothing happens.

dedupe=t optical=t

Only optical duplicates (those with an X or Y coordinate within dist) are detected.  All copies except one are removed for each duplicate.
The allduplicates flag makes all copies of duplicates removed, rather than leaving a single copy.  But like optical, it has no effect unless dedupe=t.

1
Entering edit mode

If you set "dupedist" to anything greater than 0, "optical" gets enabled automatically. So please use:

dupedist=2500 dedupe=t optical=t


or

dedupe=t


Perhaps I should clarify that...

0
Entering edit mode

I just tested clumpify with NA12878 fastq files.

The size of original read1 file (SRR952827_1.fastq) is 211G, but the size of clumpified data (clump.SRR952827_1.fastq) is 182G. Why it's smaller?

I didn't enable any deduplication. The command is:

clumpify.sh in=SRR952827_1.fastq  out=clump.SRR952827_1.fastq

0
Entering edit mode

It's smaller because the sorted reads compress more efficiently than unsorted reads, since reads sharing sequence are adjacent in the file. The number of reads is the same. If you decompress the files they will be the same size.

Edit - for some reason I overlooked the fact that your files are not gzipped. The output size should be identical for uncompressed files unless the program was terminated before it was done writing. Can you post the exact command and the screen output, as well as the BBTools version? Also, can you run "wc" on both files?

0
Entering edit mode

@Brian: Based on the names the files appear to be un-compressed though.

1
Entering edit mode

Oh, hmm, that's odd. I guess I did not read the question carefully enough. Let me edit that post...

0
Entering edit mode

chen : How much RAM did you assign for this job?

0
Entering edit mode

My system has 1T RAM, and I noticed 738G memory was available and clumpify automatically set the JVM memory limit to be 738G.

0
Entering edit mode

So it's based on the similarity of sequences? If two reads (50 bp) overlap with 25 bp, will they be cluster together? Usually, we use Hiseq2500 to do RNAseq, CLIPseq, 3'reads-seq, etc. Does the type of libraries affect the deduplication result? Thanks very much.

0
Entering edit mode

I like the tool. I'm using clumpify.sh followed by reformat.sh. Is there a way to sort by the number of copies in the output?

0
Entering edit mode

No. You will need to devise a custom script yourself.

0
Entering edit mode

Hi, I used the command -

clumpify.sh in=file1.fastq out=file1_dedup.fastq dedupe=t

However, I still see duplicates in my output file. Am I missing something?

0
Entering edit mode

What kind of duplicates and what makes you certain you still see them?

0
Entering edit mode

These should be PCR duplicates. After aligning them to my reference, I see the same read mapping to the same chromosome , having the same start site multiple times.

0
Entering edit mode

Based on your command line this seems to be single-end sequencing data? Even if the start is identical the length of the reads that are mapping may be different. If you have single end data then each of those reads would be considered unique.

Can you post the stats from your clumpify.sh run? You should have the number of clumps formed and the number of duplicates found clearly noted there.

0
Entering edit mode

I have paired end sequencing data from MiSeq. For an analysis, I am required to merge the reads. So, I merged the reads using bbmerge.sh and then use clumpify. I have the same start and the length of the reads is also the same.

Reads In:            2925342
Clumps Formed:          5527
Duplicates Found:    2849931


I did remove duplicates, but I can still find duplicates in my fastq file.

Is it because I am merging the two fastq files that I see something like this?

0
Entering edit mode

Can you please run clumpify.sh on original data files that have not been trimmed or otherwise manipulated? Merge the reads after you remove duplicates.

Note: Have you checked the section on "When clumpify should not be used" in the original post to make sure nothing applies in your case. I assume these are amplicon data?

0
Entering edit mode

I tried it that way too. I used clumpify on original data but still found a lot of duplicates.

This is RNA-Seq data from a virus. The length of the genome is ~3000 bps. I need the duplicates remove because I am look at heterogeneity at a particular location and I need deduplicated reads for it.

1
Entering edit mode

This may be another case where having overly deep coverage may be causing an issue. Why not just map the reads as is (you should not be deduplicating RNAseq data in normal cases) and look for the region you are interested in.

Edit: You are using default subs=2 option since you did not change it. It you want to allow only perfect matches make it more stringent e.g. subs=0 or relax it if you want to allow for more differences in reads considered duplicates.

0
Entering edit mode

Thank you for developing such a great product. Do you have recommendations for parameters to run dedupe with for sequences from an Illumina iSeq? Such as what number to run dist with or if any other parameters should be run with the optical parameter similar to NextSeq? Thank you for your time and help.

0
Entering edit mode

NextSeq flow cells are particularly large in size so it would likely not be appropriate to use those settings with iSeq. You may want to experiment with HiSeq settings to see if you have a problem with duplicates first.

0
Entering edit mode

Okay thank you very much. The HiSeq settings seem to work just fine.

0
Entering edit mode

Very useful tool!

After I removed duplication reads from raw PE reads using the following command:

nohup clumpify.sh in=1.oriData_DNA/T1D10-pa_R1.fq.gz in2=1.oriData_DNA/T1UD10-pa_R2.fq.gz out=2.cleanData_DNA/T1D10-pa_R1_rmDup.fq.gz out2=2.cleanData_DNA/T1D10-pa_R2_rmDup.fq.gz dedupe=t subs=0 &

Sample Name % Dups % GC
T1D10-pa_R1 13.2% 47%

Clumps Formed: 48735198
Duplicates Found: 21921008
Bases Out: 23636280600
Total time: 899.516 seconds.


I rechecked the duplication level of the dedupe PE reads using FASTQC. The result showed that there are still many duplication reads in the remaining reads.

Sample Name % Dups % GC
T1D10-pa_R1_rmDup 2.6% 46%


Any suggestions? Thanks!

0
Entering edit mode

A suggestion on your command usage:

screen <command>


is a LOT better than

nohup <command> &


GNU Screen/tmux are better than nohup because of the amount of control they offer.

0
Entering edit mode

Take a look at this blog post from authors of FastQC on subject of duplication. You are only allowing perfect matches so unless the sequence is identical for R1 read they are not going to be removed.

12
Entering edit mode
5.7 years ago

Clumpify can now do duplicate removal with the "dedupe" flag. Paired reads are only considered duplicates if both reads match. By default, all copies of a duplicate are removed except one - the highest-quality copy is retained. By default subs=2, so 2 substitutions (mismatches) are allowed between "duplicates", to compensate for sequencing error, but this can be overriden. I recommend allowing substitutions during duplicate removal; otherwise, it will enrich the dataset with reads containing errors.

Example commands:

Clumpify only; don't remove duplicates:

clumpify.sh in=reads.fq.gz out=clumped.fq.gz


Remove exact duplicates:

clumpify.sh in=reads.fq.gz out=clumped.fq.gz dedupe subs=0


Mark exact duplicates, but don't remove them (they get " duplicate" appended to the name):

clumpify.sh in=reads.fq.gz out=clumped.fq.gz markduplicates subs=0


Remove duplicates, allowing up to 5 substitutions between copies:

clumpify.sh in=reads.fq.gz out=clumped.fq.gz dedupe subs=5


Remove ALL copies of reads with duplicates rather than retaining the best copy:

clumpify.sh in=reads.fq.gz out=clumped.fq.gz dedupe allduplicates


Remove optical duplicates only (duplicates within 40 pixels of each other):

clumpify.sh in=reads.fq.gz out=clumped.fq.gz dedupe optical dist=40 spantiles=f


Note that the optimal setting for dist is platform-specific; 40 is fine for NextSeq and HiSeq2500/1T.

Remove optical duplicates and tile-edge duplicates:

clumpify.sh in=reads.fq.gz out=clumped.fq.gz dedupe optical dist=40


Clumpify only detects duplicates within the same clump. Therefore, it will always detect 100% of identical duplicates, but is not guaranteed to find all duplicates with mismatches. This is similar to deduplication by mapping - with enough mismatches, "duplicates" may map to different places or not map at all, and then they won't be detected. However, Clumpify is more sensitive to errors than mapping-based duplicate detection. To increase sensitivity, you can reduce the kmer length from the default of 31 to a smaller number like 19 with the flag "k=19", and increase the number of passes from the default of 1 to, say, 3:

clumpify.sh in=reads.fq.gz out=clumped.fq.gz dedupe k=19 passes=3 subs=5


Each pass will have a chance of identifying more duplicates, because a different kmer will be selected for seeding clumps; thus, eventually, any pair of duplicates will land in the same clump given enough passes if they share a single kmer, regardless of how many errors they have. But in practice the majority are identified in the first pass and you don't really get much more after about the 3rd pass. Decreasing K and (especially) using additional passes will take longer, and there is no point in doing them if you are running with subs=0 (identical duplicates only) because in that case all duplicates are guaranteed to be found in the first pass. If all the data fits in memory, additional passes are extremely fast; the speed penalty is only noticeable when the data does not all fit in memory. Even so, passes=3 will generally still be much faster than using mapping to remove duplicates.

0
Entering edit mode

How do you think this will affect reads coming from two almost identical regions but for a few bases? For example SMN1/SMN2 have regions that are only differentiated by 1 base. Would clumpify.sh with the recommended allowed subs parameter remove non-duplicated reads for these two regions?

--Carlos

0
Entering edit mode

There is no recommended subs parameter. You decide what is acceptable. If you want only 1 difference then you could set subs=1.

0
Entering edit mode

Hi Carlos,

As Genomax indicated, I don't have a recommended setting, but the default is 2. That said, this graph might help you:

As you can see, for HiSeq data, you get the vast majority of the duplicates allowing 1 mismatch. NextSeq is a different story, however. I would not worry very much about two genomic regions being similar unless you have very high coverage, in which case I'd suggest optical duplicate removal only. If you have 100x coverage and 100bp reads, there is a 37% chance of a given read having a duplicate arising purely by chance. But for paired reads, with an insert size range 200-400bp (let's say linearly distributed for simplicity), the insert size also has to match so you get a 0.18% rate of duplicates occurring by chance. This is pretty low. If your genome has a duplicated region, then the coincidental duplicate rate in that region will be roughly doubled to 0.37% instead. For most purposes, losing 0.37% of your data simply won't matter. But if it does matter, use the "optical" flag, which will virtually eliminate coincidental duplicates.

0
Entering edit mode

Is it possible to get a consensus sequence for each set of duplicated reads along with a count for how many reads make up the consensus?

0
Entering edit mode

Currently, you can get a consensus per clump, and rename the clump indicating how many reads it constitutes. Duplicates are handled by retaining the highest-quality member of the set of duplicates. I could modify it to produce a consensus just for duplicate reads and modify the header to indicate how many it represents, though... doesn't look like it would be overly difficult.

0
Entering edit mode

Currently, you can get a consensus per clump, and rename the clump indicating how many reads it constitutes.

I have not gone through the in-line help for clumpify recently but is that something new you recently added? What are the command options to do that?

0
Entering edit mode

Consensus was one of the first functionalities I added, when I was trying to make Clumpify into an assembler... you can enable it with the "consensus" flag. Not sure how it plays with paired reads, so if you run in consensus mode, I suggest adding the flag "unpair".

0
Entering edit mode

Does dupesubs= affect consensus? Does it work only with dupesubs=0? Can the clumped files be "re-paired" (repair.sh) afterwards as I assume they will go out of sync by specifying unpair?

0
Entering edit mode

The way consensus is currently implemented is unrelated to the dedupe and dupesubs flags. Also, consensus (again, as it is currently) will fundamentally change the data in a way that the original pairing, or reads, can never be recovered. I may add per-duplicate consensus which will allow you to keep reads and pairing intact as per danielfortin's request, but right now, it works like this:

Reads sharing a kmer are organized into clumps. I've linked my Clumpify poster, below, which can help visualize what the clumps look like:

With the consensus flag, each clump is reduced to a single consensus sequence, which is usually longer than any original read in the clump since it spans the entire clump. Currently, I think allele-splitting is only enabled for error-correction, not consensus, but I plan to add it for consensus as well.

So, if you have 100x coverage before consensus, you will have perhaps 1.3x to 2x coverage after consensus (depending on the error rate); all of the original sequences and pairing information will be destroyed, and you can't recover the pairing because each consensus sequence will consist of many reads that have mates which were absorbed into different other clusters. The goal is to flatten the reads into something like an assembly. Indeed, there are certain operations (like Sketch, kmer-based operations in general, and alignment) that would perform much better on the consensus than the raw reads, and don't care much about continuity. I think it's probably not useful for most applications.

That said, the way error-correction via Clumpify works is:

1) Organize reads into clumps that share a kmer
2) Produce a consensus for each clump
3) Compare reads to the consensus, and split alleles (as indicated in the poster) so that reads that coincidentally share a kmer, but are not actually genomically overlapping, get put into different clumps
4) Repeat 2-3 until convergence
5) Depending on the strength of the consensus at each position, convert all bases not matching the consensus to the consensus base


Therefore, to get consensus of duplicate reads, you can currently do this:

clumpify.sh in=reads.fq out=corrected.fq ecc passes=6 unpair repair
clumpify.sh in=corrected.fq out=deduplicated_consensus.fq dedupe dupesubs=0


Clumpify in ecc mode does not obey the "dupesubs" parameter (which is only for dedupe mode), but it does allow "minid", so for 100bp reads you could add the flag "minid=0.98" to achieve the same effect as "minsubs=2".

So - you can get something like consensus sequence for duplicate reads, using the above method, but it also uses nonduplicate reads as sources of information and thus is not "safe" for purposes like low-frequency variant calling from PCR'd libraries. You cannot currently get consensus for duplicate pairs, retaining pairing. But I will investigate adding that. I don't think it will be difficult.

0
Entering edit mode

Thanks for sharing a copy of the poster (and additional details). My interest in clumpify is currently mainly for its ability to identify duplicates (though the ability to do local assembly is something to I plan to check out).

My wish list is slightly different. I am purely interested in duplicates (optical primarily but others too) and want to know how many duplicate sequences were present in a clump (based purely on sequence). If that number could be correlated (sequence - # of dups) in a separate stats file that would be great. AFAIK this is not currently possible directly.

0
Entering edit mode

That would be perfect for my application. I would really appreciate it if this could be added to the next version of clumpify

0
Entering edit mode

clumpify.sh in=file.fq out=deduped.fq dedupe addcount

Note from @Brian: this does NOT work in "markduplicates" mode, only in "dedupe" mode. Reads that absorbed duplicates will have "copies=X" appended to the end to indicate how many reads they represent (including themselves, so the minimum you would see is 2).

0
Entering edit mode

What would be optimal command for removing only optical duplicates specifically from PE reads? Thanks.

3
Entering edit mode

Depending on what sequencer your data is from change the value for dupedist. clumpify.sh in1=reads1.fq.gz in2=reads2.fq.gz out1=clumped1.fq.gz out2=clumped2.fq.gz dedupe optical dupedist=NN

dupedist=40         (dist) Max distance to consider for optical duplicates.
Higher removes more duplicates but is more likely to
remove PCR rather than optical duplicates.
This is platform-specific; recommendations:
NextSeq      40  (and spany=t)
HiSeq 1T     40
HiSeq 2500   40
HiSeq 3k/4k  2500
Novaseq      12000

0
Entering edit mode

How are the files changed following this process (example command below), tried after to run them through cutadapt (output below) to remove adapter sequences and got the following:

bash clumpify.sh in1= in2= out1= out2= dedupe optical dupedist=40 spany=t

unexpected end of file
cutadapt: error: gzip process returned non-zero exit code 1. Is the input file truncated or corrupt?


This did not occur with the original fastq directly from NextSeq.

1
Entering edit mode

Can you give the full command and screen output (and BBMap version) of Clumpify? It sounds like the output was either corrupt, or Clumpify was not finished yet, when you ran Cutadapt.

0
Entering edit mode

Yes you are correct after checking the log:

Using the latest version of clumpify as I just downloaded BBMap, memory allocation being an issue is weird since it was running on cluster node. The node job was dumped but I missed that fact, many temp files accumulated actually. Cutadapt was run separately after that. The full command is as above with output in gz format.

GNU nano 2.0.9 File: STDIN.e17695366

java version "1.8.0_65"
Java(TM) SE Runtime Environment (build 1.8.0_65-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.65-b01, mixed mode)
java -ea -Xmx69077m -Xms69077m -cp /mnt/grid/tollkuhn/hpc_norepl/home/data/software/BBMap/bbmap/current/ clump.Clumpify in1=/sonas-hs$Executing clump.Clumpify [in1=/sonas-hs/tollkuhn/hpc_norepl/home/rbronste/DNase-seq_DATA/atac_adult_R3/Vehicle_020117/298777_S5_R1_001$

Clumpify version 37.36
Memory Estimate:        161714 MB
Memory Available:   54245 MB
Set groups to 29
Executing clump.KmerSplit [in1=/sonas-hs/tollkuhn/hpc_norepl/home/rbronste/DNase-seq_DATA/atac_adult_R3/Vehicle_020117/298777_S5_R1_00$Reset INTERLEAVED to false because paired input files were specified. Set INTERLEAVED to false Input is being processed as paired Writing interleaved. Made a comparator with k=31, seed=1, border=1, hashes=4 Time: 185.291 seconds. Reads Processed: 159m 859.68k reads/sec Bases Processed: 12106m 65.34m bases/sec Executing clump.KmerSort3 [in1=298777_S5_R1_001_dedup_clumpify_p1_temp%_29ab8e4910f898b2.fastq.gz, in2=null, out=/sonas-hs/tollkuhn/h$

Making comparator.
Made a comparator with k=31, seed=1, border=1, hashes=4
Making clumps.
Clump time:     1.044 seconds.
Deduping.
Dedupe time:    0.900 seconds.
Writing.
Making clumps.
Clump time:     0.975 seconds.
Deduping.
Dedupe time:    0.748 seconds.
Writing.
Making clumps.
Clump time:     1.131 seconds.
Deduping.
Dedupe time:    0.769 seconds.
Writing.
Making clumps.
Clump time:     1.038 seconds.
Deduping.
Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00002af6e1f12000, 12288, 0) failed; error='Cannot allocate memor$ ADD REPLY 1 Entering edit mode @Brian will be along with an official answer but it looks like 161G memory is estimated to be needed and you have 54G available. Did you run out of space on the local disk where the temp files were being written to? Is this a high output NextSeq run (or are these more than one runs pooled?). ADD REPLY 1 Entering edit mode Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00002af6e1f12000, 12288, 0) failed; error='Cannot allocate memor$


This is a weird message.. was the process running alone on the node, or were there other processes as well?

It looks like the free memory on the node got used up by some other process and then the Java virtual machine no longer had enough memory to operate (sort of independent of the memory being used by Clumpify itself). I'd suggest either trying it on a node where nothing else is running, or as genomax suggests, manually setting the -Xmx flag. But, possibly, I'd suggest reducing it slightly, since the memory set by -Xmx is for Clumpify, and actually competes with the free memory available for the Java virtual machine.

0
Entering edit mode

I managed to get it working with the flag:

-Xmx24g


One other question if I am invoking the following options it removes pcr AND optical duplicates?

dedupe dupedist=40 spany=t -Xmx24g


Thanks.

2
Entering edit mode

If you ever set the "dupedist" flag, it automatically goes into optical-only mode. To remove all duplicates (PCR and optical), just use "dedupe" and not "dupedist" or "spany".

The modes are, in order of becoming more restrictive:

"dedupe" : Removes all duplicates (keeping one copy of each), including PCR, tile-edge, and optical.
"dedupe dist=X spany=t" : Removes optical duplicates and NextSeq-specific tile-edge duplicates.
"dedupe dist=X" : Removes traditional optical duplicates only.

0
Entering edit mode

Thank you for clarification.

If using just dedupe is there a way to specify that its a NextSeq derived fastq - so optical and tile edge duplicates are removed accordingly?

1
Entering edit mode

You don't need to tell it the data is NextSeq; just use "dedupe" only to remove PCR, tile-edge, and optical duplicates, or "dedupe dist=40 spany=t" to remove tile-edge and optical duplicates but not PCR duplicates.

0
Entering edit mode

I am mainly interested in the deduplication performance of Clumpify for NovaSeq data. from reading this thread it seems just using 'dedupe' is my best option, but for removal of PCR duplicates AND optical duplicates don't I need to also adjust the 'dupedist' value to 12000? Yet, from my understanding, by adding in that value I am now only removing optical duplicates. Am I able to better tailor the command for all duplicate removal including optical duplicates but with the recommended dist for NovaSeq?

0
Entering edit mode

but for removal of PCR duplicates AND optical duplicates don't I need to also adjust the 'dupedist' value to 12000?

Correct. dedupe is going to remove all duplicates leaving one best copy.

I am not sure what kind of data this is but if you are only interested in removing optical duplicates (which are true artifacts) then you should just do that using dupedist=12000 without dedupe.

0
Entering edit mode

New to BBMap and clumpify. I want to re-ask some of the questions above because I'm still unclear.

If we have data from NextSeq or NovaSeq, and we want to remove PCR AND optical duplicates, we just use the dedupe switch w/o the dupedist, spany, or adjacent switches. This will apparently take into account tile-edge and optical duplicates.

However, I am unclear how dedupe, by itself, understands that the data is from a NextSeq or a NovaSeq. The clumpify documentation makes a big deal about switch recommendations for both of those platforms, but the directions given here indicate that the user should ignore all of those when attempting to remove PCR and optical duplicates.

Am I missing something? Does enabling PCR duplication removal, with clumpify, negate the necessity for those other switch values?

Secondly, I want to confirm a best practices pipeline: We are to run clumpify on raw reads, before quality trimming, adapter trimming, or paired-end read merging. Following clumpify, trimming, and mapping, we shouldn't need to run any other PCR duplicate removal (rmdup, picard, etc) tools?

0
Entering edit mode

Those switches are a combination of two or three individual options that you would need to use. You can just use actual switches you need/want and ignore those aliases so you know what exactly you did.

What kind of sequencing are you working with (RNAseq, WGS etc). You would want to be careful about removing PCR duplicates (you would want to remove optical duplicates irregardless) (Pay attention to C: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remov ). You should run clumpify on raw reads before proceeding with other analysis.

0
Entering edit mode

Thanks for the quick response. We do want to remove PCR dups, as the data are from low coverage whole genomes or post-captured libraries, both from degraded, low endogenous molecule samples. We tend to have astronomical duplication rates and I was interested to see how clumpify would handle it verse the typical PCR duplicate removal software.

I did see that earlier post, and I now know where I was confused. Running dedupe alone will capture all optical duplicates, including tile-edge duplicates from NextSeq, because it will treat them all like PCR duplicates and remove based on a combination of read 1 and read 2 start positions and the given allowed substitutions subs=. When the user just wants to remove optical duplicates, that is when you need the specific switch values changed for NextSeq,NovaSeq or else you might get increased false positives/negatives. Do I have that correct?

We're testing the pipeline now, and given our large duplication count, we are finding we need to run multiple passes to get them all, or else a large amount of read pairs with exact start and end positions get through. We are currently at k=19 passes=6 subs=5.

0
Entering edit mode

When the user just wants to remove optical duplicates, that is when you need the specific switch values changed for NextSeq,NovaSeq or else you might get increased false positives/negatives.

Correct.

You appear to have a rather unusual use case. It may indeed need more passes as you have discovered. You are allowing for 5 errors (not sure how long you reads are) but you may want to reconsider that. Illumina sequencing has a less than 0.1% error rate.

0
Entering edit mode

Good point. Is there a way to slide the threshold of duplicate identification to be based on total read base quality instead of or in addition to direct substitution count comparisons, assuming the reads in question have the same start and end positions?

0
Entering edit mode

Since you are using a high number of substitutions this may be tricky but clupmpify retains the highest quality copy by default.

0
Entering edit mode

Yes will take a look at the memory issue, did not run out of disk space however. This is indeed from a high output NextSeq run. Just one PE run.

1
Entering edit mode

Can you try a run with an explicit -XmxNNg (use the max NN you can for your node) declaration in the clumpify command line?

0
Entering edit mode

What file extensions did you use for out=? BBMap is smart in the sense that if you provide .fq as file name it will keep the output uncompressed. If you do .fq.gz it will gzip the output.

Were the sequences going in in-sync (i.e. original untrimmed files from NextSeq). If the reads were not in sync then you should first repair.sh them before using clumpify. It may be better to do the adapter trimming before you clumpify them. Use bbduk.sh from BBMap suite to do that though @Brian may have a different opinion.

0
Entering edit mode

I'm running clumpify.sh on a SGE compute cluster running Ubuntu 16.04.3. My clumpify.sh jobs are dying due to insufficient memory, but I'm using group=auto, which I though prevented insufficient memory issues. An example output from a job is shown below:

openjdk version "1.8.0_121"
OpenJDK Runtime Environment (Zulu 8.20.0.5-linux64) (build 1.8.0_121-b15)
OpenJDK 64-Bit Server VM (Zulu 8.20.0.5-linux64) (build 25.121-b15, mixed mode)
java -ea -Xmx52g -Xms52g -cp /ebio/abt3_projects/Anxiety_Twins_Metagenomes/bin/llmgqc/.snakemake/conda/90596024/opt/bbmap-37.66/current/ clump.Clumpify in=/ebio/abt3_projects/temp_data/LLMGQC_8670014500/ID15202/R1.fq.gz in2=/ebio/abt3_projects/temp_data/LLMGQC_8670014500/ID15202/R2.fq.gz out=/ebio/abt3_projects/temp_data/LLMGQC_8670014500/ID15202/R1_dedup.fq.gz out2=/ebio/abt3_projects/temp_data/LLMGQC_8670014500/ID15202/R2_dedup.fq.gz overwrite=t usetmpdir=t tmpdir=/ebio/abt3_projects/temp_data/LLMGQC_8670014500/ID15202 dedupe=t dupedist=2500 optical=t -Xmx52g
Executing clump.Clumpify [in=/ebio/abt3_projects/temp_data/LLMGQC_8670014500/ID15202/R1.fq.gz, in2=/ebio/abt3_projects/temp_data/LLMGQC_8670014500/ID15202/R2.fq.gz, out=/ebio/abt3_projects/temp_data/LLMGQC_8670014500/ID15202/R1_dedup.fq.gz, out2=/ebio/abt3_projects/temp_data/LLMGQC_8670014500/ID15202/R2_dedup.fq.gz, overwrite=t, usetmpdir=t, tmpdir=/ebio/abt3_projects/temp_data/LLMGQC_8670014500/ID15202, dedupe=t, dupedist=2500, optical=t, -Xmx52g]

Clumpify version 37.66
Memory Estimate:        181755 MB
Memory Available:       41805 MB
Set groups to 43
Executing clump.KmerSplit [in1=/ebio/abt3_projects/temp_data/LLMGQC_8670014500/ID15202/R1.fq.gz, in2=/ebio/abt3_projects/temp_data/LLMGQC_8670014500/ID15202/R2.fq.gz, out=/ebio/abt3_projects/temp_data/LLMGQC_8670014500/ID15202/R1_dedup_clumpify_p1_temp%_3ed633414cb23eec.fq.gz, out2=null, groups=43, ecco=false, addname=f, shortname=f, unpair=false, repair=f, namesort=f, ow=true, dedupe=t, -Xmx52g]

Reset INTERLEAVED to false because paired input files were specified.
Set INTERLEAVED to false
Input is being processed as paired
Writing interleaved.
Made a comparator with k=31, seed=1, border=1, hashes=4
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 24 bytes for AllocateHeap
# /ebio/abt3_projects/Anxiety_Twins_Metagenomes/bin/llmgqc/hs_err_pid1140.log

1
Entering edit mode

I develop and test BBTools under Oracle JDK. For most programs, there's no significant difference between Oracle and OpenJDK. However, in some programs that use nearly all available memory, the conditions for memory crashes are different, as the two JVMs manage memory differently. For example, I recently discovered that in many cases BBNorm will run out of memory and crash in OpenJDK when it would work fine in OracleJDK.

In this case, the error is not caused by running out of heap memory (which is what the programmer controls), but java itself encountering a problem. This may be caused by other concurrent processes on the same node using more of the free memory, or something similar. But at any rate, the problem was that there was not enough free memory outside of java for java itself to function, rather than the program within java's JVM running out of memory. So, ironically, the best way to address it is to reduce the -Xmx flag down to, perhaps, 50g.

I recommend that you try one or all of these:

1) Launch with a lower request for heap memory, for example -Xmx50g
2) Make sure no other processes are running on the node
3) Use Oracle's JDK, which I have found to be more memory efficient, at least with the default garbage collector

0
Entering edit mode

Thanks for the troubleshooting advice! Increasing the memory allocated for each clumpify.sh qsub job seems to have fixed the memory errors. It appears that clumpify.sh needs >10% more memory than what is set via -Xmx.

In regards to using Oracle's JDK instead of OpenJDK, I've installed bbtools via conda (bioconda channel), and the build recipe is set for OpenJDK (see https://github.com/bioconda/bioconda-recipes/blob/master/recipes/bbmap/meta.yaml). Can I just install Oracle's JDK via conda or would the process be more elaborate to switch from OpenJDK to Oracle's JDK?

I'd prefer to keep using conda for managing the bbtools install, given that I'm using clumpify.sh as part of a snakemake pipeline, where each snakemake rule utilizes it's own conda environment.

0
Entering edit mode

How much memory are you allocating to in your SGE script? It looks like clumpify wants to use 181G where you as you have about 42G available. Can you set group=25 and see if that works? Does the tmp location have enough space? I also suggest that you use out1= since you are using an out2=.

0
Entering edit mode

For HiSeq 2000 platform, generating SE reads, is it still dupedist=40 ?

And for RNA-Seq data, we do not want to filter PCR duplicates, only the optical duplicates, correct? If so, the syntax for that would be, for SE reads, something like:

clumpify.sh in=IN.fq.gz out=OUT_clumpd_dedupd.fq.gz dedupe optical dupedist=NN
# where NN=40 or some other number that I hope to get info from you about.


Did I get that right? Thanks!

1
Entering edit mode

Run clumpify.sh on its own to check the in-line help. That lists values @Brian recommends for various sequencers.

For RNAseq you only need to filter cluster duplicates. Those are an issue ONLY if you are using patterned flowcells (HiSeq 4000 or NovaSeq).

0
Entering edit mode

Oh OK, so HiSeq 2000 should not have optical duplicates, copy that. This means file before and after dedupe optical step should be (near) identical, correct ? This suggests there may be no value in this step for data generated on the HiSeq 2000 platform, correct? But performing it, would there be any harm?

BTW, neither help menu for clumpify.sh (pasted below), nor this post or other online posts mention dupedist value for HiSeq 2000, so I asked my question.

Plus I am not sure whether HiSeq 2500 and HiSeq2000 platforms / flow cells are similar enough that the dupedist=40run parameter value for clumpify.sh run can be identical for both HiSeq 2000 (not listed) and HiSeq 2500 (listed).

dupedist=40         (dist) Max distance to consider for optical duplicates.
Higher removes more duplicates but is more likely to
remove PCR rather than optical duplicates.
This is platform-specific; recommendations:
NextSeq      40  (and spany=t)
HiSeq 1T     40
HiSeq 2500   40
HiSeq 3k/4k  2500
Novaseq      12000

1
Entering edit mode

HiSeq 2000 and 2500 are similar. You can use that same value. That said there could be a small number of duplicates in all Illumina platforms. With patterned FC they can be a bigger issue depending on library quality and sample loading etc.

0
Entering edit mode

I'm a bit confused with the you should not be deduplicating RNAseq data in normal cases sentece. I'm working with stranded RNAseq data coming from Novaseq to perform differential expression analysis, and I actually removed optical duplicates (dedupe=t, dupedist=12000. 3-4% of reads were removed). What concerns me is that I don't usually see people removing duplicates from RNAseq. It took me a bit to find an article where, by the way, clumpify is used to remove optical duplicates in RNAseq.

0
Entering edit mode

Since you are interested in counting reads in RNAseq you can't be sure that a duplicate read you see is because of PCR duplication or is a result of multiple copies of the gene that were initially present. Only way to distinguish for sure is to use unique molecular identifiers (UMI) which are attached to RNA molecules before any manipulations/amplification is done. These add cost/effort, so are not commonly used.

As long as all of your samples were processed exactly the same way then PCR bias should be consistent across the samples and should not cause a big issue.

0
Entering edit mode

I agree about PCR duplicates arising during library prep, but then what about the optical ones? Should be removed? Sorry but I find all of this duplicates thing a bit confusing. I have read several good threads like this one itself or this among others, and I thought optical should be removed, but as I said I don't see people doing it so I'm not sure.

0
Entering edit mode

Not everyone runs their samples on patterned FC's (where the issue is likely to arise). Optical duplicates are not always a problem if one manages loading concentrations well. I assume your samples were part of a pool so if you do remove optical duplicates it should be fine as long as you don't see a skewed distribution of them across your samples.

0
Entering edit mode

Thanks GenoMax, truly helpful as always.

0
Entering edit mode

Optical duplicates can be detected by adding the "optical" flag to Clumpify if the Illumina reads still have original headers, though the exact settings depend on the platform. PCR duplicates cannot be detected and differentiated from coincidental duplicates with high confidence in RNA-seq libraries. In fact they can never really be differentiated, but it's only a problem in quantitative experiments like RNA-seq because duplicate removal changes your quantification. And particularly if you use single-ended reads, that will limit your maximum possible coverage to read length, so high-expression transcripts will be selectively flattened.

It's less of a problem with paired ends, but still, it's best to not PCR-amplify quantitative experiments. But if you do, just skip deduplication (optical duplicate removal is still OK) and hope for the best. With non-quantitative experiments deduplication does not cause problems because deduplication does not remove any information useful to the experimental results.

0
Entering edit mode

This is BRILLIANT! I just tried removing duplicates using few small RADseq and deep full genome fastq files and inspected my results carefully. Simple, transparent, easy & user friendly, memory efficient & pretty fast, and makes a lot of sense doing this to fastq files! I won't criticize the similar functions of Picard or Samtools on bam files; they must be good, too, if only they wanted to work on my giant bam files; memory errors kept coming even on a pretty good university cluster at various settings, and I turned to BBTools once again (loved BBDuk before because of using kmers). Thank you.

0
Entering edit mode

This is great, thank you. Is it possible to retain more than one duplicates? For example can we set it to keep say a percentage of duplicates or top N, like top 3, 4... etc?

thanks!

0
Entering edit mode

Not as far as I know.You could simply mark them and then use some custom code to keep copies you need.

0
Entering edit mode

Hi I'm a bit confused. I tried two of the examples above, one removing all and the other just the optical and tile duplications. However it looks like both commands give the same exact removal. Here is what I did and the outputs are indentical.

/bbmap/clumpify.sh in1=${fq1} in2=${fq2} out1=clumped_all_R1.fq.gz out2=clumped_all_R2.fq.gz dedupe subs=2 dist=2500 reorder         ###

/bbmap/clumpify.sh in1=${fq1} in2=${fq2} out1=clumped_opt_R1.fq.gz out2=clumped_opt_R2.fq.gz dedupe optical subs=2 dist=2500 reorder

Results

Write time:     35.181 seconds.
Done!
Time:                           2143.419 seconds.
Bases Processed:        29157m  13.60m bases/sec

Clumps Formed:         8911811
Duplicates Found:      6649376
Bases Out:         28486089894
Total time:     3371.326 seconds.
`

Moreover when the resulting fastq files are larger in size, may be due to compression level though, but overall I feel like something is not optimal here. When I run fastqc the duplication levles is > 50% even after clumpify.

0
Entering edit mode

Only patterned flowcells have optical duplicates. Is your data from such a flow cell? It is not necessary to have optical duplicates if the loading of libraries is optimal. You may only have PCR dups in such a case.

0
Entering edit mode

Hm ok thanks. I guess I was expecting a huge duplication issue here, especially since this set of fastq files in particular was quite large. I don't know what type of flow cells were used but that is a good point and I will ask. thanks!