I have some illumina MiSeq paired end reads which show relatively high levels of sequence duplication when QC checked using FASTQC (around 30-50% of reads remaining if deduplicated). The sequence duplication level is always below 10 with the plateau of the %total sequences line being between 1 and 5. I tried Clumpify to deduplicate my data (out of curiosity) but Clumpify removed very few duplicates and the FASTQC output remains mostly unchanged even with strict parameters set and multiple passes.
Am I doing something wrong or is my understanding of what Clumpify will do wrong? I have included an example code and output below
clumpify.sh in1=R1.fastq.gz in2=R2.fastq.gz out1=R1_clumped.fq.gz out2=R2_clumped.fq.gz dedupe k=19 passes=6 subs=0
Time: 2.289 seconds. Reads Processed: 257k 112.58k reads/sec Bases Processed: 38789k 16.95m bases/sec
Reads In: 257650 Clumps Formed: 9224 Duplicates Found: 280 Reads Out: 257650 Bases Out: 38789002 Total time: 6.048 seconds.
Thanks in anticipation