I have some illumina MiSeq paired end reads which show relatively high levels of sequence duplication when QC checked using FASTQC (around 30-50% of reads remaining if deduplicated). The sequence duplication level is always below 10 with the plateau of the %total sequences line being between 1 and 5. I tried Clumpify to deduplicate my data (out of curiosity) but Clumpify removed very few duplicates and the FASTQC output remains mostly unchanged even with strict parameters set and multiple passes.
Am I doing something wrong or is my understanding of what Clumpify will do wrong? I have included an example code and output below
clumpify.sh in1=R1.fastq.gz in2=R2.fastq.gz out1=R1_clumped.fq.gz out2=R2_clumped.fq.gz dedupe k=19 passes=6 subs=0
Time: 2.289 seconds. Reads Processed: 257k 112.58k reads/sec Bases Processed: 38789k 16.95m bases/sec
Reads In: 257650 Clumps Formed: 9224 Duplicates Found: 280 Reads Out: 257650 Bases Out: 38789002 Total time: 6.048 seconds.
Thanks in anticipation
Hi GenoMax Thanks for your response, here is a screenshot of my FastQC plot
I had set to subs=0 as I assumed the fastqc plot was showing duplicates that were exact matches - is this not the case? I have looked at some excellent posts on here explaining the sequence duplication plot and thought I understood it but perhaps I don't.
I have read the article you link to previously but think I need to go back and look again so thank you. I don't want to spend a lot of time worrying about a parameter on my FASTQC plot if it is unlikely to have an impact on my downstream analysis
Edited to add: as I was playing with the data and not intending to use it I had not removed the adapters - could this have had an impact?
Sequence duplicates can arise by a couple of different ways. By PCR amplification (can only be identified for sure if you have UMI's) or optical (mainly with patterned flowcells depending on loading). Since you have MiSeq data there should be no optical duplicates and without UMI's one can't be 100% sure about PCR ones (does not look like you have many dups based on the stats above).
I would suggest not worrying about these since they should not have an impact on downstream analysis (since I don't know what kind of data this is, this advice is general).
HI GenoMax, Thanks again for your quick response. I don't have UMIs so no I wouldn't be able to tease out the PCR duplicates but the library was produced by targeted enrichment with minimal PCR (only 14 cycles at the very last stage). My overall feeling was that I shouldn't be too concerned but the 48.81% of total sequences left after deduplication sounded high compared to a few practice files I had been working with and I am very new to working with this kind of data.
It is enriched viral DNA (expected genome size approx 36kb) which I will assemble by alignment to a best match reference genome if that changes your opinion
I had missed the 48.81% part at the top. It makes sense since you are looking at small viral genomes. Oversampling small genomes will absolutely lead to FastQC thinking that a lot of the data can be de-duplicated. Since you have been careful with PCR amplification it should be fine to go forward with downstream analysis.
If you have over-sampled the data (is this one genome or several) you may want to normalize the data before trying to assemble.
bbnorm.sh
from BBMap suite (GUIDE) would be useful for this.tadpole.sh
(which is a k-mer based assembler is good for small viral genomes (GUIDE).Hi GenoMax - sorry for the delay in responding. It will be several thousand copies of the same genome as it has come from clinical samples (but they should be identical). I had tried normalising after alignment and after variant calling but not before so will try this - thanks for the tip.
I have tried to use Tadpole several times for de novo assembly after reading it was good for small viral genomes but I haven't yet been able to get less than several hundred short contigs so I'm not sure what I am doing wrong. I have had the best success with velvet using the velvet optimiser for the samples but it may be that I am not setting the parameters correctly for my data on other assemblers
Or you simply have too much data. Assemblers actually will have a difficult times when genomes are oversampled. You should try to take between 50x-100x data from normalized pool (based on genome size) and they try
tadpole.sh
again.HI Genomax, I just ran some files quickly through BB norm and it did improve the number of contigs generated. I'll continue to have a play with it and see how I get on. Thanks for your help