Question: Sequence Duplication Levels and Clumpify
gravatar for RBright21
6 weeks ago by
RBright2110 wrote:

I have some illumina MiSeq paired end reads which show relatively high levels of sequence duplication when QC checked using FASTQC (around 30-50% of reads remaining if deduplicated). The sequence duplication level is always below 10 with the plateau of the %total sequences line being between 1 and 5. I tried Clumpify to deduplicate my data (out of curiosity) but Clumpify removed very few duplicates and the FASTQC output remains mostly unchanged even with strict parameters set and multiple passes.

Am I doing something wrong or is my understanding of what Clumpify will do wrong? I have included an example code and output below in1=R1.fastq.gz in2=R2.fastq.gz out1=R1_clumped.fq.gz out2=R2_clumped.fq.gz dedupe k=19 passes=6 subs=0

Time: 2.289 seconds. Reads Processed: 257k 112.58k reads/sec Bases Processed: 38789k 16.95m bases/sec

Reads In: 257650 Clumps Formed: 9224 Duplicates Found: 280 Reads Out: 257650 Bases Out: 38789002 Total time: 6.048 seconds.

Thanks in anticipation

ADD COMMENTlink written 6 weeks ago by RBright2110
gravatar for GenoMax
6 weeks ago by
United States
GenoMax96k wrote:

You are looking for perfectly matching reads with subs=0 so this looks like a reasonable result. You have not provided a screenshot of what your FastQC plot looks like but if you have not done so take a look at this blog post from author's of FastQC about this topic here.

ADD COMMENTlink modified 6 weeks ago • written 6 weeks ago by GenoMax96k

Hi GenoMax Thanks for your response, here is a screenshot of my FastQC plot


I had set to subs=0 as I assumed the fastqc plot was showing duplicates that were exact matches - is this not the case? I have looked at some excellent posts on here explaining the sequence duplication plot and thought I understood it but perhaps I don't.

I have read the article you link to previously but think I need to go back and look again so thank you. I don't want to spend a lot of time worrying about a parameter on my FASTQC plot if it is unlikely to have an impact on my downstream analysis

Edited to add: as I was playing with the data and not intending to use it I had not removed the adapters - could this have had an impact?

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by RBright2110

Sequence duplicates can arise by a couple of different ways. By PCR amplification (can only be identified for sure if you have UMI's) or optical (mainly with patterned flowcells depending on loading). Since you have MiSeq data there should be no optical duplicates and without UMI's one can't be 100% sure about PCR ones (does not look like you have many dups based on the stats above).

I would suggest not worrying about these since they should not have an impact on downstream analysis (since I don't know what kind of data this is, this advice is general).

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by GenoMax96k

HI GenoMax, Thanks again for your quick response. I don't have UMIs so no I wouldn't be able to tease out the PCR duplicates but the library was produced by targeted enrichment with minimal PCR (only 14 cycles at the very last stage). My overall feeling was that I shouldn't be too concerned but the 48.81% of total sequences left after deduplication sounded high compared to a few practice files I had been working with and I am very new to working with this kind of data.

It is enriched viral DNA (expected genome size approx 36kb) which I will assemble by alignment to a best match reference genome if that changes your opinion

ADD REPLYlink written 6 weeks ago by RBright2110

I had missed the 48.81% part at the top. It makes sense since you are looking at small viral genomes. Oversampling small genomes will absolutely lead to FastQC thinking that a lot of the data can be de-duplicated. Since you have been careful with PCR amplification it should be fine to go forward with downstream analysis.

If you have over-sampled the data (is this one genome or several) you may want to normalize the data before trying to assemble. from BBMap suite (GUIDE) would be useful for this. (which is a k-mer based assembler is good for small viral genomes (GUIDE).

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by GenoMax96k

Hi GenoMax - sorry for the delay in responding. It will be several thousand copies of the same genome as it has come from clinical samples (but they should be identical). I had tried normalising after alignment and after variant calling but not before so will try this - thanks for the tip.

I have tried to use Tadpole several times for de novo assembly after reading it was good for small viral genomes but I haven't yet been able to get less than several hundred short contigs so I'm not sure what I am doing wrong. I have had the best success with velvet using the velvet optimiser for the samples but it may be that I am not setting the parameters correctly for my data on other assemblers

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by RBright2110

Or you simply have too much data. Assemblers actually will have a difficult times when genomes are oversampled. You should try to take between 50x-100x data from normalized pool (based on genome size) and they try again.

ADD REPLYlink written 5 weeks ago by GenoMax96k

HI Genomax, I just ran some files quickly through BB norm and it did improve the number of contigs generated. I'll continue to have a play with it and see how I get on. Thanks for your help

ADD REPLYlink written 4 weeks ago by RBright2110
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1271 users visited in the last hour