Question: what cause high sequence duplication levels in Miseq?
Hi there, 

I have 30 samples (3 replicates * 10 conditions) running Miseq experiment to evaluate library quality. fastqc reported 2 (from the same condition) out of 30 have high sequence duplication levels (obviously outliners). I went back to check RNA quality and read abundance in those two samples, nothing weird. After mapping those two to reference transcriptome, no difference on alignment percentage from others.

Then what make those two libraries specially high sequence duplication? Should I re-make library before proceed to Hiseq? 

rna-seq • 1.6k views
There are some factors that can generate duplicated reads. Some factors that I ran into:

A redundant RNA (ssrA in my case in E. coli ~ 1.5% of total reads)
A lot of PCR
Adapters (primer-dimer)
Try to see which are the redundant sequences, are they genomic? How frequent are they? How many cycles of PCR did you do? How did you select your RNA? (poly-A, ribo-depleted, sRNAs etc.)
Are you really sure that the 2 samples with high duplication levels have way more duplication than the others or are they just above an arbitrary threshold that was used to report duplication? Do they have comparable mapping percentages also when only looking at exonic regions? In that case you can assume that the duplicated reads come from RNA molecules and it doesn't have anything to do with the MiSeq technology specifically. It can be that these two samples have very high expression of particular genes. It could also be that you had only few input material available for these samples causing low library complexity and high duplication levels

