Question: Removing duplicate reads from Illumina before hybrid de-novo assembly/ before using for correcting PacBio reads or Pac-Bio only assemblies
Hi All,

I was wondering how important is to get rid of exact duplicate Illumina reads before --

  1. Before using it for correcting PacBio reads (planning to use ProovRead)
  2. Before using it to polish a Pac-Bio only assembly using Pilon (Assembly was done using uncorrected PacBio reads - miniasm)
  3. Before using the reads to do a hybrid de-novo-assembly using PBcR

Some of my Illumina libraries have significant amounts of reads duplicated >10 times. What are your recommendations to handle these duplicate reads considering the scenarios mentioned above?

Many thanks in advance!

It's not a good idea to remove duplicate reads unless your libraries are amplified. If they are amplified, and you have reads appearing 10+ times, I highly recommend you change to an unamplified protocol, because you are wasting sequence. And by duplicates, I mean that both read 1 and read 2 of pairs are duplicates... otherwise the pairs are not, in fact, duplicates.

But - if you have a situation in which you are using an amplified library, and duplicate pairs occur, I recommend eliminating all duplicates and replacing them with a single copy of their consensus, in any situation other than quantification (e.g. RNA-seq).

Thanks, this seems reasonable.

Thanks, this seems reasonable.
