Question: Remove duplicates from reads: best practices?
gravatar for Macspider
3.8 years ago by
Vienna - BOKU
Macspider3.2k wrote:

Hey there,

I am curious about the deduplication aspect of treating the sequencing reads. So far, I did it a handful of times and always helped out in the end but I am aware that there is a debate on whether this is actually biologically correct to do or not.

What I usually do is to map the reads, get the bam file, and submit it to picard to MarkDuplicates.

What I want to know are 3 questions:

  1. How do people deduplicate by mapping position using a psl file?
  2. When would you say that deduplication is too risky?
  3. I personally developed a tool (but there are some already) to remove duplicates by sequence identity. Without going in the details of the algorithm, I can tell you that the intersection of the removed reads between picard and my script is 99% (not 100%, though, there are some different reads). Is this approach theoretically correct?
ADD COMMENTlink written 3.8 years ago by Macspider3.2k
gravatar for genomax
3.8 years ago by
United States
genomax91k wrote:

Depends on what kind of sequencer your data is from. There is a known issue where one may end up with optical duplicates on patterned flowcells (HiSeq 3/4K) depending on size of the inserts/loading concentrations. Even if you don't remove them you would want to know how many of those are present since they are artifacts of the technology.

@Brian recently added mark/remove duplicate functionality to a tool from BBMap suite called Advantage here is one does not need to align the data (to an external reference) to look for/mark/remove duplicates of all types and you can allow for errors in a controlled way.

ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by genomax91k
  1. May I ask you a link to document myself on the known issue of the flowcell?
  2. If it doesn't use alignment to identify duplicates, what does it use? My script uses sequence identity from the centre of the read.
ADD REPLYlink written 3.8 years ago by Macspider3.2k
1 and this recent thread: Duplicates on Illumina

I meant to say alignment to an external reference (corrected above). Mark duplicates functionality is an extension of the clumpify algorithm (details are in this thread: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates. ) that identifies sequences with similar sequences from a file. Those are rearranged to be near each other, which leads to efficient compression of the data files saving ~25% or so space. Optical duplicates are being marked by taking into account x,y coordinate positions of the read clusters and positional neighborhood space.

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by genomax91k

Thank you very much!

ADD REPLYlink written 3.8 years ago by Macspider3.2k

To add to this:

I recommend optical duplicate removal for all HiSeq platforms, for any kind of project in which you expect high library complexity (such as WGS). By optical duplicate, I mean removal of duplicates with very close coordinates on the flow cell. And by duplicate removal, I mean removing all duplicate copies except one. Whether you should remove non-positionally-correlated duplicates (such as PCR duplicates) is more experiment-specific. And whether you should do any form of duplicate removal on low-complexity libraries is also experiment-specific, as you'll get false positives even when restricting duplicate detection to nearby clusters.

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by Brian Bushnell17k
gravatar for agata88
3.8 years ago by
agata88800 wrote:

In my opinion the best practice is to NOT remove duplicates from reads. I've done a lot of comparisons in which some of information were lost because I've performed MarDuplicates step. At the end it turned out that without this step all of variants were detected and Sanger confirmed.

ADD COMMENTlink written 3.8 years ago by agata88800

I know and I partially agree, but in genome-wide studies you cannot verify by sanger every variant (we're talking about millions) and therefore you want to lose some data for the sake of the noise reduction to retrieve a list of variants that may be incomplete but you really, really trust. That's simply why I'm doing it RN, but I'm aware of the drawbacks.

ADD REPLYlink written 3.8 years ago by Macspider3.2k

I was just saying that my comparison was confirmed by Sanger. Of course you won't check it for genome-wide studies. And the only option is to compare with known detected variants for data. I don't get what it is all for ... if I have good results without removing reads...

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by agata88800

Yes, I always try without in the first place, I agree with you. In this case there was just too much noise, and the dedupe actually helped a lot.

ADD REPLYlink written 3.8 years ago by Macspider3.2k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1066 users visited in the last hour