Question: Why Do We Need Markduplicates For Variants Detection In Gatk Processing Pipeline?
gravatar for Lds
8.1 years ago by
Lds420 wrote:

Hi fellows,

It's said that MarkDuplicates in Picard matches all read pairs that have identical 5' coordinates and orientations and marks as duplicates all but the 'best' pair. If I have three pairs, with one of which is the 'best' pair, they're all truely from the target genome but not from sequencing artifacts, and if I set REMOVE_DUPLICATES=True, it will delete the two non-best pairs, then it will decrease the coverage for that region. This doesn't make sense, maybe I misunderstood the purpose of MarkDuplicates. So my question is, what's the purpose for MarkDuplicates, why does it delete the duplicates?

Thanks in advance

gatk picard markduplicates • 14k views
ADD COMMENTlink written 8.1 years ago by Lds420

Lots of previous information in these threads:

ADD REPLYlink written 8.1 years ago by Chris Miller21k
gravatar for Sean Davis
8.1 years ago by
Sean Davis26k
National Institutes of Health, Bethesda, MD
Sean Davis26k wrote:

Almost all statistical models for variant calling assume some sort of independence between measurements. The duplicates (if one assumes that they arise from PCR artifact) are not independent. This lack of independence will usually lead to a breakdown of the statistical model and measures of statistical significance that are incorrect.

There are experiments where one should not make the assumption that reads that have the same start positions are PCR duplicates. In that case, using MarkDuplicates is not justified.

ADD COMMENTlink written 8.1 years ago by Sean Davis26k

Thanks so much. This is the discussion in seqanswers:

I think that we should using MarkDuplicates in SNP calling.

ADD REPLYlink written 8.1 years ago by Lds420

Yes, you should.

ADD REPLYlink written 8.1 years ago by Sean Davis26k
gravatar for Alex Paciorkowski
8.1 years ago by
Rochester, NY USA
Alex Paciorkowski3.4k wrote:

MarkDuplicates is important in removing PCR duplicates -- which can introduce bias in your variant calling. If you did not mark duplicates, you would risk having over-representation in your sequence of areas preferentially amplified during PCR. One way to think about it is that marking duplicates and removing them does not really have a detrimental effect on your overall depth of coverage -- but increases the quality/reliability of the areas you have covered.

There is a good discussion covered here.

And also further discussion on the Picard Main Page.

ADD COMMENTlink modified 6 months ago by RamRS26k • written 8.1 years ago by Alex Paciorkowski3.4k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1921 users visited in the last hour