Question

Why Do We Need Markduplicates For Variants Detection In Gatk Processing Pipeline?

19

Entering edit mode

12.1 years ago

Lds ▴ 450

Hi fellows,

It's said that MarkDuplicates in Picard matches all read pairs that have identical 5' coordinates and orientations and marks as duplicates all but the 'best' pair. If I have three pairs, with one of which is the 'best' pair, they're all truely from the target genome but not from sequencing artifacts, and if I set REMOVE_DUPLICATES=True, it will delete the two non-best pairs, then it will decrease the coverage for that region. This doesn't make sense, maybe I misunderstood the purpose of MarkDuplicates. So my question is, what's the purpose for MarkDuplicates, why does it delete the duplicates?

Thanks in advance

gatk markduplicates picard • 23k views

ADD COMMENT • link updated 12.1 years ago by Alex Paciorkowski 3.5k • written 12.1 years ago by Lds ▴ 450

0

Entering edit mode

Lots of previous information in these threads: http://biostar.stackexchange.com/search?q=duplicates

ADD REPLY • link 12.1 years ago by Chris Miller 22k

score 11 · Answer 1 · 2012-03-19

11

Entering edit mode

12.1 years ago

Sean Davis 26k

Almost all statistical models for variant calling assume some sort of independence between measurements. The duplicates (if one assumes that they arise from PCR artifact) are not independent. This lack of independence will usually lead to a breakdown of the statistical model and measures of statistical significance that are incorrect.

There are experiments where one should not make the assumption that reads that have the same start positions are PCR duplicates. In that case, using MarkDuplicates is not justified.

ADD COMMENT • link 12.1 years ago by Sean Davis 26k

0

Entering edit mode

Thanks so much. This is the discussion in seqanswers: http://seqanswers.com/forums/showthread.php?t=6854

I think that we should using MarkDuplicates in SNP calling.

ADD REPLY • link 12.1 years ago by Lds ▴ 450

0

Entering edit mode

Yes, you should.

ADD REPLY • link 12.1 years ago by Sean Davis 26k

Ram · Answer 2 · 2012-03-19

MarkDuplicates is important in removing PCR duplicates -- which can introduce bias in your variant calling. If you did not mark duplicates, you would risk having over-representation in your sequence of areas preferentially amplified during PCR. One way to think about it is that marking duplicates and removing them does not really have a detrimental effect on your overall depth of coverage -- but increases the quality/reliability of the areas you have covered.

There is a good discussion covered here.

And also further discussion on the Picard Main Page.