Question: Variant calling step order question: base recalibration & mark duplicates, which is first?
We're going through & revising our variant calling pipeline on NGS data from cancer patients and a question came up:

Which step should be done first (and why), base recalibration or mark duplicates?

Currently we recalibrate bases first and then mark duplicates.

The reason I'm asking this is that we originally based part of our pipeline on the following article, which said that you recalibrate bases and then mark duplicates:

However, in the following Broad Institute best practices page it says the opposite, you mark duplicates and then recalibrate bases, saw it in another paper as well:

Thanks in advance!


As per GATK best practices workflow here,, mark duplicates first, followed by base recalibration.

I'd probably remove duplicates first, since BSRC is generating some sort of covariation model with all of the supplied reads. I'm assuming that having a bunch of clonal artifacts in your dataset might throw this off a little. But honestly, you should ask the GATK people as they have a better understanding of the underlying model.

Walnut Creek, USA
Recalibrating bases should not really improve (or affect) duplicate detection. But duplicate removal can improve recalibration, so I'd do that first. And the earlier you remove duplicates, the faster everything else becomes.

