Question

Variant calling step order question: base recalibration & mark duplicates, which is first?

0

Entering edit mode

8.0 years ago

alons ▴ 270

Hi all,

We're going through & revising our variant calling pipeline on NGS data from cancer patients and a question came up:

Which step should be done first (and why), base recalibration or mark duplicates?

Currently we recalibrate bases first and then mark duplicates.

The reason I'm asking this is that we originally based part of our pipeline on the following article, which said that you recalibrate bases and then mark duplicates: http://www.htslib.org/workflow/#mapping_to_variant

However, in the following Broad Institute best practices page it says the opposite, you mark duplicates and then recalibrate bases, saw it in another paper as well: https://software.broadinstitute.org/gatk/best-practices/bp_3step.php?case=GermShortWGS

Thanks in advance!

Alon

NGS variant calling cancer pipeline • 2.8k views

ADD COMMENT • link updated 8.0 years ago by Brian Bushnell 20k • written 8.0 years ago by alons ▴ 270

0

Entering edit mode

As per GATK best practices workflow here, https://software.broadinstitute.org/gatk/img/BP_workflow_3.6.png, mark duplicates first, followed by base recalibration.

ADD REPLY • link 8.0 years ago by cpad0112 21k

score 2 · Accepted Answer · 2017-08-03

I'd probably remove duplicates first, since BSRC is generating some sort of covariation model with all of the supplied reads. I'm assuming that having a bunch of clonal artifacts in your dataset might throw this off a little. But honestly, you should ask the GATK people as they have a better understanding of the underlying model.

score 2 · Accepted Answer · 2017-08-03

2

Entering edit mode

8.0 years ago

Brian Bushnell 20k

Recalibrating bases should not really improve (or affect) duplicate detection. But duplicate removal can improve recalibration, so I'd do that first. And the earlier you remove duplicates, the faster everything else becomes.

ADD COMMENT • link 8.0 years ago by Brian Bushnell 20k