ReadsPipelineSpark marking duplicates differently than MarkDuplicates

0

Entering edit mode

3.4 years ago

from the mountains ▴ 230

I am using GATK 4.1.0.0 to mark duplicates in and recalibrate my bam. My workflow is currently to use: 1. MarkDuplicates 2. BaseRecalibrator 3. ApplyBQSR

But recently I have wanted to replace them with spark enabled pipelines to increase efficiency. I came across ReadsPipelineSpark, which marks duplicates in the bam, but it results in a slightly different number of duplicate reads (same total reads).

for i in results_* ; do echo $i ; samtools view $i/bams/sampleA/sampleA.bam | wc -l ; samtools view -F 1024 $i/bams/sampleA/sampleA.bam | wc -l ;done 
results_ReadsPipelineSpark
315570
242745
results_regular
315570
243265

I am running both with SUM_OF_BASE_QUALITIES duplicate scoring strategy (default for both).

Does anybody understand why the two results would differ?

alignment gatk DNA-Seq • 885 views

ADD COMMENT • link 3.4 years ago by from the mountains ▴ 230

Login before adding your answer.