ReadsPipelineSpark marking duplicates differently than MarkDuplicates
0
0
Entering edit mode
3.4 years ago

I am using GATK 4.1.0.0 to mark duplicates in and recalibrate my bam. My workflow is currently to use: 1. MarkDuplicates 2. BaseRecalibrator 3. ApplyBQSR

But recently I have wanted to replace them with spark enabled pipelines to increase efficiency. I came across ReadsPipelineSpark, which marks duplicates in the bam, but it results in a slightly different number of duplicate reads (same total reads).

for i in results_* ; do echo $i ; samtools view $i/bams/sampleA/sampleA.bam | wc -l ; samtools view -F 1024 $i/bams/sampleA/sampleA.bam | wc -l ;done 
results_ReadsPipelineSpark
315570
242745
results_regular
315570
243265

I am running both with SUM_OF_BASE_QUALITIES duplicate scoring strategy (default for both).

Does anybody understand why the two results would differ?

alignment gatk DNA-Seq • 885 views
ADD COMMENT

Login before adding your answer.

Traffic: 2461 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6