Question

Picard MarkDuplicates flag vs remove issue

0

Entering edit mode

5.8 years ago

s1469060 ▴ 10

Hi all

I have "a quick question" about Picard MarkDuplicates. I have ATAC-seq data, already filtered for mitochondrial and unmapped reads. The initial file has about 59 million reads. When I run MarkDuplicates as so: java -jar /exports/igmm/eddie/hill-lab/Zoe/References_and_Scripts/picard-tools-2.5.0/picard.jar MarkDuplicates I=Mutant1_paired_align_subMitoUnc_sorted.bam O=Mutant1_align_filtered.bam M=Mutant1_test_metrics.txt REMOVE_DUPLICATES=true The file then has 39 million reads

However if I run it with REMOVE_DUPLICATES=FALSE and then use samtools to remove the 1024 flagged reads I end up with 56 million reads. I really can't seem to understand why using the remove_duplicates=TRUE causes such a difference? Should the output of both methods not be similar? Thanks in advance!

All the best, Zoe

ATAC-seq Duplicates Picard MarkDuplicates • 6.6k views

ADD COMMENT • link updated 5.8 years ago by Prakash ★ 2.2k • written 5.8 years ago by s1469060 ▴ 10

score 0 · Answer 1 · 2018-06-14

0

Entering edit mode

5.8 years ago

Prakash ★ 2.2k

This issue has been discussed earlier.. this may help C: Samtools rmdup and Piccard Markduplicates

ADD COMMENT • link 5.8 years ago by Prakash ★ 2.2k

0

Entering edit mode

Hi, thanks for the reply. But I'm still a bit confused. It's not that using Samtools rmdup is removing fewer reads, I have never even tried it. It's just that when I remove the reads flagged as duplicates by MarkDuplicates rather than using its own REMOVE_DUPLICATES=TRUE option I am getting different results. Should the REMOVE_DUPLICATES=TRUE option not just remove those it is flagging (it appears to remove a hell of a lot more)?

ADD REPLY • link 5.8 years ago by s1469060 ▴ 10

0

Entering edit mode

"However if I run it with REMOVE_DUPLICATES=FALSE and then use samtools to remove the 1024 flagged reads I end up with 56 million reads"

" It's not that using Samtools rmdup is removing fewer reads, I have never even tried it"

How do you expect an answer if you can't give clear information in the question?

1) State clearly what your goal is.

2) State clearly what you have done.

3) State clearly the results you have got from what you did.

4) State clearly what is confusing you.

ADD REPLY • link 5.8 years ago by BioinfGuru ★ 1.7k

0

Entering edit mode

Sorry if that wasn't clear enough.

Original read file: 59 million reads. The commands are as follows: java -jar /exports/igmm/eddie/hill-lab/Zoe/References_and_Scripts/picard-tools-2.5.0/picard.jar MarkDuplicates I=Mutant1_paired_align_subMitoUnc_sorted.bam O=Mutant1_align_filtered.bam M=Mutant1_test_metrics.txt REMOVE_DUPLICATES=true

Output: 39 million reads

OR

java -jar /exports/igmm/eddie/hill-lab/Zoe/References_and_Scripts/picard-tools-2.5.0/picard.jar MarkDuplicates I=Mutant1_paired_align_subMitoUnc_sorted.bam O=Mutant1_align_filtered.bam M=Mutant1_test_metrics.txt REMOVE_DUPLICATES=false Then samtools view -F 0x400 Mutant1_align_filtered.bam > Mutant1_align_filtered_2.bam

Output: 56 million reads

Question: should the reads number flagged with the 0x400 flag not match the read number removed when REMOVE_DUPLICATES=true. Which in this case it is not with one removing 3 million and the other about 20 million

ADD REPLY • link 5.8 years ago by s1469060 ▴ 10

score 0 · Answer 2 · 2018-06-14

Mark duplicates is doing something other than just removing duplicates, otherwise it would just be called remove duplicates. My guess is that 17M have flag 1024, and the 3M have a different flag - but what is it?

I would find a command that lets me print the 3M reads that are the difference between the input (59M) and output (56M) bam files when remove dups is false. Then I would query the flags present in those 3M reads and find out what those flags mean.

Here are some links I've found useful in the past:

Let us know what the answer is :)