Question about Details in Picard MarkDuplicates
1
0
Entering edit mode
8.8 years ago
DVA ▴ 630

I have a question about how picard mark duplicates: in one set of duplicated reads, would all of them be marked with Flag 1024, or only the ones for removal? In another word, would picard distinguish the "best" read from the duplicated pool of sequences by not giving it a flag?

I tried to go over picard's source code, but I'm not familiar with JAVA, so unfortunately failed:( Thanks in advance for your help here!

sequence • 4.0k views
ADD COMMENT
4
Entering edit mode
8.8 years ago

Copied from here: http://broadinstitute.github.io/picard/faq.html

Q: How does MarkDuplicates work?

A: Essentially what it does (for pairs; single-end data is also handled) is to find the 5' coordinates and mapping orientations of each read pair. When doing this it takes into account all clipping that has taking place as well as any gaps or jumps in the alignment. You can thus think of it as determining "if all the bases from the read were aligned, where would the 5' most base have been aligned". It then matches all read pairs that have identical 5' coordinates and orientations and marks as duplicates all but the "best" pair. "Best" is defined as the read pair having the highest sum of base qualities as bases with Q >= 15.

If your reads have been divided into separate BAMs by chromosome, inter-chromosomal pairs will not be identified, but MarkDuplicates will not fail due to inability to find the mate pair for a read.

ADD COMMENT
0
Entering edit mode

Oh woo I'm sorry I didn't know Picard has this explained already. Thanks!

ADD REPLY

Login before adding your answer.

Traffic: 2629 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6