Library Duplicates Vs. Optical Duplicates (Picard Markduplicates)
1
6
Entering edit mode
10.7 years ago
toni ★ 2.2k

Hi all,

I am analyzing some Illumina paired-end sequencing experiment. I would like to track the duplicates in my lanes and be able to distinguish between PCR duplicates and optical duplicates.

To this purpose, I use Picard MarkDuplicates. This function has an OPTICAL_DUPLICATE_PIXEL_DISTANCE parameter ... nice ... but as the function simply set a flag to true in the sorted BAM file, there is no way in the end to distinguish between the two. (Am I right ?)

So, basically I am wondering if this option is really useful ? It is explained that MarkDuplicates starts to find the 5' coordinates and mapping orientations of each read pair, thus to look at the coordinates of the cluster on the flowcell seems unnecessary (?), as the pair will be tagged as a duplicate anyway.

Do you use in-house script or a particular API for such a goal ?

Cheers Tony

EDIT : I am aware that Picard creates a metrics file to report some values. But in some lanes generated with a PCR-free protocol, I expected a proportion of my duplicates to be optical duplicates. Nevertheless, in Picard metrics file, I always have %optical_dup=0. So I am wondering if some of you had some issues with this measure as well.

picard markduplicates bam • 13k views
3
Entering edit mode
10.6 years ago
toni ★ 2.2k

I omitted to have a close look to the read names in my files. A read name has the following format :

 @identifier:lane:tile:x:y


Picard, by default, only match numbers and letters in the 'identifier' part. So if you have underscores (and it's quite usual to have some actually), Picard will not be able to get the coords back and then no optical duplicates will pop up...

Use the READ_NAME_REGEX option of MarkDuplicates to customize the read name matching.