Hi all,
We are doing some QC on our HiSeqX genome data. Here we see an increased rate of duplicates for the same libraries sequenced on other platforms. This is a known phenomenon as mentioned in this post on the GATK website:
However, we're trying to work something out on our end and I find myself wondering if these Picard MarkDuplicates parameters are appropriate for data from the HiSeqX:
READ_NAME_REGEX=[a-zA-Z0-9_]+:[0-9]+:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500
example read name: HISEQX2_43:2:1115:17188:49513
I'm going to do some testing on the OPTICAL_DUPLICATE_PIXEL_DISTANCE value, but I'd like to hear if anyone finds themselves using a different value than that above (2500)
I used some test HiSeq X data and tried a number of pixel distances. It does seem that the distance of about 2500 is about the saturation point for identifying duplicates as optical