Question

proximal duplicate parameters - HiSeq X

0

Entering edit mode

7.2 years ago

Richard ▴ 590

Hi all,

We are doing some QC on our HiSeqX genome data. Here we see an increased rate of duplicates for the same libraries sequenced on other platforms. This is a known phenomenon as mentioned in this post on the GATK website:

http://gatkforums.broadinstitute.org/gatk/discussion/6747/how-to-mark-duplicates-with-markduplicates-or-markduplicateswithmatecigar

However, we're trying to work something out on our end and I find myself wondering if these Picard MarkDuplicates parameters are appropriate for data from the HiSeqX:

READ_NAME_REGEX=[a-zA-Z0-9_]+:[0-9]+:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500

example read name: HISEQX2_43:2:1115:17188:49513

I'm going to do some testing on the OPTICAL_DUPLICATE_PIXEL_DISTANCE value, but I'd like to hear if anyone finds themselves using a different value than that above (2500)

duplicates proximal • 2.0k views

ADD COMMENT • link updated 7.2 years ago by GenoMax 141k • written 7.2 years ago by Richard ▴ 590

0

Entering edit mode

I used some test HiSeq X data and tried a number of pixel distances. It does seem that the distance of about 2500 is about the saturation point for identifying duplicates as optical

ADD REPLY • link 7.2 years ago by Richard ▴ 590

score 0 · Answer 1 · 2017-02-14

I will suggest that you give clumpify.sh from BBMap a try to identify duplicates (optical, PCR and other kinds). There is a detailed thread available on SeqAnswers. With clumpify.sh you do not need (or need to) align(ed) data. You can use spantiles=f dupedist=2500 parameters (which work for HiSeq 4K and should be appropriate for HiSeq X)