proximal duplicate parameters - HiSeq X
1
0
Entering edit mode
7.2 years ago
Richard ▴ 590

Hi all,

We are doing some QC on our HiSeqX genome data. Here we see an increased rate of duplicates for the same libraries sequenced on other platforms. This is a known phenomenon as mentioned in this post on the GATK website:

http://gatkforums.broadinstitute.org/gatk/discussion/6747/how-to-mark-duplicates-with-markduplicates-or-markduplicateswithmatecigar

However, we're trying to work something out on our end and I find myself wondering if these Picard MarkDuplicates parameters are appropriate for data from the HiSeqX:

READ_NAME_REGEX=[a-zA-Z0-9_]+:[0-9]+:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500

example read name: HISEQX2_43:2:1115:17188:49513

I'm going to do some testing on the OPTICAL_DUPLICATE_PIXEL_DISTANCE value, but I'd like to hear if anyone finds themselves using a different value than that above (2500)

duplicates proximal • 2.0k views
ADD COMMENT
0
Entering edit mode

I used some test HiSeq X data and tried a number of pixel distances. It does seem that the distance of about 2500 is about the saturation point for identifying duplicates as optical

ADD REPLY
0
Entering edit mode
7.2 years ago
GenoMax 141k

I will suggest that you give clumpify.sh from BBMap a try to identify duplicates (optical, PCR and other kinds). There is a detailed thread available on SeqAnswers. With clumpify.sh you do not need (or need to) align(ed) data. You can use spantiles=f dupedist=2500 parameters (which work for HiSeq 4K and should be appropriate for HiSeq X)

ADD COMMENT

Login before adding your answer.

Traffic: 1826 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6