Question: proximal duplicate parameters - HiSeq X
2.1 years ago
Richard wrote:

Hi all,

We are doing some QC on our HiSeqX genome data. Here we see an increased rate of duplicates for the same libraries sequenced on other platforms. This is a known phenomenon as mentioned in this post on the GATK website:

However, we're trying to work something out on our end and I find myself wondering if these Picard MarkDuplicates parameters are appropriate for data from the HiSeqX:

READ_NAME_REGEX=[a-zA-Z0-9_]+:[0-9]+:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500

example read name: HISEQX2_43:2:1115:17188:49513

I'm going to do some testing on the OPTICAL_DUPLICATE_PIXEL_DISTANCE value, but I'd like to hear if anyone finds themselves using a different value than that above (2500)

I used some test HiSeq X data and tried a number of pixel distances. It does seem that the distance of about 2500 is about the saturation point for identifying duplicates as optical

2.1 years ago
United States
genomax wrote:

I will suggest that you give from BBMap a try to identify duplicates (optical, PCR and other kinds). There is a detailed thread available on SeqAnswers. With you do not need (or need to) align(ed) data. You can use spantiles=f dupedist=2500 parameters (which work for HiSeq 4K and should be appropriate for HiSeq X)

