Optical duplicates?
0
2
Entering edit mode
22 months ago
abascalfederico ★ 1.2k

Investigating a weird mutation calling artefact I found that a considerable fraction of those artefacts (20-30%, very rough estimate) have certain similarities in their coordinates/tiles. We are using a conservative threshold of 2500 pixels to flag optical duplicates out of NovaSeq S4 flow cells. The following examples are further than 2500 pixels, but they show striking similarities (only showing lane:tile:x:y). The separation of 1000 in tile numbers is very frequent... update: just read that the thousand digit (1 or 2) indicates whether it is "top" or "bottom" in the tile (not sure what that means)

3:1338:9489:28416
3:1338:9489:12195

4:1308:18385:15890
4:2308:17861:17644

3:2630:7835:29684
3:1630:10818:30624


Does anyone have an idea of what may be going on?

duplicate optical • 2.2k views
2
Entering edit mode

2500 may be too low a number for NovaSeq. That number is more appropriate for HiSeq 3K/4K. @Brian Bushnell recommends 12000 for NovaSeq. As I recall X:Y co-ordinates do not directly translate to pixels. I think Illumina was not willing to give the mapping information out.

0
Entering edit mode

Thank you! We'll increase to 12000. How much higher can one go?

It seems in many cases I see reads that are closer than 2500 (our threshold) but on different surfaces of the flow cell. They must be the same cluster. However at least biobambam does not flag them if they are on different surfaces. Any idea what's going on here? Is that how other OD flagging programs work too?

If anyone can recommend a reference or post on the optimal OD thresholds that would be much appreciated too.

1
Entering edit mode

How much higher can one go?

Devon Ryan had done some empirical testing over at SeqAnswers for this. It probably does not make sense to go much higher.

but on different surfaces of the flow cell. They must be the same cluster.

If the reads are on different surfaces then they are unlikely to be from the same cluster. Illumina likes to call these cluster duplicates (identical sequences in nearby wells). Just to be sure are you using clumpify.sh from BBMap suite for this analysis?

0
Entering edit mode

No, we are not using clumpify.sh. We are using biobambam.

Umm... I don't think the same read is on both surfaces but the machine may be seeing the cluster from both sides? It is really a pattern I'm seeing repeatedly.

Reading that post, it is interesting that saturation isn't reached until 20000. I wonder how many reads would you lose with 20000 depending on your duplicate rate... I may try to reproduce Devon's analysis

0
Entering edit mode

The number above was specifically meant to be used for clumpify.sh from BBMap suite. I don't know how that will translate to biobambam.

but the machine may be seeing the cluster from both sides?

I doubt that. My understanding is that the imaging should be precise with a laser doing the scanning.

2
Entering edit mode

Just want to confirm that I've observed an increased number of duplicates on the opposite surfaces consistently across three sequencers in two facilities (iSeq, NextSeq, NovaSeq). See picture, which is log2 of number of duplicates between different tiles of a flowcell.