I've come across this picture from Illumina presentation explaining how duplicate reads can happen. Two types of optical duplicates and PCR duplicates are pretty clear to me, but lower right corner has me confused. Could anybody explain what are these "sister" duplicates?
On a related note, what happens to the second strand when during the cluster generation the fragments are denatured with 2M base? Does it just not attach to the flow cell and is later washed out?
I've identified another type of duplicates not on this chart: tile-edge duplicates. These appear to be caused by the camera not sliding over far enough when moving to a new tile. In my analysis of NextSeq data, these account for the vast majority of duplicates (>80%). The picture shows red dots for unique clusters and blue dots for duplicate clusters, and the X/Y axis are flowcell coordinates. All 13000 duplicate pairs are plotted for the first 9 tiles of this NextSeq run; the non-duplicates were subsampled to 13000 before plotting. These duplicates were detected via Clumpify rather than mapping.
Note that in this image, all 9 tiles are superimposed on top of each other (the X/Y coordinates get reset for each tile). Also, from what I have read, the NextSeq runs 3 tiles wide on a lane. So, I would expect duplicates on both sides of the middle column tiles; right side only for the left tiles; and left side only for the right tiles. Thus, the presence of nonduplicate reads on the edges (at a much lower rate than elsewhere) is probably the contribution from left and right tiles.
Usually, the molecules sent for sequencing are double-stranded DNA. But in the procedure for loading the sequencer, the DNA is denatured. Therefore, for one double-stranded molecule sent for sequencing, two reverse-complementary molecules are loaded in the machine. Each of them can form a cluster, and these clusters will be indistinguishable. Thus, even in libraries prepared without any amplification step, the sequencing results can produce two identical reads. If the goal of the sequencing is to count molecules (transcriptome sequencing, single-cell epigenome analysis, ...), it is better to count these "sister" sequences as one.
When I first wrote a duplicate-removal program, I was worried about the possible presence of "sister" duplicates, so I designed it to detect both identical and reverse-complementary duplicates. I was not concerned about the presence of these duplicates because DNA is double-stranded and thus a fragment might spawn two clusters - in fact, I doubt that's likely. I assume that when DNA is fragmented, the two strands will generally break independently. Rather, I was concerned that during PCR, a single fragment might spawn both identical and reverse-complementary fragments, so both would need to be removed.
However, in testing, I found that the rate of identical fragments was high, and the rate of reverse-complementary fragments was extremely low - a rate that could be completely accounted for by coincidence. So, I don't worry about it any more. Though I should mention that I don't really understand why this is the case - for a PCR-amplified library it still seems to me that reverse-complementary duplicates should be present. Perhaps someone else has some insight here?
I have no idea what all these colours are about, but i think they're saying that it is possible for both strands from a single DNA molecule to form separate clusters on the flowcell. In the Illumina "how its made" videos there's always only 1 strand of DNA annealing to the adapter lawn and kicking off bridge amplification. I guess in reality the other strand could also anneal, and thus 1 DNA molecule could form 2 clusters that would appear to be PCR duplicates, but actually are just "sisters".
Yeah i'm not sure how serious an issue that is. Perhaps people with barcoded, non-amplified samples saw the same fragment sequenced twice and freaked out, then it was realised that this is an actual thing that happens. Otherwise you wouldn't be able to tell it from standard PCR duplication.