I've come across this picture from Illumina presentation explaining how duplicate reads can happen. Two types of optical duplicates and PCR duplicates are pretty clear to me, but lower right corner has me confused. Could anybody explain what are these "sister" duplicates?
On a related note, what happens to the second strand when during the cluster generation the fragments are denatured with 2M base? Does it just not attach to the flow cell and is later washed out?
I've identified another type of duplicates not on this chart: tile-edge duplicates. These appear to be caused by the camera not sliding over far enough when moving to a new tile. In my analysis of NextSeq data, these account for the vast majority of duplicates (>80%). The picture shows red dots for unique clusters and blue dots for duplicate clusters, and the X/Y axis are flowcell coordinates. All 13000 duplicate pairs are plotted for the first 9 tiles of this NextSeq run; the non-duplicates were subsampled to 13000 before plotting. These duplicates were detected via Clumpify rather than mapping.
Note that in this image, all 9 tiles are superimposed on top of each other (the X/Y coordinates get reset for each tile). Also, from what I have read, the NextSeq runs 3 tiles wide on a lane. So, I would expect duplicates on both sides of the middle column tiles; right side only for the left tiles; and left side only for the right tiles. Thus, the presence of nonduplicate reads on the edges (at a much lower rate than elsewhere) is probably the contribution from left and right tiles.
Usually, the molecules sent for sequencing are double-stranded DNA. But in the procedure for loading the sequencer, the DNA is denatured. Therefore, for one double-stranded molecule sent for sequencing, two reverse-complementary molecules are loaded in the machine. Each of them can form a cluster, and these clusters will be indistinguishable. Thus, even in libraries prepared without any amplification step, the sequencing results can produce two identical reads. If the goal of the sequencing is to count molecules (transcriptome sequencing, single-cell epigenome analysis, ...), it is better to count these "sister" sequences as one.
When I first wrote a duplicate-removal program, I was worried about the possible presence of "sister" duplicates, so I designed it to detect both identical and reverse-complementary duplicates. I was not concerned about the presence of these duplicates because DNA is double-stranded and thus a fragment might spawn two clusters - in fact, I doubt that's likely. I assume that when DNA is fragmented, the two strands will generally break independently. Rather, I was concerned that during PCR, a single fragment might spawn both identical and reverse-complementary fragments, so both would need to be removed.
However, in testing, I found that the rate of identical fragments was high, and the rate of reverse-complementary fragments was extremely low - a rate that could be completely accounted for by coincidence. So, I don't worry about it any more. Though I should mention that I don't really understand why this is the case - for a PCR-amplified library it still seems to me that reverse-complementary duplicates should be present. Perhaps someone else has some insight here?
I have no idea what all these colours are about, but i think they're saying that it is possible for both strands from a single DNA molecule to form separate clusters on the flowcell. In the Illumina "how its made" videos there's always only 1 strand of DNA annealing to the adapter lawn and kicking off bridge amplification. I guess in reality the other strand could also anneal, and thus 1 DNA molecule could form 2 clusters that would appear to be PCR duplicates, but actually are just "sisters".
Yeah i'm not sure how serious an issue that is. Perhaps people with barcoded, non-amplified samples saw the same fragment sequenced twice and freaked out, then it was realised that this is an actual thing that happens. Otherwise you wouldn't be able to tell it from standard PCR duplication.
Does not it mean, that if you have double stranded (complement strands), when each strand is create independent cluster position. So after sequencing you have the same data set but in reverse/complement orientations, but after alignment to reference is the same starting position = flag as duplicates?
Well I don't think something like Picard markduplicates would mark it as a duplicate if sequence is the same but the strand flag is different. Need to check it to be sure.
Correct - reads that map to the same position but on opposite strands are not marked as duplicates by Picard, nor are they removed by SAMtools rmdup command. But those reads would have to be from independent library molecules (i.e., not duplicates). Read one always proceeds from the same adapter end (I believe it's P5) so, even if sister clusters were formed from the two strands of a single library molecule, their orientations relative to P5 would be identical.
I don't get it. You would have P5 on one strand and P5's complement on the other, would you not? And the complement should not be able to attach to the flow cell.
You're conflating two issues. My point was that, if two reads have opposite orientations, they cannot be duplicates of any type (PCR, sister, or optical) b/c all inserts are sequenced relative to the same adapter end and strand.
Regarding annealing, the flow cell contains both P5 and P7 anchor primers. These serve as PCR primers for bridge amplification (aka clustering). If one strand of the library anneals to P5 (i.e., is the reverse complement), then the other end of the other strand MUST be the reverse complement of P7 (draw a diagram if this is unclear). But one of the two anchor primers is blocked during annealing (see discussion below), which is why sister duplicates are not produced.
Got it, thank you for the explanation.
A blog post has appeared discussing HiSeq 4000 duplicates from Babraham group: https://sequencing.qcfail.com/articles/illumina-patterned-flow-cells-generate-duplicated-sequences
This has also been discussed extensively in this SeqAnswers thread: http://seqanswers.com/forums/showthread.php?t=72901