Duplicates on Illumina
4
36
Entering edit mode
7.7 years ago
predeus ★ 2.0k

Hello all,

I've come across this picture from Illumina presentation explaining how duplicate reads can happen. Two types of optical duplicates and PCR duplicates are pretty clear to me, but lower right corner has me confused. Could anybody explain what are these "sister" duplicates?

On a related note, what happens to the second strand when during the cluster generation the fragments are denatured with 2M base? Does it just not attach to the flow cell and is later washed out?

Thank you in advance.

enter image description here

duplicates illumina • 35k views
ADD COMMENT
1
Entering edit mode

Does not it mean, that if you have double stranded (complement strands), when each strand is create independent cluster position. So after sequencing you have the same data set but in reverse/complement orientations, but after alignment to reference is the same starting position = flag as duplicates?

ADD REPLY
0
Entering edit mode

Well I don't think something like Picard markduplicates would mark it as a duplicate if sequence is the same but the strand flag is different. Need to check it to be sure.

ADD REPLY
2
Entering edit mode

Correct - reads that map to the same position but on opposite strands are not marked as duplicates by Picard, nor are they removed by SAMtools rmdup command. But those reads would have to be from independent library molecules (i.e., not duplicates). Read one always proceeds from the same adapter end (I believe it's P5) so, even if sister clusters were formed from the two strands of a single library molecule, their orientations relative to P5 would be identical.

ADD REPLY
0
Entering edit mode

I don't get it. You would have P5 on one strand and P5's complement on the other, would you not? And the complement should not be able to attach to the flow cell.

ADD REPLY
4
Entering edit mode

You're conflating two issues. My point was that, if two reads have opposite orientations, they cannot be duplicates of any type (PCR, sister, or optical) b/c all inserts are sequenced relative to the same adapter end and strand.

Regarding annealing, the flow cell contains both P5 and P7 anchor primers. These serve as PCR primers for bridge amplification (aka clustering). If one strand of the library anneals to P5 (i.e., is the reverse complement), then the other end of the other strand MUST be the reverse complement of P7 (draw a diagram if this is unclear). But one of the two anchor primers is blocked during annealing (see discussion below), which is why sister duplicates are not produced.

ADD REPLY
0
Entering edit mode

Got it, thank you for the explanation.

ADD REPLY
0
Entering edit mode

A blog post has appeared discussing HiSeq 4000 duplicates from Babraham group: https://sequencing.qcfail.com/articles/illumina-patterned-flow-cells-generate-duplicated-sequences

This has also been discussed extensively in this SeqAnswers thread: http://seqanswers.com/forums/showthread.php?t=72901

ADD REPLY
15
Entering edit mode
7.7 years ago

I've identified another type of duplicates not on this chart: tile-edge duplicates. These appear to be caused by the camera not sliding over far enough when moving to a new tile. In my analysis of NextSeq data, these account for the vast majority of duplicates (>80%). The picture shows red dots for unique clusters and blue dots for duplicate clusters, and the X/Y axis are flowcell coordinates. All 13000 duplicate pairs are plotted for the first 9 tiles of this NextSeq run; the non-duplicates were subsampled to 13000 before plotting. These duplicates were detected via Clumpify rather than mapping.

Note that in this image, all 9 tiles are superimposed on top of each other (the X/Y coordinates get reset for each tile). Also, from what I have read, the NextSeq runs 3 tiles wide on a lane. So, I would expect duplicates on both sides of the middle column tiles; right side only for the left tiles; and left side only for the right tiles. Thus, the presence of nonduplicate reads on the edges (at a much lower rate than elsewhere) is probably the contribution from left and right tiles.

Duplicate Cluster Coordinates

ADD COMMENT
2
Entering edit mode

I tested our brand-new second NextSeq, and it has the same issue. The pattern is very similar. Duplicate rate seems similar too at ~1.5%.

NextSeq2

ADD REPLY
1
Entering edit mode

4 of 4 NextSeqs affected across 3 countries indicates to me that this is a general NextSeq issue :(

ADD REPLY
0
Entering edit mode

I think this is related to the large cluster size (they are on the order of mm in diamter, if I recall correctly, compared to other Illumina sequencers) for NextSeq. I have only tried a couple of samples (from HiSeq 4K and MiSeq) and have not seen this.

ADD REPLY
0
Entering edit mode

Hahahah - oh boy. Fantastic research as always, but wow, what to say.

I guess they wont be marked as optical-duplicates with tools like Picard as they are thought to come from different tiles? I mean, it will detect all duplicates regardless, but detect as optical duplicates.

ADD REPLY
0
Entering edit mode

Excellent detective work. I wonder if this can get fixed with a software update...

ADD REPLY
0
Entering edit mode

I've only tested our NextSeq, so I don't know if it's universal or ours is miscalibrated, but we will be discussing it with Illumina soon. Seems like it should be an easy fix. In the meantime I'm updating BBDuk to support location filtering so we can exclude all tile-edge reads.

ADD REPLY
1
Entering edit mode

If you post your code for generating the image I'll use it to check our NextSeq.

ADD REPLY
2
Entering edit mode

I could do the same for data from our NextSeq to add more data to the story.

ADD REPLY
1
Entering edit mode

OK, starting with interleaved reads in reads.fq.gz, and using BBMap v36.79 (which I just uploaded, as it has deduplication added to Clumpify):

clumpify.sh in=reads.fq.gz out=deduped.fq.gz k=19 subs=5 dedupe allduplicates int
filterbyname.sh in=reads.fq.gz out=duplicates.fq.gz names=deduped.fq.gz include=f
reformat.sh in=duplicates.fq.gz out=duplicates#.header srt=20k
filterbyname.sh in=reads.fq.gz out=unique.fq.gz names=deduped.fq.gz include=t
reformat.sh in=unique.fq.gz out=unique#.header srt=20k

That will give you 20000 randomly-sampled duplicate read headers in "duplicates1.txt" and nonduplicates in "unique1.txt". I pasted that into Excel splitting on spaces and colons. The headers look like this:

NS500302:178:HLNGJBGXY:1:11101:20306:1857 1:N:0:GTAGAG

So the X and Y coordinates are 20306 and 1857. You can, if you want, just start with the first 2 million reads to speed things up a lot and use less memory, if you have a huge file.

ADD REPLY
3
Entering edit mode

I get a lot less of this than you do, but it's still there. This is one of 4 samples on a single NextSeq run, but they all look similar.

Duplication

ADD REPLY
2
Entering edit mode

Hi Brian Bushnell, a bit later than I planned but I made some plots...
Scatter
DensityOnX
DensityOnY

Looks like we have the same issue. Can you keep me posted about your correspondence with Illumina?

Plotting code:

ADD REPLY
0
Entering edit mode

Thanks Wouter! I'll keep you informed. Actually, I think it would be fun to post Illumina's first response:

Hi all, I reviewed the run to see if there are any issues related to a subset or all of the 6 cameras on the instrument by looking at sequence quality overall as well as per camera section to gauge the variability.

On the whole, the run itself performed well by exceeding our quality specs for a 2x150 run: 85.6% vs 75% spec. All 6 cameras are also performing to spec, the variability is 6.8%, where we would expect <20%. For more information on camera variability in general, please refer to the attached technical note, and data specific to this run can be found in the second PDF attachment.

In summary, it looks like the system is imaging as expected and does not need calibration at this time.

So - yeah. Looks like typical Illumina quality control. I did forward them some images in response that, who knows, maybe they didn't get initially... hopefully their next response will be more useful.

ADD REPLY
1
Entering edit mode

Thanks for sharing. Illumina then has a quite limited idea of "quality" and "as expected"... I'll inform those responsible for our sequencers on Monday.

ADD REPLY
1
Entering edit mode

Replicated on data generated from mid-output flow cell run: MidOutput

ADD REPLY
0
Entering edit mode

Hi Brian Bushnell, I'm not sure how the tiles are organized on mid output flow cell, but it's likely different.
Have you investigated if the edges have overrepresented sequences or only looked at duplicates?

ADD REPLY
1
Entering edit mode

I've only looked at duplicates. Note, though, that you can use BBDuk's new positional filter (xmin ymin xmax ymax flags) to select reads in a rectangular region, if you want to run FastQC on just a subset.

ADD REPLY
1
Entering edit mode

I selected the edges using BBDuk with everything left from x=2000 and everything right from x=25000 and compared the fastQC report from this set with the all the reads, and those are very similar, nothing odd.

ADD REPLY
0
Entering edit mode

Awesome. Will check it out.

ADD REPLY
0
Entering edit mode

Legally they are in the clear. NextSeq specs are listed on this page. Only criteria are % of reads with Q30 and number of reads passing quality filter. That is pretty much the norm for all Illumina sequencers.

ADD REPLY
0
Entering edit mode

I suppose fixing this would decrease the apparent output of the machine, so they're in a no-win situation :P

ADD REPLY
0
Entering edit mode

Glad it's not just me with oddly high percent duplication rates by this method :)

ADD REPLY
0
Entering edit mode

Well this method selects the same number of duplicate reads as unique reads so it's not an absolute comparison :-)

ADD REPLY
1
Entering edit mode

I'm going to look at a few more runs to judge run-to-run variability.

As an aside, I'm trying to come up with a better visualization. I'd prefer a density plot of some sort, but they keep producing edge-effects that counteract what's otherwise visually evident.

ADD REPLY
0
Entering edit mode

I was thinking of density plots on X and Y axis, expecting to see the duplicates-effect when collapsing the data to X and expecting to see nothing special when collapsing the data to the Y chromosome. But that isn't showing the desired data?

Could it be that the run you processed already contained a lot of duplicates, therefore relatively decreasing the effect of the tile-edge duplicates?

ADD REPLY
0
Entering edit mode

I've been playing around with density plots but haven't struck on anything great yet, but will keep playing around.

But yeah, this run may have had other issues. I'm generalizing my snakemake script now to more easily test this on other runs.

ADD REPLY
0
Entering edit mode

Aww I should learn snakemake. I'll look at 2 SE transcriptome sequencing run and PE exomes sequencing run...

ADD REPLY
0
Entering edit mode

Snakemake is pretty useful. I should note that this is probably total RNAseq, so a highish duplication rate is expected.

ADD REPLY
0
Entering edit mode

I've settled on binning both axes and then plotting the % duplication in each, since otherwise the read density tends to skew the density plots. This shows a nice uptick at the edges.

ADD REPLY
0
Entering edit mode

Thanks, Devon, that's really helpful. It looks like our NextSeq is miscalibrated.

ADD REPLY
2
Entering edit mode

Brian Bushnell is allowing 5 substitutions in this clumpify.sh command line (per read) so keep that in mind. Technically those are not optical duplicates.

NextSeq clusters being huge may be a completely different beast. Brian Bushnell I will take a look at this on the HiSeq 4K data we were looking at.

ADD REPLY
0
Entering edit mode

OK, thanks, I'm curious about the effect of patterned flowcells.

ADD REPLY
0
Entering edit mode

@Brian: Was the graph example above from a single sample or a sample that came from a pool?

ADD REPLY
0
Entering edit mode

It's a single sample that came from a pool. I haven't looked at any other samples or runs yet.

ADD REPLY
1
Entering edit mode

FYI, in version 36.81 of clumpify, the allduplicates option only works if groups=1 (or you use small enough files that it sticks everything in memory by default). I assume this is a bug.

ADD REPLY
0
Entering edit mode

Ah! Nice catch. I'll look into that ASAP.

ADD REPLY
1
Entering edit mode

I come across the same (?) issue in Clumpify 36.81 (unknown parameter allduplicates)

ADD REPLY
1
Entering edit mode

Sorry about that - it's fixed now in 36.82.

ADD REPLY
0
Entering edit mode

Alright, will have a look at this tomorrow (eh technically today in my time zone). I will throw in a bit of R code for the plotting ;-)

ADD REPLY
0
Entering edit mode

This is great stuff! So, are they NOT identified as optical duplicates by Picard MarkDuplicates?

I wonder if this sort of problems happen on Hiseq3000/Hiseq4000 as well.

ADD REPLY
0
Entering edit mode

I would expect them to not be identified as optical duplicates by Picard because they are on different tiles and their coordinates are different by at least 25000 pixels on the X axis. They should still be identified as duplicates, though, since they would map to the same location. The HiSeq 4000 data I have seen so far does not show this problem.

ADD REPLY
0
Entering edit mode

Has this been resolved, and if yes, how? I checked some of our recent NextSeq runs and it seems to not be a problem. Thanks!!

ADD REPLY
0
Entering edit mode

I have not heard anything about a fix, and Illumina's suggestion for a long-term solution when we discussed it with them last was simply to cut off the leftmost and rightmost X pixels from every tile; there was no indication that they planned to change anything.

I (just now) tested a NextSeq run from last week to check, and it still shows the same problem. It's possible that you are using a newer version of Illumina's software that fixed it, though. How did you test it to determine the problem appears to be solved?

ADD REPLY
0
Entering edit mode

I screen all NextSeq runs we do as part of our standard pipeline and typically find some small portion of duplicates (the one that finished last night had up to 1%, depending on the sample). So no, the issue hasn't gone away.

ADD REPLY
0
Entering edit mode

I used the following code:

seqtk mergepe R1.fastq.gz R2.fastq.gz > merged.fastq
cat merged.fastq | awk -F'[: ]' '/@NB5/ {x=$6; y=$7; read=$8; line=NR} {if (NR==line+1 && read == 1){print $1, x, y}}' > reads_positions.txt
sort reads_positions.txt > reads_positions_sorted.txt
python identify_dups.py < reads_positions_sorted.txt > read_kinds.txt
shuf read_kinds.txt > read_kinds_shuffeled.txt
head -n 2000000 read_kinds_shuffeled.txt > read_kinds_sampled.txt

Here is the python code:

I then plotted the last output file (read_kinds_sampled.txt) using the R script shared earlier in this thread.

Some of our recent NextSeq runs had samples with 140 million 2x150bp read pairs, so using mostly stream-based tools and downsampling to 2 million reads made this analysis faster.

enter image description here enter image description here enter image description here

ADD REPLY
0
Entering edit mode

I also just confirmed with the lab that both of our NextSeqs are NextSeq 500 and we use RTA V2 for basecalling.

ADD REPLY
0
Entering edit mode

Are your NextSeq's newish (recently purchased)? Older sequencers tend to have issues that may be fixed in newer iterations. Illumina may not always advertise changes they make internally to the hardware.

ADD REPLY
0
Entering edit mode

One NextSeq was new, I believe, in March 2016 and we just got a second one 1-2 months ago that was used. Not sure exactly how old that one is (I will try and find out)

ADD REPLY
0
Entering edit mode

I don't recommend downsampling when doing positional duplicate analysis. Instead, if the full dataset is too big, just use the first 2m reads instead of 2m random reads (as long as you get reads from at least 4 tiles you should have enough).

Also, note that the duplicate rate in this analysis drops at the edges of the tiles. This is because the quality decreases as the tile goes out of focus away from the center, so duplicates are harder to detect. It's useful to allow one or two mismatches when doing duplicate detection, for this reason, particularly when you expect duplicates to be at the edges. If you have 151bp reads, you may also want to trim them to 150bp first, as base 151 tends to be very low quality.

ADD REPLY
0
Entering edit mode

Hi, is it possible for you to re-port the picture? Thanks.

ADD REPLY
0
Entering edit mode

It still loads for me - perhaps need to wait a bit?

Here it is using biostars hosting

ADD REPLY
1
Entering edit mode

predeus : @Deepak is asking about an image that @Brian has posted in his answer. Unfortunately it is not longer available.

ADD REPLY
2
Entering edit mode
7.7 years ago
Charles Plessy ★ 2.9k

Usually, the molecules sent for sequencing are double-stranded DNA. But in the procedure for loading the sequencer, the DNA is denatured. Therefore, for one double-stranded molecule sent for sequencing, two reverse-complementary molecules are loaded in the machine. Each of them can form a cluster, and these clusters will be indistinguishable. Thus, even in libraries prepared without any amplification step, the sequencing results can produce two identical reads. If the goal of the sequencing is to count molecules (transcriptome sequencing, single-cell epigenome analysis, ...), it is better to count these "sister" sequences as one.

ADD COMMENT
1
Entering edit mode

The way I understand flowcell attachment works, only one strand should be able to attach and form a cluster. I might be wrong though.. so the other strand is attaching with P7?

ADD REPLY
2
Entering edit mode
7.7 years ago

When I first wrote a duplicate-removal program, I was worried about the possible presence of "sister" duplicates, so I designed it to detect both identical and reverse-complementary duplicates. I was not concerned about the presence of these duplicates because DNA is double-stranded and thus a fragment might spawn two clusters - in fact, I doubt that's likely. I assume that when DNA is fragmented, the two strands will generally break independently. Rather, I was concerned that during PCR, a single fragment might spawn both identical and reverse-complementary fragments, so both would need to be removed.

However, in testing, I found that the rate of identical fragments was high, and the rate of reverse-complementary fragments was extremely low - a rate that could be completely accounted for by coincidence. So, I don't worry about it any more. Though I should mention that I don't really understand why this is the case - for a PCR-amplified library it still seems to me that reverse-complementary duplicates should be present. Perhaps someone else has some insight here?

ADD COMMENT
2
Entering edit mode

Illumina always shows fragments clustering in one color (orientation) in illustrations. Possibly the other class of anchors are blocked (until made available during the turnaround phase)?

ADD REPLY
0
Entering edit mode

I imagine so - but as always a few reads might not get blocked every now and again. Or perhaps it's a very different situation, where the denaturing of the original template occurs after the entire flowcell has been flooded with dsDNA. I imagine if you flooded a flowcell with ssDNA, there would be far more clusters at the top of the lane than the bottom. So they probably keep things dsDNA until the flowcell is covered, then denature. This would, and i'm just guessing here, give a more even cluster formation across the lane, and sister fragments would likely also bind to the same physical location as each other. Thus, if they do form independent clusters, they're likely to become merged. Only if sisters travel a significant distance apart after denaturation can they become two independent clusters.Thats just a total guess though. Someone needs to look at the actual protocol :P

ADD REPLY
5
Entering edit mode

@John, the library is denatured with NaOH (and neutralized in hyb buffer) before loading onto the instrument. And yes, there is usually a cluster density gradient (high -> low) across the flow cell from inflow to outflow.

P.S.-I checked my notes from an old (circa 2011) Illumina presentation; as @genomax2 suggested, one of the two primers on the flow cell is blocked during library loading.

ADD REPLY
0
Entering edit mode

Ah, awesome, great info Harold - in that case the mystery is solved. I guess only a small number of sisters are seen because only a small number make it past the block, or aren't fully washed out after the block is removed.

Off-topic, but do you happen to know why flowcells can't be reused? I've always thought that adding an exonuclease to cut off the synthesised fragment at the end would essentially reset a flowcell to being just a lawn of adapters as before. The exo to do that is used in step 3 or 4 of the protocol (after bridge amplification in the first round of sequencing to get everything homogenous 5' -> 3'), so i'm surprised i haven't heard stories of people at least trying it out, particularly since flowcells are $25,000 each or whathaveyou.

ADD REPLY
3
Entering edit mode

Flow cells aren't quite THAT expensive (more like $6K for the current PE version, and that includes reagents). Plus, we all know that the real expense of NGS is the bioinformaticist's time ;-).

The flow cells can't be reused because the anchor primers are cleaved after clustering. Both of the anchor primers are chemically modified to allow differential cleavage. After initial library loading/clustering, primer A is cleaved in the middle so that only the strand tethered to primer B remains as a template for sequencing. Upon completion of read 1 and index reading, the stub of primer A is used to synthesize the complement of the primer B-tethered strand. Then, primer B is cleaved, leaving only the newly synthesized strand as template for read 2.

ADD REPLY
2
Entering edit mode

Ah, see, this is the difference between watching the videos, and knowing the science. Thanks again Harold :) Also, word to the wise, do not Google Image search "differential cleavage" in a Starbucks. You'll get weird looks.

ADD REPLY
3
Entering edit mode

It must take into account your past browsing history; my results were completely tame :)

ADD REPLY
0
Entering edit mode

According to the supplemental material of Bentley et al., 2008, clusters in single-read flow cells were linearised with a chemical method, but the ones in paired-end read flow cells were linearised using enzymes related to base excision repair mechanisms, which I doubt will function on the anchor primers when they are still single-stranded. Thus, it might be possible to reuse the paired-end flow cells. I do not know if this still holds true on Illumina sequencers using ExAmp instead of bridge PCR to generate clusters (for instance HiSeq 4000).

ADD REPLY
0
Entering edit mode

Cleavage occurs after clustering, and the anchor primers serve as templates for the clustering PCR, so they're not single-stranded during the cleavage reactions. And whatever they use for cleavage clearly works. Unless it's some offset restriction enzyme like SapI, the cluster/cleavage cycle removes the modification from the anchor primer, so it's hard to envision a method for removing the read 2 template strand and precisely regenerating the two anchor primers for reuse.

ADD REPLY
0
Entering edit mode

Yes, the cleavage is done on the double-stranded DNA that makes the clusters. But I thought that the question was whether there would be single-stranded primers left for re-using the flow cell ? In that case, the point I want to make is that the unreacted primers in the 2008 PE/bridge-PCR design may still be pristine and available for a new clustering reaction.

ADD REPLY
0
Entering edit mode

@John's original question was whether exo (or similar) treatment would regenerate a pristine flow cell, so I responded to that question. But I can envision a few problems even in your scenario of unreacted primers:

1) Strand 2 of the first run would remain, and would be a template for read 2 of the second run, which would invariably affect base-calling of adjacent run 2 clusters (and probably S/N metrics as well, since they're calculated independently for each read). 2) The block to sister duplicate formation would be missing, so ~half of the data would be redundant. 3) The loading capacity (cluster density, number of reads) of the flow cell would be significantly reduced.

ADD REPLY
0
Entering edit mode

While that may have been true in 2008 there has been a change to this since. If I recall, now both SE/PE flowcells use a similar mechanism. @harold: does this ring a bell by any chance?

ADD REPLY
0
Entering edit mode

Indeed, I think that in sequencers like MiSeq, all flowcells are of the PE design, which also allows SE sequencing.

ADD REPLY
0
Entering edit mode

MiSeq (and I believe NextSeq) kits work for both SR and PE sequencing. But the HiSeq and Rapid kits come in both SR and PE versions, and the flow cells are not interchangeable. Of course, you could always perform SE sequencing with the PE kit but, given the substantial difference in price, why would you want to?

But this thread has drifted far from bioinformatics, so I'm signing off. Additional queries on this topic should be posted on SEQanswers.

ADD REPLY
2
Entering edit mode

Just not worth the hassle, even if you could do it. Plus illumina sells them as a part of a kit so there is no way to just buy the reagents (no third-party suppliers).

People are worried about getting cross contamination of reads from different tags in a pool and you are wanting to re-use a flowcell :)

ADD REPLY
0
Entering edit mode

Hehe, well, i'd simply call it the mystery-spike-in control ;-)

ADD REPLY
1
Entering edit mode
7.7 years ago
John 13k

I have no idea what all these colours are about, but i think they're saying that it is possible for both strands from a single DNA molecule to form separate clusters on the flowcell. In the Illumina "how its made" videos there's always only 1 strand of DNA annealing to the adapter lawn and kicking off bridge amplification. I guess in reality the other strand could also anneal, and thus 1 DNA molecule could form 2 clusters that would appear to be PCR duplicates, but actually are just "sisters".

Yeah i'm not sure how serious an issue that is. Perhaps people with barcoded, non-amplified samples saw the same fragment sequenced twice and freaked out, then it was realised that this is an actual thing that happens. Otherwise you wouldn't be able to tell it from standard PCR duplication.

ADD COMMENT
1
Entering edit mode

Shouldn't be technically possible, right? You would only have a correct P5 sequence on one strand, the other one would have a complement. That's what I am confused about.

ADD REPLY
1
Entering edit mode

I think thats just true in the marketing videos. I don't see a reason why the other wouldn't compliment on the lawn if it made it into the flow cell.

ADD REPLY

Login before adding your answer.

Traffic: 1185 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6