Question: Duplicates on Illumina
12
gravatar for predeus
7 weeks ago by
predeus320
Russia
predeus320 wrote:

Hello all,

I've come across this picture from Illumina presentation explaining how duplicate reads can happen. Two types of optical duplicates and PCR duplicates are pretty clear to me, but lower right corner has me confused. Could anybody explain what are these "sister" duplicates?

On a related note, what happens to the second strand when during the cluster generation the fragments are denatured with 2M base? Does it just not attach to the flow cell and is later washed out?

Thank you in advance.

enter image description here

illumina duplicates • 1.7k views
ADD COMMENTlink modified 7 weeks ago by Brian Bushnell9.8k • written 7 weeks ago by predeus320
1

Does not it mean, that if you have double stranded (complement strands), when each strand is create independent cluster position. So after sequencing you have the same data set but in reverse/complement orientations, but after alignment to reference is the same starting position = flag as duplicates?

ADD REPLYlink modified 7 weeks ago by Brian Bushnell9.8k • written 7 weeks ago by Paul490

Well I don't think something like Picard markduplicates would mark it as a duplicate if sequence is the same but the strand flag is different. Need to check it to be sure.

ADD REPLYlink written 7 weeks ago by predeus320
1

Correct - reads that map to the same position but on opposite strands are not marked as duplicates by Picard, nor are they removed by SAMtools rmdup command. But those reads would have to be from independent library molecules (i.e., not duplicates). Read one always proceeds from the same adapter end (I believe it's P5) so, even if sister clusters were formed from the two strands of a single library molecule, their orientations relative to P5 would be identical.

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by harold.smith.tarheel3.5k

I don't get it. You would have P5 on one strand and P5's complement on the other, would you not? And the complement should not be able to attach to the flow cell.

ADD REPLYlink written 7 weeks ago by predeus320
2

You're conflating two issues. My point was that, if two reads have opposite orientations, they cannot be duplicates of any type (PCR, sister, or optical) b/c all inserts are sequenced relative to the same adapter end and strand.

Regarding annealing, the flow cell contains both P5 and P7 anchor primers. These serve as PCR primers for bridge amplification (aka clustering). If one strand of the library anneals to P5 (i.e., is the reverse complement), then the other end of the other strand MUST be the reverse complement of P7 (draw a diagram if this is unclear). But one of the two anchor primers is blocked during annealing (see discussion below), which is why sister duplicates are not produced.

ADD REPLYlink written 7 weeks ago by harold.smith.tarheel3.5k

Got it, thank you for the explanation.

ADD REPLYlink written 7 weeks ago by predeus320
9
gravatar for Brian Bushnell
7 weeks ago by
Brian Bushnell9.8k
Walnut Creek, USA
Brian Bushnell9.8k wrote:

I've identified another type of duplicates not on this chart: tile-edge duplicates. These appear to be caused by the camera not sliding over far enough when moving to a new tile. In my analysis of NextSeq data, these account for the vast majority of duplicates (>80%). The picture shows red dots for unique clusters and blue dots for duplicate clusters, and the X/Y axis are flowcell coordinates. All 13000 duplicate pairs are plotted for the first 9 tiles of this NextSeq run; the non-duplicates were subsampled to 13000 before plotting. These duplicates were detected via Clumpify rather than mapping.

Note that in this image, all 9 tiles are superimposed on top of each other (the X/Y coordinates get reset for each tile). Also, from what I have read, the NextSeq runs 3 tiles wide on a lane. So, I would expect duplicates on both sides of the middle column tiles; right side only for the left tiles; and left side only for the right tiles. Thus, the presence of nonduplicate reads on the edges (at a much lower rate than elsewhere) is probably the contribution from left and right tiles.

Duplicate Cluster Coordinates

ADD COMMENTlink modified 7 weeks ago • written 7 weeks ago by Brian Bushnell9.8k
2

I tested our brand-new second NextSeq, and it has the same issue. The pattern is very similar. Duplicate rate seems similar too at ~1.5%.

NextSeq2

ADD REPLYlink written 6 weeks ago by Brian Bushnell9.8k
1

4 of 4 NextSeqs affected across 3 countries indicates to me that this is a general NextSeq issue :(

ADD REPLYlink written 6 weeks ago by Devon Ryan62k

I think this is related to the large cluster size (they are on the order of mm in diamter, if I recall correctly, compared to other Illumina sequencers) for NextSeq. I have only tried a couple of samples (from HiSeq 4K and MiSeq) and have not seen this.

ADD REPLYlink written 6 weeks ago by genomax222k

Hahahah - oh boy. Fantastic research as always, but wow, what to say.

I guess they wont be marked as optical-duplicates with tools like Picard as they are thought to come from different tiles? I mean, it will detect all duplicates regardless, but detect as optical duplicates.

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by John10k

Excellent detective work. I wonder if this can get fixed with a software update...

ADD REPLYlink written 7 weeks ago by Devon Ryan62k

I've only tested our NextSeq, so I don't know if it's universal or ours is miscalibrated, but we will be discussing it with Illumina soon. Seems like it should be an easy fix. In the meantime I'm updating BBDuk to support location filtering so we can exclude all tile-edge reads.

ADD REPLYlink written 7 weeks ago by Brian Bushnell9.8k
1

If you post your code for generating the image I'll use it to check our NextSeq.

ADD REPLYlink written 7 weeks ago by Devon Ryan62k
2

I could do the same for data from our NextSeq to add more data to the story.

ADD REPLYlink written 7 weeks ago by WouterDeCoster13k
1

OK, starting with interleaved reads in reads.fq.gz, and using BBMap v36.79 (which I just uploaded, as it has deduplication added to Clumpify):

clumpify.sh in=reads.fq.gz out=deduped.fq.gz k=19 subs=5 dedupe allduplicates int
filterbyname.sh in=reads.fq.gz out=duplicates.fq.gz names=deduped.fq.gz include=f
reformat.sh in=duplicates.fq.gz out=duplicates#.header srt=20k
filterbyname.sh in=reads.fq.gz out=unique.fq.gz names=deduped.fq.gz include=t
reformat.sh in=unique.fq.gz out=unique#.header srt=20k

That will give you 20000 randomly-sampled duplicate read headers in "duplicates1.txt" and nonduplicates in "unique1.txt". I pasted that into Excel splitting on spaces and colons. The headers look like this:

NS500302:178:HLNGJBGXY:1:11101:20306:1857 1:N:0:GTAGAG

So the X and Y coordinates are 20306 and 1857. You can, if you want, just start with the first 2 million reads to speed things up a lot and use less memory, if you have a huge file.

ADD REPLYlink written 7 weeks ago by Brian Bushnell9.8k
3

I get a lot less of this than you do, but it's still there. This is one of 4 samples on a single NextSeq run, but they all look similar.

Duplication

ADD REPLYlink written 7 weeks ago by Devon Ryan62k
1

I'm going to look at a few more runs to judge run-to-run variability.

As an aside, I'm trying to come up with a better visualization. I'd prefer a density plot of some sort, but they keep producing edge-effects that counteract what's otherwise visually evident.

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by Devon Ryan62k

I was thinking of density plots on X and Y axis, expecting to see the duplicates-effect when collapsing the data to X and expecting to see nothing special when collapsing the data to the Y chromosome. But that isn't showing the desired data?

Could it be that the run you processed already contained a lot of duplicates, therefore relatively decreasing the effect of the tile-edge duplicates?

ADD REPLYlink written 7 weeks ago by WouterDeCoster13k

I've been playing around with density plots but haven't struck on anything great yet, but will keep playing around.

But yeah, this run may have had other issues. I'm generalizing my snakemake script now to more easily test this on other runs.

ADD REPLYlink written 7 weeks ago by Devon Ryan62k

Aww I should learn snakemake. I'll look at 2 SE transcriptome sequencing run and PE exomes sequencing run...

ADD REPLYlink written 7 weeks ago by WouterDeCoster13k

Snakemake is pretty useful. I should note that this is probably total RNAseq, so a highish duplication rate is expected.

ADD REPLYlink written 7 weeks ago by Devon Ryan62k

I've settled on binning both axes and then plotting the % duplication in each, since otherwise the read density tends to skew the density plots. This shows a nice uptick at the edges.

ADD REPLYlink written 7 weeks ago by Devon Ryan62k
1

Hi Brian Bushnell, a bit later than I planned but I made some plots...
Scatter
DensityOnX
DensityOnY

Looks like we have the same issue. Can you keep me posted about your correspondence with Illumina?

Plotting code:

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by WouterDeCoster13k

Thanks Wouter! I'll keep you informed. Actually, I think it would be fun to post Illumina's first response:

Hi all, I reviewed the run to see if there are any issues related to a subset or all of the 6 cameras on the instrument by looking at sequence quality overall as well as per camera section to gauge the variability.

On the whole, the run itself performed well by exceeding our quality specs for a 2x150 run: 85.6% vs 75% spec. All 6 cameras are also performing to spec, the variability is 6.8%, where we would expect <20%. For more information on camera variability in general, please refer to the attached technical note, and data specific to this run can be found in the second PDF attachment.

In summary, it looks like the system is imaging as expected and does not need calibration at this time.

So - yeah. Looks like typical Illumina quality control. I did forward them some images in response that, who knows, maybe they didn't get initially... hopefully their next response will be more useful.

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by Brian Bushnell9.8k
1

Thanks for sharing. Illumina then has a quite limited idea of "quality" and "as expected"... I'll inform those responsible for our sequencers on Monday.

ADD REPLYlink written 7 weeks ago by WouterDeCoster13k
1

Replicated on data generated from mid-output flow cell run: MidOutput

ADD REPLYlink written 6 weeks ago by WouterDeCoster13k

Hi Brian Bushnell, I'm not sure how the tiles are organized on mid output flow cell, but it's likely different.
Have you investigated if the edges have overrepresented sequences or only looked at duplicates?

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by WouterDeCoster13k
1

I've only looked at duplicates. Note, though, that you can use BBDuk's new positional filter (xmin ymin xmax ymax flags) to select reads in a rectangular region, if you want to run FastQC on just a subset.

ADD REPLYlink written 6 weeks ago by Brian Bushnell9.8k
1

I selected the edges using BBDuk with everything left from x=2000 and everything right from x=25000 and compared the fastQC report from this set with the all the reads, and those are very similar, nothing odd.

ADD REPLYlink written 6 weeks ago by WouterDeCoster13k

Awesome. Will check it out.

ADD REPLYlink written 6 weeks ago by WouterDeCoster13k

Legally they are in the clear. NextSeq specs are listed on this page. Only criteria are % of reads with Q30 and number of reads passing quality filter. That is pretty much the norm for all Illumina sequencers.

ADD REPLYlink written 7 weeks ago by genomax222k

I suppose fixing this would decrease the apparent output of the machine, so they're in a no-win situation :P

ADD REPLYlink written 7 weeks ago by Devon Ryan62k

Glad it's not just me with oddly high percent duplication rates by this method :)

ADD REPLYlink written 7 weeks ago by Devon Ryan62k

Well this method selects the same number of duplicate reads as unique reads so it's not an absolute comparison :-)

ADD REPLYlink written 7 weeks ago by WouterDeCoster13k

Thanks, Devon, that's really helpful. It looks like our NextSeq is miscalibrated.

ADD REPLYlink written 7 weeks ago by Brian Bushnell9.8k
2

Brian Bushnell is allowing 5 substitutions in this clumpify.sh command line (per read) so keep that in mind. Technically those are not optical duplicates.

NextSeq clusters being huge may be a completely different beast. Brian Bushnell I will take a look at this on the HiSeq 4K data we were looking at.

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by genomax222k

OK, thanks, I'm curious about the effect of patterned flowcells.

ADD REPLYlink written 7 weeks ago by Brian Bushnell9.8k

@Brian: Was the graph example above from a single sample or a sample that came from a pool?

ADD REPLYlink written 7 weeks ago by genomax222k

It's a single sample that came from a pool. I haven't looked at any other samples or runs yet.

ADD REPLYlink written 7 weeks ago by Brian Bushnell9.8k
1

FYI, in version 36.81 of clumpify, the allduplicates option only works if groups=1 (or you use small enough files that it sticks everything in memory by default). I assume this is a bug.

ADD REPLYlink written 7 weeks ago by Devon Ryan62k

Ah! Nice catch. I'll look into that ASAP.

ADD REPLYlink written 7 weeks ago by Brian Bushnell9.8k
1

I come across the same (?) issue in Clumpify 36.81 (unknown parameter allduplicates)

ADD REPLYlink written 7 weeks ago by WouterDeCoster13k
1

Sorry about that - it's fixed now in 36.82.

ADD REPLYlink written 7 weeks ago by Brian Bushnell9.8k

Alright, will have a look at this tomorrow (eh technically today in my time zone). I will throw in a bit of R code for the plotting ;-)

ADD REPLYlink written 7 weeks ago by WouterDeCoster13k

This is great stuff! So, are they NOT identified as optical duplicates by Picard MarkDuplicates?

I wonder if this sort of problems happen on Hiseq3000/Hiseq4000 as well.

ADD REPLYlink written 7 weeks ago by predeus320

I would expect them to not be identified as optical duplicates by Picard because they are on different tiles and their coordinates are different by at least 25000 pixels on the X axis. They should still be identified as duplicates, though, since they would map to the same location. The HiSeq 4000 data I have seen so far does not show this problem.

ADD REPLYlink written 7 weeks ago by Brian Bushnell9.8k
1
gravatar for Charles Plessy
7 weeks ago by
Charles Plessy1.3k
Japan
Charles Plessy1.3k wrote:

Usually, the molecules sent for sequencing are double-stranded DNA. But in the procedure for loading the sequencer, the DNA is denatured. Therefore, for one double-stranded molecule sent for sequencing, two reverse-complementary molecules are loaded in the machine. Each of them can form a cluster, and these clusters will be indistinguishable. Thus, even in libraries prepared without any amplification step, the sequencing results can produce two identical reads. If the goal of the sequencing is to count molecules (transcriptome sequencing, single-cell epigenome analysis, ...), it is better to count these "sister" sequences as one.

ADD COMMENTlink written 7 weeks ago by Charles Plessy1.3k
1

The way I understand flowcell attachment works, only one strand should be able to attach and form a cluster. I might be wrong though.. so the other strand is attaching with P7?

ADD REPLYlink written 7 weeks ago by predeus320
1
gravatar for John
7 weeks ago by
John10k
Germany
John10k wrote:

I have no idea what all these colours are about, but i think they're saying that it is possible for both strands from a single DNA molecule to form separate clusters on the flowcell. In the Illumina "how its made" videos there's always only 1 strand of DNA annealing to the adapter lawn and kicking off bridge amplification. I guess in reality the other strand could also anneal, and thus 1 DNA molecule could form 2 clusters that would appear to be PCR duplicates, but actually are just "sisters".

Yeah i'm not sure how serious an issue that is. Perhaps people with barcoded, non-amplified samples saw the same fragment sequenced twice and freaked out, then it was realised that this is an actual thing that happens. Otherwise you wouldn't be able to tell it from standard PCR duplication.

ADD COMMENTlink written 7 weeks ago by John10k
1

Shouldn't be technically possible, right? You would only have a correct P5 sequence on one strand, the other one would have a complement. That's what I am confused about.

ADD REPLYlink written 7 weeks ago by predeus320
1

I think thats just true in the marketing videos. I don't see a reason why the other wouldn't compliment on the lawn if it made it into the flow cell.

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by John10k
1
gravatar for Brian Bushnell
7 weeks ago by
Brian Bushnell9.8k
Walnut Creek, USA
Brian Bushnell9.8k wrote:

When I first wrote a duplicate-removal program, I was worried about the possible presence of "sister" duplicates, so I designed it to detect both identical and reverse-complementary duplicates. I was not concerned about the presence of these duplicates because DNA is double-stranded and thus a fragment might spawn two clusters - in fact, I doubt that's likely. I assume that when DNA is fragmented, the two strands will generally break independently. Rather, I was concerned that during PCR, a single fragment might spawn both identical and reverse-complementary fragments, so both would need to be removed.

However, in testing, I found that the rate of identical fragments was high, and the rate of reverse-complementary fragments was extremely low - a rate that could be completely accounted for by coincidence. So, I don't worry about it any more. Though I should mention that I don't really understand why this is the case - for a PCR-amplified library it still seems to me that reverse-complementary duplicates should be present. Perhaps someone else has some insight here?

ADD COMMENTlink modified 7 weeks ago • written 7 weeks ago by Brian Bushnell9.8k
2

Illumina always shows fragments clustering in one color (orientation) in illustrations. Possibly the other class of anchors are blocked (until made available during the turnaround phase)?

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by genomax222k

I imagine so - but as always a few reads might not get blocked every now and again. Or perhaps it's a very different situation, where the denaturing of the original template occurs after the entire flowcell has been flooded with dsDNA. I imagine if you flooded a flowcell with ssDNA, there would be far more clusters at the top of the lane than the bottom. So they probably keep things dsDNA until the flowcell is covered, then denature. This would, and i'm just guessing here, give a more even cluster formation across the lane, and sister fragments would likely also bind to the same physical location as each other. Thus, if they do form independent clusters, they're likely to become merged. Only if sisters travel a significant distance apart after denaturation can they become two independent clusters.Thats just a total guess though. Someone needs to look at the actual protocol :P

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by John10k
4

@John, the library is denatured with NaOH (and neutralized in hyb buffer) before loading onto the instrument. And yes, there is usually a cluster density gradient (high -> low) across the flow cell from inflow to outflow.

P.S.-I checked my notes from an old (circa 2011) Illumina presentation; as @genomax2 suggested, one of the two primers on the flow cell is blocked during library loading.

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by harold.smith.tarheel3.5k

Ah, awesome, great info Harold - in that case the mystery is solved. I guess only a small number of sisters are seen because only a small number make it past the block, or aren't fully washed out after the block is removed.

Off-topic, but do you happen to know why flowcells can't be reused? I've always thought that adding an exonuclease to cut off the synthesised fragment at the end would essentially reset a flowcell to being just a lawn of adapters as before. The exo to do that is used in step 3 or 4 of the protocol (after bridge amplification in the first round of sequencing to get everything homogenous 5' -> 3'), so i'm surprised i haven't heard stories of people at least trying it out, particularly since flowcells are $25,000 each or whathaveyou.

ADD REPLYlink written 7 weeks ago by John10k
2

Just not worth the hassle, even if you could do it. Plus illumina sells them as a part of a kit so there is no way to just buy the reagents (no third-party suppliers).

People are worried about getting cross contamination of reads from different tags in a pool and you are wanting to re-use a flowcell :)

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by genomax222k

Hehe, well, i'd simply call it the mystery-spike-in control ;-)

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by John10k
2

Flow cells aren't quite THAT expensive (more like $6K for the current PE version, and that includes reagents). Plus, we all know that the real expense of NGS is the bioinformaticist's time ;-).

The flow cells can't be reused because the anchor primers are cleaved after clustering. Both of the anchor primers are chemically modified to allow differential cleavage. After initial library loading/clustering, primer A is cleaved in the middle so that only the strand tethered to primer B remains as a template for sequencing. Upon completion of read 1 and index reading, the stub of primer A is used to synthesize the complement of the primer B-tethered strand. Then, primer B is cleaved, leaving only the newly synthesized strand as template for read 2.

ADD REPLYlink written 7 weeks ago by harold.smith.tarheel3.5k
2

Ah, see, this is the difference between watching the videos, and knowing the science. Thanks again Harold :) Also, word to the wise, do not Google Image search "differential cleavage" in a Starbucks. You'll get weird looks.

ADD REPLYlink written 7 weeks ago by John10k
3

It must take into account your past browsing history; my results were completely tame :)

ADD REPLYlink written 7 weeks ago by Brian Bushnell9.8k

According to the supplemental material of Bentley et al., 2008, clusters in single-read flow cells were linearised with a chemical method, but the ones in paired-end read flow cells were linearised using enzymes related to base excision repair mechanisms, which I doubt will function on the anchor primers when they are still single-stranded. Thus, it might be possible to reuse the paired-end flow cells. I do not know if this still holds true on Illumina sequencers using ExAmp instead of bridge PCR to generate clusters (for instance HiSeq 4000).

ADD REPLYlink written 7 weeks ago by Charles Plessy1.3k

Cleavage occurs after clustering, and the anchor primers serve as templates for the clustering PCR, so they're not single-stranded during the cleavage reactions. And whatever they use for cleavage clearly works. Unless it's some offset restriction enzyme like SapI, the cluster/cleavage cycle removes the modification from the anchor primer, so it's hard to envision a method for removing the read 2 template strand and precisely regenerating the two anchor primers for reuse.

ADD REPLYlink written 7 weeks ago by harold.smith.tarheel3.5k

Yes, the cleavage is done on the double-stranded DNA that makes the clusters. But I thought that the question was whether there would be single-stranded primers left for re-using the flow cell ? In that case, the point I want to make is that the unreacted primers in the 2008 PE/bridge-PCR design may still be pristine and available for a new clustering reaction.

ADD REPLYlink written 7 weeks ago by Charles Plessy1.3k

@John's original question was whether exo (or similar) treatment would regenerate a pristine flow cell, so I responded to that question. But I can envision a few problems even in your scenario of unreacted primers:

1) Strand 2 of the first run would remain, and would be a template for read 2 of the second run, which would invariably affect base-calling of adjacent run 2 clusters (and probably S/N metrics as well, since they're calculated independently for each read). 2) The block to sister duplicate formation would be missing, so ~half of the data would be redundant. 3) The loading capacity (cluster density, number of reads) of the flow cell would be significantly reduced.

ADD REPLYlink written 7 weeks ago by harold.smith.tarheel3.5k

While that may have been true in 2008 there has been a change to this since. If I recall, now both SE/PE flowcells use a similar mechanism. @harold: does this ring a bell by any chance?

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by genomax222k

Indeed, I think that in sequencers like MiSeq, all flowcells are of the PE design, which also allows SE sequencing.

ADD REPLYlink written 7 weeks ago by Charles Plessy1.3k

MiSeq (and I believe NextSeq) kits work for both SR and PE sequencing. But the HiSeq and Rapid kits come in both SR and PE versions, and the flow cells are not interchangeable. Of course, you could always perform SE sequencing with the PE kit but, given the substantial difference in price, why would you want to?

But this thread has drifted far from bioinformatics, so I'm signing off. Additional queries on this topic should be posted on SEQanswers.

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by harold.smith.tarheel3.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 527 users visited in the last hour