Question: Should quality scores be considered when discriminating between optical and library duplicates?
gravatar for Dan D
6.0 years ago by
Dan D7.1k
Dan D7.1k wrote:

Picard's MarkDuplicates tool is very useful. As far as I know it's the standard for identifying duplicates in BAM files. However, a lengthy discussion with an investigator about the relationship between optical duplication and sequence complexity got me thinking in deeper detail about its current methodology.

It seems that in the case of a low-complexity library, you would have a substantial number of library duplicates incorrectly getting labeled as optical duplicates simply because they're close together on the flowcell surface. If this is true, then perhaps a way to prevent this would be to compare the quality scores of the suspected optical duplicates. If there's a substantial divergence in the quality scores between two clusters whose coordinates indicate close proximity, then this would indicate that the the reads originate from separate clusters and are therefore not optical duplicates but library duplicates.

Is this line of thinking correct?

If no one has a good argument to the contrary, I'll update the Picard MarkDuplicates tool to add an option for considering quality scores. There's a question of what threshold to use for that divergence number, but that can be set to some default, much like the distance threshold variable which is currently present in the tool (DEFAULT_OPTICAL_DUPLICATE_DISTANCE).

ADD COMMENTlink modified 6.0 years ago by mikhail.shugay3.4k • written 6.0 years ago by Dan D7.1k

I would think that there would be significant divergence in quality scores due to optical duplication (in that the peripheral faux-clusters would have lower quality scores). Have you compared the apparent optical duplication rate between low and regular complexity libraries? I would think that that would be informative.

ADD REPLYlink written 6.0 years ago by Devon Ryan97k

Perhaps a metric could be developed that, given a library complexity would quantify the expected levels of  duplication that also look like optical duplicates . Then the mark duplicate program should only remove duplicates until this expected level is reached.

ADD REPLYlink modified 6.0 years ago • written 6.0 years ago by Istvan Albert ♦♦ 85k
gravatar for mikhail.shugay
6.0 years ago by
Czech Republic, Brno, CEITEC
mikhail.shugay3.4k wrote:

I was really interested in this question since I've started working with UMI (unique molecular identifier) - barcoded libraries. In these libraries in case two reads have the same sequence in degenerate region (usually 12 N nucleotides inside primer sequence), they could be attributed to the same molecule. However we had some issues for HiSeq data which we deemed to be resulting from misread UMI sequence in close clusters.

I've done some basic analysis using UMI sequences and the tile/coordinates from header. I've analyzed 10mln read pairs from a ~30mln reads FASTQ file (data from here), analyzing separately read pairs with same and distinct UMI. There were 143723 and 139649 pairs on the same tile for matching and distinct UMI sequences respectively, those were further analyzed by calculating distance and quality for reads in pair.

Those UMI duplicates are clearly present for clusters closer than 100 pixels:

Also if plotting the (distance) vs. (minimal quality in pair) density plot for all data and those putative duplicates ("Same-close"), one can see that some of them are indeed low-quality reads:

Hope this information will be useful. 

ADD COMMENTlink written 6.0 years ago by mikhail.shugay3.4k

Interesting analysis! One thing I don't fully get though is why isn't there anything in the middle? How come these are all either far away or close to one another?

ADD REPLYlink written 6.0 years ago by Istvan Albert ♦♦ 85k

My rather naive suggestions was that when selecting at random point pairs at a very large area the probability of selecting two close points is negligibly small.

If I recall it correctly, there are 96 tiles per flow cell in HiSeq, so for 10mln pairs one should get 100 000 pairs on the same tile (approx. what I get). Given 200mln reads per lane yield, there would be 2mln reads per tile, that is 4x10^12 possible pairs >> 10^5 pairs we check here.

PS And those clusters that are close to each other are only observed in case of duplicate UMI sequence

ADD REPLYlink modified 6.0 years ago • written 6.0 years ago by mikhail.shugay3.4k

I see, that makes sense.

ADD REPLYlink written 6.0 years ago by Istvan Albert ♦♦ 85k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1632 users visited in the last hour