Picard's MarkDuplicates tool is very useful. As far as I know it's the standard for identifying duplicates in BAM files. However, a lengthy discussion with an investigator about the relationship between optical duplication and sequence complexity got me thinking in deeper detail about its current methodology.
It seems that in the case of a low-complexity library, you would have a substantial number of library duplicates incorrectly getting labeled as optical duplicates simply because they're close together on the flowcell surface. If this is true, then perhaps a way to prevent this would be to compare the quality scores of the suspected optical duplicates. If there's a substantial divergence in the quality scores between two clusters whose coordinates indicate close proximity, then this would indicate that the the reads originate from separate clusters and are therefore not optical duplicates but library duplicates.
Is this line of thinking correct?
If no one has a good argument to the contrary, I'll update the Picard MarkDuplicates tool to add an option for considering quality scores. There's a question of what threshold to use for that divergence number, but that can be set to some default, much like the distance threshold variable which is currently present in the tool (DEFAULT_OPTICAL_DUPLICATE_DISTANCE).