Common advice in DNA-seq experiments is to remove duplicate reads. These are presumed to be optical or PCR duplicates. However, when samples are sequenced deeply (more than 10X), it becomes completely reasonable for reads to be duplicated. If we stick with the idea of throwing away duplicates, it effectively limits the sequencing depth to 1X.
In cases where there is a dramatic shift to higher GC from input and a strongly skewed distribution, I think clearly one feels inclined to remove the duplicates.
However, in many of the deep sequencing datasets I work with, I see very little shift to higher GC with histograms of GC content that are very nearly symmetric and quite close to the Input.
In these cases, I often feel that I should stick with leaving the duplicated reads in. On the other hand, for certain regions of the genome, I see huge numbers of tags leading to overlapping-tag counts in the tens of thousands. These seem not to represent genuine biology.
What solutions are there for us who would like to use deep sequencing, but what a prrincipled way to filter out some of these clear artifacts?