Basically duplicates are of two kinds:
- natural duplicates - caused by the biological system producing identical DNA fragments
- artificial duplicates - caused by the sequencing instrument producing identical DNA fragments
Of course, we'd want to keep the first kinds of duplicates and remove the second kinds. But rarely if ever is a clear distinction possible between the two situations. Hence the conundrum.
While we are at it, an empirical observation that I made is that data with high rates of artificial duplication is often useless even after fixing this problem. Many other problems turn up. So it does not really matter what you do with it - remains useless.
In general, from what I understand, people tend to deduplicate their data where a uniform coverage is expected across the genome and when the coverage over a given position has major implications regarding the results. For example in SNP calling the number of reads supporting a variant is an essential decision maker in trusting that variant. We'd want to avoid using artificial duplicates there.
In most other cases, and especially when the expected coverages vary wildly and there are reasons for a fragment to occur very frequently (highly expressed short transcript in a transcriptome study) duplicate removal is not recommended.