I have data coming from sequencing ancient DNA data. The genome in question is mitochondrial. Libraries were sometimes PE, sometimes SE. With duplicates we get very high coverage ~1000-2000x. After remving duplicates we still can maintain ~150x on average. However the data loss is substantial and questions the amount of sequencing used, and then wasted with Picard duplicates removal, which does remove substantial amount of data <- one thing for sure, mitochondrion is very short ~16700 so given the amount of sequencing, it's only natural, that reads are found as duplicates. The other thing is that with aDNA, there is no guarantee of target concentration nor it's completeness (some fragments may just not be present). So this situation represents the most extreme situation for Heng's Li simple equation:
dups = 0.5*m/N m - sequencing reads N - DNA molecules before amplification http://seqanswers.com/forums/showthread.php?t=6854
Finally mitochondrial sequences are to be assembled to a samples consensus, and SNPs to be determined. Is it wise to remove duplicates from those data sets? Neither the consensus nor SNP calls change with duplicates removal/non removal.
Also worth mentioning is the fact, that following enrichment of mitochondrion, the target is in minority, with most of the sequences steming from bacterial contamination.
So. Removing, not removing? Or any other way to try and estimate it's amount (just theoretical equations?) To support myself I went with preseq library complexity estimation also to look for some correlations in case, I won't find out any definite answer to this PCR duplicates problem