I'm working with some high-depth whole-exome sequencing data. It's paired-end. Some samples are targeted at 200X and others are 400X average read depth across the exome. The duplicate rate is very high after alignment. I suspect a good portion of these marked duplicates are not actual PCR duplicates, but are simply synchronous alignments of distinct fragments. However, I'm not sure about the best way to properly calculate coverage estimates over the exome territory, because I can theoretically justify both omitting and incorporating marked duplicates into the calculation.
I haven't yet found anything in the literature addressing this specific issue. I'm thinking there's probably some middle ground that factors in the territory size, number of reads, and perhaps insert size profile in the case of paired-end sequencing. Does anyone have any guidance?
EDIT: I should add that each sample has a single library preparation.