I'm doing BS-seq with some ChIP DNA. To get 500M reads from <1ng ChIP DNA, you can imagine the duplication level is HUGE. FastQC reported the duplication rate to be 39% and 66% for my two libraries. In my case, I think the proper way of de-duplication is to set a cutoff value, say 5, to tolerate some PCR duplication (and possibly amplification from distinct DNA fragments with identical ends). How to do this in a customized way? The reads are paired-end. It would be better to start from an alignment file like BAM/SAM.
There's no generally applicable way to deal with deduplicating targeted sequencing data (this is also true for things like RRBS). You can set a threshold if you want, in which case you'll have to tailor things for each experiment and write a program to do this. Traditionally, one simply doesn't deduplicate the dataset since there will be many false positives.