We always remove duplicates from ChIP-seq data. If you sequencing is paired end, you'll want to do this in a paired-end aware manner. Normally this is done after mapping. We use
MarkDuplicates from picard for ChIP-seq.
samtools also has
rmdup. We use picard because back in the day
MarkDuplicates was more intelligent than
rmdup about how it detected duplicates, but I don't know if that is still true. If you are using
MACS for your peak-calling, you'll want to mark duplicates rather than remove them.
As suggested in this post,
you expect to have duplicates in Chip-seq data because you sequenced a very small part of the genome. It will all depends of your coverage.
Try to find the proportion of duplicates you have. If you got 98% of duplicates, try the following :
A good way to catch PCR duplicates, @harold.smith.tarheel answer from the post above : "You can discriminate via genome browser of your non-deduplicated data. Bona fide peaks will have multiple overlapping reads with offsets, while samples with only PCR duplicates will stack up perfectly without offsets."
If you got "samples with only PCR duplicates will stack up perfectly without offsets." that will be a problem (or at least you will have to choose if you keep duplicates or not). In the other way if you got "multiple overlapping reads with offsets" you can keep duplicates.