I have been analyzing WGBS data for our organism, which has a highly repetitive genome. I am using the Bismark pipeline and mapped the reads using bowtie2 within the Bismark pipeline. Typically, the pipeline recommends that deduplication should be done on WGBS datasets in order to remove PCR-based duplication. I ran this deduplication step (deduplicate_bismark) and then extracted methylation statistics based on the deduplicated data and the non-deduplicated data to see how the data differed.
Some initial findings show that deduplication is removing ~40% of the data, suggesting that these regions are PCR-based duplicates. Furthermore, the overall coverage of CpG sites is greatly reduced; from an average coverage of 4x (for the non-deduplicated data) to about 1x (for deduplicated data).
Given the large reduction in data and coverage, and that I am working with an organism with a highly repetitive genome, I am wondering if deduplication should indeed be implemented in this case. Is deduplication still needed in this case, or not? Additionally, If there are any QC steps needed to make a more informed decision, I would like to hear them!