Question: Question about deduplication in a highly repetitive genome
0
gravatar for varunorama
12 weeks ago by
varunorama50
varunorama50 wrote:

Hello Biostars,

I have been analyzing WGBS data for our organism, which has a highly repetitive genome. I am using the Bismark pipeline and mapped the reads using bowtie2 within the Bismark pipeline. Typically, the pipeline recommends that deduplication should be done on WGBS datasets in order to remove PCR-based duplication. I ran this deduplication step (deduplicate_bismark) and then extracted methylation statistics based on the deduplicated data and the non-deduplicated data to see how the data differed.

Some initial findings show that deduplication is removing ~40% of the data, suggesting that these regions are PCR-based duplicates. Furthermore, the overall coverage of CpG sites is greatly reduced; from an average coverage of 4x (for the non-deduplicated data) to about 1x (for deduplicated data).

Given the large reduction in data and coverage, and that I am working with an organism with a highly repetitive genome, I am wondering if deduplication should indeed be implemented in this case. Is deduplication still needed in this case, or not? Additionally, If there are any QC steps needed to make a more informed decision, I would like to hear them!

Thank you!

sequencing wgbs • 142 views
ADD COMMENTlink modified 12 weeks ago • written 12 weeks ago by varunorama50
0
gravatar for varunorama
12 weeks ago by
varunorama50
varunorama50 wrote:

Felix Kruger posted a very insightful and thoughtful response to this question on the Bismark GitHub page.

https://github.com/FelixKrueger/Bismark/issues/400

ADD COMMENTlink modified 12 weeks ago • written 12 weeks ago by varunorama50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1656 users visited in the last hour
_