Question

"Normal" percent duplication for RRBS reads?

2

Entering edit mode

3.7 years ago

eb13 ▴ 20

Hi all,

I recently have been working on a reduced representation bisulfite sequencing project (RRBS) where I was required to do 15-18 cycles of PCR to get libraries to a sufficient concentration for sequencing. Because of this, I ended up with quite a bit of duplicated sequences: 75-90% as determined by FastQC. While 75-90% duplication is obviously an issue, I am having a hard time finding an "expected" range for the percent of duplicated sequences for RRBS. Given that RRBS data is lowly diverse by nature and FastQC is working under the assumption libraries should be very diverse, I am curious to know at what point other people start considering deduplication steps. All I could find regarding this range was the FastQC example for RRBS (has duplication levels at ~25%) (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/RRBS_fastqc.html) and this post on Biostars which reports 40% of reads being duplicated in an RRBS experiment (PCR duplicates in RRBS data)

Does anyone have a reference for a specific range or have any insight in what could be considered "normal" levels of duplication for RRBS?

Thank you in advance!

RRBS duplication sequencing PCR bias • 2.2k views

ADD COMMENT • link updated 2.4 years ago by robinycfang ▴ 20 • written 3.7 years ago by eb13 ▴ 20

2

Entering edit mode

I have worked with 75-90% duplication (FastQC), analyzing with Bismark, and have never performed de-duplication as is it not recommended with Bismark.

For example, check out this tutorial by Felix Krueger and Simon R Andrews (Bismark creators) where they say up to 95% of duplication can be a "normal" thing for RRBS (the tutorial is a bit old, nonetheless).

Another observation: I have handled paired-end RRBS reads, and read2 always has more duplication than read1.

ADD REPLY • link 3.6 years ago by Papyrus ★ 2.9k

score 0 · Answer 1 · 2021-12-05

0

Entering edit mode

2.4 years ago

robinycfang ▴ 20

RRBS MSPI digestive enzyme cuts specifically CCGG motif (if I remember correctly), so your will get many reads with the same 5' and 3' sequence, that being said for a genomic location, you will likely get many identical reads (not due to PCR amplification, but due to the motif selection).

ADD COMMENT • link 2.4 years ago by robinycfang ▴ 20