PCR duplicates in RRBS data
1
0
18 months ago
linelr ▴ 10

Hi!

I am working with DNA-methylation in salmon and have recently aquired data from an RRBS experiment. Fastqc reports that my reads consist of around 40% PCR duplicates, which is quite high. However, I have read that I should not remove duplicates e.g. by simply removing reads that have the exact same start and stop position in the genome when working with RRBS data, but this did not come with a proper explanation. This sort of makes sense to me because of the way the library prep is performed: MspI cleaves only CCGGs + size selection of fragments --> you will probably end up with fragments that are pretty similar, and they might therefor be termed PCR duplicates of each other by fastqc. This is of course based on my non-exhaustive understandig of these processes.

I can´t seem to find any good explanations on how to perform a proper PCR duplicate removal for RRBS data, if that is indeed called for (which I suspect it is).

Does anyone know how to do this or can anyone point me to where I might find this information?

Best, Line

0
Allright! This makes sense. Thanks a lot! I´ll keep what you write about FastQC in mind for next time.

Have a good day!

0
1
Sure! Thanks for the reminder

2
18 months ago

I strongly recommend that you not remove alleged PCR duplicates in RRBS data processing. In data like this we expect that there should appear to be very high levels of what look like PCR duplicates. These are not real PCR duplicates (for the most part at least). Please note that FastQC's defaults are all intended for whole-genome sequencing and will give warnings that you should ignore if you run it on RRBS datasets.

0
