Hi, have some questions about sequence duplicates in ChIP-seq analysis:

a) When running fastQC on my samples, they give the error on 'Sequence duplication levels'. I'm going to use MACS2 as peak caller, but read in several places that MACS2 removes duplicates automatically. Does this means that I don't need to worry on removing duplicates with 'samtools rmdup' prior to mapping?

b) Some say we should remove duplicates because they are the result of PCR bias. This is the result of amplification step prior to sequencing. Others defend that these duplicates could be biological and so we shouldn't remove them. What do they mean by biological?

c) If we use a control sample (Input DNA) do we still need to remove duplicates?

  1. Correct, since MACS2 will ignore them anyway you don't need to remove them.
  2. The question is mostly "Are they really PCR duplicates or not?" If you sequence a given area deeply enough you will always have reads/fragments with the same coordinates. It's impossible to know if these are due to being PCR duplicates, or there simply being a lot of signal at a given spot.
  3. Having an input sample is irrelevant to this.
