Question

The Duplicates Dilemma of ChIP-Seq

3

Entering edit mode

18 months ago

ilovesuperheroes1993 ▴ 40

I have been analyzing Transcription Factor ChIP-Seq data for several months and I am still perpetually confused about "handling of duplicates" during peak calling with MACS2. This may seem like a topic raised many a time by various users, and I have gone through multiple posts without arriving at any concrete solution.

Some analysts claim they always remove duplicates.
Few say they don't remove duplicates, but instead rely on filtering reads with high qualities (samtools -q 20 or samtools -q 30) and retaining all reads thereafter.
Some have recommended using Picard to remove dupliactes, though that doesn't essentially clear the confusion around the original question.

My questions are as follows

From what I have read, there are "good" duplicates and "bad duplicates". How do I know beforehand which is which? For instance, my ChIP-Seq samples have an average of 30-40 million reads. The sequencing depth is understandably very high and naturally there will be a lot of duplicates. If I remove all duplicates in MACS2, then what was the point of increasing the sequencing depth?
Some of my samples in their FastQC reports show high sequence duplication levels, and the % sequences remaining after being de-duplicated varies from 15-30%. Does this mean that MACS2 will perform peak calling with only 20% of my total sequenced reads? This seems like a loss of both money and resources, and what can I do to avoid it?
Once in a while some ChIP-Seq samples turn out with lesser redundancy rate (duplicates as per MACS2 report), say 15% or 30%. But in most of the cases, there are around 50-60% duplicates. Though I had followed the same protocol while performing ChIP, I don't have an answer why those few samples showed low redundancy rates compared to others.
In MACS2, there is an option to specify a maximum of times a read is duplicated for it to be considered. Should I use this parameter at all?
In paired-end ChIP-Seq do I need to provide the parameters fragment size 'd' or --extsize? I have used macs2 predictd and there it says that it works only for singe-end reads. Do I skip providing these parameters and let MACS2 decide what to use?

I request anyone with expertise in handling ChIP-Seq data to answer my queries and also believe it will be of help to many other budding analysts.

MACS2 ChIP-Seq duplicates TF-ChIP ChIP • 2.4k views

ADD COMMENT • link updated 17 months ago by seidel 11k • written 18 months ago by ilovesuperheroes1993 ▴ 40

1

Entering edit mode

Couple of things to note:

The duplication rate reported by fastQC is the single end duplication rate (as it processes each end seperately for paired end data). In many experiments this will undoubtedly be high as there are only a limited number of places a read can start within a peak. A much more useful estimate of duplication levels is given by Picard's MarkDuplicates.
I think most people MarkDuplicates and then leave MACS2 to deal with what to do with the marked duplicates rather than removing them themselves.

ADD REPLY • link 18 months ago by i.sudbery 19k

0

Entering edit mode

I used Picard previously to mark duplicates. Here I have a question. Can MACS2 inherently identify which reads are marked as duplicates by Picard or do I have to specify them with any additional parameter? If it detects that Picard Markdup has been run (by sam flag 1024), will it shut off its inherent duplicate-removing step and instead ignore all reads marked by Picard?

ADD REPLY • link 18 months ago by ilovesuperheroes1993 ▴ 40

0

Entering edit mode

~As far as I'm aware flag 1024 is how MACS2 detects duplicates. Without having run MarkDuplicates MACS2's duplicate removal does nothing.~

EDIT: Actually, looking at the latest versions of the documentation, its not clear if that is true any more. It looks from the text of the manual like deduplication marking might now be built in.

ADD REPLY • link 18 months ago by i.sudbery 19k

score 2 · Answer 1 · 2022-10-29

The "obvious" answer here is that, if you have control over the laboratory process, you should incorporate unique molecular identifiers (UMI) into your libraries prior to amplification. You can then de-duplicate at the UMI level and run your peak caller in a mode that ignores positional duplicates.

A PLOS article from a few years back suggested a rather involved pipeline to estimate true vs duplicated rate on the basis of the rate of non-redundant and duplicate reads in peaks versus out-of-peaks; and then partially de-duplicate on this basis:

https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0214723

As for specific questions:

(1) For narrow peaks (peak size of ~400bp, fragment size of ~300bp) you should be able to disambiguate on the order of 1000 distinct molecules per peaks; with those in the middle far more likely than those towards the end -- assuming a library with infinite complexity. For 50,000 peaks that's ~50M molecules; so even though you hit diminishing returns with more depth due to deduplication; you are still pulling in distinct molecules at some rate. In addition, some peaks are shared between many cell types in your tissues; and others are restricted to a small population. Increasing sequencing depth will recover more of the "rare" peaks, at the cost of generating redundant fragments in "common" peaks.

(2) You can run MACS2 with --keep-dup=all to turn off its positional de-duplication.

(3) The duplication rate is highly dependent on library complexity, which in turn is dependent on the specific antibody you are using (down to catalog # and lot #), input amounts, cell viability of input amounts, crosslinking efficiency, etc. One thing that immediately results in a low duplication rate is a high background (!)

(4) If you have the time, it's worth playing with these parameters and seeing what impact they have; and if you like the results with one setting vs another.

(5) For paired-end reads, the distance between the first and last (reference) positions of the read pair is taken to be the fragment size, since they are the beginning and end of the molecule. These should be provided for single-end reads; and (if the depth is deep enough) strand cross-correlation can be used to infer the fragment size distribution for single-end sequencing.

score 2 · Answer 2 · 2022-10-29

There are no perfect answers to your questions without knowing specifics of your system. 40 million reads - high depth for yeast? Yes! For human cells? Meh.

(1) Good duplicates vs. bad - got a reference for that? The point of high sequencing depth is to find the more obscure peaks in your data. However, depending on many things it doesn't mean they will be real biologically meaningful peaks. Removal of duplicates should still allow peaks to be found. However, you might explore this explicitly. Find peaks under normal parameters, and then compare those results to finding peaks with the --KEEP-DUP option to see if they are different. You can also explore how your peak sets change by sampling your data, e.g. peaks found with 10M, 20M, 30M reads etc. this will allow to you to create a saturation curve. In addition, you should examine your data in a genome browser to see if you believe what you're finding via MACS.

Another thing to consider would be subjecting your data to IDR analysis to score reproducibility of your peaks among replicates. And also, you should examine all your data sets for the fraction of reads found in peaks (FrIP). Both of these are relatively easy to do and provide good metrics for comparing data sets and replicates to each other.

(2) Explicitly generate your own saturation curve (see (1)), to see what contribution depth is making to your ability to find peaks.

(3) Are you making your own libraries? Your samples may simply be over-amplified, or starting with low amounts of IP material. Are these results antibody dependent?

(4) Explicitly test this parameter and see how your peaks sets change. Do you know how to do this? If you use R, the GenomicRanges library makes comparing peaks sets very easy (but one can also do this with bedtools or other methods).

(5) For Paired-end data the fragment length is determined for each pair from the data itself, since the mapped position of each end can be used to calculate fragment length. So you don't need a fragment size parameter.