ChIP-seq or Cut and Run Differential Binding Analysis
2
0
Entering edit mode
11 months ago
Pappu ★ 2.0k

I am trying to do Differential Binding Analysis of ChIP-seq and Cut and Run data using DIffbind. I got 2 normal samples, 2 normal IGG control samples, 2 treated samples and 2 treated IGG control samples. If I do peak calling by MACS2 after Bowtie2 alignment and duplicate removal, I get the following peaks:

Normal 1 over Normal IGG Control 1
Normal 1 over Normal IGG Control 1
Normal 2 over Normal IGG Control 2
Normal 2 over Normal IGG Control 2


and

Treated 1 over Treated IGG Control 1
Treated 1 over Treated IGG Control 1
Treated 2 over Treated IGG Control 2
Treated 2 over Treated IGG Control 2


My question is it a standard method to use both controls for such analysis using DIffbind? If not, what is the standard workflow for such analysis?

ChIP-Seq • 1.1k views
1
Entering edit mode

You will probably not need the IgG samples during DE analysis. I personally only find them useful during peak calling to correct for background enrichments. They are simply too different from the IPs to be included into the DE analysis for some kind of interaction model. Therefore it comes down to a standard 2 vs 2 comparison. Diffbind is an option, alternatively check csaw or simply feed the count matrix directly into edgeR, even though a proper QC as discussed both in diffbind and csaw vignettes should be done before doing any DE testing.

0
Entering edit mode

Thanks. So for macs2 peak calling, I would merge the two controls. Then I will have 2 peaks for Normal and 2 peaks for Treated. Is there any requirement for the IGG controls for Normal and Treated quite similar in order to avoid any confounding results? Existence of 4 controls is the reason for my confusion.

1
Entering edit mode
11 months ago
Rory Stark ★ 1.2k

I'm not sure why you have two sets of peaks for each pair? Or is it e.g. Normal 1 vs Normal IGG 1 and Normal 1 vs Normal IGG 2 etc?

There are a few ways to handle the controls in DiffBind:

1. Ignore the controls. You can run a differential analysis without reference to the controls using a consensus peakset derived from the controls. The idea is that the controls were used in identifying the enriched areas, and now you can looks for consistent changes in read counts within those areas.
2. Greylists: You can derive "greylists" (experiment-specific blacklists) from the IGG controls to identify anomalous regions that should be excluded from subsequent analysis. These can generated automatically from within DiffBind (this is now the default way to handle controls).
3. Subtracting IGG reads: if you don't use greylists, you can specify the "matched" control for each primary sample, and subtract the IGG reads. If there is a large pileup in the IGG, it will dampen or cancel out the main signal. DiffBind will handle this case as well, including scaling the control reads if the library sequencing depth is mismatched.

When you make your samplesheet, you can specify the appropriate control for each sample (eg Normal IGG Control 1 for Normal 1)

0
Entering edit mode

Thanks. I have two controls. So if I use two controls separetly, I would get get two peaks for each sample. Another option is to use combine the controls for macs2 peak calling. Then I would get:

Normal 1 over IGG Control 1+2 for Normal
Normal 2 over IGG Control 1+2 for Normal
Treated 1 over IGG Control 1+2 for Treated
Treated 2 over IGG Control 1+2 for Treated

0
Entering edit mode

Why actually merge the controls? On first glance, it seems to me that one retains the maximum amount of information without merging Control 1+2, but using each sample with its control 1/2.

0
Entering edit mode
11 months ago
Rory Stark ★ 1.2k

If the controls are matched -- IGG Control 1 was done in the same "batch" as Normal 1 -- you can just call peaks over the matched control.

As you are doing a quantitative differential analysis with these data, there's no need to over-think the controls. The peak calling is just a step to identify potential sites of interest, which will only be identified as being differential if the counts consistently differ. I'd be more concerned that you only have two replicates for each condition than getting overly fancy with the IgG controls; if there is much variance in your data, there may not be enough replicates to confidently identify differential sites.

If I were doing this analysis, I'd take the following steps:

1. Generate greylists from the four IgG samples, and merge them.
2. Filter reads from the Normal/Treated samples that overlap greylisted regions (as well as ENCODE blacklisted regions if one exists for your reference genome).
3. Call four sets of peaks (Normal 1/IgG Control 1 etc.)
4. Form a consensus peakset from the four sets; count overlapping reads
5. Normalize to background bins over the filtered reads
6. Perform a differential analysis on this count matrix

A simplified version of the above is to calculate/apply the blacklists/greylists after peak calling and filter peaks instead of reads; this can all be done very straightforwardly in DiffBind once peaks are called with the primary bam files.

0
Entering edit mode

Why would one want to merge the greylists from all four IgG samples prior to removing regions that overlap the greylist? Couldn't it be that there is a greylist area in Treated which actually is not a greylist area in Normal, but an appropriate peak? Isn't the whole idea of greylists that they are cell line (or sample) specific ?

1
Entering edit mode

Suppose there is a region in the Treated samples that has an anomaly in the Treated control, but not in the Normal control. If you include a peak in this region because it was identified in the Normal, how can you tell if it is differentially bound in Treated if you can't rely on the Treated reads? Remember that we need to calculate a read count for every consensus peak for every sample, whether or not it was identified in a peak in that sample. (The same applies to a csaw style windowing analysis: you need overlapping read counts in each window for every sample, so if you exclude the reads for one sample because of an anomaly in its control, you can't obtain a p-value for that window).

If you are doing a differential analysis, then any region excluded in any sample should be excluded from the analysis, which is why we merge the control-specific greylists into a master greylist.