Different batches of Chipseq data
1
1
Entering edit mode
5 months ago
Svetlana ▴ 10

Hi All,

I'm currently analyzing a dataset containing 6 biological replicates of two conditions: 6 Condition1 vs 6 Condition2. Experiments were done some time ago in 3 different batches, e.g. 3 different days. Basically, I call peaks separately for every sample (n =12) using Input of that batch as a control. Then, I use DiffBind 3.0 to detect common peakset and find differentially bound peaks.

So the problem I'm stuck with - when I use all the available replicates - I get too few differentially bound peaks (11!). Combining replicates from different batches helps to increase number of diff-bound regions (though not sure which part of that is due to batch-effect).

I would really appreciate your tips on the following:

1) How can I pick proper replicates for DiffBind analysis? (in case ChipSeq fingerprint plots look similar for majority of replicates); 2) Is it appropriate to use samples from different batches- like 1+2+3 for Condition1 and 1+4+5 Condition2? Maybe I need to include multi-factor design in DiffBind package to account for my batch effect?

Svetlana

DiffBind Chipseq batch batch-effect • 356 views
0
Entering edit mode

Did you provide the batch information to the design, something like ~batch + condition? This is how one commonly corrects for batch information given that each condition has replicates of all batches. Otherwise it condition would be confounded by batch and you could not correct for it.

0
Entering edit mode

Thanks for a fast reply! Yes, I saw this option in the DiffBind vignette and was wondering about that. Although, I am still not sure whether it would be biologically appropriate to go forward with picking replicates from different batches.. Although this way gave me much higher number of differentially bound sites ~1,500.

0
Entering edit mode
5 months ago
Rory Stark ★ 1.2k

If the batches each included only one or the other of the conditions, then the batch is confounded with the condition and there is no way to differentiate between technical and biological variance. Assuming batch and condition are not confounded, it is definitely worth looking at the clustering plots (dba.plotPCA() and dba.plotHeatmap()), colored by batch and by condition, to see which is a greater source of variance. If the samples cluster by batch, including the batch factor in the model formula (as suggested by ATpoint) could make a big difference.

I would be extremely wary about picking which replicates to keep and which to discard based on how many DB sites you get. This is a classic way to introduce bias into your result. Optimally you would include all replicates (possible excluding only those that failed some objective quality control, e.g. a failed IP). In general, while more replicates usually increase all types of variance, they provide more power to detect biological variance and hence more DB sites.