2.9 years ago by
Denmark / Copenhagen / BRIC
I think that is a very relevant (and common) question to ask - not really odd at all. There is in my opinion too much focus on the absolute number of peaks, and the questioning you do is healthy.
I agree with Sinji, that formally you can't conclude that. As mentioned then the number of peaks varies with sequencing depth.
We actually did a systematic test of that here in figure 3e, where we randomly downscaled H3K4me3 and input from 26M reads to 2M reads in steps of 2M reads and did peakfinding for all combinations in MACS1.4, MACS 2.0 and EaSeq. As expected the number of obtained peaks varies a lot depending on the number of reads, where more reads generally tend to improve your signal/noise level and allow more peaks to be identified with higher reliability. Interestingly, the number of found peaks scaled quite differently with the number of reads in MACS and EaSeq, and this will likely also be the case if you test other algorithms. The variation in the number of peaks found with the different algorithms is also quite evident in our Figure 3a and in this paper.
Finally, even if you took the dataset sizes into account and matched them perfectly, then I agree with Devon that many parts of the experimental conditions can vary a lot, and that the variation between ChIP-seq replicates can be quite high, so it would still be impossible to make that conclusion formally.
Nonetheless, I think that you do see people make the conclusion that a difference in peak numbers indicates different binding - and in many cases it is probably also true. But there are a lot of unknowns, where we don't even know their extent - and cannot reliably measure how much they will affect your conclusion.
If the question is central to your work then I would:
- Make biological replicates - triplicates preferably, but you PI might not agree on that :-)
- Ideally scale the datasets to the same size before peakfinding, and see if the difference in peak numbers is reproduced. Although, that does still not rule out the effects that Devon mention on cell line genomic sequence identity & copy numbers. However, if your samples are differently treated cells of the same origin, then this should not pose an issue and the variation boils down to 'only' being biology, IP-efficiency, and library creation, which your independent biological triplicates will give you an idea of.
- Use one of the replicates for each condition for peak-calling, and the other(s) for quantifying signal strenghts at the peaks in the different samples. Then you might get an indication of how much of the signal that is rediscovered in the different samples for each peakset. Using the same samples for identifying peaks and quantifying signal will lead to a bias in the quantitation.
- Make sure that the central conclusions are supported by independent methods as well. ChIP-seq is absolutely not flawless.