Is there any tool that will tell me how different/similar two chip-seq peak sets are in two different parts of the genome? For example, if I have a ~10Kb region in the genome with a series of peaks and another ~10Kb region in the genome with another set of peaks from the same experiment, can I calculate a distance measure between these two peak set profiles with any available tool?
this is a problem that I am actively investigating. I have come up with a potential (homegrown) approach but it has has not been fully vetted, so keep that in mind. It builds on the following assumptions
- We assume that the position of each peak is defined independently of the rest
- Within one peak the distribution of the reads is governed by a reasonably normal distribution
Thus if we could detect each peak, find the corresponding peak in the other dataset, extract only the reads that correspond to both of the these peaks, then we can run a statistical test to detect differences between these distributions.
The results will characterize each peak individually rather than the entire shape. These differences may manifest themselves as a difference in the mean or variance of peaks. (the first indicating a shift of the peak, the other is a change in occupancy). For example below are the results from a script that I wrote that compares peaks around TSS for two experiments:
The upper panel shows the original peaks, the lower panel shows the underlying read distributions, the little boxes below show the shift and p-values respectively. The interpretation is that the last 2 peaks show a statistically significant shift in the mean value of 10bp and +20 bp respectively.
I do have a tool that does this pretty automatically but since I am not yet convinced of the correctness of the approach as a whole it is not yet publicly available.
Not so long ago I was advised that this is a problem can be thought of a time series analysis but have not yet looked into this possibility. That is something to also investigate.
Looking only at the beautiful plots and not knowing what they really mean it seems to me that the a homegrown approach is unnecessary because after normalization you end up with two probability mass functions. There are multiple distance measures on probability distributions, but the Jensen-Shannon divergence seems (to me) to be most useful here because it can be generalized to multiple reads and has a probabilistic interpretation (the probability that the two runs represent samples drawn from the same background distribution)
El-Yaniv, R., Fine, S. & Tishby, N. Agnostic classification of Markovian sequences. In Advances in Neural Information Processing (NIPS-97 (1997).