Question: Distance Measure Between Chip-Seq Peak Set Profiles
3
10.6 years ago by
2184687-1231-83-5.0k wrote:

Is there any tool that will tell me how different/similar two chip-seq peak sets are in two different parts of the genome? For example, if I have a ~10Kb region in the genome with a series of peaks and another ~10Kb region in the genome with another set of peaks from the same experiment, can I calculate a distance measure between these two peak set profiles with any available tool?

chip-seq distance • 3.9k views
written 10.6 years ago by 2184687-1231-83-5.0k
4
10.6 years ago by
Istvan Albert ♦♦ 85k
University Park, USA
Istvan Albert ♦♦ 85k wrote:

Hi,

this is a problem that I am actively investigating. I have come up with a potential (homegrown) approach but it has has not been fully vetted, so keep that in mind. It builds on the following assumptions

1. We assume that the position of each peak is defined independently of the rest
2. Within one peak the distribution of the reads is governed by a reasonably normal distribution

Thus if we could detect each peak, find the corresponding peak in the other dataset, extract only the reads that correspond to both of the these peaks, then we can run a statistical test to detect differences between these distributions.

The results will characterize each peak individually rather than the entire shape. These differences may manifest themselves as a difference in the mean or variance of peaks. (the first indicating a shift of the peak, the other is a change in occupancy). For example below are the results from a script that I wrote that compares peaks around TSS for two experiments:

The upper panel shows the original peaks, the lower panel shows the underlying read distributions, the little boxes below show the shift and p-values respectively. The interpretation is that the last 2 peaks show a statistically significant shift in the mean value of 10bp and +20 bp respectively.

I do have a tool that does this pretty automatically but since I am not yet convinced of the correctness of the approach as a whole it is not yet publicly available.

Not so long ago I was advised that this is a problem can be thought of a time series analysis but have not yet looked into this possibility. That is something to also investigate.

10.4 years later this may be a long shot, but I'm curious if you explored this any further/made your tool publicly available? I have a problem where I am less interested in identifying the presence/absence of peaks, and more concerned with characterizing changes in peak position/distribution at a given locus. I can't seem to find many established methods of looking at this, but maybe I'm looking in the wrong places!

1
10.6 years ago by
Marcin Cieslik520 wrote:

Looking only at the beautiful plots and not knowing what they really mean it seems to me that the a homegrown approach is unnecessary because after normalization you end up with two probability mass functions. There are multiple distance measures on probability distributions, but the Jensen-Shannon divergence seems (to me) to be most useful here because it can be generalized to multiple reads and has a probabilistic interpretation (the probability that the two runs represent samples drawn from the same background distribution)

see:

El-Yaniv, R., Fine, S. & Tishby, N. Agnostic classification of Markovian sequences. In Advances in Neural Information Processing (NIPS-97 (1997).

One of shortcomings of the approaches that detect overall differences is that these don't answer what will be the next logical question: In what way are the two peak distributions different.