I have a list of pairs genes and I want to know if they are differentially enriched in a particularly histone mark in the same conditions. I made some research and it seems all actual software are design to compare a same gene in different condition.

So initially I though to use the number of peaks(1). Then I realise that some histones mark could be really broad and so I thought to use the sum of pb under a peaks in the gene region / length of the gene region(2). But some of the mark which interest me H3K4me2 for example are referenced as gapped peaks (encode project) it means that they could be broad and narrow. Some people tell me that in my case I should use directly the reads count and try to make a linear model based on log(ReadCountGene1 / ReadCountGene2) (3) we point out the lack of normalisation so we arrive at the conclusion to use not the actual read count but ReadCountGene1IP/ ReadCountGene1Input (4). But it seems to me those last method lack the statistically significance of peaks calling and are simply trying to get the fold enrichment which is for me a marker of the abundance of the mark in the cellular population, and add little information about the difference in term of number of peaks / region on a peaks.

I am really confused rigth now about How analysing those data and the biological relevance of each of this method.

every hint or relevant remark are welcome !


It's not clear at the moment if ChIP-seq is a quantitative method.

What does the height of a peak actually represent? Is it the number of cells? The amount of protein? The affinity of binding? The length of time a protein is bound?

The simplistic view is to assume 'more protein' but since there's a finite upper limit to how many molecules can occupy a given stretch of DNA, the degree of variation in peak heights would suggest that a number of other factors are involved.

This may be why people do not generally compare differential ChIP-seq enrichments within the same sample.

