I am looking at the recently published signal tracks from Roadmap Epigenomics project. Namely the section d in http://egg2.wustl.edu/roadmap/web_portal/processed_data.html.
They suggest using their -log10 p-value track for analysis. The way the track is generated is explained here as well as in their Nature paper. MACS2 function bdgcmp seems to have been used to do the heavy-lifting (see line 217 of the script). Semantically, as described in the paper, it is supposed to give you a p-value from Poisson distribution that a signal/noise ratio is higher than the one expected from chance. I am interested in this p-value.
I want to, however, look into non-overlapping windows in the genome (i.e. bins) of some size, say 100bp.
The output produced by MACS (and thus the consortium) is of form:
chr1 0 9853 0.01005 chr1 9853 9927 0.05026 chr1 9927 9971 0.13816
Where the last column is the p-value and the first three describe the genomic interval.
Since I am interested in fixed-length bins I wonder how one would define the p-value for fixed-length bins using this data, keeping the underlying Poisson model in mind. For instance, what is the p-value associated with bin chr1:9800-9900?
A quick an dirty way is to, of course, use Ernst and Kellis approach and to "take a base-level average of signal overlapping each [25bp] bin". I wonder if this is the best approach from the statistical sense, or is there better one?
Don't think any of them would make much difference in the downstream analysis, but I am still wondering. More out of curiosity than practicality, I guess.