I want to determine the best way to normalize ChIP-seq replicates that differ in total reads. I am analyzing ChIP-seq data for a factor that is found near transcription start sites (TSSs), and focusing my analysis on a relatively small window around TSSs. Such experiments may yield 10 million mappable reads, but only <1million map to a window around TSSs in which I am interested, say +/- 1000bp. I want to normalize for differences in tag counts between technical replicates, and replicates generated from different conditions.
It seems I could normalize by total reads between replicates, i.e. make the total reads in each replicate equal to 10 million and proceed to mapping the reads to a window around TSSs. This method actually takes ALL the data, greater than >90% of which I will discard early in the analysis, for normalization. So I would be normalizing the signal of interest by what amounts to a great deal of excess noise.
Alternatively, I could map the reads to the TSSs within the window of interest, and normalize the data that lies within this window. In this case I can first see how the proportion of tags within the window for each replicate compares to the total number of tags in each replicate. If an equal proportion of total reads from each replicate maps to +/- 1kb around TSSs, both methods should yield similar results. However to me it makes sense to refine the data first, isolating those data you will ultimately analyze, than do the normalization between replicates to adjust for read counts. Especially for cases where the biology predicts replicates from different cellular conditions will differ in a narrow window around a subset of genes: a small percentage or a large dataset.
Does a consensus exist as to the best approach?