Question

ChIP-seq datasets: input samples omitted?

0

Entering edit mode

3 months ago

vanbelj ▴ 40

I have been analyzing RNA Polymerase II ChIP-seq datasets available at NCBI's Gene Expression Omnibus (GEO). I'm working with datasets from have been published by reputable labs.

I'm repeatedly finding that only the IP fraction is provided, and I assume the input fraction was not sequenced.

When I refer to the publications where the data is reported, I find y-axis labels such as "Counts Per Million", "Fold Enrichment", or "Spike-In Normalized". It appears that many labs have foregone the input normalization completely and are solely using a spike-in control, generally in the form of chromatin from an independent species, for normalization. I understand that this type of control would allow for normalization of library size or technical variation between samples. However, I do not see how a spike-in control could be used to normalize for site-based relative enrichment.

Am I missing something? Isn't an input sample a necessary control for accurate peak calling in ChIP-seq?

ChIP-seq Normalization NGS • 440 views

ADD COMMENT • link updated 12 weeks ago by i.sudbery 20k • written 3 months ago by vanbelj ▴ 40

score 5 · Accepted Answer · 2024-04-28

In my hands (and from what I know based on many years here) inputs are almost exclusively used during peak calling to correct for loci-specific amplification bias, and then omitted entirely, especially during differential analysis. Normalization to input is something I have never done because I am not aware of reliable tools for it. Also, since input sequencing is basically low-depth WGS you would need a good coverage across the genome to avoid many zeros in peak regions, and that requires depth that usually nobody wants to pay for. Say you want only a 3x coverage with a ChIP-seq typical sequencing setup of 1x75bp reads and a 3bio bp genome then you would need (How To Calculate Coverage) 120mio reads for a single input. A typical ChIP-seq is sequenced to maybe 30mio reads. That obviously drives costs up quite a bit. On the other hand, if you sequence input at 30mio reads with 1x75 then this gives you an average 0.75x coverage, so mostly zeros, hence no information for many potential peak regions. It often just eliminates obvious amplification-biased regions (thinking aloud here, I don't have in depth analysis to show that), so depending on experimental design and budget one might simply not do inputs. Same goes for IgG controls which are probably even worse because you essentially build a library with almost zero DNA content, as IgG IP returns almost nothing.

That having said: If you goal is to precisely define binding sites (peaks) then you definitely should do proper inputs. If your goal is differential analysis between conditions then you don't really need inputs as they're not really used at all. After all, as always, it depends on what you do.

Note that this is all my interpretation of the problem, and based on my experience, I have do quality analysis I could show to back that up, but it's how I tackled ChIP-seq when I was still working with it in the lab.