Question

Importance of INPUT for ChIP-seq

0

Entering edit mode

2.3 years ago

N ▴ 10

The importance of an Input sample (i.e. genomic DNA that hasn't been immunoprecipitated) confuses me. First of all, is its only purpose to serve as a control for background peaks? And would you expect it to contain more reads, the same, or fewer reads than the treatment samples? How do the library sizes compare between the two and are they normalized?

Thanks

ChIP-seq MACS2 • 1.2k views

ADD COMMENT • link updated 2.3 years ago by ATpoint 81k • written 2.3 years ago by N ▴ 10

score 2 · Accepted Answer · 2021-12-14

First of all, is its only purpose to serve as a control for background peaks?

Others might use it differently but yes, I use it during peak calling as control against background. Not all genomic regions amplify and get sequenced equally well so especially peaks with weak protein binding and experiments with poor antibodies benefit from a DNA input control. I do not use it for anything else as the absence of peaks makes it hard to impossible to properly normalize IP vs input beyond reads per million and RPM performs poor for between-sample comparisons.

Further reads on RPM = poor :

TMM-Normalization

ATAC-seq sample normalization

And would you expect it to contain more reads, the same, or fewer reads than the treatment samples?

That depends on how you sequence it. Read numbers depend on library concentration on the flow cell so there is no direct link between the type of library and the obtained reads. ChIP input is basically a low-coverage whole genome sequencing, but if you really want to use it as input control you have to at least sequence it to a similar depth as the IP. I mean, you can calculate the number of reads requires to get a given coverage in inputs (=WGS). Say you want at least a 2x coverage (idealized, assuming even coverage). So at a genome size of humans of ~3e9 base pairs and 150 sequencing cycles (e.g. 2x75 on a Nextseq) you would need 40mio reads. And that is more than most people are willing to spend on inputs. Often in published data you see a few million input reads. That then (imho) really only captures the top bias regions of the genome that pop up as outliers during PCR/sequencing and as such as false-positive peaks in the IP. But I understand the struggle. If I had limitd money/resources I would probably always add an extra IP/antibody than sequencing an input deeper unless I knew that the IP was absolutely poor and a good input is key to get any confident peaks.

Otherwise coverages will be zero for most parts of the genome, making it a waste of resources.

How do the library sizes compare between the two and are they normalized?

As said above, IP should get a peaky profile and input is basically a low-coverage WGS with no enrichment beyond the "normal" amplification / PCR and sequencing noise of the genome. That means some regions (e.g. based on GC content) amplify better, and others worse, and this is then reflected in the read counts for these regions.

Does that answer your questions?