Why is an Input file used as a control during MACS peak calling?
1
0
Entering edit mode
8.7 years ago

I am currently newer to the computational biology field, however am very familiar with the biology aspect of genome wide experiments. Currently, I am attempting to call peaks using MACS1.4 on aligned BAM files. I noticed that in the control section of the peak calling, people tend to use Input as a control. I was wondering if someone could explain why this is used. To my logic, MACS uses the control it is given to calculate regions of binding enriched in the treatment file over the control, however if the input is given as the control, technically there really shouldn't be ANY regions where the treatment file is more enriched than the control, given that the treatment is an IP and the input contains all genomic DNA. Anyway, if someone could clear this up for me I would be grateful!

Cheers

MACS ChIP-Seq peak-calling • 7.8k views
ADD COMMENT
0
Entering edit mode
8.7 years ago

There is a treatment sample and two types of control samples. 1. The "input" control and 2. The Mock control. The "input" control is just genomic DNA without any immunoprecipitation. The "mock" control us usually done with IgG antibody. So the "input" control is not the "treatment" sample.

MACS uses peaks from control samples to calculate Empirical FDR. Any peaks seen in control samples will be considered as false discovery. If no control is provided, no FDR in the output. So control samples help in identification of reliable peaks. I don't think there is any other use of control samples.

Some tools try to normalise the data using control samples.

ADD COMMENT
1
Entering edit mode

...by definition every peak that you see in your treatment file will be seen in the control file, because the treatment file is just a portion of the input bound by a specific transcription factor.

Treatment is just a portion of the input, correct, but it is the portion that remains after the non-bound input is washed away. The input represents the starting DNA concentration of all locations in your genome. The immunoprecipitation presumably binds some regions (those attached to the protein of interest), and not others, which get washed away during the experiment. Thus at the end, the IP contains some regions of DNA, but most of the input DNA has been dropped out. Thus the IP is enriched, and after normalization/scaling - as mentioned by Devon, one should see the locations of enrichment. Since non-specific locations get washed away, those locations will have fewer reads in the IP than in the input (thus we will see peaks in the IP relative to the input).

ADD REPLY
0
Entering edit mode

Right I get that, but if you use input as a control, by definition every peak that you see in your treatment file will be seen in the control file, because the treatment file is just a portion of the input bound by a specific transcription factor. This is where my confusion lies.

ADD REPLY
5
Entering edit mode

My feeling is that your confusion comes from the misinterpretation that the IP sample will contain only those sites where the antibody found its target. That is wrong. In fact, the vast majority of DNA found in an IP sample will represent the entire genome (= background = akin to what you expect to see in the input). After all, enrichment values of a good IP tend to be between 1 to 10% which means that 1 to 10% of the entire genome are more often found in the IP sample than in the input sample. It does not mean that the IP contains only those 1 to 10%!

The need for an input sample comes from the observation that the genome is usually never represented uniformly, i.e. even if you simply sequence genomic DNA (= input in ChIP-speak), you will have regions that you see more often than others (due to various reasons, e.g. PCR artifacts). That means, if you ran MACS on a simple genomic DNA sequencing experiment, it will most likely identify peaks, i.e. regions with more read coverage than expected. These "aberrant" peaks will of course also be present in your IP sample, but these are not the kinds of peaks you are interested to report (because they are not specific to your DNA-binding protein of interest). Hence, MACS encourages to compare the peaks found in your IP sample to those found in a matching input sample so that you can focus on those that are presumably indicative of your protein of interest binding the genome.

ADD REPLY
0
Entering edit mode

I updated my post.

ADD REPLY
0
Entering edit mode

Yep, we're still not on the same page haha. I understand what input and mock IgG controls are, and if the IgG control were used as the "control" file when peak calling, everything would make sense to me. However, from everything that I've read it seems like people use the input as the "control" file and the IP as the "Treatment" file when doing MACS peak calling, in which case nothing in the treatment file should really be amplified over the "control" (input) file, because the "treatment" IP by definition is only a percent of the input given anyway. And from what I understand MACS uses peaks that are enriched in the treatment (IP) over the control (input) to call peaks, which technically should never happen...

ADD REPLY
1
Entering edit mode

The control is not immunoprecipitated by protein of interest, where as the treatment is. So the peaks (due to immunoprecipitation) in treatment should not be seen in control. Otherwise they will be false positives. Anyway, lets wait for responses from others.

ADD REPLY
0
Entering edit mode

You're missing that things are scaled, so IP will indeed show vast over-representation in certain areas versus input.

ADD REPLY

Login before adding your answer.

Traffic: 2373 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6