Question: How does MACS algorithm work on ATAC-seq data if there is no control sample and no model estimation?
0
gravatar for salamandra
14 months ago by
salamandra180
salamandra180 wrote:

I understand that unlike for ChIP-seq, with ATAC-seq MACS doesn't estimate the model, which means it doesn't determine distance 'd' between the tags and therefore it doesn't shift the tags by d/2 in 3´direction (and that is why with ATAC-seq the --nomodel --shift 0 parameters are set).

But if 'd' is not estimated then how can MACS slide a window of 2d across the genome to calculate peak enrichment?

Also, for ChIP-seq there is a control sample (input DNA or DNA pulled down with an unspecific antibody) that is used to calculate lambda local and determine peak enrichment. In ATAC-seq there isn't any control pulled down with unspecific antibody. What is that control in ATAC-seq?

chip-seq macs • 1.3k views
ADD COMMENTlink modified 9 weeks ago by jihed.chouaref0 • written 14 months ago by salamandra180

Thanks for the discussion I am going through the same question and as a non-bioinformatician, this is much appreciated!

ADD REPLYlink written 9 weeks ago by jihed.chouaref0
5
gravatar for ATpoint
14 months ago by
ATpoint12k
Germany
ATpoint12k wrote:

MACS (for ChIP or ATAC) does not necessarily need a control. There is actually a section in the paper describing exactly the situation where no control is available. In this case, it determines the local lambda in certain windows. The paragraph starts with:

Therefore, instead of using a uniform λBG, estimated from the whole genome, MACS uses a dynamic parameter, λlocal, defined for each candidate peak as (...)

In the absence of a control, naively one would estimate the background level of the experiment in a uniform fashion. That means if you have, say 30mio reads and you randomly threw them onto the genome, then every base (given a certain fragment length of the library) would have a coverage of x. Still, the genome coverage is never uniform (you will most impressively see once you analyze your first WGS sample, it is really a hilly landscape) due to differences in chromatin structure, PCR/GC bias, local copy number alterations etc. So instead of a genome-wide λBG, MACS checks the vicinity of the peak centers (up to 10kb) to estimate how prone this genomic region is to accumulate reads. In my understanding, as a typical ChIPseq experiment produces sharp peaks, the local environment should be depleted for enriched signals. Therefore, notable readcounts in the vicinity are an indication of a local bias. As a result, the peak enrichment needs to be penalized (down-corrected), as the region itself is prone to accumulate enrichment, irrespective of the protein target.

ADD COMMENTlink written 14 months ago by ATpoint12k

Hi, I have some questions about your explanation. As you said, when there is no control, the paragraph said it would use a dynamic parameter. But according to the paper, the whole sentence is like this:

"" For example, at the FoxA1 candidate peak locations, tag counts are well correlated between ChIP and control samples (Figure 1c,d). Many possible sources for these biases include local chromatin structure, DNA amplification and sequencing bias, and genome copy number variation. Therefore, instead of using a uniform λ BG estimated from the whole genome, MACS uses a dynamic parameter, λ local , defined for each candidate peak as: ""

Does it mean that λ local works for eliminating the influence of local biases? Furthermore, in the following passage, there is another sentence as below:

'''' where λ 1k , λ 5k and λ 10k are λ estimated from the 1 kb, 5 kb or 10 kb window centered at the peak location in the control sample, or the ChIP-Seq sample when a control sample is not available (in which case λ 1k is not used). ''''

In this case, I don't think λ local is actually work for a situation where no control is available.

By the way, I am not quite understand what the control really is in the ATAC-seq analysis and whether a control is necessary in such condition. I wonder whether the data with nucleosome signal can work as a control, since those nucleosome free regions would probably not be detected in these data (As the Figure3A in Buenrostro J D, et al. Nature methods, 2013, 10(12): 1213-1218).

ADD REPLYlink modified 13 months ago • written 13 months ago by ghostforever.shi30
1

Does it mean that λ local works for eliminating the influence of local biases? Furthermore, in the following passage, there is another sentence as below:

I would say it tries to estimate and corrects for the bias.

where λ 1k , λ 5k and λ 10k are λ estimated from the 1 kb, 5 kb or 10 kb window centered at the peak location in the control sample, or the ChIP-Seq sample when a control sample is not available (in which case λ 1k is not used).

Without a control, only the 5kb and 10kb are used.

By the way, I am not quite understand what the control really is in the ATAC-seq analysis and whether a control is necessary in such condition. I wonder whether the data with nucleosome signal can work as a control, since those nucleosome free regions would probably not be detected in these data (As the Figure3A in Buenrostro J D, et al. Nature methods, 2013, 10(12): 1213-1218).

In ATAC-seq, you do not have a control. Both nucleosomal and nucleosome free signals are located in open chromatin. Do not mistake open chromatin with nucleosome free DNA. Open chromatin is a combination of distinctly positioned nucleosomes which flank nucleosome free DNA. ATAC-seq peaks contain both nucleosomal and nucleosome free signals.

ADD REPLYlink written 13 months ago by ATpoint12k

Oh, I got you, Much appreciate. However, as you know, Tn5 is an enzyme. Without a control, how can we correct the bias caused by the enzyme itself? This problem really confused me a lot.

ADD REPLYlink written 13 months ago by ghostforever.shi30

Double-check the early ATAC papers (maybe the original one and the one on the NucleoATAC software). They show that the intrinsic cutting preference/bias of the transposon is minimal if I remember correctly. The only proper control would probably be to use the Tn on "naked" genomic DNA. But then you need quiet many reads to get a proper signal due to the size of mammalian genomes, so the cost/effect ratio is simply not economic so nobody routinely does it (the last sentence is only thinking aloud^^).

ADD REPLYlink modified 13 months ago • written 13 months ago by ATpoint12k

Yeap, it appears in the original one, I miss that. >_<|||

Thanks a lot!

ADD REPLYlink written 13 months ago by ghostforever.shi30
0
gravatar for YaGalbi
14 months ago by
YaGalbi1.4k
Biocomputing, MRC Harwell Institute, Oxford, UK
YaGalbi1.4k wrote:

Have you seen these posts? 1) biostars1 2) biostars2 3) MACSgithub

ADD COMMENTlink written 14 months ago by YaGalbi1.4k

i have, and none of them seem to answer my questions, but as i'm no expert on bioinformatics is possible i am just not understanding what they are saying.

ADD REPLYlink written 14 months ago by salamandra180
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1141 users visited in the last hour