Question: ATAC-seq peak calling with MACS
13
gravatar for igor
2.2 years ago by
igor7.0k
United States
igor7.0k wrote:

I am trying to call peaks in ATAC-seq data. Not surprisingly, MACS is a popular option. According to the MACS documentation:

  1. To find enriched cutting sites such as some DNAse-Seq datasets. In this case, all 5' ends of sequenced reads should be extended in both direction to smooth the pileup signals. If the wanted smoothing window is 200bps, then use '--nomodel --shift -100 --extsize 200'.

  2. For certain nucleosome-seq data, we need to pileup the centers of nucleosomes using a half-nucleosome size for wavelet analysis (e.g. NPS algorithm). Since the DNA wrapped on nucleosome is about 147bps, this option can be used: '--nomodel --shift 37 --extsize 73'.

Based on a brief literature review, people use both --shift -100 --extsize 200 and --shift 37 --extsize 73 for ATAC-seq. Is one option more appropriate? Are there maybe different sub-types of ATAC-seq where one is better than the other?

atac-seq chip-seq peaks macs • 14k views
ADD COMMENTlink modified 6 months ago by ZheFrench150 • written 2.2 years ago by igor7.0k

I was wondering which parameters to use too. Assuming this is the pipeline ENCODE is using (https://github.com/kundajelab/atac_dnase_pipelines), specifically the ATAC-seq pipeline protocol definition (https://docs.google.com/document/d/1f0Cm4vRyDQDu0bMehHD7P7KOMxTOP-HiNoIvL1VcBt8/edit?usp=sharing)

I am seeing that in the step 3a. Peak calling - MACS2 : function macs2()

shiftsize := round( smooth_window.parseReal()/2.0 )

macs2 callpeak \ -t $tag -f BED -n "$prefix" -g "$gensz" -p $pval_thresh \ --nomodel --shift -$shiftsize --extsize $smooth_window --broad --keep-dup all

The shift is - shiftsize

If you look down in the document:

function main() {

// By default, smoothing window for MACS2 is 150. IDR threshold is 0.1
// for ENCODE3, smoothing window for MACS2 is 73. IDR threshold is 0.05
if ( ENCODE3 ) {
      smooth_win = 73
      idr_thresh = 0.05
}
else {
      smooth_win = 150
      idr_thresh = 0.1
}
  

So effectively it would be either

macs2 callpeak ... --nomodel --shift -37 --extsize 73 --broad --keep-dup all

or

macs2 callpeak ... --nomodel --shift -75 --extsize 150 --broad --keep-dup all

Could it be that the first one will be sensitive to peaks and valleys in signal due to nucleosomes, while the second one will extend the cutting sites to cover 1 nucleosome?

ADD REPLYlink written 20 months ago by jaime.alvarez.benayas10
6
gravatar for benformatics
2.2 years ago by
benformatics320
benformatics320 wrote:

Comparison of cuts by technique

Assuming that in ATAC-Seq, we make a cut at "any" open chromatin regions, I imagine that using nucleosome specific parameters would not be optimal. Given that for things like MNAse you assume your average fragment is as long as the DNA wrapped around the nucleosome (~150 nt) which is not necessarily true for ATAC. Thus my guess would be method #1 would provide more optimal results.

Although if you check this workflow neither of those parameters are used for MACS: http://informatics.fas.harvard.edu/atac-workshop-capellini-lab.html

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by benformatics320

In that workflow, they also don't adjust for larger fragment size during Bowtie2 alignment (-X/--maxins), which is specifically discussed in the original ATAC-seq paper, so I wouldn't treat it as the gold standard.

Also, that's a really nice illustration. Where is it from?

ADD REPLYlink written 2.2 years ago by igor7.0k

The source of the image is here: http://www.the-scientist.com/?articles.view/articleNo/44772/title/Reveling-in-the-Revealed/

In regards to the larger fragment size, it does seem smart to try to adjust the --X/--maxins parameter to match your dataset. In my case it increased the number mapped reads by 1-2%.

I found this resource to be very helpful. It almost seems to suggest that increasing -X anytime you use bowtie2 is vital. http://lab.loman.net/2013/05/02/use-x-with-bowtie2-to-set-minimum-and-maximum-insert-sizes-for-nextera-libraries/

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by benformatics320

It shouldn't change the alignment rate by much, since it will just attempt to align the mates independently by default. However, it will count them as discordant. Also, the fragment size distribution will be cut off.

And I agree. Increasing that parameter is helpful in general. You don't always know what your library size is going to be. The only downside is it will take longer to run.

ADD REPLYlink written 2.2 years ago by igor7.0k
1

I think an -X 2000 for bowtie2 makes sense, as most current sequencing platforms are not able to sequence fragments > 2000bp. So this is rather a filter against false-positive alignments and pretty much standard in most pipelines.

EDIT(02/18): Still, larger insert sizes than expected can indicate structural variants and might be discarded with -X 2000. Still, when one is interested in structural variants, you 1) would probably not use ATAC-seq data and 2) rather use BWA mem for alignment. For standard ATAC-seq -X 1000 or 2000 makes perfect sense to me.

ADD REPLYlink modified 8 months ago • written 2.2 years ago by ATpoint9.3k
4
gravatar for igor
2.2 years ago by
igor7.0k
United States
igor7.0k wrote:

I got another alternative from the MACS developer:

If you followed original protocol for ATAC-Seq, you should get Paired-End reads. If so, I would suggest you just use "--format BAMPE" to let MACS2 pileup the whole fragments in general. But if you want to focus on looking for where the 'cutting sites' are, then '--nomodel --shift -100 --extsize 200' should work.

And then more recently:

Why do so many papers use --shift -100 --extsize 200 for MACS2 rather than -f BAMPE if this is not recommended for paired-end data?

--shift -100 --extsize 200 will amplify the 'cutting sites' enrichment from ATAC-seq data. So in the end, the 'peak' is where Tn5 transposase likes to attack. The fact is that, although many information such as the insertion length and the other mate alignment is ignored, such result is still usable. Especially when the short fragment population is extremely dominant, the final output won't be off much.

Source: https://github.com/taoliu/MACS/issues/145

ADD COMMENTlink modified 6 weeks ago • written 2.2 years ago by igor7.0k

Why specifying --shift in BAMPE? The manual says that in BAMPE, --shift can only be zero. Also, if you provide paired-end data, why using an --extsize. The fragment size is already determined by the paired-end sequcening.

I used to follow the strategy from the NucleoATAC and single-cell ATAC paper, [macs2 -g hs --nomodel]

ADD REPLYlink written 2.2 years ago by ATpoint9.3k

I think he meant specifying either BAMPE or shift, not both. If you don't specify BAMPE, reads are not treated as pairs.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by igor7.0k

That's a good suggestion concerning altering the macs2 call to use the "real" insert size for PE data. We will try it and see how it compares to past peak calling.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by benformatics320

Although then a double nucleosome produces a peak twice the size of a single nucleosome, which I am not sure I am entirely comfortable with either.

ADD REPLYlink written 2.2 years ago by igor7.0k
1

The regions where ATAC detects nucleosomes are open chromatin regions, where I think you always have a mixture of signals, both NFR, mono, and maybe di-or trinucleosomes. Inferring these features requires more specialized software, such as NucleoATAC. The peakcalling, in my understanding, just gives you a region where transposase cutting events are enriched and in which "you can dig deeper". The NucleoATAC software then takes a bed with peaks as input and the BAM sequencing files to perform modelling of the fragment size distribution and to infer NFR and nucleosome positions. So I think MACS peaks are (by themselves) not suitable for base-pair-resolved analysis.

I think it depends what you want to use your data for. If you just want to say which regions are accessable or not, then MACS peaks should do the job. If in-depth analysis, such as NFR and nucleosome positions or TF footprints are to be obtained, you need specialized software/ custom approaches, such as NucleoATAC, but also excellent data quality and good sequencing depth.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by ATpoint9.3k
0
gravatar for Simply Bioinformatics
12 months ago by
WashingtonDC
Simply Bioinformatics110 wrote:

We Are Using This:

macs2 callpeak -t input -f BED -g "hs" -p 0.01 --nomodel --shift -37 --extsize 73 -B --SPMR --keep-dup all --call-summits

Currently, there is an existing issue on MACS2 which intermittent -1 as pValue which will crash the IDR as downstream analysis

as a heads up: convert all -1 to zero using sed

ADD COMMENTlink written 12 months ago by Simply Bioinformatics110
0
gravatar for ZheFrench
6 months ago by
ZheFrench150
France
ZheFrench150 wrote:

So finally what is the consensus to use for MACS2 with ATAC-SEQ paired-end ?

I read that -f BAMPE was a good pratice . But then should it be used with --nomodel option ?

For what it worth, this is what I'm using until now on my bam without duplicates:

macs2 callpeak --verbose 3 --treatment my.bam --name myname -g hs --bdg -q 0.05 --keep-dup all -f BAMPE --broad
ADD COMMENTlink written 6 months ago by ZheFrench150

Hi, ZheFrench, the read position should be shifted +4 and -5 bp in the BAM file before macs2 callpeak? Thank you!

ADD REPLYlink written 4 months ago by afli140

Is --nomodel not required if you are using -f BAMPE?

ADD REPLYlink written 11 weeks ago by Lucy0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 937 users visited in the last hour