ATAC-seq peak calling with MACS
3
22
Entering edit mode
6.3 years ago
igor 13k

I am trying to call peaks in ATAC-seq data. Not surprisingly, MACS is a popular option. According to the MACS documentation:

1. To find enriched cutting sites such as some DNAse-Seq datasets. In this case, all 5' ends of sequenced reads should be extended in both direction to smooth the pileup signals. If the wanted smoothing window is 200bps, then use '--nomodel --shift -100 --extsize 200'.

2. For certain nucleosome-seq data, we need to pileup the centers of nucleosomes using a half-nucleosome size for wavelet analysis (e.g. NPS algorithm). Since the DNA wrapped on nucleosome is about 147bps, this option can be used: '--nomodel --shift 37 --extsize 73'.

Based on a brief literature review, people use both --shift -100 --extsize 200 and --shift 37 --extsize 73 for ATAC-seq. Is one option more appropriate? Are there maybe different sub-types of ATAC-seq where one is better than the other?

atac-seq macs ChIP-Seq peaks • 30k views
0
Entering edit mode

I was wondering which parameters to use too. Assuming this is the pipeline ENCODE is using (https://github.com/kundajelab/atac_dnase_pipelines), specifically the ATAC-seq pipeline protocol definition (https://docs.google.com/document/d/1f0Cm4vRyDQDu0bMehHD7P7KOMxTOP-HiNoIvL1VcBt8/edit?usp=sharing)

I am seeing that in the step 3a. Peak calling - MACS2 : function macs2()

shiftsize := round( smooth_window.parseReal()/2.0 )

macs2 callpeak \ -t $tag -f BED -n "$prefix" -g "$gensz" -p$pval_thresh \ --nomodel --shift -$shiftsize --extsize$smooth_window --broad --keep-dup all

The shift is - shiftsize

If you look down in the document:

function main() {

// By default, smoothing window for MACS2 is 150. IDR threshold is 0.1
// for ENCODE3, smoothing window for MACS2 is 73. IDR threshold is 0.05
if ( ENCODE3 ) {
smooth_win = 73
idr_thresh = 0.05
}
else {
smooth_win = 150
idr_thresh = 0.1
}


So effectively it would be either

macs2 callpeak ... --nomodel --shift -37 --extsize 73 --broad --keep-dup all

or

macs2 callpeak ... --nomodel --shift -75 --extsize 150 --broad --keep-dup all

Could it be that the first one will be sensitive to peaks and valleys in signal due to nucleosomes, while the second one will extend the cutting sites to cover 1 nucleosome?

0
Entering edit mode

So finally what is the consensus to use for MACS2 with ATAC-SEQ paired-end ?

I read that -f BAMPE was a good pratice . But then should it be used with --nomodel option ?

For what it worth, this is what I'm using until now on my bam without duplicates:

macs2 callpeak --verbose 3 --treatment my.bam --name myname -g hs --bdg -q 0.05 --keep-dup all -f BAMPE --broad

0
Entering edit mode

Hi, ZheFrench, the read position should be shifted +4 and -5 bp in the BAM file before macs2 callpeak? Thank you!

0
Entering edit mode

Is --nomodel not required if you are using -f BAMPE?

11
Entering edit mode
6.2 years ago
igor 13k

I got another alternative from the MACS developer:

If you followed original protocol for ATAC-Seq, you should get Paired-End reads. If so, I would suggest you just use "--format BAMPE" to let MACS2 pileup the whole fragments in general. But if you want to focus on looking for where the 'cutting sites' are, then '--nomodel --shift -100 --extsize 200' should work.

And then more recently:

Why do so many papers use --shift -100 --extsize 200 for MACS2 rather than -f BAMPE if this is not recommended for paired-end data?

--shift -100 --extsize 200 will amplify the 'cutting sites' enrichment from ATAC-seq data. So in the end, the 'peak' is where Tn5 transposase likes to attack. The fact is that, although many information such as the insertion length and the other mate alignment is ignored, such result is still usable. Especially when the short fragment population is extremely dominant, the final output won't be off much.

0
Entering edit mode

That's a good suggestion concerning altering the macs2 call to use the "real" insert size for PE data. We will try it and see how it compares to past peak calling.

0
Entering edit mode

Although then a double nucleosome produces a peak twice the size of a single nucleosome, which I am not sure I am entirely comfortable with either.

2
Entering edit mode

The regions where ATAC detects nucleosomes are open chromatin regions, where I think you always have a mixture of signals, both NFR, mono, and maybe di-or trinucleosomes. Inferring these features requires more specialized software, such as NucleoATAC. The peakcalling, in my understanding, just gives you a region where transposase cutting events are enriched and in which "you can dig deeper". The NucleoATAC software then takes a bed with peaks as input and the BAM sequencing files to perform modelling of the fragment size distribution and to infer NFR and nucleosome positions. So I think MACS peaks are (by themselves) not suitable for base-pair-resolved analysis.

I think it depends what you want to use your data for. If you just want to say which regions are accessable or not, then MACS peaks should do the job. If in-depth analysis, such as NFR and nucleosome positions or TF footprints are to be obtained, you need specialized software/ custom approaches, such as NucleoATAC, but also excellent data quality and good sequencing depth.

11
Entering edit mode
6.2 years ago

Assuming that in ATAC-Seq, we make a cut at "any" open chromatin regions, I imagine that using nucleosome specific parameters would not be optimal. Given that for things like MNAse you assume your average fragment is as long as the DNA wrapped around the nucleosome (~150 nt) which is not necessarily true for ATAC. Thus my guess would be method #1 would provide more optimal results.

Although if you check this workflow neither of those parameters are used for MACS [fixed link thanks to Dataman]: http://informatics1.rc.fas.harvard.edu/atac-workshop-capellini-lab.html

2
Entering edit mode

The link to the workflow/workshop is broken. The working link is: http://informatics1.rc.fas.harvard.edu/atac-workshop-capellini-lab.html

0
Entering edit mode

In that workflow, they also don't adjust for larger fragment size during Bowtie2 alignment (-X/--maxins), which is specifically discussed in the original ATAC-seq paper, so I wouldn't treat it as the gold standard.

Also, that's a really nice illustration. Where is it from?

0
Entering edit mode

The source of the image is here: http://www.the-scientist.com/?articles.view/articleNo/44772/title/Reveling-in-the-Revealed/

In regards to the larger fragment size, it does seem smart to try to adjust the --X/--maxins parameter to match your dataset. In my case it increased the number mapped reads by 1-2%.

I found this resource to be very helpful. It almost seems to suggest that increasing -X anytime you use bowtie2 is vital. http://lab.loman.net/2013/05/02/use-x-with-bowtie2-to-set-minimum-and-maximum-insert-sizes-for-nextera-libraries/

0
Entering edit mode

It shouldn't change the alignment rate by much, since it will just attempt to align the mates independently by default. However, it will count them as discordant. Also, the fragment size distribution will be cut off.

And I agree. Increasing that parameter is helpful in general. You don't always know what your library size is going to be. The only downside is it will take longer to run.

1
Entering edit mode
5.1 years ago

We Are Using This:

macs2 callpeak -t input -f BED -g "hs" -p 0.01 --nomodel --shift -37 --extsize 73 -B --SPMR --keep-dup all --call-summits

Currently, there is an existing issue on MACS2 which intermittent -1 as pValue which will crash the IDR as downstream analysis

as a heads up: convert all -1 to zero using sed