I am new to ATAC-seq analysis, hence the following basic questions regarding the consensus in the field:
Should I perform peak calling (I am using MACS2 in PE mode) using the entire BAM file or should I first split the BAM file in nucleosome free, mono-nucleosome, di-nucleosome etc. and then call peaks with these separate files?
What is the average/expected number of called peak regions for a high-quality ATAC dataset?
Also, I have 3 biological replicates per condition. Should I merge the replicates prior to peak calling? Or treat them separately throughout?
There is no clear consensus on this problem. I currently like to use Genrich to call peaks over replicated samples.
A command line could be: Genrich -t sample1.bam,sample2.bam,sample3.bam -j -l 200 -q 0.05 -o out.narrowPeak. You will have to sort your BAMs by read name first.
This will call peaks over every single sample and then combine the p-values of each sample using Fisher's method.
This saves you from annoying filtering like peak must be present in all samples or in 2 of 3 samples, or from using the IDR framework which only works for n=2. You can also use macs to call peaks over the combined BAM files but this in my experience gives a lot of peaks from which you do not know if they are reproducible across replicates. In any case, I merge the peak lists of the replicate groups to get a template to make a count matrix from (with featureCounts). I do it that way because I personally prefer to have a conservative list of confidence peaks that I use throughout the project. Others might have different opinions. A popular different opinion is given in the manual of the csaw package where the author recommends to avoid peak calling at all in favor of using sliding windows for differential analysis. Still, as I said, I prefer a list of confidence peaks as one might need this peak list for other applications in the project beyond diff. analysis. Choice is yours.
Once you have the count matrix you can use packages like DESeq2 or edgeRfor differential analysis. If you use edgeR I suggest to check the csaw manual which contains example code for ChIP/ATAC-like analysis. Normalization can be done either using the peak counts or using the counts of 10kb windows across the genome (again, check csaw manual for details). The latter I find useful if the samples are expected to be very different from each other.