MACS2 --SPMR and --nolambda usage for version 2.1.2
1
4
Entering edit mode
3.7 years ago
varunorama ▴ 50

Hello all,

I've been spending some time researching the advanced options for macs2 and found some inconsistencies that I am not understanding. Specifically, I looked at the Building signal track tutorial ( https://github.com/taoliu/MACS/wiki/Build-Signal-Track ) and saw that the option --SPMR was used prior to bdgcmp. Looking at the bin source code for version 2.1.2, it is stated that

"If you plan to use the signal output in bedGraph to call peaks using bdgcmp and bdgpeakcall, you shouldn't use this option because you will end up with different results. However, this option is recommended for displaying normalized pileup tracks across many datasets."

I understand that SPMR is great for normalizing the treat_pileup.bdg and control_lambda.bdg to visualize directly on a genome browser, but the tutorial seems to be inconsistent with what is written. Is an --SPMR normalization in callpeak truly valid to use before contructing a signal track using bdgcmp? I thought it was initially a new change for version 2.1.2, but the version change log did not show any changes in the options for the version that the tutorial is referring to.

Additionally, if no input controls are available, is the --nolambda option still the way to go (as recommended by the authors in their Jeng et al. 2011 paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3120977/).

Thank you! Varun

macs2 ChIP-Seq • 5.9k views
1
Entering edit mode

In my understanding --nolambda in the absence of a control is only valid for broad peaks, so diffuse histone modification ChIPs. If you apply it on sharp peak ChIPs like transcription factors, you deactivate the local background estimation that aims to decide if a peak is true or false based on the signal landscape surrounding the putative peak.

1
Entering edit mode

Thank you for that sensible explanation.

1
Entering edit mode

Hi, so I'm aware a few months have gone by but the Building signal track tutorial still shows --SPMR and there is the inconsistency that you explained above and wasn't commented.

Any thoughts on this? Is the use of --SPMR in that context not correct anymore and the tutorial should be updated?

And what about the use of --nomodel? is it recommended without control?

Thank you for your time

1
Entering edit mode
3.2 years ago
kalavattam ▴ 80

Is an --SPMR normalization in callpeak truly valid to use before constructing a signal track using bdgcmp?

It appears to be valid for construction of a signal track (a); however, if I understand correctly, such an approach may not be appropriate for (b) peak calling and (c) comparisons between libraries:

(a) with respect to building signal tracks

This approach has been used (with or without modifications) in publications, including this recently published (at the time of this writing) manuscript: Tata et al., Dev Cell 2018; https://doi.org/10.1016/j.devcel.2018.02.024

From the Methods:

Normalized fold enrichment tracks were generated by using the callpeak function with the -SPMR flag and a 200bp extension size, then passing the bedgraph outputs into the bdgcmp function with the setting -m FE (fold enrichment).

Another publication that used this basic approach (albeit adapted): Murphy et al., Cell 2018; https://doi.org/10.1016/j.cell.2018.01.022

From the Methods:

Data was then processed using MACS2 for enrichment scoring and peak calling. Briefly, read count normalization was performed on alignment files to account for sequencing depth differences, and then background corrected based on input using the bdgdiff function of Macs2. Finally, Log10 fold enrichment over input lambda was then calculated for each nucleotide across the entire genome using the bdgcmp function of Macs2.

More MACS2 details at the related GEO entries, e.g., https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2494905

Macs2 was then used to generate read count normalized genome wide pileup tracks and lambda tracks for precipitated samples and input. (callpeak --nomodel --extsize 150 --SPMR) Pileup tracks were then Input corrected using the Macs2 subtract function (bdgcmp -m subtract). Log10 Fold enrichment was then calculated for each sample comparing input corrected pileups to input lambda. (bdgcmp -m logFE). Peaks were then called as regions with greater than 3 fold enrichment over lambda using the Macs2 bdgpeakcall function. (bdgpeakcall -c 0.477 -l 100 -g 100)

More relevant information at this Biostars Q&A entry: macs2 normalization on broadPeak

(b) with respect to peak calling

From the \$ macs2 callpeak --help output (version 2.1.2):

If you plan to use the signal output in bedGraph to call peaks using bdgcmp and bdgpeakcall, you shouldn't use this option because you will end up with different results.

(c) with respect to comparisons between libraries

From the following URL: https://github.com/taoliu/MACS/issues/55

If you want to compare different libraries, like in your case, first choice is that you can try 'bdgcmp' to get some statistic measurements; or you can ask MACS2 callpeak to generate pileup per million basepairs for you with --SPMR by downscaling. But if you want to use the second method, you have to set both replicates the same fragment size by '--nomodel --extsize XXX' to make the comparison fair. Also bear in mind that such scaling will affect statistic significance since 100 vs 50 is more significant than 10 vs 5, so if library sizes differ a lot, scaling such as SPMR may not be a good option.

I hope others can add more to this discussion.

More interesting and (somewhat) related links:

1
Entering edit mode

Just my two cents: One should use macs for what it was originally designed and that is calling peaks on single samples. Nothing more, nothing less. It is not suitable for comparing peak profiles across samples because it does not respect differences in signal-to-noise errors and only performs naive per-million scaling. It will give you information on peak locations per sample which you can then feed into proper statistical frameworks to compare peaks.

1
Entering edit mode

Check this which will be helpful. It's the same situation as in the differential gene expression analysis -- some algorithms use FPKM or TPM to compare conditions and some algorithms use counts directly then statistically estimate size-factors to normalize counts. Since MACS2 uses the Poisson test, we expect the values to be compared are discrete "counts".