Question: Peak Calling with MACS (treat and control samples)
1
gravatar for ukpersia1
4.1 years ago by
ukpersia1 20
United Kingdom
ukpersia1 20 wrote:

I would like to use MACS for differential peak calling.

I have a set of data (GSM947524 ) (cancer data as treatment data) and was wondering if I can use ( GSM947411) as control for it?  or should I use ( GSM947523) as control? 

whole data sets can be accessed by Query DataSets for GSE38685  (which includes 12 samples)

I have the following command line for peak calling:

macs14 -t treat.sam -c control.sam -g hs -n cancer -w -S

 

where “treat” is the cancer file GSM947524 

and  “control” is the one that I’m confused about!!!

 

chip-seq peak calling macs • 8.1k views
ADD COMMENTlink modified 4.1 years ago by Gary450 • written 4.1 years ago by ukpersia1 20

Just a suggestion: Since the data you are looking at is H3k4me3, one would expect broad peaks. It is good to run macs whenever you are dealing with epigenome data, with --nomodel parameter and if you know the fragment size, with --shiftsize equal to half the fragment size (usually its ~200 for 40-50bp reads). 

ADD REPLYlink written 4.1 years ago by poisonAlien2.8k
1

I do not think this is a good advice, especially for beginner. First, H3K4me3 is not really the broadest histone mark there is, I think point peak calling should work. If it doesn't, the new version of MACS2 has a --broad parameter which should combine the nearby point peaks. Saying it is good to run macs [..] with `--nomodel` parameter just like that is quite irresponsible, in my opinion, especially to a beginner. 

Namely, these parameters are automatically estimated when running MACS in the default mode. If you provide them manually you are basically saying "I believe I can do better than MACS at estimating them". It is a strong assumption. You need to be extremely confident about the inner workings of MACS to do this in my opinion. Particularly, one needs to know is the MACS2 model in the first place, what `--shiftsize`* is for, what is the fragment length, how to figure it out, and once it is known, how to set the appropriate parameters of the software accordingly. Most importantly you would need to justify your choice of parameters, verify that the assumptions you make are actually correct, and be aware how that influences the resulting p-values. I, personally, would not feel too confident about doing this.

 

* Actually, `shiftsize` is now removed from the latest MACS version, in favour of `--extsize` which should be set to twice the length of the old shift size (see the `--help`).

P.S. How does one do inline code blocks like the author of the comment? Backticks don't work :(

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by Saulius Lukauskas530
1

The MACS parameters Dr. Liu suggested for H3K4me3 ChIP-Seq as below. I have tried different parameters for our H3K4me3 ChIP-Seq data, and found that the parameters Dr. Liu suggested are best. However, which parameters are better heavily depends on our data quality and data structure. Actually, it’s case by case. Please see the detail of this great paper and other related papers.

Feng JX, Liu T, Qin B, Zhang Y, Liu XS. 2012. Identifying ChIP-seq enrichment using MACS. Nature Protocols 7:1728-1740.

> macs14 -t BROAD_GM12878_H3K4me3.bam –c BROAD_GM12878_H3K4me3_Control.bam -g hs -n BROAD_GM12878_H3K4me3 --nomodel --shiftsize 73 -B -S --call-subpeaks

ADD REPLYlink written 4.1 years ago by Gary450

The 'case per case' bit is precisely what I meant -- sticking with defaults at first is always a good shout.  I believe we might be starting to confuse the OP, but, since this thread is starting to become a complete reference on MACS peak calling I will add that Roadmap Epigenome also use --nomodel and manual --ext size for their data calling. They estimate the fragment length using strand cross correlation (link, section c).

ADD REPLYlink written 4.1 years ago by Saulius Lukauskas530

I would like to add this reference as well from the same author: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3120977/

Take a look at the section "Critical Parametsrs and troubleshooting", where they explain suggested argumenst for epigenome data. Anyhow, along with --nomodel and --shiftsize, you can also increase the --mfold from default 10,30 to 10,100 (or even higher if you are looking for super enhancers using H3K27Ac) since most of the histone modifications are highly accumulated and generally cross the default threshold of 30, which would be missed by MACS for model building. 

ADD REPLYlink written 4.1 years ago by poisonAlien2.8k

Hi poisonAlien, may I have a question here? Now, I use ROSE (https://bitbucket.org/young_computation/rose) to identify super-enhancers based on H3K27ac or Med1 ChIP-Seq. I don’t know MACS can identify super-enhancers, too. Could you tell me how to manipulate it? Many thanks.

ADD REPLYlink written 4.1 years ago by Gary450
2

MACS does not identify super enhancers. It just gives you peaks (or just "regular enhancers"). You need to stich those closely spaced peaks and rank them acording to their intensity, which is what ROSE does. You first call peaks with MACS, then use the called peaks (e.g: H3K27Ac.bed) as input to rose. 

ROSE_main.py -g hg19 -i H3K27Ac_peaks.bed -r H3K27Ac.bam -o outPut_dir -c Input.bam

You may want to add -t 2000 if you want to remove those peaks overlapping with tss. Again I am no expert, but this is what I follow.

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by poisonAlien2.8k

Thanks. It is my misunderstanding.

ADD REPLYlink written 4.1 years ago by Gary450

Thanks very much for your input to this. I am so glad that this website exist :) at least for beginners like me is a massive help. 

ADD REPLYlink written 4.1 years ago by ukpersia1 20
1
gravatar for Saulius Lukauskas
4.1 years ago by
London, UK
Saulius Lukauskas530 wrote:

As I already described in your previous post with this issue (peak calling for cancer and healthy data? ), I believe the best solution is to use the input sample (GSM947411) and do the peak calling twice. That's what I would do.

EDIT: Just to reiterate this quickly, you need to distinguish between biological control and signal (normal cell line v. cancer cells) from technical control and signal (background noise levels v. supposed signal levels).

I think you are slightly confusing what peak callers like MACS do. In the essence, peak callers are just tools to distinguish between signal and noise regions. The input they take is just a set of x versus y measurements and output x coordinates that have significantly large y values. They have no biological assumptions encoded between themselves, just the assumptions about how the background noise levels are distributed. Since this noise is not uniformly distributed across the x axis (i.e. genome), MACS needs the input sample (which they call 'control') to estimate the per-base-pair noise levels. Once these levels are estimated, one can proceed to look at y coordinates that are significant. 

Now your question is: "do the peaks differ between normal and cancer cell lines?". Translated to the x-y analogy above, it would be 'do the x coordinates where y is significantly large for normal cell line and the x coordinates where y is significantly large for cancer cell line differ?'. In this language it is perhaps clearer why MACS need to be run twice, and that it does not answer that question for you*. Hope this helps.

* Well, the standard options don't. As I mention in your previous post (C: peak calling for cancer and healthy data? ) there are bdgcmp and diffpeaks options in MACS that might be helpful. 

ADD COMMENTlink modified 4.1 years ago • written 4.1 years ago by Saulius Lukauskas530

Once again thanks :) this is really helpful and indeed is the answer to my question. 

aulius Lukauskas do you know about any useful course about NGS (chip-seq) that is held in London? 

Best regards

ADD REPLYlink written 4.1 years ago by ukpersia1 20
1

For a beginner, this one could be very helpful for your information.

A step-by-step guide to ChIP-seq data analysis (abcam free webinar)

http://www.abcam.com/events/a-step-by-step-guide-to-chip-seq-data-analysis-free-webinar

ADD REPLYlink written 4.1 years ago by Gary450

Thanks for the link. it's very informative 

ADD REPLYlink written 4.1 years ago by ukpersia1 20
1
gravatar for Gary
4.1 years ago by
Gary450
Taiwan/Taichung/China Medical University Hospital
Gary450 wrote:

Both samples (i.e. GSM947411 & GSM947523) can be your control. Which one is better depends on your biological question and experimental design. In the beginning (i.e. in 2007), people can use only one sample without Input (control) to identify peaks. Today, more and more people use replicate samples with Input or IgG control to identify peaks. No one can tell you what method is absolute better, although what Saulius Lukauskas said is the mainstream strategy people use today. In fact, only people really understand the biological background, usually it’s you, can suggest a better experimental design and a better control sample. In practice, you can try both controls to determine with one is better based on whether their results are consistent with your biological phenomena. I hope this might be helpful. 

ADD COMMENTlink written 4.1 years ago by Gary450

Thanks Gary 

ADD REPLYlink written 4.1 years ago by ukpersia1 20
0
gravatar for geek_y
4.1 years ago by
geek_y9.6k
Barcelona/CRG/London/Imperial
geek_y9.6k wrote:

MACS can be run without control sample. From MACS manual:

-t/--treatment FILENAME
This is the only REQUIRED parameter for MACS.

Read the section 'Control sample' from Encode paper to understand exactly what is control sample and its uses. 

ADD COMMENTlink written 4.1 years ago by geek_y9.6k

Yes indeed. but if I want to use a control sample which of the above samples is suitable?  

 

ADD REPLYlink written 4.1 years ago by ukpersia1 20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1404 users visited in the last hour