Question: Output Of Chip-Seq Data Analysis Using Macs
3
gravatar for Vikas Bansal
4.9 years ago by
Vikas Bansal2.3k
Berlin, Germany
Vikas Bansal2.3k wrote:

Hi everyone,

I am analyzing chip-seq data for 60 samples. I am using MACS for peak calling and I have some questions regarding different outputs. Just to make it easier, I will give an example using two of my chip-seq datasets - 1 is control and 1 treatment. The control sample has been sequenced twice (replicates, different lanes) and therefore I have two fastq files for control. According to the protocol given in this paper, we should merge the replicates using samtools after mapping so that we can have a single file for control. I got ~400 peaks after using the merged file as control (lets say this as Analysis A). I did the analysis again (say Analysis B) and this time, I used only one of the file from my two control files (just took one file randomly). I got ~200 peaks using MACS.

Questions

  1. FDR values in Analysis B are much higher as compared to Analysis A. Not even a single peak in Analysis B has FDR below 25% whereas in Analysis A all FDR values are less than 5%. Is it normal to see such a difference? Which output would you prefer - Analysis A or B?

  2. Could anyone please explain, how the FDR values are calculated by MACS? I have read it in the above mentioned paper but I did not understand. May be an example would help (I read one example from the paper).

Thanks in advance.

macs chip-seq • 11k views
ADD COMMENTlink modified 2.9 years ago by muhe198510 • written 4.9 years ago by Vikas Bansal2.3k
5
gravatar for KCC
4.9 years ago by
KCC3.8k
Cambridge, MA
KCC3.8k wrote:

I use MACS frequently. First, you should definitely trust the analysis where you used the merged control files. The pooled control is better for the same reason that a larger sample size is generally better in most statistical analyses. MACS uses the control as an estimate of the background enrichment.

Second, if I remember correctly, the FDR is calculated by calculating the number of peaks in the control relative to the treatment. The assumption is the control should contain no peaks. Thus all peaks found are false positives. (I think the control tends to increase around at least some binding sites so this isn't a perfect assumption. There is some mention of this in the Peakseq paper from the Gerstein lab.) Note that the control will tend to be less enriched at actual peaks.

Using the 'false positives' as a model, you can calculate the FDR for the control sample at different thresholds.

One extra nitpick, the values for a single peak are q-values. The FDR is just the maximum possible q-value.

The algorithm MACS uses to merge regions into peaks is a little irregular and can lead to strange results. However, in this case, the idea that the number of peaks would double with a better (ie. pooled) control sounds reasonable to me.

ADD COMMENTlink written 4.9 years ago by KCC3.8k

Thanks for your reply. In the paper, it is written "For a particular P value threshold, the empirical FDR is then calculated as the number of control peaks passing the threshold divided by the number of ChIP-seq peaks passing the same threshold." Are these FDR values some how related to P values because I don't see a particular trend in P-values and FDR's or these are calculated separately?

ADD REPLYlink written 4.9 years ago by Vikas Bansal2.3k

Macs has multiple windows. A Poisson distribution is fit to each window. Each Poisson distribution gives a different p-value based on the model. The p-value assigned by MACS is the maximum of these p-values. You can read the original paper to see the different window sizes. There should also be a more recent paper on MACS2.

The q-values come from the procedure I mentioned. The FDR is estimated at different thresholds by computing the number of false peaks at different thresholds.

ADD REPLYlink written 4.9 years ago by KCC3.8k
2
gravatar for Sukhdeep Singh
4.9 years ago by
Sukhdeep Singh9.1k
Netherlands
Sukhdeep Singh9.1k wrote:

The point which is confusing to me, is with less number of tags (a single file) how macs can output less peaks. In principle, when one merges two files, the tag density is increased and the number of peaks should drop down, as the places where the read density was not enough to make a peak, now is a peak and when compared with sample, it won't be called a peak. Just to make sure, are you talking about the positive peaks (enrichment of sample over control) or negative peaks (enrichment of control over sample). Are you using Macs14 or Macs2

Few things can be tried,

1) Merge the fastq files (technical replicates) in the start, then map and remove the duplicates using rmdup and -q parameters of samtools. Try calling peaks then.

2) Try merging as you are doing and then read length normalization as Luke is saying.

I have a feeling samtools merge is doing something, can you also check the read density of the merged file. Also, check the --keep-dup=KEEPDUPLICATES parameter of MACS.

Cheers

ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by Sukhdeep Singh9.1k

Sorry, I should have mentioned that. I am using Macs14 and yes I am talking about positive peaks. I checked the merged sam file (replicates of control) and after merging the size of file is almost double and Macs is detecting 65.9 million tags in merged control file (sum of number of tags in two control files). Is it normal?

ADD REPLYlink written 4.9 years ago by Vikas Bansal2.3k

It makes sense, you merged and you are getting what you should get. Try read density normalization and removal of duplicates as I've mentioned earlier. Now the tags are more, so you should get less +ve peaks but you are reporting the other way around, which is not normal.

ADD REPLYlink written 4.9 years ago by Sukhdeep Singh9.1k
2

Read density normalization and removal of duplicates are the default parameters in MACS and I did not change them (please correct me if i am wrong). Could you also comment on FDR values? With Analysis A, I am getting much lower FDR.

ADD REPLYlink written 4.9 years ago by Vikas Bansal2.3k

They are but now perfect, you will see a difference when you'll manually do it. Thats, the confusing part, low FDR values means the enrichment is more in sample than control, which exactly contradicts that control has more reads, the FDR should be high, unless Macs normalization is completely removing the duplicate reads and flagging those regions and thus the sample shows significant binding there. You should also ask on MACS list

ADD REPLYlink written 4.9 years ago by Sukhdeep Singh9.1k
1
gravatar for Vikas Bansal
4.9 years ago by
Vikas Bansal2.3k
Berlin, Germany
Vikas Bansal2.3k wrote:

Thanks everyone for your answers. I will try to answer my own questions mentioned in my original post. Please correct me if I am wrong. I will answer my second question first.

Could anyone please explain, how the FDR values are calculated by MACS?

Lets say, we found a peak (PEAK A) in treatment with p-value 0.000001 (compared to local λ). MACS will calculate how many peaks are there in treatment with this p-value cut off. Lets say it found 100 peaks. Then MACS will swap the treatment and control and now it will calculate how many peaks are there in control (which is treatment now) with this p-value cut off. Lets say it found 2 peaks. So FDR for "PEAK A" is (2/100)*100 i:e. 2%

FDR values in Analysis B are much higher as compared to Analysis A. Not even a single peak in Analysis B has FDR below 25% whereas in Analysis A all FDR values are less than 5%. Is it normal to see such a difference? Which output would you prefer - Analysis A or B?

As in Analysis A I merged the control files (replicates) but not the treatment, I have more reads in control. MACS will scale the large data towards small dataset by dividing the local λ with r. Here r is calculated by number of reads in controls/number of reads in treatment. Lets say, we have 15 million reads in treatment and 30 million reads in control, then r is 2. Now while calculating p-value for a PEAK A, we will divide local λ with 2 which will give us smaller P-value and when we swap the samples we will find less peaks in control (because p-value cut off is low) which will ultimately decrease the FDR (it is not always true because with lower p-value cut-off, we will also find less peaks in treatment (but not as less as in control))

I have some more questions but for that I will start a new post. Please let me know, if I got something wrong or if I missed something.

ADD COMMENTlink written 4.9 years ago by Vikas Bansal2.3k
0
gravatar for Luke
4.9 years ago by
Luke220
Turin, Italy
Luke220 wrote:

Hi Vikas,

Have a look at this paper. In discussion the authors say:

"[...] when tag counts from ChIP and controls are not balanced, the sample with more tags often gives more peaks even though MACS normalizes the total tag counts between the two samples. [...] we suggest to ChIP-Seq users that if they sequence more ChIP tags than controls, the FDR estimate of their ChIP peaks might be overly optimistic."

How many tags do you have in ChIP and control samples?

Best Regards

ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by Luke220

For Chip treatment, I have ~33.8 million tags and for control as I said earlier - I have 2 replicates. First one has ~33.4 million and second one has ~32.5 million tags.

ADD REPLYlink written 4.9 years ago by Vikas Bansal2.3k

Luke, this doesn't implies what Vikas is saying. Its actually opposite, how a control with less tags can output less peaks as compared to control with more tags (merged file) unless we are confusing positive (enrichment of sample over control) and negative peaks.

ADD REPLYlink written 4.9 years ago by Sukhdeep Singh9.1k
0
gravatar for Ian
4.9 years ago by
Ian5.0k
University of Manchester, UK
Ian5.0k wrote:

Just a small point about FDR calculations and MACS. FDR calculations become inaccurate when one of the samples is much larger than the other, this warning can be seen in the STDOUT. Are you seeing such warning when using the combined input files?

Also, including reads mapped to the mitochondrial genome (chrM) adversely effect the FDR calculation. I routinely map reads to all the available reference sequence (e.g. human: chr 1-22, X, Y, chrM, unassembled contigs "random"), but only use the reads that map to chr 1-22, X and Y for peak calling. This seems to work best.

ADD COMMENTlink written 4.9 years ago by Ian5.0k

Thanks for your reply. I did not get this warning and yes, I am also excluding chrM, etc.

ADD REPLYlink written 4.9 years ago by Vikas Bansal2.3k
0
gravatar for muhe1985
2.9 years ago by
muhe198510
United States
muhe198510 wrote:

I don't think 400 peaks and 800 peaks are in principle very different. but just to remind LZ that they both point to a failur of ChIP or ChIP sequencing themselves. the number of peaks are way too few to be a successful ChipSEQ. Such principle/criteria was also mentioned in the ENCODE guideline for CHIPSEQ('hundreds or low thousands point to failure of CHIPSEQ')

For the question itself. it is understandable that with increase of depth the number of peaks in control file increased, which leads to decrease of called peaks in CHip sample.

I would also suggest you to look at the called peaks if they are in the encode blacklist.

 

 

 

ADD COMMENTlink modified 2.9 years ago • written 2.9 years ago by muhe198510
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1431 users visited in the last hour