It seems most peak callers default return only significant peaks (like I use MACS2 with parameter -q 0.05). Then these sig-peaks' narrowPeak file (or bed files) can be load into software like Diffbind, MAnorm2 for normalisation, consensus counting .etc. Finally, differential analysis will be done on them.
However, I tried to ask MACS2 return me all peaks (-q 1), and the distribution is different with sig peaks (below figure), which means if we do peak filtering at initial stage, no matter what analysis we do later, we are not dealing with peaks origin distribution. It may not influence much if I want to find differential regions, draw correlation heatmap .etc. However, I think some situations may be influenced, like:
For example, after differential calling, we may do p.value adjustedment based on number of total peaks compared. I think the value would be different if we include all peaks or only select those sig-peaks.
Another example is, I want to have a null distribution for peaks (exactly the below density plot), or compared statistics after differential analysis. If I initially do peak filtering, I can never get null distribution right? It's more like we select just the top 100 students from A school and B school each, then compare scores between them, no matter how we compare, we can never know the score distribution of all students for school A or school B right?
The third example is, if I want to check how many peaks (sig or not) are enriched in a certain gene, without all peak information, I can never know. I can only know how many sig peaks enriched each gene.
I am new to ChIP-SEQ, and I want to know if it's possible that we don't do any filtering, just put all peaks (-q -1) returned into nearly all analysis. The significance filtering can be done manually after calculation at any stage. Is that work?
In another word, I want to know peaks with higher q value means they have less intensity (like a unexpression gene, or low methylated CpG), or they are low-quality, unreliable results should be discarded.
I think you're conflating two questions: (1) is this a peak, and (2) does it show different behavior between conditions. Do you want to know how top students differ between schools? Or do you want to know what a top student is? In your school analogy, you might imagine the same students at two different schools in a parallel universe. Then you could assess the relationship between student topness and school environment - which would be a different question than student topness and student in general.
The common approach is to find peaks under each condition, reduce them to a single set of genomic locations (these locations then define a universe of peak potential within your experimental paradigm), then you can ask questions about conditions and peak behavior. The question of peak behavior versus the rest of the genome, is a different, broader question. It occurs me that even with a q-value cutoff of 1, the peaks in your set would still have some p-value threshold by which they were selected (and I don't know what that is), otherwise the null set would simply be all the bins of reads on the genome. What is the p-value cutoff in your set of -q 1 peaks? Have you examined these peaks in a genome browser to see if they really resemble anything you would consider worth pursuing?
Thanks for your quick reply. Yes, I think my key question is: should a peak with less significance should also seem as peak? Or they are low-quality noise should be removed.
Some of my research targets:
Top student ratio between schools: I found that in my cancer samples, sig-peak/all-peak ratio is much lower than normal. If I don't do all-peak, I can only make a conclusion that cancer same has fewer sig (-q 0.05) peaks than normal. However, if I consider all-peak, that's another story.
Find true top students. Assume we have two genes, A gene has 1 sig-peak (q-value 0.04), but B gene have 10 non-sig peaks (q-value 0.06), I would focus more on gene B. I want to loess smooth over all peaks enriched in gene regions I am interested to get a rank or them. I know this idea is a bit similar to "broad peak", but I don't want to use two different calling methods in my research as they seem uncomparable as I posted here. So I coded my own script.
Compare two school's general situations, not only top students: I want to draw two density lines like above for cancer/normal. If I only take sig-peaks, I am actually drawing the difference between only top students, I think that's not "global", that's "selected".
Sorry to type so long. Continue here.
The common pipeline works in my project well except for the above three questions. However, I think even if I include all peaks for differential analysis, it should also work right? I don't mind having a lot of non-significant results. In DNA methylation or RNA-seq analysis, we always have a lot of undifferential signals, unexpression genes, unmethylated CpGs. Can I treat all peak set like them as well?
I did not tune the p-value, I thought q is enough, I can try to modify that, but still, the question is the same: If I use threshold as 1, 0.5, 0.3, 0.1, can the pipelines still work?
For now, seem most result/evidence I found to support my naive idea:
(quick IGV plot, y-axis is ratio between Merged_Bnd/Merged_Input from bamCompare)
In this answer, the author visualized sig-peaks from different p/q parameters. To me, it looks like a lower threshold leads to more peaks eye-visible above. So I suspect if he uses q/p 1 as the threshold, he will get peaks cover all above the red lane. So sig or not is totally an intensity thing.
I will try to do differential calling on all-peak. Then compare the result with the sig-peak result. If most signals I found in sig-peak result also can be found in all-peak, I can see no reason why it's not working.
Sorry for asking too much, I used to be good at DNA methylation analysis, where normally I have 850k CpGs, many CpGs are not-methylated inside, which did not influence my work. On the contrary, a large matrix in DNA methylation allows me to do various statistics on it much easier. So I am new to ChIP-seq, after following the pipeline, I am thinking if it's actually similar between an all-peak count matrix and a methylation matrix.
By the way, my data is MeDIP-seq, which was supposed to capture DNA methylation signals, thus maybe that's why I think they are similar and my algorithms designed for DNA methylation also work well in my current project. Maybe for Histone ChIP-seq, my naive idea is not applicable.