Does DiffBind correctly designate peaks with very small fold-change to be significant
2
0
Entering edit mode
3.3 years ago
Aspire ▴ 300

In the process of analyzing ChIP-Seq replicates with DiffBind (drosophila data, 3 replicates for each sample, choosing DESeq2 as the analysis method), peaks with very small Fold-change were found to be significant.

The median absolute value of the fold change among the significant peaks is 1.55 and the mean is 1.93 (not in log scale). The Concentration is also not that high, the median among the significant values being 225 reads.

I have plotted an MA plot; the red points are the significant ones, the black points are the non-significant ones. I have plotted the non-significant points after plotting the red points, so if a non-significant peak would have a large fold, it would be easily visible.

enter image description here

All points with even slight log fold changes are significant. I wonder, why could this be the case?

For example, the peak which has the smallest fold change, and is still significant is (Fold and Conc are in log2 scale here):

seqnames     start        end         width     strand 
3L           11657351   11659235       1885      * 

Conc    Conc_Mut     Conc_WT       Fold        p.value        FDR     
8.48     8.63         8.3           0.23       0.00655      0.0376

Individual counts (not in log2 scale):

S7        S8        S9        S1         S2        S3 
404.42     391     396.29     305.76     317.14     321.53

Had these been the results with genes in RNA-Seq, I'd be surprised. Is it merely the fact that the ChIP-Seq has much less "genes" to make a multiple hypothesis correction for? Could it be something else?

DiffBind • 2.3k views
ADD COMMENT
1
Entering edit mode

If nobody answers, you may consider the Bioconductor Support forum, where the DiffBind developer regularly checks for questions: https://support.bioconductor.org/

ADD REPLY
0
Entering edit mode
3.3 years ago
ATpoint 81k

Small fold changes can become statistically significant if dispersion is low enough and counts are high enough. That is not uncommon. I personally use glmTreat within the edgeR QL framework to test against a minimum fold change (default is 0) of e.g. log2(1.25), or higher, depending on the assay. That ensures that significant genes/regions have statistical support for a fold change that is not zero but greater than this value you chose. I doubt that these tiny fold changes you see are indeed biologically meeaningful. Not sure whether DiffBind implements that. By the way, I would suggest to reduce ylimits if you present a plot like this, it is probably not too informative if most data points are squished close to the center of the plot, e.g. try c(-4,4) or so.

ADD COMMENT
0
Entering edit mode

I have used DiffBind with the DESeq2 method; after counting reads and normalizing, it uses the DESeq2 functions internally to test for statistical significance. You are right that most changes are probably not biologically significant; but even statistical significance using DESeq2 surprised me in this case.

ADD REPLY
0
Entering edit mode

DESeq2 has a lfc argument in results() and lfcShrink() that allows testing against a fold change. Again, not sure whether DiffBind has that implemented. If not I guess that would be a good feature request. Can you show a plot with updated ylimits, it looks unexpectedly "clean" to the left of the plot, so not much of the expected large FCs to the left. is this shrunken?

ADD REPLY
0
Entering edit mode

Yes, DiffBind shrinks the fold changes.

Updated ylimits

ADD REPLY
0
Entering edit mode

According to the source code testing against a fold change is not implemented ¯_(ツ)_/¯

ADD REPLY
0
Entering edit mode

Is it possible simply to filter for those peaks where the fold change > 2?

ADD REPLY
2
Entering edit mode

Possible yes, but it flavours genes/regions on the left of the MA-plot. Look at this plot:

enter image description here

If you filter by a fold change threshold then you start to remove the genes/regions on the right first, and those on the left inevitably are flavoured. The ones on the left have on average fewer counts and therefore lower statistical power, plus fold change estimates are more uncertain. I am not sure of how strong this bias would manifest effectively, but the statisticians recommend to use dedicated testing so genes on the left and right get evenly removed if you start testing against FC threshold. Not sure how it is with shrunken values, I can only say that I prefer not to do manual filtering but simply test against a small minimal fold change.

ADD REPLY
0
Entering edit mode

Sure, but those sort of ad hoc filtering approaches tend to destroy FDR control and favor highly variable, lowly expressed genes. Not that that stops it from being a very commonly used method.

ADD REPLY
0
Entering edit mode

Looking at the counts posted above, it does seem that the dispersion is low. (S1-3 are one group, and S7-9 are another). So I think that the statistical significance is explained; sorry for the bother. Your points about biological significance still hold.

ADD REPLY
0
Entering edit mode
2.9 years ago
Rory Stark ★ 2.0k

From Version 3.3.2 onwards, DiffBind will recompute the statistics if you specify a fold threshold in dba.report(), dba.plotMA(), or dba.plotVolcano().

For edgeR analysis, it uses edgeR::glmTreat() (instead of edgeR::glmQLFTest()), while for DESeq2 analysis, the lfcThreshold parameter is passed to DESeq2::results() (then re-calculating the fold changes).

ADD COMMENT

Login before adding your answer.

Traffic: 2412 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6