I have a question regarding the results of DESeq2.
With one dataset, by changing the lfcthreshold from 0 to 0.75 I found a decrease in the number of DEGs ---- from 100 to 3. DESeq gave 101 for another set of data with an lfcthreshold of 0.75 and 215 for an lfcthreshold of 0.
Why is the DEG drop more in the first case? Is it because of the variation in the read counts in the replicate samples?
Hi BS,
here is the reason why the number of DEG drop if you rise the lfcthreshold: link
Thank you Andres:) I understand the implication of using a strict lfcthreshold.
Thanks Andres - that's an important answer from James on Bioconductor to which to link here. BS, I am not sure of your experience but, unless you have a good reason to modify the default value of
lfcThreshold
, then it may be better to leave it at 0 and filter for fold-change in the final results table that is generated. A typical cut-off there would be Adjusted p < 0.05 and absolute Log2FC > 2.I personally test against a
log2(1.2)
simply to get rid of genes that are statistically significant but show tiny fold changes which then (in my head) are unlikely to actually drive any meaningful biological differences. A small threshold like 1.2 is (I think) better than post-hoc filtering for 1.5 or 2 because post-hoc filtering flavours genes that are lowly-expressed and therefore more prone to show large FCs (assuming you did not shrink the FCs with DESeq2). This is pretty much what the edgeR authors recommend.Thank you very much Kevin for the answer:) The experimental design and the species studied are different. However, the data-prep and sequencing methods were similar. In DESeq if you specify the lfcthreshold the results function will ignore the pvalue. Note that I have used p<0.05 in both cases.
The result function doesn't ignore the p-value.
Kevin has given the one and only good answer you can give here
due to a virtually infinite number of reasons
. You have independent datasets, results will always be different. Statistical power, te true underlying biological effect, batch effects, sequencing depth, variance and dispersion, number of genes surviving the independent filtering, it can be everything, results are not predictable, that is why you run experiments and apply rigid statistics to get meaningful and reliable results. There is no simple answer for this.