I don't know if I would call it more reliable, but it does do additional calculations that other methods don't. For one, DESeq2 does something called "shrinking" fold changes of those genes that have low read counts. I don't pretend to understand the math behind it, but in general what it is doing is reducing the fold change of any gene that has low read counts in one or the other or both conditions. Genes with low read counts can have exaggerated fold changes. For example, imagine you have two conditions (each with 3 replicates). In the control for gene A the read counts are 1, 2 and 2, (average 1.67) and in the experiment the read counts are 4, 3 and 4 (average 3.67). Now you also have gene B with control values of 100, 200 and 200 and experimental values of 400, 300 and 400. The calculated fold change for both genes is going to be 2.19, and they may also be significant changes according to the adjusted p-value (I've checked, it happens). However, having a difference of 2 read counts on average is not a lot, and I would not call that differentially expressed unless it is really reproduced in a lot of replicates, thus, DESeq2 shrinks the fold change value accordingly.
I've compared DESeq2 to EdgeR, and while I like both methods, EdgeR does return many significant genes that have exaggerated fold changes due to low read counts (or zero read counts) whereas DESeq2 shrinks the fold change to where it is generally below my cutoff for differential expression. Generally, when I filter for differential expression I use both the padj value and the fold change value. Unless you have a lot of replicates, low fold changes may not be completely accurate. Thus, I use DESeq2 specifically because it adjusts the fold changes of genes with low read counts.
It's not likely that any method is better for all experiments, and methods can be evaluated across many metrics (accuracy in estimating effect size, control of FDR, sensitivity, robust, etc.). Just a few important ways in which even your standard, bulk RNA-seq experiment can differ:
- number of biological replicates per group
- number of groups
- experimental design
- batch effects
- amount of within-group biological variability (big difference btwn controlled experiment vs study)
- scale of the effect sizes (big or small diffs btwn groups)
- proportion of genes/features which show differences btwn groups
- presence of outliers
We like to remind users that, with very many replicates and exchangeable samples, rank tests or permutation tests are great because you don't have to make distributional assumptions. It's just that investigators often don't want to spend money on extra experiments when e.g. 3 or 5 replicates per group will suffice in finding the large effects, and allow them to examine more conditions.
With these differences in mind, I'd recommend looking for evaluations by 3rd parties.