I think for an experimentalist who wants to understand why use one of these packages over the other, or any at all, most of these explanations are rather vague. They may not have any idea what is meant by "different statistical model."
This is what I would do:
- briefly restate the purpose of the experiment (just to get on the same page)
- briefly review how we test differential abundance
- state why we don't just use standard parametric or nonparametric methods to find significant differential abundance
In general, the main concern in analyzing an experiment is to look for differences between groups. In the case of abundance data (microarray, RNA-Seq, MiSeq, pathway analysis, etc.) we have a list of features (genes, proteins, microbiota, etc.) and a set of samples. These samples have different descriptors or characteristics associated with them such as gender, age, treatment group, treatment time, etc. Once we have our abundances, we want to know how different the abundances are between groups when we look across all samples. This is called "differential abundance," and is typically measured in terms of the "fold change" for each feature, which is the ratio of the mean abundance between groups.
Perhaps the most direct way to determine differential abundance would be to perform an individual linear regression for each feature of the abundance (dependent variable) vs. the values of the descriptor variables across samples. For example,
y_n = b_0 + b1*x1 + b2*x2 + b3*x3
where y_n is the abundance of the nth feature (e.g. gene) and (x1, x2, x3) are variables such as gender, genotype, treatment.
This linear regression gives us a p-value for each regression coefficient (the b's).
Suppose feature y_10 has a significant result for x2, which happens to be gender. Then, we would calculate the fold change FC = mean(Female)/mean(Male), and say that
gender is significantly differentially abundant in feature y_10 with a fold change of FC
note: mean(Female) means "take the mean of the abundances of y_10 for all samples with gender=Female"
The reason we don't use this simplistic approach in most cases is because the distributions of abundances don't meet the requirements of our standard tests (normality, homogeneity of variances, etc.). Therefore, to perform our tests we need to use validated algorithms that generally perform these steps:
- Start with an idea of what kind of distribution our data has (negative binomial, poisson, lognormal, etc.)
- Do specific transformations on the data to make it adequately align to the chosen model
- Perform some sort of regression analysis to obtain significance. This may involve multiple additional steps and statistical techniques.
- Use more advanced theory to estimate the fold changes taking into account all of the previous work.
EdgeR, Limma, and DESeq2 (as well as DESeq) perform these steps with a specific set of assumptions and techniques that are designed to be as optimal as possible for the type of data they are designed for.
It is still up to the user to
- Choose the correct tool
- Use the tool correctly by selecting values for different parameters the model can include
- Properly interpret the results.
The last part is why they pay you lots of money!