Question

Limma t and F-tests: should the degrees of freedom correspond to all samples if only some treatments are used in contrast?

0

Entering edit mode

1 day ago

Sarah • 0

I noticed that the t-test degrees of freedom in limma (and F-test denominator degrees of freedom) are always the same regardless of which treatments are in a contrast. However, I have a lot of missing data for some specific features in some specific treatments in a phosphoproteomics dataset.

Is it statistically correct that a t or F-test involving only treatments that have few/no missing values suffer a loss in degrees of freedom because treatments _not in the contrast(s) being tested_ are missing?

This can lead to an unintuitive ordering of t and F-statistics for features in my data.

For example, I have a toy dataset where some features have many more missing values than others, but only in certain treatments:

example_data

When I do a t-test on treatments where the sample number is the same between features 1 and 2 (n = 3 for all the blue treatments), the degrees of freedom is still very different because of the other treatments where there are missing values:

# Specify model: means model, no intercept
design <- model.matrix(~0 + treatment, data = metadata)

# Fit model
fit <- lmFit(object = counts, design = design)  # warning is that some treatments have all NA values

# Specify contrasts
# Choose contrasts where sample size for example features is the same
contr <- makeContrasts(
    contrasts = c(
        "treatmenttreatment2 - treatmenttreatment1",
        "treatmenttreatment3 - treatmenttreatment1"
    ),
    levels = colnames(coef(fit))
)

# Estimate contrasts
# Use empirical bayes moderation
fit_contrasts <- contrasts.fit(fit, contr)
fit_contrasts_ebayes <- eBayes(fit_contrasts, trend = TRUE)

# Investigate results for features 1 and 2
print(fit_contrasts_ebayes$df.residual[1:2])
print(topTable(fit_contrasts_ebayes, coef = 1, number = 2, sort = "none"))

This leads to the opposite order of significance compared to what I would expect (feature 2 more significant than feature 1, even though feature 1 logFC is greater):

> print(fit_contrasts_ebayes$df.residual[1:2])
[1] 36 58

> print(topTable(fit_contrasts_ebayes, coef = 1, number = 2, sort = "none"))
             logFC  AveExpr         t      P.Value    adj.P.Val        B
feature1  5.440525 15.84160  7.507528 4.740642e-09 2.370321e-07 10.49628
feature2 -2.776424 22.47795 -8.531085 5.767286e-12 5.767286e-10 16.92264

limma Proteomics • 282 views

ADD COMMENT • link updated 13 hours ago by Gordon Smyth ★ 8.4k • written 1 day ago by Sarah • 0

1

Entering edit mode

20 hours ago

LChart 5.1k

I'm sure Gordon Smyth can provide a far more detailed explanation. But until he can chime in with more detail I'll give it a shot:

From the perspective of the T-test, you have a setup that looks like

y = Xb

where y is the feature response (nx1), X is your treatment matrix (n x k ; binary), and b is the coefficient vector. We'll assume everything is 0 centered.

The least squares operator $enter image description here$ is what links the observations to the statistics (indeed, for a constrast design C, the expectation of the statistic is simply $enter image description here$ ).

This breaks the full set of n observations into k constrained variates (coming from the (k x k) matrix) and n-k residual variates. Critically this does not depend on the structure of X (except in degenerate cases like when the column rank of X is less than k). The "residual variates" can be thought of as the effective sample that determines the (single!) error variance. (Single, here, is important).

In your case, X is binary, and your question is really one about what it means when the two following cases disagree (here I will assume there are only 4 columns in X), and the observations are ordered by condition

Case 1:

C = [1 0 0 -1]; X = [I1, I2, I3, I4] (indicators, block diagonal); Y = [Y1 | Y2 | Y3 | Y4]

Case 2:

C = [1 -1]; X=[I1, I4] (indicators, block diagonal); Y = [Y1 | Y4]

If there are only differences in mean (homoskedasticity across groups), then Case 2 kills your power for no gain, as you have lost confidence in the error variance, due to the "lost" residual degree of freedom.

If there are differences in mean and variance (heteroskedasticity), then Case 1 will appear to kill your power, since high-variance groups will increase the estimate of the (assumed to be constant) error variance.

In your case, feature 1 looks by eye to be heteroskedastic, with treatments 13, 14, and 16 clearly having larger errors than the others. I think it's this increase in overall variance (driven by a few groups) that is leading to the results you observe, and that it has little to do with loss of degrees of freedom due to missing values.

In general, if you feel like you're getting "better results" by treating groups in a pairwise fashion - it's probably a good bet that the groups are heteroskedastic.

ADD COMMENT • link 20 hours ago by LChart 5.1k

0

Entering edit mode

Thank you for the quick and thorough answer! Indeed I checked and the variance is different between my two features. However, I could imagine a scenario where the variance is not different, but the difference in missingness still alters the t-statistic via the degrees of freedom. I understand from your answer that this is expected, the difference in t-statistic would then reflect a different level of certainty in the same variance in that case, because there are more points with which to estimate the variance for one feature compared to another.

For my case I think I will restrict the question to your Case 2 to get an "intuitive" ranking of features for the contrast I care about and to ignore some of the heteroskedasticity in other treatments.

Edit: I answered too soon! Thanks Gordon for your answer as well and for pointing me towards limpa. What you write makes sense and I'll check it out.

ADD REPLY • link 13 hours ago by Sarah • 0

0

Entering edit mode

Chopping up your data into lots of two group comparisons is almost never the right way to go. Almost always, it is better to analyse all the data at once. You haven't shown any evidence of heteroskedasticity between treatment groups for your real data but, if it does exist, then just tell limma to downweight groups according to their variability. In limpa, you do this by setting var.group=treatment when you run dpcDE().

ADD REPLY • link 13 hours ago by Gordon Smyth ★ 8.4k

score 4 · Accepted Answer · 2025-10-21

If you have a mass spectrometry phosphoproteomics dataset with lots of missing values, then you should be using the limpa package (https://doi.org/doi:10.18129/B9.bioc.limpa), which is specifically designed for this type of data. limpa provides an enhanced limma pipeline that takes into account the information provided by the missing values. For phosphoproteomics data, you should use the dpcImpute() pipeline briefly mentioned in the vignette, which preserves the data at the feature level instead of quantifying by protein.

The standard limma pipeline is statistically correct assuming that (i) the missing values are missing at random and (ii) all observations for a given feature have the same precision.

The residual degrees of freedom are determined by the total number of observations and missing values compared to the number of groups. It does not depend on what contrast you are forming, and is not supposed to. You will still have more power to detect DE for groups with more non-missing observations, but that is directly determined by the number of observations in those groups rather than the degrees of freedom.

I do not see anything strange or unintuitive about the results that you report. Statistical significance depends not just on logFCs but also on variability and on the number of observations averaged to get the logFCs. Different features will have different precisions, so the DE results will not simply rank by fold-change. If you were going to do something so simplistic as ranking by fold-change, then you wouldn't need statistics at all.

limpa enhances the standard limma pipeline by leveraging the fact that missing values are not at random.