Limma analysis for multiple groups
Entering edit mode
5.7 years ago
vyom84 ▴ 30

I have around 600 proteins quantification data from proteomics for 3 groups. The first group is Normal (111 samples) and the second group is diseased further split into two groups medium (58) and high (31) samples.

My limma design looks like this:

contrast.matrix<-makeContrasts(Normal-Medium, Normal-High, Medium-High, levels=design)

My question is whether the result obtained for Normal-High are right or validate (as the normal group consist of 111 sample and high groups has only 31 samples). Please advised whether I am going in right direction. The code I am using is:

design<-model.matrix(~ 0+Group)
colnames(design) <- c("Normal","Medium","High")
contrast.matrix<-makeContrasts(Normal-Medium, Normal-High, Medium-High, levels=design)


limma differential analysis • 8.9k views
Entering edit mode

Just to make sure, why did you apply eBayes twice, and to fit2 (which isn't created in the code you included)?

Entering edit mode
5.7 years ago
russhh 5.6k

This is a really important point. Yes, I believe the method remains valid despite the unbalanced numbers in the design. Hopefully I can show you this is the case (and hopefully I'm not completely wrong...):


Unbalanced designs in standard 'limma' workflow:

Randomly sample some differential expression: Assume all of the 500 genes have mean of 0 in normal and mean in medium and high sampled as follows:

coefs <- matrix(c(rep(0, 500), rnorm(1000, mean = 0, sd = 10)), ncol = 3)
     [,1]        [,2]      [,3]
[1,]    0  11.3496509  8.500435
[2,]    0  11.1193185 -9.253130
[3,]    0  -8.7077763  8.935812
[4,]    0   2.1073159 -9.410097
[5,]    0   0.6939565  5.389521
[6,]    0 -16.6264885 -1.819744

So for the first gene we expect mean of: 0 in normal; 11.35 in medium; 8.50 in high

Take all genes to have sd of 0.1 within the classes so the original coefs should be easily fitted

labels <- factor(
    c(rep('Normal', 111), rep('Medium', 58), rep('High', 31)),
    levels = c('Normal', 'Medium', 'High')

design <- model.matrix(~ -1 + labels)
colnames(design) <- levels(labels)

# sample some linear model coefficients for the genes
assay.mat <- t(apply(
        means = design %*% coef.row
        rnorm(length(means), mean = means, sd = 0.1)

You can see that this should give you some statistical differences by eye, if needs be (not shown here):

par(mfrow = c(2,2))
for(i in 1:4){
  plot(assay.mat[i, ] ~ labels, ylim = c(-20, 20))

Now we run the limma analysis workflow:

-- Convert assay.mat to an eset

eset <- ExpressionSet(
    assayData = assay.mat

Note the subtle change to the sign of the contrasts (compared with your contrast matrix)

contrast.matrix <- makeContrasts(
    MN = Medium - Normal,
    HN = High - Normal,
    HM = High - Medium,
    levels = design

fit <- lmFit(eset, design)
fit.cont <-, contrast.matrix)
fit.eb   <- eBayes(fit.cont)

The raw coefficients are close to what we wanted:

head(fit$coefficients, 2) # the fitted coefficients
        Normal      Medium      High
1 -0.012117565  11.3503072  8.514598
2 -0.001331074  11.1288127 -9.240092

 head(coefs[1:2, ]) # the sampled coefficients
         [,1]     [,2]      [,3]
    [1,]    0 11.34965  8.500435
    [2,]    0 11.11932 -9.253130

What should the expected contrast value be?

expected.contrasts <- data.frame(
  MN = coefs[, 2] - coefs[, 1],
  HN = coefs[, 3] - coefs[, 1],
  HM = coefs[, 3] - coefs[, 2]

The fitted contrasts were:

head($coefficients), n = 2)
        MN        HN         HM
1 11.36242  8.526716  -2.835709
2 11.13014 -9.238761 -20.368905

... and the expected contrasts were

head(expected.contrasts, n = 2)
        MN        HN         HM
1 11.34965  8.500435  -2.849216
2 11.11932 -9.253130 -20.372448

Which is smashing, so the limma workflow holds up despite the unbalanced numbers. The applicability of limma to proteomic datasets is outwith my knowledge though: do the data even look normal?

I have some reservations about your research question though. The most important contrasts are surely i) 'disease' vs 'normal' (which you haven't tested); and ii) severe disease vs moderate disease (which you have tested).

Entering edit mode

Thanks for the elaborative answer. It seems I am using the limma in correct way. I have tested diseases vs normal also. The result of diseases vs normal is more similar to diseases vs high (severe disease) as compare to diseases vs medium (moderate disease). Probably the protein relative abundance has high difference between normal and high as compared to normal vs medium.

I compared the result of disease vs normal design with a proteomics analysis package and the result do agree. But in case of disease vs normal, there is almost equal number of diseased and normal sample. My only confusion was whether normal vs high results are also significant, although looking at the results the biology of results says the result are right.

Thanks for the answer.


Login before adding your answer.

Traffic: 2469 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6