Question

Limma analysis for multiple groups

3

Entering edit mode

8.1 years ago

vyom84 ▴ 40

I have around 600 proteins quantification data from proteomics for 3 groups. The first group is Normal (111 samples) and the second group is diseased further split into two groups medium (58) and high (31) samples.

My limma design looks like this:

contrast.matrix<-makeContrasts(Normal-Medium, Normal-High, Medium-High, levels=design)

My question is whether the result obtained for Normal-High are right or validate (as the normal group consist of 111 sample and high groups has only 31 samples). Please advised whether I am going in right direction. The code I am using is:

design<-model.matrix(~ 0+Group)
colnames(design) <- c("Normal","Medium","High")
fit<-lmFit(a1c[,1:200],design)
contrast.matrix<-makeContrasts(Normal-Medium, Normal-High, Medium-High, levels=design)
fita1c<-contrasts.fit(fit,contrast.matrix)
fita1c<-eBayes(fit2)
fita1c<-eBayes(fita1c)

Thanks

limma differential analysis • 13k views

ADD COMMENT • link updated 8.1 years ago by russhh 5.7k • written 8.1 years ago by vyom84 ▴ 40

0

Entering edit mode

Just to make sure, why did you apply eBayes twice, and to fit2 (which isn't created in the code you included)?

ADD REPLY • link 8.1 years ago by andrew.j.skelton73 6.5k

score 3 · Answer 1 · 2016-03-31

This is a really important point. Yes, I believe the method remains valid despite the unbalanced numbers in the design. Hopefully I can show you this is the case (and hopefully I'm not completely wrong...):

library(Biobase)
library(limma)
set.seed(1)

Unbalanced designs in standard 'limma' workflow:

Randomly sample some differential expression: Assume all of the 500 genes have mean of 0 in normal and mean in medium and high sampled as follows:

coefs <- matrix(c(rep(0, 500), rnorm(1000, mean = 0, sd = 10)), ncol = 3)
head(coefs)
     [,1]        [,2]      [,3]
[1,]    0  11.3496509  8.500435
[2,]    0  11.1193185 -9.253130
[3,]    0  -8.7077763  8.935812
[4,]    0   2.1073159 -9.410097
[5,]    0   0.6939565  5.389521
[6,]    0 -16.6264885 -1.819744

So for the first gene we expect mean of: 0 in normal; 11.35 in medium; 8.50 in high

Take all genes to have sd of 0.1 within the classes so the original coefs should be easily fitted

labels <- factor(
    c(rep('Normal', 111), rep('Medium', 58), rep('High', 31)),
    levels = c('Normal', 'Medium', 'High')
    )

design <- model.matrix(~ -1 + labels)
colnames(design) <- levels(labels)

# sample some linear model coefficients for the genes
assay.mat <- t(apply(
    coefs,
    1,
    function(coef.row){
        means = design %*% coef.row
        rnorm(length(means), mean = means, sd = 0.1)
        }
    ))

You can see that this should give you some statistical differences by eye, if needs be (not shown here):

par(mfrow = c(2,2))
for(i in 1:4){
  plot(assay.mat[i, ] ~ labels, ylim = c(-20, 20))
  }

Now we run the limma analysis workflow:

-- Convert assay.mat to an eset

eset <- ExpressionSet(
    assayData = assay.mat
    )

Note the subtle change to the sign of the contrasts (compared with your contrast matrix)

contrast.matrix <- makeContrasts(
    MN = Medium - Normal,
    HN = High - Normal,
    HM = High - Medium,
    levels = design
    )

fit <- lmFit(eset, design)
fit.cont <- contrasts.fit(fit, contrast.matrix)
fit.eb   <- eBayes(fit.cont)

The raw coefficients are close to what we wanted:

head(fit$coefficients, 2) # the fitted coefficients
        Normal      Medium      High
1 -0.012117565  11.3503072  8.514598
2 -0.001331074  11.1288127 -9.240092

 head(coefs[1:2, ]) # the sampled coefficients
         [,1]     [,2]      [,3]
    [1,]    0 11.34965  8.500435
    [2,]    0 11.11932 -9.253130

What should the expected contrast value be?

expected.contrasts <- data.frame(
  MN = coefs[, 2] - coefs[, 1],
  HN = coefs[, 3] - coefs[, 1],
  HM = coefs[, 3] - coefs[, 2]
  )

The fitted contrasts were:

head(as.data.frame(fit.eb$coefficients), n = 2)
        MN        HN         HM
1 11.36242  8.526716  -2.835709
2 11.13014 -9.238761 -20.368905

... and the expected contrasts were

head(expected.contrasts, n = 2)
        MN        HN         HM
1 11.34965  8.500435  -2.849216
2 11.11932 -9.253130 -20.372448

Which is smashing, so the limma workflow holds up despite the unbalanced numbers. The applicability of limma to proteomic datasets is outwith my knowledge though: do the data even look normal?

I have some reservations about your research question though. The most important contrasts are surely i) 'disease' vs 'normal' (which you haven't tested); and ii) severe disease vs moderate disease (which you have tested).