Question

Limma Contrast Question

1

Entering edit mode

10.9 years ago

pichai.raman ▴ 40

Hello All,

I'm using the Limma package to perform some microarray analysis. In the past I've typically compared 2 groups in contrasts such as "GroupA-GroupB". Occasionally, I've even done comparisons where I've normalized before comparing 2 groups such as "(GroupAExp-GroupAControl)-(GroupBExp-GroupBControl)". In my current analysis I have 4 levels for a particular factor and I'd like to compare 2 of the levels to 2 of the other levels. So, my targets file is as follows.

Sample     Condition
S1         A
S2         A
S3         A
S4         B
S5         B
S6         B
S7         C
S8         C
S9         C
S10        D
S11        D
S12        D

And basically my contrast is "(A+B)-(C+D)". Which is to say I'd like to find genes that are significantly higher(or lower) in the pooled A and B group compared to the C and D group. Unfortunately, I'm not quite sure I understand the log fold changes that are output. I don't quite understand how Limma is calculating these. I could always just create another column/factor in which I put A & B together as one factor and C & D as another factor but in truth these 4 levels are distinct. So my questions to the group are

Is it better to make the contrast as I have or would it be more appropriate to create a new factor and pool A+b into 1 level and C+D into another level.
If the contrast I've chosen is correct how does the logFC get calculated, I know that its simply not taking the all the values in A+B and subtracting all the values in C+D as that doesn't yield the same result as what I see in the output.

Any thoughts would be much appreciated.

Cheers,
Pichai

Limma Contrasts • 4.8k views

ADD COMMENT • link updated 3.6 years ago by Ram 45k • written 10.9 years ago by pichai.raman ▴ 40

1

Entering edit mode

Thanks Brent, yeah I actually think I have no choice but to create a new column. The contrast is essentially taking the mean of the first level adding it to the mean of the second level and then subtracting the mean of the 3rd and 4th levels which is inflating the LogFC of course. I assumed somewhat naively that it would just mean all values in A and B together and subtract the mean of all values in C and D.

library(limma);

#Create Data
s1 <- abs(rnorm(10))*5; # Group A
s2 <- abs(rnorm(10))*5; # Group A
s3 <- abs(rnorm(10))*4; # Group B
s4 <- abs(rnorm(10))*4; # Group B
s5 <- abs(rnorm(10))*3; # Group C
s6 <- abs(rnorm(10))*3; # Group C
s7 <- abs(rnorm(10))*2; # Group D
s8 <- abs(rnorm(10))*2; # Group D

dataExp <- data.frame(s1, s2, s3, s4, s5, s6, s7, s8);
rownames(dataExp) <- paste("g", c(1:10), sep="");
colnames(dataExp) <- paste("s", c(1:8), sep="");

#Targets File
fTarget <- factor(c(rep("A",2), rep("B",2), rep("C",2), rep("D",2)));
design <- model.matrix(~0+fTarget);
colnames(design) <- gsub("fTarget", "", colnames(design));

#Run Limma
fit <- lmFit(dataExp, design);
myConts <- "(A+B)-(C+D)";
contrast.matrix <- makeContrasts(contrasts=paste(myConts), levels=design)
fit2 <- contrasts.fit(fit, contrast.matrix)
fit2 <- eBayes(fit2)
topTable(fit2)

    ID      logFC   AveExpr          t     P.Value  adj.P.Val         B
1   g1  6.4588390 4.1901582  2.8160003 0.009826156 0.09826156 -2.771584
3   g3  3.7026979 2.6308296  1.4732837 0.154284956 0.53257352 -4.466917
4   g4  4.0849422 3.8189196  1.4530554 0.159772057 0.53257352 -4.486924
5   g5 -2.7718791 1.9597527 -1.1644998 0.256207255 0.55965665 -4.747696
2   g2  2.9266876 2.9289445  1.1068964 0.279828327 0.55965665 -4.793931
9   g9  1.9081786 3.6318367  0.7606780 0.454609288 0.63881922 -5.027231
8   g8  1.7709799 1.6842968  0.7325484 0.471264278 0.63881922 -5.042671
7   g7 -1.5356162 1.1441038 -0.6676071 0.511055378 0.63881922 -5.076215
10 g10 -0.6502641 1.7959415 -0.2567874 0.799635371 0.82904464 -5.218137
6   g6 -0.4985768 0.9833397 -0.2184091 0.829044644 0.82904464 -5.225028

# Calculate logFC manually

calcFC <- data.frame(rowMeans(dataExp[1:4])-rowMeans(dataExp[5:8]));
calcFC
    rowMeans.dataExp.1.4.....rowMeans.dataExp.5.8..
g1                                        3.2294195
g2                                        1.4633438
g3                                        1.8513490
g4                                        2.0424711
g5                                       -1.3859396
g6                                       -0.2492884
g7                                       -0.7678081
g8                                        0.8854899
g9                                        0.9540893
g10                                      -0.3251320

# These do not match up, i.e. logFC doesn't equate with the calculated logFC becuase the contast isn't simply grouping them together

# But when we calculate in this way (which obviously matches the expression(
calcFC2 <- data.frame(rowMeans(dataExp[1:2])+ rowMeans(dataExp[3:4])-rowMeans(dataExp[5:6])-rowMeans(dataExp[7:8]));

calcFC2
    rowMeans.dataExp.1.2.....rowMeans.dataExp.3.4.....rowMeans.dataExp.5.6.....
g1                                                                    6.4588390
g2                                                                    2.9266876
g3                                                                    3.7026979
g4                                                                    4.0849422
g5                                                                   -2.7718791
g6                                                                   -0.4985768
g7                                                                   -1.5356162
g8                                                                    1.7709799
g9                                                                    1.9081786
g10                                                                  -0.6502641

ADD REPLY • link updated 3.6 years ago by Ram 45k • written 10.9 years ago by pichai.raman ▴ 40

0

Entering edit mode

re 1) for simplicity, you could make a new column as you suggest:

df$condAB = df$Condition %in% c("A", "B")

and 2) you can find lots of information on how limma calculates logFC, e.g. :https://stat.ethz.ch/pipermail/bioconductor/2012-January/042950.html

ADD REPLY • link updated 3.6 years ago by Ram 45k • written 10.9 years ago by brentp 24k

0

Entering edit mode

If you want A and B to be pooled then you won't be doing a contrast, pretty much by definition.

ADD REPLY • link 10.9 years ago by Devon Ryan 105k

0

Entering edit mode

All I wanted to do was contrast samples that are members of groups A or B with samples that are members of groups C and D. This equates to creating a new factor with 2 levels e.g. E and F , E) all samples in A and B, F) all samples in C and D. so our contrast may be "E-F"

What I did was "(A+B)-(C+D)" which I believe is still a contrast. You're contrasting the sum of A and B with that of C and D which I'm sure has some use case somewhere (just not for me right now). Is there another term for this expression, if its not a contrast?

ADD REPLY • link updated 3.6 years ago by Ram 45k • written 10.9 years ago by pichai.raman ▴ 40

1

Entering edit mode

I think your confusion arises from thinking that the contrast (here (A+B)-(C+D)) applies to the raw values rather than to the fit model. By the time you perform the contrast, you've already fit the 4 factor model and are using the coefficients and their deviations. If you pool things, you're not doing a contrast and you're also decreasing your power since you're not compensating for a known effect (thus, the variance is increased). Perhaps this ends up making sense for the underlying biological question, perhaps not.

ADD REPLY • link 10.9 years ago by Devon Ryan 105k