Question

DE analysis with limma (What to include in the model?)

0

Entering edit mode

3.7 years ago

grayapply2009 ▴ 280

In my RNAseq data, I have 6 groups (A, B, C, D, E, F). I'd like to compare between A and D. I did DE analysis with limma. I found including all groups in the model and including only the 2 comparison groups gave me significantly different numbers of DEGs (P < 0.05). For example, the first method below detected 6000 DEGs while the 2nd method only detected 900 DEGs. Why is the huge difference between the two methods and which one should I use?

Here are 2 ways to do the DE analysis:

1. Include all the groups and comparisons in the design matrix and contrast matrix:

design <- model.matrix(~ 0 + group, data)

contrast <- makeContrasts(A.vs.B = A-B, A.vs.C = A-C, A.vs.D = A-D, A.vs.E = A-E, A.vs.F = A-F, B.vs.C = B-C, B.vs.D = B-D, B.vs.E = B-E, B.vs.F = B-F, C.vs.D = C-D,  C.vs.E = C-E,  C.vs.F = C-F,  D.vs.E = D-E,  D.vs.F = D-F,  E.vs.F = E-F)

fit1 <- lmFit(object = data, design = design)
fit2 <- contrasts.fit(fit1, contrast)
fit2 <- eBayes(fit2)
topTable(fit2, coef = A.vs.D)

2. Subset group A and D, and focus only on these 2 groups:

design <- model.matrix(~ 0 + group_sub, data_sub)

contrast <- makeContrasts(A.vs.D = A-D)

I wrote some code specifically to compare the above 2 methods. Here are the code and data.

DE analysis limma Differential Expression • 2.2k views

ADD COMMENT • link 3.7 years ago by grayapply2009 ▴ 280

0

Entering edit mode

Thats hard to believe that by subsetting the data by columns to just the samples you are testing will differ in DEG from 6000 vs 900 is OK with folks here. My guess is 6000 DEG is the correct one and you messed up somewhere on the second Limma run.

ADD REPLY • link 3.7 years ago by David ▴ 10

0

Entering edit mode

Davide, we are working on minimal information from the user.

ADD REPLY • link 3.7 years ago by Kevin Blighe 87k

0

Entering edit mode

Thank you, Davide and Kevin. I included my script and data in this post. Please take a look at them. I really appreciate it if you can help me.

ADD REPLY • link 3.7 years ago by grayapply2009 ▴ 280

1

Entering edit mode

Thanks, but, there is much code missing from your second 'subgroup' analysis. Also, it would help to show output results tables so that we could see that information. Anything else that you could output would help, too, for example, even something simple like str(fit2) from both #1 and #2.

In general, though, your approach #1 should be used, unless your sub-groups are so completely different in their transcriptomic profiles such that analysing them in the same dataset introduces bias (e.g., analysing neurons and melanocytes together).

ADD REPLY • link 3.7 years ago by Kevin Blighe 87k

0

Entering edit mode

Hi Kevin, I literally included all my data and code in the post. Please take a look at the last sentence in the above post. There is a link to my data and code. Sorry for not making it clear.

ADD REPLY • link 3.7 years ago by grayapply2009 ▴ 280

0

Entering edit mode

Thanks, here is your full script:

library(edgeR)
library(limma)
library(Biobase)


#data norm
dat <- read.delim('data.tsv', as.is = TRUE)
rownames(dat) <- dat[, 1]
#dat <- dat[, colnames(dat) != 'S24']
dat <- as.matrix(dat[, -1])
meta <- read.delim('meta.csv', as.is = TRUE, sep = ",")
rownames(meta) <- meta$SampleID_2
meta <- meta[colnames(dat), ]
dge <- DGEList(counts = dat, samples = meta)
isexpr <-rowSums(cpm(dge) >= 1) >= 1
print(table(isexpr))
dge <- dge[isexpr, , keep.lib.sizes = FALSE]
dge <- calcNormFactors(dge)
expr <- voom(dge, plot = F)


# DE (include all other comparison groups)
design_full <- model.matrix(~ 0 + Sex_Geno_Treatment + RIN, data = expr$targets)
colnames(design_full) <- gsub("Sex_Geno_Treatment", "", colnames(design_full))
contrast_full <- makeContrasts(
        KD.vs.Ctrl.in.E4.Female = F_E4_KD_12 - F_E4_Ctrl_12,
        KD.vs.Ctrl.in.E3.Female = F_E3_KD_12 - F_E3_Ctrl_12,
        OE.vs.Ctrl.in.E4.Female = F_E4_OE_34 - F_E4_Ctrl_34,
        OE.vs.Ctrl.in.E3.Female = F_E3_OE_34 - F_E3_Ctrl_34,
        levels = design_full)

fit1_full <- lmFit(object = expr, design = design_full)
fit2_full <- contrasts.fit(fit1_full, contrast_full)
fit2_full <- eBayes(fit2_full)
deg_full_model <- topTable(fit2_full, coef = "KD.vs.Ctrl.in.E3.Female", n = Inf, sort.by = "p", p.value = 0.05, lfc = log2(1.2))
print(nrow(deg_full_model))



# DE (include only the current comparison group)
exprls_sub <- expr[, expr$targets$Sex_Geno_Treatment %in% c("F_E3_KD_12", "F_E3_Ctrl_12")]
design_part <- model.matrix(~ 0 + Sex_Geno_Treatment + RIN, data = exprls_sub$targets)
colnames(design_part) <- c("F_E3_Ctrl_12", "F_E3_KD_12", "RIN")
contrast_part <- makeContrasts(contrasts = "F_E3_KD_12-F_E3_Ctrl_12", levels = design_part) 
fit1_part <- lmFit(object = exprls_sub, design = design_part)
fit2_part <- contrasts.fit(fit1_part, contrast_part)
fit2_part <- eBayes(fit2_part)

deg_part_model <- topTable(fit2_part, n = Inf, sort.by = "p", p.value = 0.05, lfc = log2(1.2))
print(nrow(deg_part_model))

Now I know more about your data and workflow.

Why do you feel that you need to adjust for sex, too, and RIN?

You should also be filtering by adjusted p-values, not nominal p-values (Edit: according to current default, not specifying an adjustment method will mean that BH is chosen). Your threshold for Log [base 2] fold-change is also very low.

Irrespective of everything, 3 people in this thread have given the general consensus that approach #1 is better, except for this situation:

In general, though, your approach #1 should be used, unless your sub-groups are so completely different in their transcriptomic profiles such that analysing them in the same dataset introduces bias (e.g., analysing neurons and melanocytes together).

ADD REPLY • link 3.7 years ago by Kevin Blighe 87k

0

Entering edit mode

grayapply2009, if you want a definitive answer, I encourage you to post this question on Bioconductor Support (where I am also Moderator). There, the EdgeR developer is more likely to respond, and it feels like only hearing from him will close this issue off for you. Please link back to this thread when you post there. Thanks!

ADD REPLY • link 3.7 years ago by Kevin Blighe 87k

0

Entering edit mode

Thank you, Kevin. I just don't understand why subsetting the data could cause such a huge difference. I think I'll just have to read more about their algorithm.

By the way, p.value = 0.05 in the topTable represents adjusted p < 0.05. FC > 1.2 is the conventional threshold our lab's been using. I didn't adjusted sex because I only compared within females. I adjusted RIN because RIN is significantly correlated with most of the variables.

ADD REPLY • link 3.7 years ago by grayapply2009 ▴ 280

0

Entering edit mode

Why dont you create a single R script that you feed one file, compare the results using the complete set and another file containing the subset.

You can use the same code eg:

design <- model.matrix(~ 0 + group, data)

contrast <- makeContrasts(A.vs.D = A-D)

That would provide you some confidence that your code is the same while the only variable changing is your input file and then you can determine what the DEG while using the code.

ADD REPLY • link 3.7 years ago by David ▴ 10

0

Entering edit mode

That IS what I did. Please take a look at my data and the single script I provided in this post. The link is in the last sentence of the post.

ADD REPLY • link 3.7 years ago by grayapply2009 ▴ 280

0

Entering edit mode

I agree on the points Kevin is making. What are the DEG after using "adjusted p-values" for both sets ?

ADD REPLY • link 3.7 years ago by David ▴ 10

score 1 · Answer 1 · 2020-08-05

1

Entering edit mode

3.7 years ago

h.mon 35k

limma was developed for microarray data. One can use limma for for RNAseq analysis, using the voom transformation, are you using it?

I am almost certain I have seen the same question discussed in depth before, either here at BioStars or at the BioConductor support forum. However, I can't find any relevant post, so I will give a superficial explanation: limma uses an empirical Bayes method to squeeze the residual variances for each gene towards a global trend, and, in doing so, also increases the degrees of freedom for the individual variances. This means the presence of the additional samples will modify these estimates, thus will affect the tests results.

ADD COMMENT • link 3.7 years ago by h.mon 35k

0

Entering edit mode

Thank you for your reply. Yes, the data were voom normalized. Which method do you recommend? I've been using method 2 all the time and it worked perfectly fine for me. The huge difference between these two methods surprised me. By the way, even logFC was slightly different between the 2 methods. Any ideas why this happened?

ADD REPLY • link 3.7 years ago by grayapply2009 ▴ 280

0

Entering edit mode

You're subsetting the entire dataset in the second approach, so, parameter estimations will differ. You should use approach #1, in my opinion.

ADD REPLY • link 3.7 years ago by Kevin Blighe 87k