Question: DE analysis with limma (What to include in the model?)
0
grayapply2009220 wrote:

In my RNAseq data, I have 6 groups (A, B, C, D, E, F). I'd like to compare between A and D. I did DE analysis with limma. I found including all groups in the model and including only the 2 comparison groups gave me significantly different numbers of DEGs (P < 0.05). For example, the first method below detected 6000 DEGs while the 2nd method only detected 900 DEGs. Why is the huge difference between the two methods and which one should I use?

Here are 2 ways to do the DE analysis:

1. Include all the groups and comparisons in the design matrix and contrast matrix:

``````design <- model.matrix(~ 0 + group, data)

contrast <- makeContrasts(A.vs.B = A-B, A.vs.C = A-C, A.vs.D = A-D, A.vs.E = A-E, A.vs.F = A-F, B.vs.C = B-C, B.vs.D = B-D, B.vs.E = B-E, B.vs.F = B-F, C.vs.D = C-D,  C.vs.E = C-E,  C.vs.F = C-F,  D.vs.E = D-E,  D.vs.F = D-F,  E.vs.F = E-F)

fit1 <- lmFit(object = data, design = design)
fit2 <- contrasts.fit(fit1, contrast)
fit2 <- eBayes(fit2)
topTable(fit2, coef = A.vs.D)
``````

2. Subset group A and D, and focus only on these 2 groups:

``````design <- model.matrix(~ 0 + group_sub, data_sub)

contrast <- makeContrasts(A.vs.D = A-D)
``````

I wrote some code specifically to compare the above 2 methods. Here are the code and data.

modified 6 weeks ago • written 7 weeks ago by grayapply2009220

Thats hard to believe that by subsetting the data by columns to just the samples you are testing will differ in DEG from 6000 vs 900 is OK with folks here. My guess is 6000 DEG is the correct one and you messed up somewhere on the second Limma run.

Davide, we are working on minimal information from the user.

Thank you, Davide and Kevin. I included my script and data in this post. Please take a look at them. I really appreciate it if you can help me.

1

Thanks, but, there is much code missing from your second 'subgroup' analysis. Also, it would help to show output results tables so that we could see that information. Anything else that you could output would help, too, for example, even something simple like `str(fit2)` from both #1 and #2.

In general, though, your approach #1 should be used, unless your sub-groups are so completely different in their transcriptomic profiles such that analysing them in the same dataset introduces bias (e.g., analysing neurons and melanocytes together).

Hi Kevin, I literally included all my data and code in the post. Please take a look at the last sentence in the above post. There is a link to my data and code. Sorry for not making it clear.

Thanks, here is your full script:

``````library(edgeR)
library(limma)
library(Biobase)

#data norm
dat <- read.delim('data.tsv', as.is = TRUE)
rownames(dat) <- dat[, 1]
#dat <- dat[, colnames(dat) != 'S24']
dat <- as.matrix(dat[, -1])
meta <- read.delim('meta.csv', as.is = TRUE, sep = ",")
rownames(meta) <- meta\$SampleID_2
meta <- meta[colnames(dat), ]
dge <- DGEList(counts = dat, samples = meta)
isexpr <-rowSums(cpm(dge) >= 1) >= 1
print(table(isexpr))
dge <- dge[isexpr, , keep.lib.sizes = FALSE]
dge <- calcNormFactors(dge)
expr <- voom(dge, plot = F)

# DE (include all other comparison groups)
design_full <- model.matrix(~ 0 + Sex_Geno_Treatment + RIN, data = expr\$targets)
colnames(design_full) <- gsub("Sex_Geno_Treatment", "", colnames(design_full))
contrast_full <- makeContrasts(
KD.vs.Ctrl.in.E4.Female = F_E4_KD_12 - F_E4_Ctrl_12,
KD.vs.Ctrl.in.E3.Female = F_E3_KD_12 - F_E3_Ctrl_12,
OE.vs.Ctrl.in.E4.Female = F_E4_OE_34 - F_E4_Ctrl_34,
OE.vs.Ctrl.in.E3.Female = F_E3_OE_34 - F_E3_Ctrl_34,
levels = design_full)

fit1_full <- lmFit(object = expr, design = design_full)
fit2_full <- contrasts.fit(fit1_full, contrast_full)
fit2_full <- eBayes(fit2_full)
deg_full_model <- topTable(fit2_full, coef = "KD.vs.Ctrl.in.E3.Female", n = Inf, sort.by = "p", p.value = 0.05, lfc = log2(1.2))
print(nrow(deg_full_model))

# DE (include only the current comparison group)
exprls_sub <- expr[, expr\$targets\$Sex_Geno_Treatment %in% c("F_E3_KD_12", "F_E3_Ctrl_12")]
design_part <- model.matrix(~ 0 + Sex_Geno_Treatment + RIN, data = exprls_sub\$targets)
colnames(design_part) <- c("F_E3_Ctrl_12", "F_E3_KD_12", "RIN")
contrast_part <- makeContrasts(contrasts = "F_E3_KD_12-F_E3_Ctrl_12", levels = design_part)
fit1_part <- lmFit(object = exprls_sub, design = design_part)
fit2_part <- contrasts.fit(fit1_part, contrast_part)
fit2_part <- eBayes(fit2_part)

deg_part_model <- topTable(fit2_part, n = Inf, sort.by = "p", p.value = 0.05, lfc = log2(1.2))
print(nrow(deg_part_model))
``````

Why do you feel that you need to adjust for `sex`, too, and `RIN`?

You should also be filtering by adjusted p-values, not nominal p-values (Edit: according to current default, not specifying an adjustment method will mean that BH is chosen). Your threshold for Log [base 2] fold-change is also very low.

Irrespective of everything, 3 people in this thread have given the general consensus that approach #1 is better, except for this situation:

In general, though, your approach #1 should be used, unless your sub-groups are so completely different in their transcriptomic profiles such that analysing them in the same dataset introduces bias (e.g., analysing neurons and melanocytes together).

grayapply2009, if you want a definitive answer, I encourage you to post this question on Bioconductor Support (where I am also Moderator). There, the EdgeR developer is more likely to respond, and it feels like only hearing from him will close this issue off for you. Please link back to this thread when you post there. Thanks!

Thank you, Kevin. I just don't understand why subsetting the data could cause such a huge difference. I think I'll just have to read more about their algorithm.

By the way, p.value = 0.05 in the topTable represents adjusted p < 0.05. FC > 1.2 is the conventional threshold our lab's been using. I didn't adjusted sex because I only compared within females. I adjusted RIN because RIN is significantly correlated with most of the variables.

Why dont you create a single R script that you feed one file, compare the results using the complete set and another file containing the subset.

You can use the same code eg:

`design <- model.matrix(~ 0 + group, data)`

`contrast <- makeContrasts(A.vs.D = A-D)`

That would provide you some confidence that your code is the same while the only variable changing is your input file and then you can determine what the DEG while using the code.

That IS what I did. Please take a look at my data and the single script I provided in this post. The link is in the last sentence of the post.

I agree on the points Kevin is making. What are the DEG after using "adjusted p-values" for both sets ?

1
h.mon31k wrote:

limma was developed for microarray data. One can use limma for for RNAseq analysis, using the voom transformation, are you using it?

I am almost certain I have seen the same question discussed in depth before, either here at BioStars or at the BioConductor support forum. However, I can't find any relevant post, so I will give a superficial explanation: limma uses an empirical Bayes method to squeeze the residual variances for each gene towards a global trend, and, in doing so, also increases the degrees of freedom for the individual variances. This means the presence of the additional samples will modify these estimates, thus will affect the tests results.

Thank you for your reply. Yes, the data were voom normalized. Which method do you recommend? I've been using method 2 all the time and it worked perfectly fine for me. The huge difference between these two methods surprised me. By the way, even logFC was slightly different between the 2 methods. Any ideas why this happened?