Question: DE analysis with limma (What to include in the model?)
0
gravatar for grayapply2009
7 weeks ago by
grayapply2009220
United States
grayapply2009220 wrote:

In my RNAseq data, I have 6 groups (A, B, C, D, E, F). I'd like to compare between A and D. I did DE analysis with limma. I found including all groups in the model and including only the 2 comparison groups gave me significantly different numbers of DEGs (P < 0.05). For example, the first method below detected 6000 DEGs while the 2nd method only detected 900 DEGs. Why is the huge difference between the two methods and which one should I use?

Here are 2 ways to do the DE analysis:

1. Include all the groups and comparisons in the design matrix and contrast matrix:

design <- model.matrix(~ 0 + group, data)

contrast <- makeContrasts(A.vs.B = A-B, A.vs.C = A-C, A.vs.D = A-D, A.vs.E = A-E, A.vs.F = A-F, B.vs.C = B-C, B.vs.D = B-D, B.vs.E = B-E, B.vs.F = B-F, C.vs.D = C-D,  C.vs.E = C-E,  C.vs.F = C-F,  D.vs.E = D-E,  D.vs.F = D-F,  E.vs.F = E-F)

fit1 <- lmFit(object = data, design = design)
fit2 <- contrasts.fit(fit1, contrast)
fit2 <- eBayes(fit2)
topTable(fit2, coef = A.vs.D)

2. Subset group A and D, and focus only on these 2 groups:

design <- model.matrix(~ 0 + group_sub, data_sub)

contrast <- makeContrasts(A.vs.D = A-D)

I wrote some code specifically to compare the above 2 methods. Here are the code and data.

ADD COMMENTlink modified 6 weeks ago • written 7 weeks ago by grayapply2009220

Thats hard to believe that by subsetting the data by columns to just the samples you are testing will differ in DEG from 6000 vs 900 is OK with folks here. My guess is 6000 DEG is the correct one and you messed up somewhere on the second Limma run.

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by Davide10

Davide, we are working on minimal information from the user.

ADD REPLYlink written 6 weeks ago by Kevin Blighe65k

Thank you, Davide and Kevin. I included my script and data in this post. Please take a look at them. I really appreciate it if you can help me.

ADD REPLYlink written 6 weeks ago by grayapply2009220
1

Thanks, but, there is much code missing from your second 'subgroup' analysis. Also, it would help to show output results tables so that we could see that information. Anything else that you could output would help, too, for example, even something simple like str(fit2) from both #1 and #2.

In general, though, your approach #1 should be used, unless your sub-groups are so completely different in their transcriptomic profiles such that analysing them in the same dataset introduces bias (e.g., analysing neurons and melanocytes together).

ADD REPLYlink written 6 weeks ago by Kevin Blighe65k

Hi Kevin, I literally included all my data and code in the post. Please take a look at the last sentence in the above post. There is a link to my data and code. Sorry for not making it clear.

ADD REPLYlink written 6 weeks ago by grayapply2009220

Thanks, here is your full script:

library(edgeR)
library(limma)
library(Biobase)


#data norm
dat <- read.delim('data.tsv', as.is = TRUE)
rownames(dat) <- dat[, 1]
#dat <- dat[, colnames(dat) != 'S24']
dat <- as.matrix(dat[, -1])
meta <- read.delim('meta.csv', as.is = TRUE, sep = ",")
rownames(meta) <- meta$SampleID_2
meta <- meta[colnames(dat), ]
dge <- DGEList(counts = dat, samples = meta)
isexpr <-rowSums(cpm(dge) >= 1) >= 1
print(table(isexpr))
dge <- dge[isexpr, , keep.lib.sizes = FALSE]
dge <- calcNormFactors(dge)
expr <- voom(dge, plot = F)


# DE (include all other comparison groups)
design_full <- model.matrix(~ 0 + Sex_Geno_Treatment + RIN, data = expr$targets)
colnames(design_full) <- gsub("Sex_Geno_Treatment", "", colnames(design_full))
contrast_full <- makeContrasts(
        KD.vs.Ctrl.in.E4.Female = F_E4_KD_12 - F_E4_Ctrl_12,
        KD.vs.Ctrl.in.E3.Female = F_E3_KD_12 - F_E3_Ctrl_12,
        OE.vs.Ctrl.in.E4.Female = F_E4_OE_34 - F_E4_Ctrl_34,
        OE.vs.Ctrl.in.E3.Female = F_E3_OE_34 - F_E3_Ctrl_34,
        levels = design_full)

fit1_full <- lmFit(object = expr, design = design_full)
fit2_full <- contrasts.fit(fit1_full, contrast_full)
fit2_full <- eBayes(fit2_full)
deg_full_model <- topTable(fit2_full, coef = "KD.vs.Ctrl.in.E3.Female", n = Inf, sort.by = "p", p.value = 0.05, lfc = log2(1.2))
print(nrow(deg_full_model))



# DE (include only the current comparison group)
exprls_sub <- expr[, expr$targets$Sex_Geno_Treatment %in% c("F_E3_KD_12", "F_E3_Ctrl_12")]
design_part <- model.matrix(~ 0 + Sex_Geno_Treatment + RIN, data = exprls_sub$targets)
colnames(design_part) <- c("F_E3_Ctrl_12", "F_E3_KD_12", "RIN")
contrast_part <- makeContrasts(contrasts = "F_E3_KD_12-F_E3_Ctrl_12", levels = design_part) 
fit1_part <- lmFit(object = exprls_sub, design = design_part)
fit2_part <- contrasts.fit(fit1_part, contrast_part)
fit2_part <- eBayes(fit2_part)

deg_part_model <- topTable(fit2_part, n = Inf, sort.by = "p", p.value = 0.05, lfc = log2(1.2))
print(nrow(deg_part_model))

Now I know more about your data and workflow.

Why do you feel that you need to adjust for sex, too, and RIN?

You should also be filtering by adjusted p-values, not nominal p-values (Edit: according to current default, not specifying an adjustment method will mean that BH is chosen). Your threshold for Log [base 2] fold-change is also very low.

Irrespective of everything, 3 people in this thread have given the general consensus that approach #1 is better, except for this situation:

In general, though, your approach #1 should be used, unless your sub-groups are so completely different in their transcriptomic profiles such that analysing them in the same dataset introduces bias (e.g., analysing neurons and melanocytes together).

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by Kevin Blighe65k

grayapply2009, if you want a definitive answer, I encourage you to post this question on Bioconductor Support (where I am also Moderator). There, the EdgeR developer is more likely to respond, and it feels like only hearing from him will close this issue off for you. Please link back to this thread when you post there. Thanks!

ADD REPLYlink written 6 weeks ago by Kevin Blighe65k

Thank you, Kevin. I just don't understand why subsetting the data could cause such a huge difference. I think I'll just have to read more about their algorithm.

By the way, p.value = 0.05 in the topTable represents adjusted p < 0.05. FC > 1.2 is the conventional threshold our lab's been using. I didn't adjusted sex because I only compared within females. I adjusted RIN because RIN is significantly correlated with most of the variables.

ADD REPLYlink written 6 weeks ago by grayapply2009220

Why dont you create a single R script that you feed one file, compare the results using the complete set and another file containing the subset.

You can use the same code eg:

design <- model.matrix(~ 0 + group, data)

contrast <- makeContrasts(A.vs.D = A-D)

That would provide you some confidence that your code is the same while the only variable changing is your input file and then you can determine what the DEG while using the code.

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by Davide10

That IS what I did. Please take a look at my data and the single script I provided in this post. The link is in the last sentence of the post.

ADD REPLYlink written 6 weeks ago by grayapply2009220

I agree on the points Kevin is making. What are the DEG after using "adjusted p-values" for both sets ?

ADD REPLYlink written 6 weeks ago by Davide10
1
gravatar for h.mon
7 weeks ago by
h.mon31k
Brazil
h.mon31k wrote:

limma was developed for microarray data. One can use limma for for RNAseq analysis, using the voom transformation, are you using it?

I am almost certain I have seen the same question discussed in depth before, either here at BioStars or at the BioConductor support forum. However, I can't find any relevant post, so I will give a superficial explanation: limma uses an empirical Bayes method to squeeze the residual variances for each gene towards a global trend, and, in doing so, also increases the degrees of freedom for the individual variances. This means the presence of the additional samples will modify these estimates, thus will affect the tests results.

ADD COMMENTlink written 7 weeks ago by h.mon31k

Thank you for your reply. Yes, the data were voom normalized. Which method do you recommend? I've been using method 2 all the time and it worked perfectly fine for me. The huge difference between these two methods surprised me. By the way, even logFC was slightly different between the 2 methods. Any ideas why this happened?

ADD REPLYlink written 7 weeks ago by grayapply2009220

You're subsetting the entire dataset in the second approach, so, parameter estimations will differ. You should use approach #1, in my opinion.

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by Kevin Blighe65k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1640 users visited in the last hour