DESeq2 multiple group comparison
1
5
Entering edit mode
2.7 years ago
dazhudou1122 ▴ 110

Dear Biostars community:

I ran into a question when I was doing RNA-seq analysis using DeSEQ2. I have 4 groups, 3 samples per group, and set my groups using the following code:

(condition <- factor(c(rep("ctl", 3), rep("A", 3), rep("B", 3), rep("C", 3))))


Can anyone tell me how the comparison in DeSEQ2 is done? I am not sure what the p value and log2FC mean in the default output. Supposedly it should be outputting ctl vs A in the default output. But the data is different when I have only ctl and A as input (2 groups). Also, is there a way to specify how the program performs the comparison? I read some previous threads but the answers were not very clear. Any help will be highly appreciated!

Best,

Wenhan

Below is the code I used:

countdata <- read.table("B6_B6MHV_B6MHVWY.txt", header=TRUE, row.names=1)
countdata <- countdata[ ,6:ncol(countdata)]
colnames(countdata) <- gsub("\\.[sb]am$", "", colnames(countdata)) countdata <- as.matrix(countdata) head(countdata) (condition <- factor(c(rep("ctl", 3), rep("A", 3), rep("B", 3), rep("C", 3)))) library(DESeq2) (coldata <- data.frame(row.names=colnames(countdata), condition)) dds <- DESeqDataSetFromMatrix(countData=countdata, colData=coldata, design=~condition) dds dds <- DESeq(dds) # Plot dispersions png("qc-dispersions.png", 2000, 2000, pointsize=20) plotDispEsts(dds, main="Dispersion plot") dev.off() # Regularized log transformation for clustering/heatmaps, etc rld <- rlogTransformation(dds) head(assay(rld)) hist(assay(rld)) # Colors for plots below ## Ugly: ## (mycols <- 1:length(unique(condition))) ## Use RColorBrewer, better library(RColorBrewer) (mycols <- brewer.pal(8, "Dark2")[1:length(unique(condition))]) # Sample distance heatmap sampleDists <- as.matrix(dist(t(assay(rld)))) library(gplots) png("qc-heatmap-samples.png", w=1500, h=2500, pointsize=1500) heatmap.2(as.matrix(sampleDists), key=F, trace="none", col=colorpanel(100, "black", "white"), ColSideColors=mycols[condition], RowSideColors=mycols[condition], margin=c(10, 10), main="Sample Distance Matrix") dev.off() # Principal components analysis ## Could do with built-in DESeq2 function: ## DESeq2::plotPCA(rld, intgroup="condition") ## I like mine better: rld_pca <- function (rld, intgroup = "condition", ntop = 500, colors=NULL, legendpos="bottomleft", main="PCA Biplot", textcx=1, ...) { require(genefilter) require(calibrate) require(RColorBrewer) rv = rowVars(assay(rld)) select = order(rv, decreasing = TRUE)[seq_len(min(ntop, length(rv)))] pca = prcomp(t(assay(rld)[select, ])) fac = factor(apply(as.data.frame(colData(rld)[, intgroup, drop = FALSE]), 1, paste, collapse = " : ")) if (is.null(colors)) { if (nlevels(fac) >= 3) { colors = brewer.pal(nlevels(fac), "Paired") } else { colors = c("black", "red") } } pc1var <- round(summary(pca)$importance[2,1]*100, digits=1)
pc2var <- round(summary(pca)$importance[2,2]*100, digits=1) pc1lab <- paste0("PC1 (",as.character(pc1var),"%)") pc2lab <- paste0("PC1 (",as.character(pc2var),"%)") plot(PC2~PC1, data=as.data.frame(pca$x), bg=colors[fac], pch=21, xlab=pc1lab, ylab=pc2lab, main=main, ...)
with(as.data.frame(pca$x), textxy(PC1, PC2, labs=rownames(as.data.frame(pca$x)), cex=textcx))
legend(legendpos, legend=levels(fac), col=colors, pch=20)
#     rldyplot(PC2 ~ PC1, groups = fac, data = as.data.frame(pca$rld), # pch = 16, cerld = 2, aspect = "iso", col = colours, main = draw.key(key = list(rect = list(col = colours), # terldt = list(levels(fac)), rep = FALSE))) } png("qc-pca.png", 1500, 1500, pointsize=25) rld_pca(rld, colors=mycols, intgroup="condition", xlim=c(-75, 35)) dev.off() # Get differential expression results res <- results(dds) table(res$padj<0.05)
res <- res[order(res$padj), ] ## Merge with normalized count data resdata <- merge(as.data.frame(res), as.data.frame(counts(dds, normalized=TRUE)), by="row.names", sort=FALSE) names(resdata)[1] <- "Gene" head(resdata) ## Write results write.csv(resdata, file="diffexpr-results.csv")  RNA-Seq Deseq2 • 14k views ADD COMMENT 18 Entering edit mode 2.7 years ago Your problem may be in setting the reference level for your condition factor. Here is a reproducible example for anyone to follow: # create random data of 12 samples and 4800 genes x <- round(matrix(rexp(480 * 10, rate=.1), ncol=12), 0) rownames(x) <- paste("gene", 1:nrow(x)) colnames(x) <- paste("sample", 1:ncol(x)) # fake metadata coldata <- data.frame( condition = factor(c( rep("ctl", 3), rep("A", 3), rep("B", 3), rep("C", 3)))) # set ctl as reference level coldata$condition <- relevel(coldata$condition, ref = "ctl") # run DESeq2 library(DESeq2) dds <- DESeqDataSetFromMatrix( countData = x, colData = coldata, design= ~ condition) dds <- DESeq(dds, betaPrior = FALSE) resultsNames(dds) [1] "Intercept" "condition_A_vs_ctl" "condition_B_vs_ctl" [4] "condition_C_vs_ctl" # generate results table for A vs ctl res <- results(dds, name="condition_A_vs_ctl") # same as above but with lfc shrinkage res <- lfcShrink(dds, coef = 'condition_A_vs_ctl', type = 'apeglm', res = res) res log2 fold change (MAP): condition A vs ctl Wald test p-value: condition A vs ctl DataFrame with 400 rows and 5 columns baseMean log2FoldChange lfcSE <numeric> <numeric> <numeric> gene 1 6.52931993577912 0.072476893785378 0.357142842617753 gene 2 5.76235176928007 -0.0933872559160501 0.363373514145294 gene 3 6.50856065196759 -0.0204496319490765 0.345984423938138 gene 4 8.22529367562065 -0.214564268854677 0.443132280960639 gene 5 5.79513765732215 -0.124262442880464 0.378075429134928 ... ... ... ... gene 396 6.31562199001634 0.216432088296947 0.433285297905755 gene 397 7.06387542770804 -0.0514375974511503 0.347860712050382 gene 398 11.7113061577201 0.00284345055606203 0.345680871683025 gene 399 6.49584500337424 -0.0476168389832278 0.345546422494701 gene 400 9.9964224972449 -0.163803486191863 0.398777125778805 pvalue padj <numeric> <numeric> gene 1 0.483733657727805 0.868376605216856 gene 2 0.37470748747808 0.833726100572576 gene 3 0.849966094025773 0.980516244268461 gene 4 0.0516046628774167 0.562928133989655 gene 5 0.228734151969893 0.786123281264348 ... ... ... gene 396 0.0892978383641832 0.562928133989655 gene 397 0.649120107927275 0.943542041143732 gene 398 0.983171679048765 0.997357437589209 gene 399 0.684227860534656 0.953646890096339 gene 400 0.139850562852436 0.647215395526392  You can also compare other levels, e.g., A vs C results(dds, c("condition", "A", "C"))  See the DESeq2 vignette for all other possible functionality. Kevin ADD COMMENT 0 Entering edit mode Hello Kevin, I am using your example it is working well. I have few question from my data: 1. How can I remove batch effect from my data with DESEQ2. 2. I know the VST function available in DESEQ2 for remove batch effect, but after this step how can we do DE analysis from the adjusted data. Hope you can understand my problem. ADD REPLY 0 Entering edit mode Hey, there is a function called removeBatchEffects (limma), which may be of use to you. There is a practical example given here with DESeq2: https://support.bioconductor.org/p/76099/#93176 It is neither VST nor rlog that are removing batch, technically speaking. ADD REPLY 0 Entering edit mode Sorry Kevin I am confused with data I have, can you please have a look I have gene expression raw counts like this  A1 B1 C1 D1 E1 F1 A2M 511 1623 665 208 553 469 AADAT 34 137 372 7 52 124 ABCB1 119 114 123 22 25 89 ABCB11 27 186 200 27 30 49 ABCC2 1 10 21 1 8 3  For each patients I also have some characteristics > head(meta) a b c d d A1 3 1 3 0 2 B1 2 1 0 0 1 C1 2 1 0 0 2 D1 0 0 0 0 2 E1 1 0 0 1 3 F1 0 0 0 0 1 >  I want differentially expressed genes, but I can not figure out how to let all of these meta data in my DEseq2 design For example differential expression which takes all of these variables into calculation Can you help me please? Thank you ADD REPLY 1 Entering edit mode The DESeq2 design formula behaves much the same as standard regression formulae. So, if you want to include these, you just need to add them like this: ~ IC + PD.L1 + TC + TILs + TILs.group  That assumes that the effects of these are 'additive'. It is possible to have interactions and multiplicative effects, too, but these are more complex. If you are unsure, then it is better to keep the design formula very simple. You may want to consider, for example, the fact that some of these columns are describing virtually the same trait / characteristic, in which case their usage in a model must be considered. ADD REPLY 0 Entering edit mode Thank you I done so > head(meta) IC PD.L1 TC TILs TILs.group A1 3 1 3 0 2 B1 2 1 0 0 1 C1 2 1 0 0 2 D1 0 0 0 0 2 E1 1 0 0 1 3 F1 0 0 0 0 1 > > IC=as.factor(c(rep ("0",3), rep("1", 1),rep("2", 2),rep("3", 2))) > PD.L1=as.factor(c(rep ("High",4), rep("Low", 4))) > TC=as.factor(c(rep ("0",6), rep("1", 1),rep("3", 1))) > TILs=as.factor(c(rep ("0",6), rep("1", 2))) > TILs.group=as.factor(c(rep ("1",3), rep("2", 3),rep("3", 2))) > dds=DESeqDataSetFromMatrix(countData = data, colData = meta, design = ~ IC + PD.L1 + TC + TILs + TILs.group) the design formula contains a numeric variable with integer values, specifying a model with increasing fold change for higher values. did you mean for this to be a factor? if so, first convert this variable to a factor using the factor() function > > dds <- DESeq(dds) estimating size factors estimating dispersions gene-wise dispersion estimates mean-dispersion relationship -- note: fitType='parametric', but the dispersion trend was not well captured by the function: y = a/x + b, and a local regression fit was automatically substituted. specify fitType='local' or 'mean' to avoid this message next time. final dispersion estimates fitting model and testing > dds$IC
[1] 3 2 2 0 1 0 3 0
> res <-results (dds, contrast=c("IC", "0", "1","2","3"))
Error in checkContrast(contrast, resNames) :
'contrast', as a character vector of length 3, should have the form:
contrast = c('factorName','numeratorLevel','denominatorLevel'),
>
> res <-results (dds, contrast=c("IC", "0", "1"))
Error in if (!expanded & (hasIntercept | noInterceptPullCoef)) { :
argument is of length zero
>
>

2
Entering edit mode

Sorry, please take some time to debug this on your own. I have no time to take you through your analysis step-by-step. If you have a supervisor, please contact her / him / it

0
Entering edit mode

Thank you, Kevin, but i noticed that

You can also compare other levels, e.g., A vs C

results(dds, c("condition", "A", "C"))


If compare "A" vs "C", lfcShrink( ) with type = "apeglm" will next be impossible to do due to the lack of corresponding coef parameter. Does this mean I need to regenerate a new dds with "C" as a reference?

1
Entering edit mode

Hey, in that case, you can go back and re-level the factors such that you can obtain the required coefficient, as to which I allude here: A: How to change conditions in DESeq2

It is also now mentioned briefly in the vignette:

Although apeglm cannot be used with contrast, we note that many designs can be easily rearranged such that what was a contrast becomes its own coefficient

0
Entering edit mode

I have same problem. i have 3 sample , one with WT and 2 others are treatment1 and treatment2. It is confusing for me when I try either approach and e.g. for WT vs treatment1 contrast i am getting deseq2 result diiferent ,

-dds2<- DESeq(dds, betaPrior = FALSE) -resultsNames(dds2) -res <- results(dds2, name="condition_treatment1_vs_WT", alpha = 0.05) -summary(res)

-out of 26151 with nonzero total read count adjusted p-value < 0.05 LFC > 0 (up) : 2463, 9.4% LFC < 0 (down) : 2982, 11% outliers [1] : 2, 0.0076% low counts [2] : 5070, 19% (mean count < 5) [1] see 'cooksCutoff' argument of ?results [2] see 'independentFiltering' argument of ?results

-compared with running it alone through pipeline

-des <- DESeq(dds) -res <- results(des, alpha = 0.05) -summary(res)

-out of 24475 with nonzero total read count adjusted p-value < 0.05 LFC > 0 (up) : 2657, 11% LFC < 0 (down) : 3140, 13% outliers [1] : 0, 0% low counts [2] : 3322, 14% (mean count < 5) [1] see 'cooksCutoff' argument of ?results [2] see 'independentFiltering' argument of ?results

Can anyone explain why and which approach is right i.e. DESeq 1 vs 1 comparison or multiple group comparison?