Question: DESeq2 multiple group comparison
1
gravatar for dazhudou1122
10 months ago by
dazhudou112250
dazhudou112250 wrote:

Dear Biostars community:

I ran into a question when I was doing RNA-seq analysis using DeSEQ2. I have 4 groups, 3 samples per group, and set my groups using the following code:

(condition <- factor(c(rep("ctl", 3), rep("A", 3), rep("B", 3), rep("C", 3))))

Can anyone tell me how the comparison in DeSEQ2 is done? I am not sure what the p value and log2FC mean in the default output. Supposedly it should be outputting ctl vs A in the default output. But the data is different when I have only ctl and A as input (2 groups). Also, is there a way to specify how the program performs the comparison? I read some previous threads but the answers were not very clear. Any help will be highly appreciated!

Best,

Wenhan

Below is the code I used:

countdata <- read.table("B6_B6MHV_B6MHVWY.txt", header=TRUE, row.names=1)
countdata <- countdata[ ,6:ncol(countdata)]
colnames(countdata) <- gsub("\\.[sb]am$", "", colnames(countdata))
countdata <- as.matrix(countdata)
head(countdata)
(condition <- factor(c(rep("ctl", 3), rep("A", 3), rep("B", 3), rep("C", 3))))
library(DESeq2)
(coldata <- data.frame(row.names=colnames(countdata), condition))
dds <- DESeqDataSetFromMatrix(countData=countdata, colData=coldata, design=~condition)
dds
dds <- DESeq(dds)
# Plot
dispersions
png("qc-dispersions.png", 2000, 2000, pointsize=20)
plotDispEsts(dds, main="Dispersion plot")
dev.off()

# Regularized log transformation for clustering/heatmaps, etc
rld <- rlogTransformation(dds)
head(assay(rld))
hist(assay(rld))

# Colors for plots below
## Ugly:
## (mycols <- 1:length(unique(condition)))
## Use RColorBrewer, better
library(RColorBrewer)
(mycols <- brewer.pal(8, "Dark2")[1:length(unique(condition))])

# Sample distance heatmap
sampleDists <- as.matrix(dist(t(assay(rld))))
library(gplots)
png("qc-heatmap-samples.png", w=1500, h=2500, pointsize=1500)
heatmap.2(as.matrix(sampleDists), key=F, trace="none",
          col=colorpanel(100, "black", "white"),
          ColSideColors=mycols[condition], RowSideColors=mycols[condition],
          margin=c(10, 10), main="Sample Distance Matrix")
dev.off()

# Principal components analysis
## Could do with built-in DESeq2 function:
## DESeq2::plotPCA(rld, intgroup="condition")
## I like mine better:
rld_pca <- function (rld, intgroup = "condition", ntop = 500, colors=NULL, legendpos="bottomleft", main="PCA Biplot", textcx=1, ...) {
  require(genefilter)
  require(calibrate)
  require(RColorBrewer)
  rv = rowVars(assay(rld))
  select = order(rv, decreasing = TRUE)[seq_len(min(ntop, length(rv)))]
  pca = prcomp(t(assay(rld)[select, ]))
  fac = factor(apply(as.data.frame(colData(rld)[, intgroup, drop = FALSE]), 1, paste, collapse = " : "))
  if (is.null(colors)) {
    if (nlevels(fac) >= 3) {
      colors = brewer.pal(nlevels(fac), "Paired")
    }   else {
      colors = c("black", "red")
    }
  }
  pc1var <- round(summary(pca)$importance[2,1]*100, digits=1)
  pc2var <- round(summary(pca)$importance[2,2]*100, digits=1)
  pc1lab <- paste0("PC1 (",as.character(pc1var),"%)")
  pc2lab <- paste0("PC1 (",as.character(pc2var),"%)")
  plot(PC2~PC1, data=as.data.frame(pca$x), bg=colors[fac], pch=21, xlab=pc1lab, ylab=pc2lab, main=main, ...)
  with(as.data.frame(pca$x), textxy(PC1, PC2, labs=rownames(as.data.frame(pca$x)), cex=textcx))
  legend(legendpos, legend=levels(fac), col=colors, pch=20)
  #     rldyplot(PC2 ~ PC1, groups = fac, data = as.data.frame(pca$rld),
  #            pch = 16, cerld = 2, aspect = "iso", col = colours, main = draw.key(key = list(rect = list(col = colours),
  #                                                                                         terldt = list(levels(fac)), rep = FALSE)))
}
png("qc-pca.png", 1500, 1500, pointsize=25)
rld_pca(rld, colors=mycols, intgroup="condition", xlim=c(-75, 35))
dev.off()

# Get differential expression results
res <- results(dds)
table(res$padj<0.05)
## Order by adjusted p-value
res <- res[order(res$padj), ]
## Merge with normalized count data
resdata <- merge(as.data.frame(res), as.data.frame(counts(dds, normalized=TRUE)), by="row.names", sort=FALSE)
names(resdata)[1] <- "Gene"
head(resdata)
## Write results
write.csv(resdata, file="diffexpr-results.csv")
rna-seq deseq2 • 2.7k views
ADD COMMENTlink modified 10 months ago by Kevin Blighe51k • written 10 months ago by dazhudou112250
10
gravatar for Kevin Blighe
10 months ago by
Kevin Blighe51k
Kevin Blighe51k wrote:

Your problem may be in setting the reference level for your condition factor.

Here is a reproducible example for anyone to follow:

# create random data of 12 samples and 4800 genes
x <- round(matrix(rexp(480 * 10, rate=.1), ncol=12), 0)
rownames(x) <- paste("gene", 1:nrow(x))
colnames(x) <- paste("sample", 1:ncol(x))

# fake metadata
coldata <- data.frame(
  condition = factor(c(
    rep("ctl", 3),
    rep("A", 3),
    rep("B", 3),
    rep("C", 3))))

# set `ctl` as reference level
coldata$condition <- relevel(coldata$condition, ref = "ctl")

# run DESeq2
library(DESeq2)
dds <- DESeqDataSetFromMatrix(
  countData = x,
  colData = coldata,
  design= ~ condition)
dds <- DESeq(dds, betaPrior = FALSE)

resultsNames(dds)
[1] "Intercept"          "condition_A_vs_ctl" "condition_B_vs_ctl"
[4] "condition_C_vs_ctl"

# generate results table for A vs ctl
res <- results(dds, name="condition_A_vs_ctl")

# same as above but with lfc shrinkage
res <- lfcShrink(dds, coef="condition_A_vs_ctl", type="apeglm")
res
log2 fold change (MAP): condition A vs ctl 
Wald test p-value: condition A vs ctl 
DataFrame with 400 rows and 5 columns
                 baseMean      log2FoldChange             lfcSE
                <numeric>           <numeric>         <numeric>
gene 1   6.52931993577912   0.072476893785378 0.357142842617753
gene 2   5.76235176928007 -0.0933872559160501 0.363373514145294
gene 3   6.50856065196759 -0.0204496319490765 0.345984423938138
gene 4   8.22529367562065  -0.214564268854677 0.443132280960639
gene 5   5.79513765732215  -0.124262442880464 0.378075429134928
...                   ...                 ...               ...
gene 396 6.31562199001634   0.216432088296947 0.433285297905755
gene 397 7.06387542770804 -0.0514375974511503 0.347860712050382
gene 398 11.7113061577201 0.00284345055606203 0.345680871683025
gene 399 6.49584500337424 -0.0476168389832278 0.345546422494701
gene 400  9.9964224972449  -0.163803486191863 0.398777125778805
                     pvalue              padj
                  <numeric>         <numeric>
gene 1    0.483733657727805 0.868376605216856
gene 2     0.37470748747808 0.833726100572576
gene 3    0.849966094025773 0.980516244268461
gene 4   0.0516046628774167 0.562928133989655
gene 5    0.228734151969893 0.786123281264348
...                     ...               ...
gene 396 0.0892978383641832 0.562928133989655
gene 397  0.649120107927275 0.943542041143732
gene 398  0.983171679048765 0.997357437589209
gene 399  0.684227860534656 0.953646890096339
gene 400  0.139850562852436 0.647215395526392

You can also compare other levels, e.g., A vs C

results(dds, c("condition", "A", "C"))

See the DESeq2 vignette for all other possible functionality.

Kevin

ADD COMMENTlink modified 7 weeks ago • written 10 months ago by Kevin Blighe51k

Hello Kevin, I am using your example it is working well. I have few question from my data: 1. How can I remove batch effect from my data with DESEQ2. 2. I know the VST function available in DESEQ2 for remove batch effect, but after this step how can we do DE analysis from the adjusted data. Hope you can understand my problem.

ADD REPLYlink written 7 months ago by anshulmbi0

Hey, there is a function called removeBatchEffects (limma), which may be of use to you. There is a practical example given here with DESeq2: https://support.bioconductor.org/p/76099/#93176

It is neither VST nor rlog that are removing batch, technically speaking.

ADD REPLYlink written 7 months ago by Kevin Blighe51k

Sorry Kevin I am confused with data I have, can you please have a look

I have gene expression raw counts like this

          A1 B1   C1     D1  E1  F1
A2M     511 1623 665    208 553 469
AADAT   34  137 372     7   52  124
ABCB1   119 114 123    22   25  89
ABCB11  27  186 200    27   30  49
ABCC2   1   10  21     1    8   3

For each patients I also have some characteristics

> head(meta)
    a b c d d
A1  3     1  3    0          2
B1  2     1  0    0          1
C1  2     1  0    0          2
D1  0     0  0    0          2
E1  1     0  0    1          3
F1  0     0  0    0          1
>

I want differentially expressed genes, but I can not figure out how to let all of these meta data in my DEseq2 design For example differential expression which takes all of these variables into calculation

Can you help me please?

Thank you

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by A3.6k
1

The DESeq2 design formula behaves much the same as standard regression formulae. So, if you want to include these, you just need to add them like this:

~ IC + PD.L1 + TC + TILs + TILs.group

That assumes that the effects of these are 'additive'. It is possible to have interactions and multiplicative effects, too, but these are more complex. If you are unsure, then it is better to keep the design formula very simple.

You may want to consider, for example, the fact that some of these columns are describing virtually the same trait / characteristic, in which case their usage in a model must be considered.

ADD REPLYlink written 4 weeks ago by Kevin Blighe51k

Thank you

I done so

> head(meta)
   IC PD.L1 TC TILs TILs.group
A1  3     1  3    0          2
B1  2     1  0    0          1
C1  2     1  0    0          2
D1  0     0  0    0          2
E1  1     0  0    1          3
F1  0     0  0    0          1
> 

> IC=as.factor(c(rep ("0",3), rep("1", 1),rep("2", 2),rep("3", 2)))
> PD.L1=as.factor(c(rep ("High",4), rep("Low", 4)))
> TC=as.factor(c(rep ("0",6), rep("1", 1),rep("3", 1)))
> TILs=as.factor(c(rep ("0",6), rep("1", 2)))
> TILs.group=as.factor(c(rep ("1",3), rep("2", 3),rep("3", 2)))
> dds=DESeqDataSetFromMatrix(countData = data, colData = meta, design        =       ~ IC + PD.L1 + TC + TILs + TILs.group)
  the design formula contains a numeric variable with integer values,
  specifying a model with increasing fold change for higher values.
  did you mean for this to be a factor? if so, first convert
  this variable to a factor using the factor() function
> 
> dds <- DESeq(dds)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
-- note: fitType='parametric', but the dispersion trend was not well captured by the
   function: y = a/x + b, and a local regression fit was automatically substituted.
   specify fitType='local' or 'mean' to avoid this message next time.
final dispersion estimates
fitting model and testing
> dds$IC
[1] 3 2 2 0 1 0 3 0
> res <-results (dds, contrast=c("IC", "0", "1","2","3"))
Error in checkContrast(contrast, resNames) : 
  'contrast', as a character vector of length 3, should have the form:
contrast = c('factorName','numeratorLevel','denominatorLevel'),
see the manual page of ?results for more information
> 
> res <-results (dds, contrast=c("IC", "0", "1"))
Error in if (!expanded & (hasIntercept | noInterceptPullCoef)) { : 
  argument is of length zero
> 
>
ADD REPLYlink written 4 weeks ago by A3.6k
2

Sorry, please take some time to debug this on your own. I have no time to take you through your analysis step-by-step. If you have a supervisor, please contact her / him / it

ADD REPLYlink written 4 weeks ago by Kevin Blighe51k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2156 users visited in the last hour