Hi all,
I have some difficulty understanding my results from DE analysis. Briefly, I have 2 groups (Dneg & Dpos) with 2 replicates each (total 4 samples). I performed RNA-seq and then used kallisto
for transcript quantification. I imported the data using tximport
with the scaledTPM
option to get gene-level counts.
For DE, I used edgeR
to perform the analysis. Upon exploration of the data using MDS plot, I found that the 2 replicates in Dpos are very dissimilar to each other. However, when I plot a PCA plot, I got the inverse which shows that my 2 Dneg replicates are dissimilar. I was afraid I did some mistake so I did DE analysis using DESeq2
and I got the same results.
From the looks of the plots, I'm sure 1 of my samples is an outlier or problematic, but I don't know whether it is the one from the Dneg group or the one from the Dpos group. The code I used for my analysis is as below:
(1) Read in counts data
load("~/txi.kallisto.Rdata")
head(txi.kallisto$counts, n = 5)
Dneg_1 Dneg_9 Dpos_10 Dpos_2
ENSG00000000003 497.45843 579.17559 563.3364 571.37096
ENSG00000000005 0.00000 0.00000 0.0000 0.00000
ENSG00000000419 1451.35256 1764.77228 1503.6384 1427.12914
ENSG00000000457 69.25668 66.55058 74.7108 78.89496
ENSG00000000460 244.89625 263.26950 182.9352 200.74267
(2) edgeR analysis
# Prepare DGEList object
cts <- txi.kallisto$counts
group <- c("Dneg", "Dneg", "Dpos", "Dpos")
df.edgeR <- DGEList(cts, group = group)
symbol <- mapIds(org.Hs.eg.db, keys = rownames(df.edgeR), keytype = "ENSEMBL", column = "SYMBOL")
df.edgeR$genes <- data.frame(geneSYM = symbol)
# Filter out lowly expressed genes
keep <- filterByExpr(df.edgeR, group = group)
df.edgeR <- df.edgeR[keep, , keep.lib.sizes=FALSE]
# Normalization and estimate dispersions
df.edgeR <- calcNormFactors(df.edgeR)
# Plot MDS
groupPlot <- factor(c("Dneg", "Dneg", "Dpos", "Dpos"))
points <- c(15,16)
colors <- c("blue", "red")
plotMDS(df.edgeR, col = colors[groupPlot], pch = points[groupPlot])
legend("bottomright", legend = levels(groupPlot), pch = points, col = colors)
# Plot PCA
edgeR.DDS <- DESeqDataSetFromMatrix(countData = round(df.edgeR$counts), colData = df.edgeR$samples, design = ~group)
transform.edgeR.DDS <- rlog(edgeR.DDS, blind = TRUE)
pcaData <- plotPCA(transform.edgeR.DDS, intgroup = c("group"), returnData = TRUE, ntop = 1000)
percentVar <- round(100 * attr(pcaData, "percentVar"))
ggplot(pcaData, aes(PC1, PC2, color = group)) +
geom_point(size = 5) +
xlab(paste0("PC1: ", percentVar[1], "% variance")) +
ylab(paste0("PC2: ", percentVar[2], "% variance")) +
coord_fixed() +
scale_color_manual(values = c("blue", "red"))
(3) DESeq2 analysis
# Prepare table of sample ID and experimental condition
sampleInfo <- data.frame(sampleID, group, stringsAsFactors = FALSE)
# Convert counts to DESeq2 object
ddsTxi <- DESeqDataSetFromTximport(txi.kallisto, colData = sampleInfo, design = ~group)
# Pre-filtering to remove genes with low counts
keep <- rowSums(counts(ddsTxi)) >= 10
ddsTxi <- ddsTxi[keep, ]
# Log-transform data for visualization
transform.ds <- rlog(deseq.ds, blind = TRUE)
# Plot PCA
pcaData <- plotPCA(transform.ds, intgroup = c("group"), returnData = TRUE, ntop = 1000)
percentVar <- round(100 * attr(pcaData, "percentVar"))
ggplot(pcaData, aes(PC1, PC2, color = group)) +
geom_point(size = 5) +
xlab(paste0("PC1: ", percentVar[1], "% variance")) +
ylab(paste0("PC2: ", percentVar[2], "% variance")) +
coord_fixed() +
scale_color_manual(values = c("blue", "red"))
# Plot MDS
library(DEFormats)
deseqMDS <- as.DGEList(ddsTxi)
plotMDS(deseqMDS, col = colors[groupPlot], pch = points[groupPlot])
legend("topright", legend = levels(groupPlot), pch = points, col = colors)
My question here is: (1) Which result should I trust / use? The MDS or PCA plots? (2) Are there any better ways to check for sample similarity / difference?
Thank you very much!
No, I did
rlog
transformation for both data before plotting the PCA. For edgeR analysis, therlog
function is under the # Plot PCA comment.ntop=100
is an argument for DESeq2plotPCA
function where it decides the number of top genes to use for principal components selected by highest row variance. I repeated and specified to use all available genes and still get the same results.